Stage
As with many AI problems one first needs to have valid “real life” data which can form a useful baseline. Here at the UK Data Service we used two sources:
queries related to research data management, submitted from users to the Service via our web-based Helpdesk system
text from our fairly substantive RDM help pages
Once the first set of data was assembled, the south korea rcs data next step was to clean and pre-process it so that it could be fed into a mapping function which tries to predict an answer for a given question. We did this by taking existing emails and putting each individual one in its own text file. This resulted in a collection of 274 plain text files.
The same was done with all the data management web pages – each page is located in its own individual text file. This proved to be a time consuming task. Due to the initial file format of the data, both knowledge bases had to be generated by hand. Textual data often has special escape characters such as newline “\n”, or tab “\t” which need to be stripped from the text, and this was performed when separating the emails and web pages into separate files.