Creating a predictive model training set with Vital AI

Part of an ongoing series to introduce the Vital AI software used to make predictions.

To process data with a machine learning algorithm to build a predictive model, a dataset must be created.

The Twenty Newsgroup source data is comprised of around 20,000 individual text files – one per article.

The Vital AI software uses a standardized data format for datasets, with each data object conforming to the data model.

To convert the source data into the Vital AI data format, we use a simple script.

The key lines of the script are:


def doc = new TwentyNewsDocument()

doc.URI = "${newsgroup}/${id}";

doc.title = subject

doc.body = body

doc.newsGroup = '' + newsgroup;





The resulting data file is in the “Vital Block” format, called “block” format as data objects can be grouped together in “blocks” for processing.

