Using the Vital AI software to make predictions

In this series of blog posts, I’ll introduce components of the Vital AI software used to make predictions via machine learning models.

6a00e5510ddf1e883301a73d97b9f0970d-320wi

We’ll use the venerable “20 Newsgroup” dataset often used in text classification, which consists of around 20,000 text articles across 20 categories.  The dataset is available here: https://github.com/vital-ai/vital-datasets/tree/master/20news

The primary steps are:

  • Set up a data model
  • Create the data set
  • Define the prediction model
  • Run the machine learning training
  • Evaluate the trained model
  • Use the model to make ongoing predictions

In this example, our predictions will be the categories assigned to the text – such as a category like “baseball” if the text is about baseball.

Next: Introduction to Big Data Models

Introduction to Big Data Models with Vital AI

Part of a series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

At the heart of any data-driven application is a data model – but often the data model is never is fully captured.  It is spread out over many schema files, databases, source code files, and the minds of the developers working on the application.

By capturing the data model in one place:

  • Developers can easily reference it across different software components
  • Developers have a single place to look for data definitions
  • Code can be generated directly from the data model
  • Errors can be detected much more easily by checking data against the model

Typical schema formats are very limited in what can be specified – to truly capture the full model for Big Data applications a much richer format must be used.

At Vital AI, we use the OWL standard to describe Big Data Models.  OWL is a standard used to create data models – also known as “ontologies.”

Some background on the standard is here:
http://en.wikipedia.org/wiki/Web_Ontology_Language

And documentation on the standard is here:
http://www.w3.org/TR/owl2-overview/

We’ll use a graphical user interface to edit our data model called Protégé, which is an open-source application.

It’s available here: http://protege.stanford.edu/

For our data model, we need a single class (type of data object) to represent the articles in the 20 Newsgroup dataset.

Vital AI provides a core data model defining the most fundamental data types, and a base application data model which defines typical data objects such as “User” and “Document”.

For our Twenty Newsgroup dataset, we’ll extend the “Document” class and create the TwentyNewsArticle class.

6a00e5510ddf1e883301a5118d5cfe970c-800wi

The Vital AI “vitalsigns” application generates code from a data model, so our data object definitions can be used in our software.

From the command line we can enter:

vitalsigns generate -o twentynews-1.0.1.owl -p com.twentynews.model -j twentynews-groovy-1.0.2.jar

to create a JAR file which we can then include in our application.

In our IDE, we can use our new data model directly in our code.

6a00e5510ddf1e883301a3fcddb903970b-800wi

In addition to our data objects, we need to define the categories we want to use in our predictions.

Next: Defining categories to use in predictions 

Creating a classification Taxonomy with Vital AI

Part of a series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

To categorize data, we need to define the categories and add them into our data model.

We can use a simple text file and helper application to do this.

First, our categories:

6a00e5510ddf1e883301a73d9897ff970d-800wi

Then:

vitaltaxonomy -i twentynews_taxonomy.txt -o twentynews_categories.owl

The vitaltaxonomy command creates an OWL file that contains the list of categories.

These can be merged into our data model using:

vitalsigns mergeindividuals -o twentynews-1.0.2.owl -i ../taxonomy/twentynews_categories.owl

Now our categories are added into our data model.  We can check it by listing them.

vitalsigns listindividuals -o twentynews-1.0.1.owl

6a00e5510ddf1e883301a5118d66d9970c-800wi

And we can see them added into our model with Protege:

6a00e5510ddf1e883301a5118d671b970c-800wi

Next: Adding features and target to a Vital AI predictive model

Adding features and target to a Vital AI predictive model

Part of a series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

We can edit our data model to define a predictive model.

For our predictive model, we define:

  • A unique identifier for the model (URI)
  • A name for the model
  • The features (inputs) to the model, including their datatype. In this example, the inputs will be textual.
  • The target (output) of the model, including its datatype. In this example, the output is categorical (one of a list of options).
  • The machine learning algorithm to use with the predictive model. In this case, we’ll use complementary bayes.

The predictive model is defined by a combination of individuals and annotations.

6a00e5510ddf1e883301a73d989921970d-800wi

Features are specified, such as the “hasBody” property.

6a00e5510ddf1e883301a5118d68ff970c-800wi

The Target property is specified:

6a00e5510ddf1e883301a73d989a44970d-800wi

We can also specify how results of the prediction will be asserted:

6a00e5510ddf1e883301a5118d6ae4970c-800wi

We can check the definition of the model using the “showmodels” option of the vitalpredict command.

vitalpredict showmodels

6a00e5510ddf1e883301a5118d6c73970c-800wi

Now that we have our model defined, we can create our dataset.

Next: Creating a predictive model training set with Vital AI

Creating a predictive model training set with Vital AI

Part of an ongoing series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

To process data with a machine learning algorithm to build a predictive model, a dataset must be created.

The Twenty Newsgroup source data is comprised of around 20,000 individual text files – one per article.

The Vital AI software uses a standardized data format for datasets, with each data object conforming to the data model.

To convert the source data into the Vital AI data format, we use a simple script.

The key lines of the script are:

...

def doc = new TwentyNewsDocument()

doc.URI = "http://example.org/twentynews/${newsgroup}/${id}";

doc.title = subject

doc.body = body

doc.newsGroup = 'http://vital.ai/twentynews/Category/' + newsgroup;

writer.startBlock();

writer.writeGraphObject(doc);

writer.endBlock();

}

The resulting data file is in the “Vital Block” format, called “block” format as data objects can be grouped together in “blocks” for processing.

Next: Processing datasets using Machine Learning on Hadoop with Vital AI

Processing datasets using Machine Learning on Hadoop with Vital AI

Part of an ongoing series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

Processing datasets using Machine Learning on Hadoop with Vital AI

From the previous steps, we now have:

  • TwentyNews Data Model, including a taxonomy of categories
  • Predictive Model defined for categorization
  • A dataset in the desired format

We can double-check our predictive model using:

vitalpredict showmodels

Now we’re ready to run our machine learning algorithm on the data.

We’ll be using Amazon’s Elastic Map Reduce service for Hadoop.

The s3cmd for accessing Amazon’s S3 service should be installed, as well as the elastic-mapreduce command line to interact with Amazon’s EMR service.

We can first upload our dataset to Amazon’s S3 service with a command such as:

s3cmd put twentynews-dataset.vital.gz s3://mys3bucket/input/twentynews-dataset.vital.gz

Next, we can start a new EMR Hadoop cluster with a command such as:

 vitalhadoop start --create --alive  --num-instances 2  --master-instance-type m1.xlarge --slave-instance-type m1.medium  --vital-bootstrap-action s3:// mys3bucket/bootstrap-action/configure-hadoop-vital.sh  --log-uri s3n://mys3bucket/logs/  --name "TwentyNewsgroup Example"

We can determine the cluster identifier with the command:

vitalhadoop status

To specify the location of resources in S3, use a config file, such as:

# common configuration settings

command {

        emrJobID = "cluster-id"   

        emrJarLocation =  "location-to-store-job-jar"

        s3TrainingPath = "path-to-training-data"

        s3OutputModelPath = "path-for-resulting-model"

        domainJarPath = "twentynews-1.0.1.jar"

        modelURI = "http://vital.ai/ontology/twentynews.owl#twentynews_cbayes_model"

        taxonomyRoot = "http://vital.ai/twentynews/Category/Taxonomy"

}

And once our cluster is running, we can run our algorithm with the command:

vitalpredict –c twentynews.config –j *cluster-id*

(Fill in the EMR cluster-id)

We can use the Amazon EMR interface to check our progress, or use:

vitalhadoop status 

6a00e5510ddf1e883301a73d989f97970d-800wi

Once our job is complete, we can view the results:

6a00e5510ddf1e883301a3fcddc6f6970b-800wi

This shows the results for the “holdout” data — data classified by the model that the model has not seen before.  The model correctly classified 88.95% of the instances.  Not bad!

The results also show the “confusion matrix.”  This is a matrix of category vs. category — perfect would be all numbers along the diagonal — errors are shown when the instance was in a category but the model categorized it into a different category, and thus not along the diagonal.

We can download our model from S3:

s3cmd get s3://mys3bucket/model/model.jar

Next: Using a model to make predictions with Vital AI

Using a model to make predictions with Vital AI

Final part of a series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

Now that we have the predictive model produced in the previous step, we can use it to make new predictions.

In order to make a prediction, we need an instance of the data object with properties set which will be the input features to the model.

Here is a snippet of code to load the model, instantiate a data object with properties, make the prediction, and output the result.

static main(args) {

String modelJarPath = "./model/twentynews-model.jar"

PredictModel model = new PredictModel()

model.loadJar(new File(modelJarPath))

TwentyNewsDocument mydoc = new TwentyNewsDocument()

mydoc.URI = "http://example.org/twentynews/TwentyNewsDocument/123"
mydoc.title = "Let's play softball in the park!"
mydoc.body = "Softball game tonight.  Bring your bats!"

model.predict(mydoc) 

List<Category> categories = mydoc.getNewsCategories()

for( int i = 0 ; i < categories.size(); i++ ) {

println "${categories[i].name}"

}

}

Now we can use such code in our application to make ongoing predictions.

marc:twentynewsclassificationapp hadfield$ ./bin/twentynewsclassifier ./model/twentynews-model-container.jar 
using domain jar: /Users/hadfield/LocalStorage/vitalhome/domain-jar/twentynews-1.0.2.jar
Model jar path: ./model/twentynews-model-container.jar

Input:
mydoc.URI = "http://example.org/twentynews/TwentyNewsDocument/123"
mydoc.title = "Let's play softball in the park!"
mydoc.body = "Softball game tonight.  Bring your bats!"

Output:
rec.sport.baseball 0.2307
rec.sport.hockey 0.1717
talk.politics.guns 0.1717
sci.space 0.1717
misc.forsale 0.1224

The “baseball” category was the top choice, with a score of 0.2307.

Hope you enjoyed this introduction to the Vital AI software.

We will present similar series for Natural Language Processing, Graph Analytics, Logical Inference, and other topics.  Please contact us for additional information and to start using the Vital AI software!

Vital AI:

info@vital.ai
http://vital.ai/#contact