Processing datasets using Machine Learning on Hadoop with Vital AI

Part of an ongoing series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

Processing datasets using Machine Learning on Hadoop with Vital AI

From the previous steps, we now have:

  • TwentyNews Data Model, including a taxonomy of categories
  • Predictive Model defined for categorization
  • A dataset in the desired format

We can double-check our predictive model using:

vitalpredict showmodels

Now we’re ready to run our machine learning algorithm on the data.

We’ll be using Amazon’s Elastic Map Reduce service for Hadoop.

The s3cmd for accessing Amazon’s S3 service should be installed, as well as the elastic-mapreduce command line to interact with Amazon’s EMR service.

We can first upload our dataset to Amazon’s S3 service with a command such as:

s3cmd put twentynews-dataset.vital.gz s3://mys3bucket/input/twentynews-dataset.vital.gz

Next, we can start a new EMR Hadoop cluster with a command such as:

 vitalhadoop start --create --alive  --num-instances 2  --master-instance-type m1.xlarge --slave-instance-type m1.medium  --vital-bootstrap-action s3:// mys3bucket/bootstrap-action/configure-hadoop-vital.sh  --log-uri s3n://mys3bucket/logs/  --name "TwentyNewsgroup Example"

We can determine the cluster identifier with the command:

vitalhadoop status

To specify the location of resources in S3, use a config file, such as:

# common configuration settings

command {

        emrJobID = "cluster-id"   

        emrJarLocation =  "location-to-store-job-jar"

        s3TrainingPath = "path-to-training-data"

        s3OutputModelPath = "path-for-resulting-model"

        domainJarPath = "twentynews-1.0.1.jar"

        modelURI = "http://vital.ai/ontology/twentynews.owl#twentynews_cbayes_model"

        taxonomyRoot = "http://vital.ai/twentynews/Category/Taxonomy"

}

And once our cluster is running, we can run our algorithm with the command:

vitalpredict –c twentynews.config –j *cluster-id*

(Fill in the EMR cluster-id)

We can use the Amazon EMR interface to check our progress, or use:

vitalhadoop status 

6a00e5510ddf1e883301a73d989f97970d-800wi

Once our job is complete, we can view the results:

6a00e5510ddf1e883301a3fcddc6f6970b-800wi

This shows the results for the “holdout” data — data classified by the model that the model has not seen before.  The model correctly classified 88.95% of the instances.  Not bad!

The results also show the “confusion matrix.”  This is a matrix of category vs. category — perfect would be all numbers along the diagonal — errors are shown when the instance was in a category but the model categorized it into a different category, and thus not along the diagonal.

We can download our model from S3:

s3cmd get s3://mys3bucket/model/model.jar

Next: Using a model to make predictions with Vital AI

Using a model to make predictions with Vital AI

Final part of a series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

Now that we have the predictive model produced in the previous step, we can use it to make new predictions.

In order to make a prediction, we need an instance of the data object with properties set which will be the input features to the model.

Here is a snippet of code to load the model, instantiate a data object with properties, make the prediction, and output the result.

static main(args) {

String modelJarPath = "./model/twentynews-model.jar"

PredictModel model = new PredictModel()

model.loadJar(new File(modelJarPath))

TwentyNewsDocument mydoc = new TwentyNewsDocument()

mydoc.URI = "http://example.org/twentynews/TwentyNewsDocument/123"
mydoc.title = "Let's play softball in the park!"
mydoc.body = "Softball game tonight.  Bring your bats!"

model.predict(mydoc) 

List<Category> categories = mydoc.getNewsCategories()

for( int i = 0 ; i < categories.size(); i++ ) {

println "${categories[i].name}"

}

}

Now we can use such code in our application to make ongoing predictions.

marc:twentynewsclassificationapp hadfield$ ./bin/twentynewsclassifier ./model/twentynews-model-container.jar 
using domain jar: /Users/hadfield/LocalStorage/vitalhome/domain-jar/twentynews-1.0.2.jar
Model jar path: ./model/twentynews-model-container.jar

Input:
mydoc.URI = "http://example.org/twentynews/TwentyNewsDocument/123"
mydoc.title = "Let's play softball in the park!"
mydoc.body = "Softball game tonight.  Bring your bats!"

Output:
rec.sport.baseball 0.2307
rec.sport.hockey 0.1717
talk.politics.guns 0.1717
sci.space 0.1717
misc.forsale 0.1224

The “baseball” category was the top choice, with a score of 0.2307.

Hope you enjoyed this introduction to the Vital AI software.

We will present similar series for Natural Language Processing, Graph Analytics, Logical Inference, and other topics.  Please contact us for additional information and to start using the Vital AI software!

Vital AI:

info@vital.ai
http://vital.ai/#contact