Processing datasets using Machine Learning on Hadoop with Vital AI

Part of an ongoing series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

Processing datasets using Machine Learning on Hadoop with Vital AI

From the previous steps, we now have:

  • TwentyNews Data Model, including a taxonomy of categories
  • Predictive Model defined for categorization
  • A dataset in the desired format

We can double-check our predictive model using:

vitalpredict showmodels

Now we’re ready to run our machine learning algorithm on the data.

We’ll be using Amazon’s Elastic Map Reduce service for Hadoop.

The s3cmd for accessing Amazon’s S3 service should be installed, as well as the elastic-mapreduce command line to interact with Amazon’s EMR service.

We can first upload our dataset to Amazon’s S3 service with a command such as:

s3cmd put twentynews-dataset.vital.gz s3://mys3bucket/input/twentynews-dataset.vital.gz

Next, we can start a new EMR Hadoop cluster with a command such as:

 vitalhadoop start --create --alive  --num-instances 2  --master-instance-type m1.xlarge --slave-instance-type m1.medium  --vital-bootstrap-action s3:// mys3bucket/bootstrap-action/  --log-uri s3n://mys3bucket/logs/  --name "TwentyNewsgroup Example"

We can determine the cluster identifier with the command:

vitalhadoop status

To specify the location of resources in S3, use a config file, such as:

# common configuration settings

command {

        emrJobID = "cluster-id"   

        emrJarLocation =  "location-to-store-job-jar"

        s3TrainingPath = "path-to-training-data"

        s3OutputModelPath = "path-for-resulting-model"

        domainJarPath = "twentynews-1.0.1.jar"

        modelURI = ""

        taxonomyRoot = ""


And once our cluster is running, we can run our algorithm with the command:

vitalpredict –c twentynews.config –j *cluster-id*

(Fill in the EMR cluster-id)

We can use the Amazon EMR interface to check our progress, or use:

vitalhadoop status 


Once our job is complete, we can view the results:


This shows the results for the “holdout” data — data classified by the model that the model has not seen before.  The model correctly classified 88.95% of the instances.  Not bad!

The results also show the “confusion matrix.”  This is a matrix of category vs. category — perfect would be all numbers along the diagonal — errors are shown when the instance was in a category but the model categorized it into a different category, and thus not along the diagonal.

We can download our model from S3:

s3cmd get s3://mys3bucket/model/model.jar

Next: Using a model to make predictions with Vital AI

One thought on “Processing datasets using Machine Learning on Hadoop with Vital AI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s