Part of an ongoing series to introduce the Vital AI software used to make predictions.
Go to beginning: Using the Vital AI software to make predictions
Processing datasets using Machine Learning on Hadoop with Vital AI
From the previous steps, we now have:
- TwentyNews Data Model, including a taxonomy of categories
- Predictive Model defined for categorization
- A dataset in the desired format
We can double-check our predictive model using:
Now we’re ready to run our machine learning algorithm on the data.
We’ll be using Amazon’s Elastic Map Reduce service for Hadoop.
The s3cmd for accessing Amazon’s S3 service should be installed, as well as the elastic-mapreduce command line to interact with Amazon’s EMR service.
We can first upload our dataset to Amazon’s S3 service with a command such as:
s3cmd put twentynews-dataset.vital.gz s3://mys3bucket/input/twentynews-dataset.vital.gz
Next, we can start a new EMR Hadoop cluster with a command such as:
vitalhadoop start --create --alive --num-instances 2 --master-instance-type m1.xlarge --slave-instance-type m1.medium --vital-bootstrap-action s3:// mys3bucket/bootstrap-action/configure-hadoop-vital.sh --log-uri s3n://mys3bucket/logs/ --name "TwentyNewsgroup Example"
We can determine the cluster identifier with the command:
To specify the location of resources in S3, use a config file, such as:
# common configuration settings
emrJobID = "cluster-id"
emrJarLocation = "location-to-store-job-jar"
s3TrainingPath = "path-to-training-data"
s3OutputModelPath = "path-for-resulting-model"
domainJarPath = "twentynews-1.0.1.jar"
modelURI = "http://vital.ai/ontology/twentynews.owl#twentynews_cbayes_model"
taxonomyRoot = "http://vital.ai/twentynews/Category/Taxonomy"
And once our cluster is running, we can run our algorithm with the command:
vitalpredict –c twentynews.config –j *cluster-id*
(Fill in the EMR cluster-id)
We can use the Amazon EMR interface to check our progress, or use:
Once our job is complete, we can view the results:
This shows the results for the “holdout” data — data classified by the model that the model has not seen before. The model correctly classified 88.95% of the instances. Not bad!
The results also show the “confusion matrix.” This is a matrix of category vs. category — perfect would be all numbers along the diagonal — errors are shown when the instance was in a category but the model categorized it into a different category, and thus not along the diagonal.
We can download our model from S3:
s3cmd get s3://mys3bucket/model/model.jar
Next: Using a model to make predictions with Vital AI