Big Data Modeling at NoSQLNow! / Semantic Technology Conference, San Jose 2014

We had a wonderful time in San Jose last week at the NoSQLNow! / Semantic Technology Conference.

Many thanks to the organizers Tony Shaw, Eric Franzon, and the rest of the Dataversity team for putting on a great event!

My presentation on Thursday afternoon was “Big Data Modeling”.

The presentation is available below:

Vital AI: Big Data Modeling from Vital.AI

Data Visualization with Vital AI, Wordnet, and Cytoscape

In this series of blog posts, I’ll provide an example of using the Vital AI Development Kit (VDK) for Data Visualization.

One of my favorite visualization applications is Cytoscape ( http://www.cytoscape.org/ ).  Cytoscape is often used in Life Science research applications, but can be used for any graph visualization need.  I highly recommend giving it a whirl.  In this example, we’ll create a plugin to Cytoscape to connect with the Vital AI software.

Wordnet is a wonderful dataset that captures many types of relationships among words and word categories, including relationships like “part-of” as in “hand is part-of an arm” and “member-of” as in “soldier is member-of an army”.  Wordnet was developed at Princeton University ( http://wordnet.princeton.edu/ ).

Because Wordnet contains relationships between words, it’s an ideal dataset to use for graph visualization.  The technique can be applied to many different types of data.

For this example, we will:

  • Generate a dataset using the source Wordnet Data ready to load into Vital AI
  • Create a plugin to Cytoscape to connect to the Vital AI software via the VDK API
  • Visually explore the Wordnet data, perform some graph analysis, and use the analysis output as part of our visualization.

Once complete, the Cytoscape interface viewing the Wordnet data via the underlying Vital AI VDK API will look like:

6a00e5510ddf1e883301a511ab3b61970c-800wi

Next Post: https://vitalai.com/2014/04/29/generating-a-wordnet-dataset-using-vital-ai-development-kit/

Generating a Wordnet Dataset using Vital AI Development Kit

Part of a series beginning with:

https://vitalai.com/2014/04/29/data-visualization-with-vital-ai-wordnet-and-cytoscape/

To import a new dataset into Vital AI with the VDK, the first thing we need to do is add any needed classes and properties into our data model to help model the dataset.

In the  case of Wordnet, we like to use it as an example, and so have added classes and properties for it into the main Vital data model (vital.owl).

The main Node we’ve defined is the SynsetNode, as Wordnet uses “synset” objects for synonym-sets.  This node has sub-classes for Verbs, Adjectives, Adverbs, and Nouns for those different types of words.

6a00e5510ddf1e883301a3fcfbc975970b-800wi

To connect the Wordnet SynSetNodes together, we represent the various Wordnet relationship types as Edges (there are a bunch).  Two such relationships are HyperNym and HypoNym which are sometimes called the type-of or is-a relationship, such as the relationship between Tiger/Animal or Red/Color.

More information about HyperNyms and HypoNyms is available via Wikipedia here:  http://en.wikipedia.org/wiki/Hyponymy_and_hypernymy.

6a00e5510ddf1e883301a3fcfbcb05970b-800wi

The current version of the Vital AI ontologies are available on github here:https://github.com/vital-ai/vital-ontology/tree/rel-0.1.0

Now that we have our data model ready, we can generate a dataset.

There is an open-source API to access the Wordnet dictionary files via Java available from:  http://projects.csail.mit.edu/jwi/

We can use this API to help generate our dataset with code like this to create all our nodes:

for(POS p : POS.values()) {
 
     for( Iterator
          synsetIterator = _dict.getSynsetIterator(p);
          synsetIterator.hasNext(); ) {
 
          ISynset next = synsetIterator.next()
 
          String gloss = next.getGloss();
 
          List words = next.getWords();
 
          String word_string = words.toString()
 
          String idPart = "${next.getPOS().getTag()}_${((ISynsetID)next.getID()).getOffset()}"
 
          SynsetNode sn = cls.newInstance();
 
          sn.URI = URIGenerator.generateURI("wordnet", cls)
          sn.name = word_string
          sn.gloss = gloss
          sn.wordnetID = idPart
 
          writer.startBlock()
          writer.writeGraphObject(sn)
          writer.endBlock()
     }
 
}

This mainly iterates over the parts-of-speech, iterates over the synonym-sets (“concepts”) in each part-of-speech, collects the words associated with each synonym-net, and adds a new SynsetNode for each synonym-set setting a URI (unique identifier), the set of words, the gloss (short definition), and Wordnet identifier.

and code like this to create all our edges:

for(POS p : POS.values()) {
 
for( Iterator synsetIterator = _dict.getSynsetIterator(p); synsetIterator.hasNext(); ) {
 
ISynset key = synsetIterator.next();
 
String uri = synsetWords.get(key.getID())
 
for( Iterator<Entry<IPointer, List>> iterator2 = key.getRelatedMap().entrySet().iterator(); iterator2.hasNext(); ) {
 
Entry<IPointer, List> next2 = iterator2.next();
 
IPointer type = next2.getKey();
List l = next2.getValue();
 
for(ISynsetID id : l) {
 
String destURI = synsetWords.get(id);
  
Edge_hasWordnetPointer newEdge = cls.newInstance();
 
newEdge.URI = URIGenerator.generateURI("wordnet", cls)
 
newEdge.sourceURI = uri
 
newEdge.destinationURI = destURI 
 
writer.startBlock()
writer.writeGraphObject(newEdge)
writer.endBlock()
 
 
}
 
}
 
}
 
}

This iterates over the parts-of-speech, iterates over all the synsets, gets the set of relationships for each, and adds an Edge for each such relationship using Edges of specific type, like HyperNym and HypoNym.

With this we have all our Nodes and Edges written to a dataset file (see previous blog entries for our file “block” format).

We can then import the dataset file into local or remote Vital Service endpoint instance.

Next Post: https://vitalai.com/2014/04/29/building-a-data-visualization-plugin-with-the-vital-ai-development-kit/

Visualizing Data with Graph Analytics with the Vital AI Development Kit

Part of a series beginning with:

https://vitalai.com/2014/04/29/data-visualization-with-vital-ai-wordnet-and-cytoscape/

In the previous post, we created a Cytoscape App connected to a Vital Service Endpoint containing the Wordnet dataset. The App can search the Wordnet data and “expand” it by adding connected Nodes and Edges to the visualized network.

Now let’s use some graph analytics to help visualize a network. We’ll be performing the analysis locally within Cytoscape. For a very large graph we would be using a server cluster to perform the analysis. The Vital AI VDK and Platform enable this by running the analysis within a Hadoop Cluster. However, for this example, we’ll be using a relatively small subset of the Wordnet data.

First let’s search for “car” (in the sense of “automobile”), add it to our network, and expand to get all nodes and edges up to 3 hops away. This gives us about 1,100 Nodes and around 1,500 Edges. Initially they are in a jumble, sometimes called a “hair-ball”.

6a00e5510ddf1e883301a511ab8119970c-800wi

Now, let’s run our network analysis, available from the “Tools” menu.

6a00e5510ddf1e883301a3fcfbd9d3970b-800wi

By doing the network analysis, we calculate various metrics about the network, such as how many edges are associated with each node — this is it’s “degree”.  Another such metric is called “centrality”.  This is a calculation of how “central” a node is to the network.  Central Nodes can be more “important” such as influencers in a social network.

Next, we associate some of these metrics with the network visualization.  We can adjust node size to the degree and color to centrality.  The more red a node is, the more “important” it is.

6a00e5510ddf1e883301a511ab8513970c-800wi

We use the centrality associated with the edges to help visually lay out the network, showing some underlying structure, using options in the “Layout” menu.

6a00e5510ddf1e883301a73db69e6e970d-800wi

Next we can zoom in on the middle of the network.

6a00e5510ddf1e883301a73db69e85970d-800wi

The node representing “car, automobile” is a deep red as it is the most central and important part of the graph.

Panning around, we can find “motor vehicle” —

6a00e5510ddf1e883301a511ab827d970c-800wi

“Motor Vehicle” is a reddish-yellow reflecting it’s importance, but not as important as “car, automobile”.

Panning over to “airplane” we see that it’s bright yellow, with it’s sub-types like “biplane” a bright green, reflecting that they are not central and not “important” by our metric.  This is not a surprise, as “airplane” is a bit removed from the rest of the “car, automobile” network — they do have “motor vehicle” in common, and “biplane” is even further removed.

6a00e5510ddf1e883301a73db69f0a970d-800wi

Cytoscape has many layout and visualization features, and paired with a Big Data repository via the Vital AI VDK, makes a compelling data analysis system.

Using the contextual menu also allows the App to be a great data exploration application to discover new ways the data is connected.

Hope you have enjoyed this series on integrating a Data Visualization application with Vital AI using the Vital AI Development Kit!

Using the Vital AI software to make predictions

In this series of blog posts, I’ll introduce components of the Vital AI software used to make predictions via machine learning models.

6a00e5510ddf1e883301a73d97b9f0970d-320wi

We’ll use the venerable “20 Newsgroup” dataset often used in text classification, which consists of around 20,000 text articles across 20 categories.  The dataset is available here: https://github.com/vital-ai/vital-datasets/tree/master/20news

The primary steps are:

  • Set up a data model
  • Create the data set
  • Define the prediction model
  • Run the machine learning training
  • Evaluate the trained model
  • Use the model to make ongoing predictions

In this example, our predictions will be the categories assigned to the text – such as a category like “baseball” if the text is about baseball.

Next: Introduction to Big Data Models

Introduction to Big Data Models with Vital AI

Part of a series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

At the heart of any data-driven application is a data model – but often the data model is never is fully captured.  It is spread out over many schema files, databases, source code files, and the minds of the developers working on the application.

By capturing the data model in one place:

  • Developers can easily reference it across different software components
  • Developers have a single place to look for data definitions
  • Code can be generated directly from the data model
  • Errors can be detected much more easily by checking data against the model

Typical schema formats are very limited in what can be specified – to truly capture the full model for Big Data applications a much richer format must be used.

At Vital AI, we use the OWL standard to describe Big Data Models.  OWL is a standard used to create data models – also known as “ontologies.”

Some background on the standard is here:
http://en.wikipedia.org/wiki/Web_Ontology_Language

And documentation on the standard is here:
http://www.w3.org/TR/owl2-overview/

We’ll use a graphical user interface to edit our data model called Protégé, which is an open-source application.

It’s available here: http://protege.stanford.edu/

For our data model, we need a single class (type of data object) to represent the articles in the 20 Newsgroup dataset.

Vital AI provides a core data model defining the most fundamental data types, and a base application data model which defines typical data objects such as “User” and “Document”.

For our Twenty Newsgroup dataset, we’ll extend the “Document” class and create the TwentyNewsArticle class.

6a00e5510ddf1e883301a5118d5cfe970c-800wi

The Vital AI “vitalsigns” application generates code from a data model, so our data object definitions can be used in our software.

From the command line we can enter:

vitalsigns generate -o twentynews-1.0.1.owl -p com.twentynews.model -j twentynews-groovy-1.0.2.jar

to create a JAR file which we can then include in our application.

In our IDE, we can use our new data model directly in our code.

6a00e5510ddf1e883301a3fcddb903970b-800wi

In addition to our data objects, we need to define the categories we want to use in our predictions.

Next: Defining categories to use in predictions 

Creating a classification Taxonomy with Vital AI

Part of a series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

To categorize data, we need to define the categories and add them into our data model.

We can use a simple text file and helper application to do this.

First, our categories:

6a00e5510ddf1e883301a73d9897ff970d-800wi

Then:

vitaltaxonomy -i twentynews_taxonomy.txt -o twentynews_categories.owl

The vitaltaxonomy command creates an OWL file that contains the list of categories.

These can be merged into our data model using:

vitalsigns mergeindividuals -o twentynews-1.0.2.owl -i ../taxonomy/twentynews_categories.owl

Now our categories are added into our data model.  We can check it by listing them.

vitalsigns listindividuals -o twentynews-1.0.1.owl

6a00e5510ddf1e883301a5118d66d9970c-800wi

And we can see them added into our model with Protege:

6a00e5510ddf1e883301a5118d671b970c-800wi

Next: Adding features and target to a Vital AI predictive model

Adding features and target to a Vital AI predictive model

Part of a series to introduce the Vital AI software used to make predictions.

Go to beginning: Using the Vital AI software to make predictions

We can edit our data model to define a predictive model.

For our predictive model, we define:

  • A unique identifier for the model (URI)
  • A name for the model
  • The features (inputs) to the model, including their datatype. In this example, the inputs will be textual.
  • The target (output) of the model, including its datatype. In this example, the output is categorical (one of a list of options).
  • The machine learning algorithm to use with the predictive model. In this case, we’ll use complementary bayes.

The predictive model is defined by a combination of individuals and annotations.

6a00e5510ddf1e883301a73d989921970d-800wi

Features are specified, such as the “hasBody” property.

6a00e5510ddf1e883301a5118d68ff970c-800wi

The Target property is specified:

6a00e5510ddf1e883301a73d989a44970d-800wi

We can also specify how results of the prediction will be asserted:

6a00e5510ddf1e883301a5118d6ae4970c-800wi

We can check the definition of the model using the “showmodels” option of the vitalpredict command.

vitalpredict showmodels

6a00e5510ddf1e883301a5118d6c73970c-800wi

Now that we have our model defined, we can create our dataset.

Next: Creating a predictive model training set with Vital AI