VDK Release 0.2.304, VitalService ASync with Kafka

We released VDK 0.2.304 a couple weeks back, and it contains many new features.

One of the most significant is the asynchronous version of the VitalService API, which uses an underlying realtime distributed messaging system in its implementation.

Quick aside for some definitions: In synchronous systems, one component sends a request and waits for an answer.  This usually involves “blocking” — holding on to system resources while waiting.   In asynchronous systems, one component sends a request and moves on to other things, and processes the answer when it arrives.  This is usually “non-blocking” with no or limited resources being held while a response is pending.  Applications usually combine both methods — there are advantages and disadvantages of each.  Generally, asynchronous requests require more overhead but can scale up to large numbers, and synchronous requests can get individual answers much quicker.  Most modern websites include asynchronous communication with the server, whereas microservices ( https://en.wikipedia.org/wiki/Microservices ) are synchronous.

kafka-integration
Architecture of an application using VitalService API Client with an underlying asynchronous implementation using a Kafka cluster and set of Workers to process messages.

While we have often combined VitalService with a realtime messaging systems in Big Data applications for fully “reactive” applications ( http://www.reactivemanifesto.org/ ), this deeper integration enables a much simpler realtime application implementation and a seamless flow across synchronous and asynchronous software components.

So, the advantage in the 0.2.304 release is a simplification and streamlining of development processes via a unification of APIs — resulting in fewer lines of code, quicker development, fewer bugs, lower cost, and less technical debt.

Using the updated API, a developer works with a single API and chooses which calls to be synchronous or asynchronous based on the parameters of the API call.

For our messaging implementation we use Kafka ( http://kafka.apache.org/ ) , however we could also use alternatives, such as Amazon Kinesis ( https://aws.amazon.com/kinesis/ ).

In the above diagram, we have an application using the VitalService API client.  The application may be processing messages from a user, requesting realtime predictions from a predictive model, querying a database, or any other VitalService API function.

This VitalService API client is using Vital Prime as the underlying implementation.  See: https://console.vital.ai/productdetails/vital-prime-021 and http://www.vital.ai/tech.html for more information about Vital Prime.

Vital Prime acts as both a REST server as well as a message producer/consumer (publisher/subscriber).

When VitalService API client authenticates with Prime, it internally learns the details of the distributed messaging cluster (Kafka cluster), including the connection details and the current set of “topics”(essentially, queues) with their statuses .  Prime coordinates with the Kafka Cluster using Zookeeper ( https://zookeeper.apache.org/ ) to track the available brokers and the status of topics. The VitalService API Client can then seamlessly direct incoming API calls into messages, and direct incoming messages to callback functions.

Thus, an API call like a Query has a synchronous version and an asynchronous version which is the same except for the callback function parameter (the callback function can be empty for “fire-and-forget” API calls).  If the synchronous version is used, a blocking REST call is made to Prime to fulfill the request.  If the asynchronous version is used, the call is directed into the messaging system.

In our example application, we have three pictured “workers” which are processing messages consumed from Kafka, coordinating with Prime as needed.  By this method, work can be distributed across the cluster according to whatever scale is needed.  Such workers can be implemented with instances of Prime, a Spark Cluster ( http://spark.apache.org/ with http://aspen.vital.ai/ ), or any scalable compute service such as Amazon’s Lambda ( https://aws.amazon.com/lambda/ ).

This deeper integration with realtime messaging systems, especially Kafka, has been spurred on by the upcoming release of our Haley AI Assistant service, which can be used to implement AI Assistants for different domains, as well as related “chatbot” services.

More information on Haley AI to come!

 

 

Using the Beaker Notebook with Vital Service

In this post I’ll describe using the Beaker data science notebook with Vital back-end components for data exploration and analysis, using the Wordnet dataset as an example.

At Vital AI we use many tools to explore and analyze data, and chief among them are data science notebooks.  Examples include IPython/Jupyter and Zepplin, plus similar products/services such as Databricks and RStudio.

One that has become a recent favorite is the Beaker Notebook ( http://beakernotebook.com/ ).  Beaker is open-source with a git repo on github ( https://github.com/twosigma/beaker-notebook ) under very active development.

Beaker fully embraces polyglot programming and supports many programming languages including Javascript, Python, R, and JVM languages including Groovy, Java, and Scala.

Scala is especially nice for integration with Apache Spark.  R of course is great for stats and visualization, and JavaScript is convenient for visualization and web dashboards, especially when using visualization libraries like D3.

At Vital AI we typically use JVM for production server applications in combination with Apache Spark, so having all options available in a single notebook makes data analysis a very agile process.

About Data Models

At Vital AI we model data by creating data models (aka ontologies) to capture the meaning of the data, and use these data models within our code.  This allows the meaning of the data to guide our analysis, as well as enable strong data standards – saving a huge amount of manual effort.

We create data models using the open standard OWL, and then generate code using the VitalSigns tool.  This data model code is then utilized within all data analysis and workflows.

At runtime, VitalSigns loads data models into the JVM in one of two ways: from the classpath that was specified when the JVM started via the ServiceLoader API or dynamically via a dynamic classloader.

By using the dynamic method, we can use the Vital Prime server as a “data model server” so that data models are discovered and loaded at run-time from the Prime server.  Thus the data models are kept in sync with data managed by the Prime server, so data analysis is always working with the latest data definitions.

As Groovy is a dynamic language on the JVM, we use Groovy for many data analysis scripts that use data models.

About Wordnet

One of our favorite datasets to use is Wordnet.  From the Wordnet website ( http://wordnet.princeton.edu ):

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

As Wordnet has the form of a graph – words linked to words linked to other words – it is very convenient for visualization.

VitalService API

The VitalService API is a standard API that includes methods for data queries, running analysis scripts (aka datascripts), and reading, saving, updating, and deleting data (so-called “CRUD” operations).  We use the VitalService API for working with data locally, accessing a database, or using a remote service.  This means we use the same API calls when we switch from working with data locally to working with a production service, so we can use the same code library throughout.

Application Architecture

vital-arch1

A full-stack of an application may include a web application layer, a VitalService implementation such as Prime, a Database such as DynamoDB, and an analysis environment based on Apache Spark and Hadoop.  Above, Prime is managing the data models (the “gear” icons) and provides “datascripts” to process data via a scripting interface.

 

vital-arch2

The above diagram focuses on the current case of Beaker Notebook where we are connecting to VitalService Prime as a client, synchronizing the data models, and sending queries to an underlying database.  In our example, the database contains the Wordnet data.

Some sample code to generate the Wordnet dataset is here: https://github.com/vital-ai/vital-examples/blob/master/vital-samples/src/main/groovy/ai/vital/samples/SampleWordnetGenerate.groovy

And some sample code to load the Wordnet data into VitalService is here:  https://github.com/vital-ai/vital-examples/blob/master/vital-samples/src/main/groovy/ai/vital/samples/SampleWordnetImport.groovy

Or more generally the vitalimport utility could be used, found in here: https://github.com/vital-ai/vital-utils

Back to the Beaker Notebook

Now that we’ve a few definitions out of the way, we can get back to using Beaker.

The previous version of Beaker had some limitations to loading JVM classes (see: https://github.com/twosigma/beaker-notebook/issues/2276 ) which are now fixed in Beaker’s github but not yet included in a released version.  We’re currently using a patched version here: https://github.com/vital-ai/beaker-notebook until the next release.

For this example, let’s query some data using VitalService and then visualize the resulting data using D3 with JavaScript.

Our example is based on the one found here: https://pub.beakernotebook.com/#/publications/560c9f9b-14e6-4d95-8e78-cc0a60bf4e5a?fullscreen=true

Our example will include 3 cells: one Groovy one to do a query, one JavaScript one to run D3 over the data, and one HTML one to display the resulting graph.

The Groovy cell first connects to VitalService, like so:

VitalSigns vs = VitalSigns.get()

VitalServiceKey key = new VitalServiceKey().generateURI()
key.key = vs.getConfig("analyticsKey")

def service = VitalServiceFactory.openService(key, "prime", "AnalyticsService")

The code above initializes VitalSigns, sets an authentication key based upon a value in a configuration file, and connects to the VitalService endpoint.  Prime requires an authentication key for security.


vs.pipeline { ->

def builder = new VitalBuilder()

VitalGraphQuery q = builder.query {

// query for graphs like:
// node1(name:happy) ---edge--->node2

GRAPH {

value segments: ["wordnet"]
value inlineObjects: true

ARC {
// bind this node to name "node1"
node_bind { "node1" }

// include subclasses of SynsetNode: Noun, Verb, Adjective, Adverb
node_constraint { SynsetNode.expandSubclasses(true) }
node_constraint { SynsetNode.props().name.equalTo("happy") }

ARC {
// bind the node and edge to names "node2" and "edge"
edge_bind { "edge" }
node_bind { "node2" }
}
}
}
}.toQuery()

ResultList list = service.query( q )

// count the results
def j = 1

list.each {

// Use the binding names to get the URI values out of GraphMatch

def node1_uri = it."node1".toString()
def edge_uri = it."edge".toString()
def node2_uri = it."node2".toString()

// inlineObjects is true, which embeds unseen objects into the results
// if cache is null, get graph object out of GraphMatch results
// graph objects referenced via the URI

def node1 = vs.getFromCache(node1_uri) ?: it."$node1_uri"
def edge = vs.getFromCache(edge_uri) ?: it."$edge_uri"
def node2 = vs.getFromCache(node2_uri) ?: it."$node2_uri"

// add new ones into cache, doesn't hurt to refresh existing ones
vs.addToCache([node1, edge, node2])

// print out node1 --edge--> node2, with edge type (minus the namespace)
println j++ + ": " + node1.name + "---" + (edge.vitaltype.toString() - "http://vital.ai/ontology/vital-wordnet#") + "-->" + node2.name
}

}

service.close()

The above code performs a query for all Wordnet entries with the name “happy”, and then follows all links from those to other words, putting the results into a cache, as well as printing them out.

Note the use of data model objects in the code above like: “SynsetNode”, “VITAL_Node”, and “VITAL_Edge”.  These are used which avoids having any code which directly parses data – the analysis code receives data objects which are “typed” according to the data model.

A screenshot:

querygraph

The result of the “println” statements is:

1: happy—Edge_WordnetSimilarTo–>laughing, riant
2: happy—Edge_WordnetAlsoSee–>joyful
3: happy—Edge_WordnetAlsoSee–>joyous
4: happy—Edge_WordnetSimilarTo–>golden, halcyon, prosperous
5: happy—Edge_WordnetAttribute–>happiness, felicity
6: happy—Edge_WordnetAlsoSee–>euphoric
7: happy—Edge_WordnetAlsoSee–>elated
8: happy—Edge_WordnetAlsoSee–>cheerful
9: happy—Edge_WordnetAlsoSee–>felicitous
10: happy—Edge_WordnetAlsoSee–>glad
11: happy—Edge_WordnetAlsoSee–>contented, content
12: happy—Edge_WordnetSimilarTo–>blissful
13: happy—Edge_WordnetSimilarTo–>blessed
14: happy—Edge_WordnetSimilarTo–>bright
15: happy—Edge_WordnetAttribute–>happiness

We then take all the nodes and edges in the cache and turn them into JSON data as D3 expects.


def nodes = []
  def links = []
   
  Iterator i = vs.getCacheIterator()
  
  def c = 0
  
  while(i.hasNext() ) {
     
    GraphObject g = i.next()
    
    if(g.isSubTypeOf(VITAL_Node)) {
        
      g."local:index" = c
      
      nodes.add ( "{\"name\": \"$g.name\", \"group\": $c}" )
      
      c++
      
    }
       
  }
   
  def max = c
  
  i = vs.getCacheIterator()
  
  while(i.hasNext() ) {
     
  GraphObject g = i.next()
    
    if(g.isSubTypeOf(VITAL_Edge)) {
       
        def srcURI = g.sourceURI
        def destURI = g.destinationURI
      
        def source = vs.getFromCache(srcURI)
        def destination = vs.getFromCache(destURI)
        
        def sourceIndex = source."local:index"
        def destinationIndex = destination."local:index"
      
      
      links.add (  "{\"source\": $sourceIndex, \"target\": $destinationIndex, \"value\": 10}"   ) 
         
    }
    
  }
  
  println "Graph:" + "{\"nodes\": $nodes, \"links\": $links}"
  
  beaker.graph = "{\"nodes\": $nodes, \"links\": $links}"

 

The last line above puts the data into a “beaker” object, which is the handoff point to other languages.

A screenshot of the results and the JSON:

resultsgraph

Then in a Javascript cell:


var graphstr = JSON.stringify(beaker.graph);

var graph = JSON.parse(graphstr)


var width = 800,
    height = 300;

var color = d3.scale.category20();

var force = d3.layout.force()
    .charge(-120)
    .linkDistance(100)
    .size([width, height]);

var svg = d3.select("#fdg").append("svg")
    .attr("width", width)
    .attr("height", height);

var drawGraph = function(graph) {
  force
      .nodes(graph.nodes)
      .links(graph.links)
      .start();

  var link = svg.selectAll(".link")
      .data(graph.links)
    .enter().append("line")
      .attr("class", "link")
      .style("stroke-width", function(d) { return Math.sqrt(d.value); });

  var gnodes = svg.selectAll('g.gnode')
     .data(graph.nodes)
     .enter()
     .append('g')
     .classed('gnode', true);
    
  var node = gnodes.append("circle")
      .attr("class", "node")
      .attr("r", 10)
      .style("fill", function(d) { return color(d.group); })
      .call(force.drag);

  var labels = gnodes.append("text")
      .text(function(d) { return d.name; });

  
  force.on("tick", function() {
    link.attr("x1", function(d) { return d.source.x; })
        .attr("y1", function(d) { return d.source.y; })
        .attr("x2", function(d) { return d.target.x; })
        .attr("y2", function(d) { return d.target.y; });

    gnodes.attr("transform", function(d) { 
        return 'translate(' + [d.x, d.y] + ')'; 
    });
      
    
      
  });
};

drawGraph(graph);

Screenshot:

jsgraph

Note the handoff of the “beaker.graph” object in the beginning of the JavaScript code.  It may be a bit tricky to get the data exchanges right so that JSON produced on the groovy side is interpreted as JSON on the JavaScript side, or vice-versa.  Beaker provides auto-translation for various data structures including DataFrames, but it still takes some trial and error to get it right.

The above JavaScript code comes from the Beaker example project, plus this StackOverFlow article which discusses adding labels to graphs:  http://stackoverflow.com/questions/18164230/add-text-label-to-d3-node-in-force-directed-graph-and-resize-on-hover

In the last Beaker cell, we include the HTML to be the “target” of the JavaScript code:


<style>
.node {
  stroke: #fff;
  stroke-width: 1.5px;
}

.link {
  stroke: #999;
  stroke-opacity: .6;
}
</style>
<div id="fdg"></div>

And a screenshot of the HTML cell with the resulting D3 graph.

happygraph2

Hope you have enjoyed this walkthrough of using Beaker with the VitalService interface, and visualizing query results in a graph with D3.

Please ask any questions in the comments section, or send them to us at info@vital.ai.

Happy New Year!

Vital AI

Optimizing the
 Data Supply Chain
 for Data Science

I gave a talk at the Enterprise Dataversity conference in Chicago in November.

The title of the talk was:

Optimizing the Data Supply Chain for Data Science“.

data-supply-chain-edv2015-hadfield-submitted.001

Below are the slides from that presentation.

Here is a quick summary of the talk:

The Data Supply Chain is the next step in the progression of large scale data management, starting with a “traditional” Data Warehouse, moving to a Hadoop-based environment such as a Data Lake, then to a Microservice Oriented Architecture (Microservices across a set of independently managed Hadoop clusters, “Micro-SOA”), and now to the Data Supply Chain which adds additional data management and coordination processes to produce high quality Data Products across independently management environments.

A Data Product can be any data service such as an eCommerce recommendation system, a Financial Services fraud/compliance predictive service, or Internet of Things (IoT) logistics optimization service.  As a specific example, loading the Amazon.com website triggers more than 170 Data Products predicting consumer sentiment, likely purchases, and much more.

The “Data Supply Chain” (DSC) is a useful metaphor for how a “Data Product” is created and delivered.  Just like a physical “Supply Chain”, data is sourced from a variety of suppliers.  The main difference is that a Data Product can be a real-time combination of all the suppliers at once as compared to a physical product which moves linearly along the supply chain.  However, very often data does flow linearly across the supply chain and becomes more refined downstream.

Each participant of a DSC may be an independent organization, a department within a large organization, or a combination of internal and external data supplies — such as combining internal sales data with social media data.

As each participant in the DSC may have its own model of data, combining data from many sources can be very challenging due to incompatible assumptions.  As a simple example, a “car engine supplier” considers a “car engine” as a finished “product“, whereas a “car manufacturer” considers a “car engine” to be a “car part” and a finished car as a “product“, therefore the definitions of “product” and “car engine” are inconsistent.

As there is no central definition of data as each data supplier is operating independently, there must be an independent mechanism to capture metadata to assist flowing data across the DSC.

At Vital AI, we use semantic data models to capture data models across the DSC.  The models capture all the implicit assumptions in the data, and facilitate moving data across the DSC and building Data Products.

We generate code from the semantic data models which then automatically drives ETL processes, data mapping, queries, machine learning, and predictive analytics — allowing data products to be created and maintained with minimal effort while data sources continue to evolve.

Creating semantic data models not only facilitates creating Data Products, but also provides a mechanism to develop good data standards — Data Governance — across the DSC.  Data Governance is a critical part of high quality Data Science.

As code generated from semantic data models is included at all levels of the software stack, semantic data models also provide a mechanism to keep the interpretation of data consistent across the stack including in User Interfaces, Data Infrastructure (databases), and Data Science including predictive models.

As infrastructure costs continue to fall, the primary cost component of high quality Data Products is human labor.  The use of technologies such as semantic data models to optimize the Data Supply Chain and minimize human labor becomes more and more critical.

To learn more about the Data Supply Chain and Data Products, including how to apply semantic data models to minimize the effort, please contact us at Vital AI!

— Marc Hadfield

Email: info@vital.ai
Telephone: 1.917.463.4776

Vital AI Dev Kit and Product Release 255, including Vital DynamoDB Vitalservice Implementation

VDK 0.2.255 was recently released, as well as corresponding releases for each product.

dynamodb-logo

The main new addition is the Vital DynamoDB Vitalservice Implementation, now available in the dashboard.

https://dashboard.vital.ai

The Vital DynamoDB product provides an implementation of the Vitalservice API using Amazon’s DynamoDB Database-as-a-Service as the data repository.

DynamoDB provides a scalable NoSQL database service.  The Vitalservice implementation allows all the capabilities of the Vitalservice API such as Graph queries and use of VitalSigns domain model objects with DynamoDB as the underlying database.

More information about DynamoDB is available from Amazon here: https://aws.amazon.com/dynamodb/

New artifacts are in the maven repository:

https://github.com/vital-ai/vital-public-mvn-repo/tree/releases/vital-ai

Code is in the public github repos for public projects:

https://github.com/orgs/vital-ai

Vital DynamoDB Highlights:

  • Vitalservice implementation using Amazon’s DynamoDB as the underlying data repository
  • IndexDB updates to use DynamoDB as the backing database, combined with the Index for fast queries
  • VitalPrime updates to support IndexDB using DynamoDB
  • Support for transactions
  • Support for all query types include Select, Graph, Path, and Aggregation
  • Support for all query constraints including subclasses and subproperties
  • Vitalservice administrative scripts available for managing Apps, Segments
  • Vital Utility scripts updates for imports, exports, upgrades, downgrades and other data processing with DynamoDB

Vital AI example apps for prediction using AlchemyAPI (IBM Bluemix), Metamind.io, and Apache Spark

Along with our recent release of VDK 0.2.254, we’ve added a few new example apps to help developers get started with the VDK.

By starting with one of these examples, you can quickly build applications for prediction, classification, and recommendation with a JavaScript web application front end, and prediction models on the server.  The examples use prediction models trained using Apache Spark or an external service such as AlchemyAPI (IBM Bluemix), or Metamind.io.

There is also an example app for various queries of a document database containing the Enron Email dataset.  Some details on this dataset are here: https://www.cs.cmu.edu/~./enron/

The example applications have the same architecture.

6a00e5510ddf1e883301bb086472c6970d-800wi

The components are:

  • JavaScript front end, using asynchronous messages to communicate with the server.  Messaging and domain model management are provided by the VitalService-JS library.
  • VertX application server, making use of the Vital-Vertx module.
  • VitalPrime server using DataScripts to implement server-side functionality, such as generating predictions using a Prediction Model.
  • Prediction Models to make predictions or recommendations.  A Prediction Model can be trained based on a training set, or it could interface to an external prediction service.  If trained, we often use Apache Spark with the Aspen library to create the trained prediction model.
  • A Database such as DynamoDB, Allegrograph, MongoDB, or other to store application data.

Here is a quick overview of some of the examples.

We’ll post detailed instructions on each app in followup blog entries.

MetaMind Image Classification App:

Source Code:

https://github.com/vital-ai/vital-examples/tree/master/metamind-app

Demo Link:

https://demos.vital.ai/metamind-app/index.html

Screenshot:

6a00e5510ddf1e883301b7c7c02e88970b-800wi

This example uses a MetaMind ( https://www.metamind.io/ ) prediction model to classify an image.

AlchemyAPI/IBM Bluemix Document Classification App

Source Code:

https://github.com/vital-ai/vital-examples/tree/master/alchemyapi-app

Demo Link:

https://demos.vital.ai/alchemyapi-app/index.html

Screenshot:

6a00e5510ddf1e883301b8d14a0496970c-800wi

This example app uses an AlchemyAPI (IBM Bluemix) prediction model to classify a document.

Movie Recommendation App

Source Code (Web Application):

https://github.com/vital-ai/vital-examples/tree/master/movie-recommendations-js-app

Source Code (Training Prediction Model):

https://github.com/vital-ai/vital-examples/tree/master/movie-recommendations

Demo Link:

https://demos.vital.ai/movie-recommendations-js-app/index.html

Screenshot:

6a00e5510ddf1e883301b7c7c03038970b-800wi

This example uses a prediction model trained on the MovieLens data to recommend movies based on a user’s current movie ratings.  The prediction model uses the Collaborative Filtering algorithm trained using an Apache Spark job.  Each user has a user-id such as “1010” in the screenshot above.

Spark’s collaborative filtering implementation is described here:

http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

The MovieLens data can be found here:

http://grouplens.org/datasets/movielens/

Enron Document Search App

Source Code:

https://github.com/vital-ai/vital-examples/tree/master/enron-js-app

Demo Link:

https://demos.vital.ai/enron-js-app/index.html

Screenshot:

6a00e5510ddf1e883301b7c7c0314e970b-800wi

This example demonstrates how to implement different queries against a database, such as a “select” query — find all documents with certain keywords, and a “graph” query — find documents that are linked to users.

Example Data Visualizations:

The Cytoscape graph visualization tool can be used to visualize the above sample data using the Vital AI Cytoscape plugin.

The Cytoscape plugin is available from:

https://github.com/vital-ai/vital-cytoscape

An example of visualizing the MovieLens data:

6a00e5510ddf1e883301b8d14a0660970c-800wi

An example of visualizing the Wordnet Dataset, viewing the graph centered on “Red Wine”:

6a00e5510ddf1e883301b8d14a07de970c-800wi

For generating and importing the Wordnet data, see sample code here:

https://github.com/vital-ai/vital-examples/tree/master/vital-samples/src/main/groovy/ai/vital/samples

Information about Wordnet is available here:

https://wordnet.princeton.edu/

Another example of the Wordnet data, with some additional visual styles added:

6a00e5510ddf1e883301b7c7c03384970b-800wi

Join Vital AI at NY Tech Day Tomorrow!

Come to NY Tech Day tomorrow (Thursday, April 23rd) and stop by our booth!

e70FYXSbyU5qqtkkvoZmCMmlPEeElI7iYk1P0pKC2jU

NY Tech Day is an annual start-up extravaganza with over 400 companies presenting and over 10,000 attendees.  It’s always a very exciting day.

It’s free to attend and is held at Pier 92 on the west side of Manhattan (around 54th Street, on the Hudson River).

More details are available at: https://techdayhq.com/new-york

Hope to see you there!

Tracking Big Data Models in OWL with Git Version Control

In my presentation this year at NoSQL Now! / Semantic Technology Conference, I discussed Big Data Modeling.

A key point is using the same Data Model throughout an application stack, so data can be collected, stored, and analyzed in a streamlined way without introducing data inconsistencies, which otherwise inevitably occur during manual data transformations.  Ideally the Data Model can be used to integrate additional components into your application stack with no additional manual integration effort, such as adding Machine Learning Analyzers with the Data Model specifying data elements to use in the analysis.

I presented OWL Ontologies ( http://www.w3.org/TR/owl2-overview/ ) as a great means of capturing Data Models, which can then be automatically transformed into the “schema” needed by different elements of the application stack, such as NoSQL databases or Machine Learning Analyzers.  At Vital AI, we use our tool VitalSigns to transform OWL Ontologies into code and schema files for a variety of components like HBase and Hadoop MapReduce/Spark Jobs.

You can see the full presentation here:
https://vitalai.com/2014/08/26/big-data-modeling-at-nosqlnow-semantic-technology-conference-san-jose-2014/

An OWL Data Model used in this way is part of your codebase, and should be managed in the same way as the rest of your code.

Git is a wonderful code management tool — let’s use OWL and Git together!

Git can be used as a service from providers such as Github and Bitbucket.  Whether you use git internally or via a service provider, it’s a great way to keep developers organized while still working in a distributed and independent way.

As part of Vital AI’s VitalSigns tool, we’ve integrated Git and OWL in the following way:

Within our “home” directory, we keep a directory of domain ontologies in OWL at:

{home}/domain-ontology/

Previous versions of an ontology get moved to an archive directory at:

{home}/domain-ontology/archive/

We keep a strict naming convention of the ontologies:

{Domain}-{version}.owl

The Domain is kept unique and is the key element in the Ontology URI, such as:

http://www.vital.ai/ontology/nycschools/NYCSchoolRecommendation.owl

with “NYCSchoolRecommendation” as the Domain in this case, with “http://www.vital.ai/ontology/nycschools/” providing a unique namespace for an application.

The version follows the Semantic Versioning standard described here:

http://semver.org/

with a value like “0.1.8”

This value is also in the OWL ontology, specified like:

&lt;owl:versionInfo&gt;0.1.8&lt;/owl:versionInfo&gt;

This makes the filename of this OWL ontology:

NYCSchoolRecommendation-0.1.8.owl

When we want to modify an ontology we first increase the patch number using a script:

vitalsigns upversion NYCSchoolRecommendation-0.1.8.owl

which increases the version to 0.1.9, moves the old file to the archive, and creates a new version:

NYCSchoolRecommendation-0.1.9.owl

that is ready to be modified.

We keep the previous versions of the Ontology in the archive so that we can easily “roll back” to a previous version.  This is especially helpful as we may have data conformant to older versions of the Ontology — we can can use the older Ontology version to interpret these data sets.  We may have years worth of data in our Data Warehouse (such as in a Hadoop cluster), and we don’t want to lose what the data means by losing our data model.

To update the ontology files, basic git commands such as “git add” and “git rename” are being used, so that the git repository is aware of the new ontology, and the moved old version.

Updating the git repository is then just a matter of using the git commands such as “git push” to push updates to a remote repository, and “git pull” to bring in updates from a remote repository.  By making modifications and using git push and pull, your entire development team can keep update-to-date with the latest versions of the OWL ontologies.

Git integration requires a few more steps for full integration.

When a file is moved into the archive, we add the username to the filename — this avoids clashes in the archive if two (or more) users independently moved the OWL ontology into the archive.  Thus, in the archive, we may have a OWL file with a name like:

NYCSchoolRecommendation-johnsmith-0.1.8.owl

when the user “johnsmith” moved it into the archive.  This won’t collide with a file like:

NYCSchoolRecommendation-maryjones-0.1.8.owl

if “maryjones” also was working on that version of the file.

Git compares files to determine if they are different or the same using a command called “diff” (coming from “differences”).   The “diff” command compares files line by line to find how they differ.  Software source code is generally always in linear order (Step 1, followed by Step 2, followed by Step 3, …), so this is a very natural way to find differences in source code.  However, order is not necessarily important in OWL files — the data model can be defined in any order.  If we define classes A and then B, this is the same as defining classes B and then A.  Thus, diff does not work well with OWL files — unless you give it a little help.

OWL is made up of definitions of classes, properties, annotations, and other elements.  Each of these has a unique identifier (a URI) associated with it.

This identifier gives us a way to sort the OWL ontology so we can always put it in the same order.  Once in the same order, we can compare the elements of the OWL ontology, such as class to class, property to property, to detect differences.

So, with a little help, we can continue to use an updated version of “diff” to find the differences between OWL ontologies, which is a  key part of tracking changes.

The final addition to git required for supporting OWL ontology files is to the “merge” operation.  Git uses “merge” to merge changes between two versions of a file to create a new file.  Similar to the case with “diff”, the files are expected to be starting from the same order.  So, for an OWL merge, we must first sort the elements like we did with diff, and compare them one by one to merge changes into a merged file.

To summarize, to use OWL files and Git together we must:

  • Enforce a naming convention using the version number in both the file and the version annotation so that our archive will have historical versions of the OWL ontologies — we can easily “roll back” to a previous version, especially when interpreting data that may be conformant to an earlier version of the Ontology.
  • The naming convention should incorporate the username of the user making the change to prevent clashes in the archive
  • Update diff to put OWL files in sorted order to line up differences
  • Update merge to use sorted OWL files to help merging differences

For helpful code for the diff and merge cases above, check out the open-source:

https://github.com/utapyngo/owl2vcs

VitalSigns makes use of this and the other mentioned methods to integrate OWL and Git.

Please contact us to help your team use Git and OWL Ontologies together!

http://vital.ai/#contact