Tracking Big Data Models in OWL with Git Version Control

In my presentation this year at NoSQL Now! / Semantic Technology Conference, I discussed Big Data Modeling.

A key point is using the same Data Model throughout an application stack, so data can be collected, stored, and analyzed in a streamlined way without introducing data inconsistencies, which otherwise inevitably occur during manual data transformations.  Ideally the Data Model can be used to integrate additional components into your application stack with no additional manual integration effort, such as adding Machine Learning Analyzers with the Data Model specifying data elements to use in the analysis.

I presented OWL Ontologies ( http://www.w3.org/TR/owl2-overview/ ) as a great means of capturing Data Models, which can then be automatically transformed into the “schema” needed by different elements of the application stack, such as NoSQL databases or Machine Learning Analyzers.  At Vital AI, we use our tool VitalSigns to transform OWL Ontologies into code and schema files for a variety of components like HBase and Hadoop MapReduce/Spark Jobs.

You can see the full presentation here:
https://vitalai.com/2014/08/26/big-data-modeling-at-nosqlnow-semantic-technology-conference-san-jose-2014/

An OWL Data Model used in this way is part of your codebase, and should be managed in the same way as the rest of your code.

Git is a wonderful code management tool — let’s use OWL and Git together!

Git can be used as a service from providers such as Github and Bitbucket.  Whether you use git internally or via a service provider, it’s a great way to keep developers organized while still working in a distributed and independent way.

As part of Vital AI’s VitalSigns tool, we’ve integrated Git and OWL in the following way:

Within our “home” directory, we keep a directory of domain ontologies in OWL at:

{home}/domain-ontology/

Previous versions of an ontology get moved to an archive directory at:

{home}/domain-ontology/archive/

We keep a strict naming convention of the ontologies:

{Domain}-{version}.owl

The Domain is kept unique and is the key element in the Ontology URI, such as:

http://www.vital.ai/ontology/nycschools/NYCSchoolRecommendation.owl

with “NYCSchoolRecommendation” as the Domain in this case, with “http://www.vital.ai/ontology/nycschools/” providing a unique namespace for an application.

The version follows the Semantic Versioning standard described here:

http://semver.org/

with a value like “0.1.8”

This value is also in the OWL ontology, specified like:

<owl:versionInfo>0.1.8</owl:versionInfo>

This makes the filename of this OWL ontology:

NYCSchoolRecommendation-0.1.8.owl

When we want to modify an ontology we first increase the patch number using a script:

vitalsigns upversion NYCSchoolRecommendation-0.1.8.owl

which increases the version to 0.1.9, moves the old file to the archive, and creates a new version:

NYCSchoolRecommendation-0.1.9.owl

that is ready to be modified.

We keep the previous versions of the Ontology in the archive so that we can easily “roll back” to a previous version.  This is especially helpful as we may have data conformant to older versions of the Ontology — we can can use the older Ontology version to interpret these data sets.  We may have years worth of data in our Data Warehouse (such as in a Hadoop cluster), and we don’t want to lose what the data means by losing our data model.

To update the ontology files, basic git commands such as “git add” and “git rename” are being used, so that the git repository is aware of the new ontology, and the moved old version.

Updating the git repository is then just a matter of using the git commands such as “git push” to push updates to a remote repository, and “git pull” to bring in updates from a remote repository.  By making modifications and using git push and pull, your entire development team can keep update-to-date with the latest versions of the OWL ontologies.

Git integration requires a few more steps for full integration.

When a file is moved into the archive, we add the username to the filename — this avoids clashes in the archive if two (or more) users independently moved the OWL ontology into the archive.  Thus, in the archive, we may have a OWL file with a name like:

NYCSchoolRecommendation-johnsmith-0.1.8.owl

when the user “johnsmith” moved it into the archive.  This won’t collide with a file like:

NYCSchoolRecommendation-maryjones-0.1.8.owl

if “maryjones” also was working on that version of the file.

Git compares files to determine if they are different or the same using a command called “diff” (coming from “differences”).   The “diff” command compares files line by line to find how they differ.  Software source code is generally always in linear order (Step 1, followed by Step 2, followed by Step 3, …), so this is a very natural way to find differences in source code.  However, order is not necessarily important in OWL files — the data model can be defined in any order.  If we define classes A and then B, this is the same as defining classes B and then A.  Thus, diff does not work well with OWL files — unless you give it a little help.

OWL is made up of definitions of classes, properties, annotations, and other elements.  Each of these has a unique identifier (a URI) associated with it.

This identifier gives us a way to sort the OWL ontology so we can always put it in the same order.  Once in the same order, we can compare the elements of the OWL ontology, such as class to class, property to property, to detect differences.

So, with a little help, we can continue to use an updated version of “diff” to find the differences between OWL ontologies, which is a  key part of tracking changes.

The final addition to git required for supporting OWL ontology files is to the “merge” operation.  Git uses “merge” to merge changes between two versions of a file to create a new file.  Similar to the case with “diff”, the files are expected to be starting from the same order.  So, for an OWL merge, we must first sort the elements like we did with diff, and compare them one by one to merge changes into a merged file.

To summarize, to use OWL files and Git together we must:

  • Enforce a naming convention using the version number in both the file and the version annotation so that our archive will have historical versions of the OWL ontologies — we can easily “roll back” to a previous version, especially when interpreting data that may be conformant to an earlier version of the Ontology.
  • The naming convention should incorporate the username of the user making the change to prevent clashes in the archive
  • Update diff to put OWL files in sorted order to line up differences
  • Update merge to use sorted OWL files to help merging differences

For helpful code for the diff and merge cases above, check out the open-source:

https://github.com/utapyngo/owl2vcs

VitalSigns makes use of this and the other mentioned methods to integrate OWL and Git.

Please contact us to help your team use Git and OWL Ontologies together!

http://vital.ai/#contact

Big Data Modeling at NoSQLNow! / Semantic Technology Conference, San Jose 2014

We had a wonderful time in San Jose last week at the NoSQLNow! / Semantic Technology Conference.

Many thanks to the organizers Tony Shaw, Eric Franzon, and the rest of the Dataversity team for putting on a great event!

My presentation on Thursday afternoon was “Big Data Modeling”.

The presentation is available below:

Vital AI: Big Data Modeling from Vital.AI