Optimizing the
 Data Supply Chain
 for Data Science

I gave a talk at the Enterprise Dataversity conference in Chicago in November.

The title of the talk was:

Optimizing the Data Supply Chain for Data Science“.

data-supply-chain-edv2015-hadfield-submitted.001

Below are the slides from that presentation.

Here is a quick summary of the talk:

The Data Supply Chain is the next step in the progression of large scale data management, starting with a “traditional” Data Warehouse, moving to a Hadoop-based environment such as a Data Lake, then to a Microservice Oriented Architecture (Microservices across a set of independently managed Hadoop clusters, “Micro-SOA”), and now to the Data Supply Chain which adds additional data management and coordination processes to produce high quality Data Products across independently management environments.

A Data Product can be any data service such as an eCommerce recommendation system, a Financial Services fraud/compliance predictive service, or Internet of Things (IoT) logistics optimization service.  As a specific example, loading the Amazon.com website triggers more than 170 Data Products predicting consumer sentiment, likely purchases, and much more.

The “Data Supply Chain” (DSC) is a useful metaphor for how a “Data Product” is created and delivered.  Just like a physical “Supply Chain”, data is sourced from a variety of suppliers.  The main difference is that a Data Product can be a real-time combination of all the suppliers at once as compared to a physical product which moves linearly along the supply chain.  However, very often data does flow linearly across the supply chain and becomes more refined downstream.

Each participant of a DSC may be an independent organization, a department within a large organization, or a combination of internal and external data supplies — such as combining internal sales data with social media data.

As each participant in the DSC may have its own model of data, combining data from many sources can be very challenging due to incompatible assumptions.  As a simple example, a “car engine supplier” considers a “car engine” as a finished “product“, whereas a “car manufacturer” considers a “car engine” to be a “car part” and a finished car as a “product“, therefore the definitions of “product” and “car engine” are inconsistent.

As there is no central definition of data as each data supplier is operating independently, there must be an independent mechanism to capture metadata to assist flowing data across the DSC.

At Vital AI, we use semantic data models to capture data models across the DSC.  The models capture all the implicit assumptions in the data, and facilitate moving data across the DSC and building Data Products.

We generate code from the semantic data models which then automatically drives ETL processes, data mapping, queries, machine learning, and predictive analytics — allowing data products to be created and maintained with minimal effort while data sources continue to evolve.

Creating semantic data models not only facilitates creating Data Products, but also provides a mechanism to develop good data standards — Data Governance — across the DSC.  Data Governance is a critical part of high quality Data Science.

As code generated from semantic data models is included at all levels of the software stack, semantic data models also provide a mechanism to keep the interpretation of data consistent across the stack including in User Interfaces, Data Infrastructure (databases), and Data Science including predictive models.

As infrastructure costs continue to fall, the primary cost component of high quality Data Products is human labor.  The use of technologies such as semantic data models to optimize the Data Supply Chain and minimize human labor becomes more and more critical.

To learn more about the Data Supply Chain and Data Products, including how to apply semantic data models to minimize the effort, please contact us at Vital AI!

— Marc Hadfield

Email: info@vital.ai
Telephone: 1.917.463.4776

Vital AI Dev Kit and Products Release 254

VDK 0.2.254 was recently released, as well as corresponding releases for each product.

The new release is available via the Dashboard:

https://dashboard.vital.ai

Artifacts are in the maven repository:

https://github.com/vital-ai/vital-public-mvn-repo/tree/releases/vital-ai

Code is in the public github repos for public projects:

https://github.com/orgs/vital-ai

Highlights of the release include:

Vital AI Development Kit:

  • Support for deep domain model dependencies.
  • Full support of dynamic domains models (OWL to JVM and JSON-Schema)
  • Synchronization of domain models between local and remote vitalservice instances.
  • Service Operations DSL for version upgrade and downgrade to facilitate updating datasets during a domain model change.
  • Support for loading older/newer version of domain model to facility upgrading/downgrading datasets.
  • Configuration option to specify enforcement of version differences (strict, tolerant, lenient).
  • Able to specify preferred version of imported domain models.
  • Able to specify backward compatibility with prior domain model versions.
  • Support for deploy directories to cleanly separate domain models under development from those deployed in applications.

VitalPrime:

  • Full dynamic domain support
  • Synchronization of domain models between client and server
  • Datascripts to support domain model operations
  • Support for segment to segment data upgrade/downgrade for domain model version changes.

Aspen:

  • Prediction models to support externally defined taxonomies.
  • Support of AlchemyAPI prediction model
  • Support of MetaMind prediction model
  • Support for dynamic domain loading in Spark
  • Added jobs for upgrading/downgrading datasets for version change.

Vital Utilities:

  • Import and Export scripts using bulk operations of VitalService
  • Data migration script for updating dataset upon version change

Vital Vertx and VitalService-JS:

  • Support for dynamic domain models in JSON-Schema.
  • Asynchronous stream support, including multi-part data transfers (file upload/download in parts).

Vital Triplestore:

  • Support for EQ_CASE_INSENSITIVE comparator
  • Support for Allegrograph 4.14.1

Tracking Big Data Models in OWL with Git Version Control

In my presentation this year at NoSQL Now! / Semantic Technology Conference, I discussed Big Data Modeling.

A key point is using the same Data Model throughout an application stack, so data can be collected, stored, and analyzed in a streamlined way without introducing data inconsistencies, which otherwise inevitably occur during manual data transformations.  Ideally the Data Model can be used to integrate additional components into your application stack with no additional manual integration effort, such as adding Machine Learning Analyzers with the Data Model specifying data elements to use in the analysis.

I presented OWL Ontologies ( http://www.w3.org/TR/owl2-overview/ ) as a great means of capturing Data Models, which can then be automatically transformed into the “schema” needed by different elements of the application stack, such as NoSQL databases or Machine Learning Analyzers.  At Vital AI, we use our tool VitalSigns to transform OWL Ontologies into code and schema files for a variety of components like HBase and Hadoop MapReduce/Spark Jobs.

You can see the full presentation here:
https://vitalai.com/2014/08/26/big-data-modeling-at-nosqlnow-semantic-technology-conference-san-jose-2014/

An OWL Data Model used in this way is part of your codebase, and should be managed in the same way as the rest of your code.

Git is a wonderful code management tool — let’s use OWL and Git together!

Git can be used as a service from providers such as Github and Bitbucket.  Whether you use git internally or via a service provider, it’s a great way to keep developers organized while still working in a distributed and independent way.

As part of Vital AI’s VitalSigns tool, we’ve integrated Git and OWL in the following way:

Within our “home” directory, we keep a directory of domain ontologies in OWL at:

{home}/domain-ontology/

Previous versions of an ontology get moved to an archive directory at:

{home}/domain-ontology/archive/

We keep a strict naming convention of the ontologies:

{Domain}-{version}.owl

The Domain is kept unique and is the key element in the Ontology URI, such as:

http://www.vital.ai/ontology/nycschools/NYCSchoolRecommendation.owl

with “NYCSchoolRecommendation” as the Domain in this case, with “http://www.vital.ai/ontology/nycschools/” providing a unique namespace for an application.

The version follows the Semantic Versioning standard described here:

http://semver.org/

with a value like “0.1.8”

This value is also in the OWL ontology, specified like:

<owl:versionInfo>0.1.8</owl:versionInfo>

This makes the filename of this OWL ontology:

NYCSchoolRecommendation-0.1.8.owl

When we want to modify an ontology we first increase the patch number using a script:

vitalsigns upversion NYCSchoolRecommendation-0.1.8.owl

which increases the version to 0.1.9, moves the old file to the archive, and creates a new version:

NYCSchoolRecommendation-0.1.9.owl

that is ready to be modified.

We keep the previous versions of the Ontology in the archive so that we can easily “roll back” to a previous version.  This is especially helpful as we may have data conformant to older versions of the Ontology — we can can use the older Ontology version to interpret these data sets.  We may have years worth of data in our Data Warehouse (such as in a Hadoop cluster), and we don’t want to lose what the data means by losing our data model.

To update the ontology files, basic git commands such as “git add” and “git rename” are being used, so that the git repository is aware of the new ontology, and the moved old version.

Updating the git repository is then just a matter of using the git commands such as “git push” to push updates to a remote repository, and “git pull” to bring in updates from a remote repository.  By making modifications and using git push and pull, your entire development team can keep update-to-date with the latest versions of the OWL ontologies.

Git integration requires a few more steps for full integration.

When a file is moved into the archive, we add the username to the filename — this avoids clashes in the archive if two (or more) users independently moved the OWL ontology into the archive.  Thus, in the archive, we may have a OWL file with a name like:

NYCSchoolRecommendation-johnsmith-0.1.8.owl

when the user “johnsmith” moved it into the archive.  This won’t collide with a file like:

NYCSchoolRecommendation-maryjones-0.1.8.owl

if “maryjones” also was working on that version of the file.

Git compares files to determine if they are different or the same using a command called “diff” (coming from “differences”).   The “diff” command compares files line by line to find how they differ.  Software source code is generally always in linear order (Step 1, followed by Step 2, followed by Step 3, …), so this is a very natural way to find differences in source code.  However, order is not necessarily important in OWL files — the data model can be defined in any order.  If we define classes A and then B, this is the same as defining classes B and then A.  Thus, diff does not work well with OWL files — unless you give it a little help.

OWL is made up of definitions of classes, properties, annotations, and other elements.  Each of these has a unique identifier (a URI) associated with it.

This identifier gives us a way to sort the OWL ontology so we can always put it in the same order.  Once in the same order, we can compare the elements of the OWL ontology, such as class to class, property to property, to detect differences.

So, with a little help, we can continue to use an updated version of “diff” to find the differences between OWL ontologies, which is a  key part of tracking changes.

The final addition to git required for supporting OWL ontology files is to the “merge” operation.  Git uses “merge” to merge changes between two versions of a file to create a new file.  Similar to the case with “diff”, the files are expected to be starting from the same order.  So, for an OWL merge, we must first sort the elements like we did with diff, and compare them one by one to merge changes into a merged file.

To summarize, to use OWL files and Git together we must:

  • Enforce a naming convention using the version number in both the file and the version annotation so that our archive will have historical versions of the OWL ontologies — we can easily “roll back” to a previous version, especially when interpreting data that may be conformant to an earlier version of the Ontology.
  • The naming convention should incorporate the username of the user making the change to prevent clashes in the archive
  • Update diff to put OWL files in sorted order to line up differences
  • Update merge to use sorted OWL files to help merging differences

For helpful code for the diff and merge cases above, check out the open-source:

https://github.com/utapyngo/owl2vcs

VitalSigns makes use of this and the other mentioned methods to integrate OWL and Git.

Please contact us to help your team use Git and OWL Ontologies together!

http://vital.ai/#contact