Optimizing the
 Data Supply Chain
 for Data Science

I gave a talk at the Enterprise Dataversity conference in Chicago in November.

The title of the talk was:

Optimizing the Data Supply Chain for Data Science“.

data-supply-chain-edv2015-hadfield-submitted.001

Below are the slides from that presentation.

Here is a quick summary of the talk:

The Data Supply Chain is the next step in the progression of large scale data management, starting with a “traditional” Data Warehouse, moving to a Hadoop-based environment such as a Data Lake, then to a Microservice Oriented Architecture (Microservices across a set of independently managed Hadoop clusters, “Micro-SOA”), and now to the Data Supply Chain which adds additional data management and coordination processes to produce high quality Data Products across independently management environments.

A Data Product can be any data service such as an eCommerce recommendation system, a Financial Services fraud/compliance predictive service, or Internet of Things (IoT) logistics optimization service.  As a specific example, loading the Amazon.com website triggers more than 170 Data Products predicting consumer sentiment, likely purchases, and much more.

The “Data Supply Chain” (DSC) is a useful metaphor for how a “Data Product” is created and delivered.  Just like a physical “Supply Chain”, data is sourced from a variety of suppliers.  The main difference is that a Data Product can be a real-time combination of all the suppliers at once as compared to a physical product which moves linearly along the supply chain.  However, very often data does flow linearly across the supply chain and becomes more refined downstream.

Each participant of a DSC may be an independent organization, a department within a large organization, or a combination of internal and external data supplies — such as combining internal sales data with social media data.

As each participant in the DSC may have its own model of data, combining data from many sources can be very challenging due to incompatible assumptions.  As a simple example, a “car engine supplier” considers a “car engine” as a finished “product“, whereas a “car manufacturer” considers a “car engine” to be a “car part” and a finished car as a “product“, therefore the definitions of “product” and “car engine” are inconsistent.

As there is no central definition of data as each data supplier is operating independently, there must be an independent mechanism to capture metadata to assist flowing data across the DSC.

At Vital AI, we use semantic data models to capture data models across the DSC.  The models capture all the implicit assumptions in the data, and facilitate moving data across the DSC and building Data Products.

We generate code from the semantic data models which then automatically drives ETL processes, data mapping, queries, machine learning, and predictive analytics — allowing data products to be created and maintained with minimal effort while data sources continue to evolve.

Creating semantic data models not only facilitates creating Data Products, but also provides a mechanism to develop good data standards — Data Governance — across the DSC.  Data Governance is a critical part of high quality Data Science.

As code generated from semantic data models is included at all levels of the software stack, semantic data models also provide a mechanism to keep the interpretation of data consistent across the stack including in User Interfaces, Data Infrastructure (databases), and Data Science including predictive models.

As infrastructure costs continue to fall, the primary cost component of high quality Data Products is human labor.  The use of technologies such as semantic data models to optimize the Data Supply Chain and minimize human labor becomes more and more critical.

To learn more about the Data Supply Chain and Data Products, including how to apply semantic data models to minimize the effort, please contact us at Vital AI!

— Marc Hadfield

Email: info@vital.ai
Telephone: 1.917.463.4776

Vital AI Dev Kit and Product Release 255, including Vital DynamoDB Vitalservice Implementation

VDK 0.2.255 was recently released, as well as corresponding releases for each product.

dynamodb-logo

The main new addition is the Vital DynamoDB Vitalservice Implementation, now available in the dashboard.

https://dashboard.vital.ai

The Vital DynamoDB product provides an implementation of the Vitalservice API using Amazon’s DynamoDB Database-as-a-Service as the data repository.

DynamoDB provides a scalable NoSQL database service.  The Vitalservice implementation allows all the capabilities of the Vitalservice API such as Graph queries and use of VitalSigns domain model objects with DynamoDB as the underlying database.

More information about DynamoDB is available from Amazon here: https://aws.amazon.com/dynamodb/

New artifacts are in the maven repository:

https://github.com/vital-ai/vital-public-mvn-repo/tree/releases/vital-ai

Code is in the public github repos for public projects:

https://github.com/orgs/vital-ai

Vital DynamoDB Highlights:

  • Vitalservice implementation using Amazon’s DynamoDB as the underlying data repository
  • IndexDB updates to use DynamoDB as the backing database, combined with the Index for fast queries
  • VitalPrime updates to support IndexDB using DynamoDB
  • Support for transactions
  • Support for all query types include Select, Graph, Path, and Aggregation
  • Support for all query constraints including subclasses and subproperties
  • Vitalservice administrative scripts available for managing Apps, Segments
  • Vital Utility scripts updates for imports, exports, upgrades, downgrades and other data processing with DynamoDB

SmartData, NoSQL Now! conference talk: MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

I’m excited for my talk tomorrow at the NoSQLNow! conference!  I hope those in the San Francisco/San Jose area can make it, and for those that do, please join me on Tuesday at 2pm!

— Marc Hadfield

Further details:

Tuesday, August 18, 2015

02:00 PM – 02:45 PM

SmartData, NoSQL Now! conference in San Jose, California

SmartData conference: http://smartdata2015.dataversity.net/

NoSQL Now conference: http://nosql2015.dataversity.net/

MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Each database has its strengths and weaknesses for different data access profiles, and we should endeavor to use the right tool for the right job.

However, adding another infrastructure component greatly increases not only the management effort, but also the development effort to integrate and maintain connections across multiple data repositories, let alone keeping the data synchronized.

In this talk, we’ll discuss MetaQL, a common query layer across database technologies including NoSQL, SQL, Sparql, and Spark.

Using a common query layer lessens the burden on developers, allows using the right database for the right job, and opens up data to additional analysis that would be unavailable previously – providing new and unexpected value.

In this talk, we will discuss:

  • Database Data Model and Schema
  • Java/JVM Query Builder, driven by Schema
  • Query constructs for “Select”, “Graph”, “Path”, and “Aggregation” Queries
  • NoSQL & SQL Databases
  • Sparql RDF Databases, including Allegrograph
  • Apache Spark, including SparkSQL and the new DataFrame API
  • ETL, Transactions, and Data Synchronization
  • Seamless queries across databases

More details: http://nosql2015.dataversity.net/sessionPop.cfm?confid=90&proposalid=7823

Vital AI example apps for prediction using AlchemyAPI (IBM Bluemix), Metamind.io, and Apache Spark

Along with our recent release of VDK 0.2.254, we’ve added a few new example apps to help developers get started with the VDK.

By starting with one of these examples, you can quickly build applications for prediction, classification, and recommendation with a JavaScript web application front end, and prediction models on the server.  The examples use prediction models trained using Apache Spark or an external service such as AlchemyAPI (IBM Bluemix), or Metamind.io.

There is also an example app for various queries of a document database containing the Enron Email dataset.  Some details on this dataset are here: https://www.cs.cmu.edu/~./enron/

The example applications have the same architecture.

6a00e5510ddf1e883301bb086472c6970d-800wi

The components are:

  • JavaScript front end, using asynchronous messages to communicate with the server.  Messaging and domain model management are provided by the VitalService-JS library.
  • VertX application server, making use of the Vital-Vertx module.
  • VitalPrime server using DataScripts to implement server-side functionality, such as generating predictions using a Prediction Model.
  • Prediction Models to make predictions or recommendations.  A Prediction Model can be trained based on a training set, or it could interface to an external prediction service.  If trained, we often use Apache Spark with the Aspen library to create the trained prediction model.
  • A Database such as DynamoDB, Allegrograph, MongoDB, or other to store application data.

Here is a quick overview of some of the examples.

We’ll post detailed instructions on each app in followup blog entries.

MetaMind Image Classification App:

Source Code:

https://github.com/vital-ai/vital-examples/tree/master/metamind-app

Demo Link:

https://demos.vital.ai/metamind-app/index.html

Screenshot:

6a00e5510ddf1e883301b7c7c02e88970b-800wi

This example uses a MetaMind ( https://www.metamind.io/ ) prediction model to classify an image.

AlchemyAPI/IBM Bluemix Document Classification App

Source Code:

https://github.com/vital-ai/vital-examples/tree/master/alchemyapi-app

Demo Link:

https://demos.vital.ai/alchemyapi-app/index.html

Screenshot:

6a00e5510ddf1e883301b8d14a0496970c-800wi

This example app uses an AlchemyAPI (IBM Bluemix) prediction model to classify a document.

Movie Recommendation App

Source Code (Web Application):

https://github.com/vital-ai/vital-examples/tree/master/movie-recommendations-js-app

Source Code (Training Prediction Model):

https://github.com/vital-ai/vital-examples/tree/master/movie-recommendations

Demo Link:

https://demos.vital.ai/movie-recommendations-js-app/index.html

Screenshot:

6a00e5510ddf1e883301b7c7c03038970b-800wi

This example uses a prediction model trained on the MovieLens data to recommend movies based on a user’s current movie ratings.  The prediction model uses the Collaborative Filtering algorithm trained using an Apache Spark job.  Each user has a user-id such as “1010” in the screenshot above.

Spark’s collaborative filtering implementation is described here:

http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

The MovieLens data can be found here:

http://grouplens.org/datasets/movielens/

Enron Document Search App

Source Code:

https://github.com/vital-ai/vital-examples/tree/master/enron-js-app

Demo Link:

https://demos.vital.ai/enron-js-app/index.html

Screenshot:

6a00e5510ddf1e883301b7c7c0314e970b-800wi

This example demonstrates how to implement different queries against a database, such as a “select” query — find all documents with certain keywords, and a “graph” query — find documents that are linked to users.

Example Data Visualizations:

The Cytoscape graph visualization tool can be used to visualize the above sample data using the Vital AI Cytoscape plugin.

The Cytoscape plugin is available from:

https://github.com/vital-ai/vital-cytoscape

An example of visualizing the MovieLens data:

6a00e5510ddf1e883301b8d14a0660970c-800wi

An example of visualizing the Wordnet Dataset, viewing the graph centered on “Red Wine”:

6a00e5510ddf1e883301b8d14a07de970c-800wi

For generating and importing the Wordnet data, see sample code here:

https://github.com/vital-ai/vital-examples/tree/master/vital-samples/src/main/groovy/ai/vital/samples

Information about Wordnet is available here:

https://wordnet.princeton.edu/

Another example of the Wordnet data, with some additional visual styles added:

6a00e5510ddf1e883301b7c7c03384970b-800wi

Vital AI Dev Kit and Products Release 254

VDK 0.2.254 was recently released, as well as corresponding releases for each product.

The new release is available via the Dashboard:

https://dashboard.vital.ai

Artifacts are in the maven repository:

https://github.com/vital-ai/vital-public-mvn-repo/tree/releases/vital-ai

Code is in the public github repos for public projects:

https://github.com/orgs/vital-ai

Highlights of the release include:

Vital AI Development Kit:

  • Support for deep domain model dependencies.
  • Full support of dynamic domains models (OWL to JVM and JSON-Schema)
  • Synchronization of domain models between local and remote vitalservice instances.
  • Service Operations DSL for version upgrade and downgrade to facilitate updating datasets during a domain model change.
  • Support for loading older/newer version of domain model to facility upgrading/downgrading datasets.
  • Configuration option to specify enforcement of version differences (strict, tolerant, lenient).
  • Able to specify preferred version of imported domain models.
  • Able to specify backward compatibility with prior domain model versions.
  • Support for deploy directories to cleanly separate domain models under development from those deployed in applications.

VitalPrime:

  • Full dynamic domain support
  • Synchronization of domain models between client and server
  • Datascripts to support domain model operations
  • Support for segment to segment data upgrade/downgrade for domain model version changes.

Aspen:

  • Prediction models to support externally defined taxonomies.
  • Support of AlchemyAPI prediction model
  • Support of MetaMind prediction model
  • Support for dynamic domain loading in Spark
  • Added jobs for upgrading/downgrading datasets for version change.

Vital Utilities:

  • Import and Export scripts using bulk operations of VitalService
  • Data migration script for updating dataset upon version change

Vital Vertx and VitalService-JS:

  • Support for dynamic domain models in JSON-Schema.
  • Asynchronous stream support, including multi-part data transfers (file upload/download in parts).

Vital Triplestore:

  • Support for EQ_CASE_INSENSITIVE comparator
  • Support for Allegrograph 4.14.1

Amazon Echo tells “Yo Mama…” jokes.

To experiment with the Amazon Echo API, we created a little app called “Funnybot”.

The details of the app can be found in the previous post here:

https://vitalai.com/2015/07/building-an-amazon-echo-app-with-vital-ai.html 

 

All the source code of the app can be found on github here:

https://github.com/vital-ai/vital-examples/tree/master/amazon-echo-humor-app

The Vital AI components are available here: https://dashboard.vital.ai/

You may notice a Raspberry Pi in the video also — we’re in the midst of integrating the Echo with the Raspberry Pi for a home automation application.

Building an Amazon Echo App with Vital AI

A recent delivery from Amazon brought us an Amazon Echo. After some initial fun experimentation, including hooking it up to a Belkin Wemo switch and my Pandora account, we dove into the developer API to hook it up to the Vital AI platform and see what we could do.

Note: All the code discussed below can be found on github: https://github.com/vital-ai/vital-examples/tree/master/amazon-echo-humor-app

The Voice Interface configuration is a bit similar to the Wit.AI API, and setting things up on the Amazon side was easy enough. Some Amazon provided Java code gave us a good start on the webservice backend.

We often use Vert.X ( http://vertx.io/ ) for web applications, and created the REST Webservice using that.

This we configured to communicate with our Vital Prime server, which itself is configured with a database for storage.

So, the final architecture is:

6a00e5510ddf1e883301b8d133caa3970c-800wi

When the Echo gets a request like “Alexa, launch <your-app-name>”, the Echo communicates with the Amazon Echo service, which in turn communicates with the app, which in our case is implemented with the Vert.X Webservice.

The Webservice makes an API call to the Prime server, which uses a datascript to fulfill the request.

The datascript picks a random joke from those in its database, and replies to Echo with the joke.

We loaded in a few hundred “Yo Mama” jokes as a starting point, and called the app “Funnybot” in an homage to a rather terrible episode of South Park.

In this case the backend was a pretty simple database lookup, but a more realistic example would include a “serving” layer as well as a scalable streaming and “analytics” layer.

In this case, the architecture would look like:

6a00e5510ddf1e883301b8d133cb54970c-800wi

Here we are using Apache Spark as the streaming (Spark Streaming) and the analytics layer (such as Spark GraphX or MLLib).  Other than Spark, one could also use Apache Storm for streaming with the same basic architecture.

Aspen Datawarehouse is an open-source collection of software on top of Apache Spark to help connect streaming and analysis on Spark to the front end of the architecture,  mainly by keeping the data model consistent and providing integration hooks — thus we get a nice handoff among Vert.X, Prime, and Spark.

Datascripts are scripts running within Prime, typically implementing logic for an application that is close to the data.

In this case, the datascript is doing a query for all the Jokes in the database, caching them, and randomly picking one from the cached list.

The query is:

VitalSelectQuery vsq = new VitalBuilder().query {
 
SELECT {
     value segments: ['humor-app']
     value offset: 0
     value limit: 10000
     node_constraint { Joke.class }
}}.toQuery()

We’ve created a simple domain model which includes a class for “Joke”, and by constraining the query to “Joke” objects, we get back all the jokes that are within the “humor-app” database.

All the implementation code can be found on github here: https://github.com/vital-ai/vital-examples/tree/master/amazon-echo-humor-app

The Vital AI components mentioned above are available from Vital AI here: https://dashboard.vital.ai/

Please contact us at info@vital.ai if you would like assistance creating an Amazon Echo application.

In my next post I’ll show a quick video of the result.

We’re currently in the process of hooking the Echo to a Raspberry Pi via the Vital AI platform for a home automation application — stay tuned!