Using the Beaker Notebook with Vital Service

In this post I’ll describe using the Beaker data science notebook with Vital back-end components for data exploration and analysis, using the Wordnet dataset as an example.

At Vital AI we use many tools to explore and analyze data, and chief among them are data science notebooks.  Examples include IPython/Jupyter and Zepplin, plus similar products/services such as Databricks and RStudio.

One that has become a recent favorite is the Beaker Notebook ( http://beakernotebook.com/ ).  Beaker is open-source with a git repo on github ( https://github.com/twosigma/beaker-notebook ) under very active development.

Beaker fully embraces polyglot programming and supports many programming languages including Javascript, Python, R, and JVM languages including Groovy, Java, and Scala.

Scala is especially nice for integration with Apache Spark.  R of course is great for stats and visualization, and JavaScript is convenient for visualization and web dashboards, especially when using visualization libraries like D3.

At Vital AI we typically use JVM for production server applications in combination with Apache Spark, so having all options available in a single notebook makes data analysis a very agile process.

About Data Models

At Vital AI we model data by creating data models (aka ontologies) to capture the meaning of the data, and use these data models within our code.  This allows the meaning of the data to guide our analysis, as well as enable strong data standards – saving a huge amount of manual effort.

We create data models using the open standard OWL, and then generate code using the VitalSigns tool.  This data model code is then utilized within all data analysis and workflows.

At runtime, VitalSigns loads data models into the JVM in one of two ways: from the classpath that was specified when the JVM started via the ServiceLoader API or dynamically via a dynamic classloader.

By using the dynamic method, we can use the Vital Prime server as a “data model server” so that data models are discovered and loaded at run-time from the Prime server.  Thus the data models are kept in sync with data managed by the Prime server, so data analysis is always working with the latest data definitions.

As Groovy is a dynamic language on the JVM, we use Groovy for many data analysis scripts that use data models.

About Wordnet

One of our favorite datasets to use is Wordnet.  From the Wordnet website ( http://wordnet.princeton.edu ):

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

As Wordnet has the form of a graph – words linked to words linked to other words – it is very convenient for visualization.

VitalService API

The VitalService API is a standard API that includes methods for data queries, running analysis scripts (aka datascripts), and reading, saving, updating, and deleting data (so-called “CRUD” operations).  We use the VitalService API for working with data locally, accessing a database, or using a remote service.  This means we use the same API calls when we switch from working with data locally to working with a production service, so we can use the same code library throughout.

Application Architecture

vital-arch1

A full-stack of an application may include a web application layer, a VitalService implementation such as Prime, a Database such as DynamoDB, and an analysis environment based on Apache Spark and Hadoop.  Above, Prime is managing the data models (the “gear” icons) and provides “datascripts” to process data via a scripting interface.

 

vital-arch2

The above diagram focuses on the current case of Beaker Notebook where we are connecting to VitalService Prime as a client, synchronizing the data models, and sending queries to an underlying database.  In our example, the database contains the Wordnet data.

Some sample code to generate the Wordnet dataset is here: https://github.com/vital-ai/vital-examples/blob/master/vital-samples/src/main/groovy/ai/vital/samples/SampleWordnetGenerate.groovy

And some sample code to load the Wordnet data into VitalService is here:  https://github.com/vital-ai/vital-examples/blob/master/vital-samples/src/main/groovy/ai/vital/samples/SampleWordnetImport.groovy

Or more generally the vitalimport utility could be used, found in here: https://github.com/vital-ai/vital-utils

Back to the Beaker Notebook

Now that we’ve a few definitions out of the way, we can get back to using Beaker.

The previous version of Beaker had some limitations to loading JVM classes (see: https://github.com/twosigma/beaker-notebook/issues/2276 ) which are now fixed in Beaker’s github but not yet included in a released version.  We’re currently using a patched version here: https://github.com/vital-ai/beaker-notebook until the next release.

For this example, let’s query some data using VitalService and then visualize the resulting data using D3 with JavaScript.

Our example is based on the one found here: https://pub.beakernotebook.com/#/publications/560c9f9b-14e6-4d95-8e78-cc0a60bf4e5a?fullscreen=true

Our example will include 3 cells: one Groovy one to do a query, one JavaScript one to run D3 over the data, and one HTML one to display the resulting graph.

The Groovy cell first connects to VitalService, like so:

VitalSigns vs = VitalSigns.get()

VitalServiceKey key = new VitalServiceKey().generateURI()
key.key = vs.getConfig("analyticsKey")

def service = VitalServiceFactory.openService(key, "prime", "AnalyticsService")

The code above initializes VitalSigns, sets an authentication key based upon a value in a configuration file, and connects to the VitalService endpoint.  Prime requires an authentication key for security.


vs.pipeline { ->

def builder = new VitalBuilder()

VitalGraphQuery q = builder.query {

// query for graphs like:
// node1(name:happy) ---edge--->node2

GRAPH {

value segments: ["wordnet"]
value inlineObjects: true

ARC {
// bind this node to name "node1"
node_bind { "node1" }

// include subclasses of SynsetNode: Noun, Verb, Adjective, Adverb
node_constraint { SynsetNode.expandSubclasses(true) }
node_constraint { SynsetNode.props().name.equalTo("happy") }

ARC {
// bind the node and edge to names "node2" and "edge"
edge_bind { "edge" }
node_bind { "node2" }
}
}
}
}.toQuery()

ResultList list = service.query( q )

// count the results
def j = 1

list.each {

// Use the binding names to get the URI values out of GraphMatch

def node1_uri = it."node1".toString()
def edge_uri = it."edge".toString()
def node2_uri = it."node2".toString()

// inlineObjects is true, which embeds unseen objects into the results
// if cache is null, get graph object out of GraphMatch results
// graph objects referenced via the URI

def node1 = vs.getFromCache(node1_uri) ?: it."$node1_uri"
def edge = vs.getFromCache(edge_uri) ?: it."$edge_uri"
def node2 = vs.getFromCache(node2_uri) ?: it."$node2_uri"

// add new ones into cache, doesn't hurt to refresh existing ones
vs.addToCache([node1, edge, node2])

// print out node1 --edge--> node2, with edge type (minus the namespace)
println j++ + ": " + node1.name + "---" + (edge.vitaltype.toString() - "http://vital.ai/ontology/vital-wordnet#") + "-->" + node2.name
}

}

service.close()

The above code performs a query for all Wordnet entries with the name “happy”, and then follows all links from those to other words, putting the results into a cache, as well as printing them out.

Note the use of data model objects in the code above like: “SynsetNode”, “VITAL_Node”, and “VITAL_Edge”.  These are used which avoids having any code which directly parses data – the analysis code receives data objects which are “typed” according to the data model.

A screenshot:

querygraph

The result of the “println” statements is:

1: happy—Edge_WordnetSimilarTo–>laughing, riant
2: happy—Edge_WordnetAlsoSee–>joyful
3: happy—Edge_WordnetAlsoSee–>joyous
4: happy—Edge_WordnetSimilarTo–>golden, halcyon, prosperous
5: happy—Edge_WordnetAttribute–>happiness, felicity
6: happy—Edge_WordnetAlsoSee–>euphoric
7: happy—Edge_WordnetAlsoSee–>elated
8: happy—Edge_WordnetAlsoSee–>cheerful
9: happy—Edge_WordnetAlsoSee–>felicitous
10: happy—Edge_WordnetAlsoSee–>glad
11: happy—Edge_WordnetAlsoSee–>contented, content
12: happy—Edge_WordnetSimilarTo–>blissful
13: happy—Edge_WordnetSimilarTo–>blessed
14: happy—Edge_WordnetSimilarTo–>bright
15: happy—Edge_WordnetAttribute–>happiness

We then take all the nodes and edges in the cache and turn them into JSON data as D3 expects.


def nodes = []
  def links = []
   
  Iterator i = vs.getCacheIterator()
  
  def c = 0
  
  while(i.hasNext() ) {
     
    GraphObject g = i.next()
    
    if(g.isSubTypeOf(VITAL_Node)) {
        
      g."local:index" = c
      
      nodes.add ( "{\"name\": \"$g.name\", \"group\": $c}" )
      
      c++
      
    }
       
  }
   
  def max = c
  
  i = vs.getCacheIterator()
  
  while(i.hasNext() ) {
     
  GraphObject g = i.next()
    
    if(g.isSubTypeOf(VITAL_Edge)) {
       
        def srcURI = g.sourceURI
        def destURI = g.destinationURI
      
        def source = vs.getFromCache(srcURI)
        def destination = vs.getFromCache(destURI)
        
        def sourceIndex = source."local:index"
        def destinationIndex = destination."local:index"
      
      
      links.add (  "{\"source\": $sourceIndex, \"target\": $destinationIndex, \"value\": 10}"   ) 
         
    }
    
  }
  
  println "Graph:" + "{\"nodes\": $nodes, \"links\": $links}"
  
  beaker.graph = "{\"nodes\": $nodes, \"links\": $links}"

 

The last line above puts the data into a “beaker” object, which is the handoff point to other languages.

A screenshot of the results and the JSON:

resultsgraph

Then in a Javascript cell:


var graphstr = JSON.stringify(beaker.graph);

var graph = JSON.parse(graphstr)


var width = 800,
    height = 300;

var color = d3.scale.category20();

var force = d3.layout.force()
    .charge(-120)
    .linkDistance(100)
    .size([width, height]);

var svg = d3.select("#fdg").append("svg")
    .attr("width", width)
    .attr("height", height);

var drawGraph = function(graph) {
  force
      .nodes(graph.nodes)
      .links(graph.links)
      .start();

  var link = svg.selectAll(".link")
      .data(graph.links)
    .enter().append("line")
      .attr("class", "link")
      .style("stroke-width", function(d) { return Math.sqrt(d.value); });

  var gnodes = svg.selectAll('g.gnode')
     .data(graph.nodes)
     .enter()
     .append('g')
     .classed('gnode', true);
    
  var node = gnodes.append("circle")
      .attr("class", "node")
      .attr("r", 10)
      .style("fill", function(d) { return color(d.group); })
      .call(force.drag);

  var labels = gnodes.append("text")
      .text(function(d) { return d.name; });

  
  force.on("tick", function() {
    link.attr("x1", function(d) { return d.source.x; })
        .attr("y1", function(d) { return d.source.y; })
        .attr("x2", function(d) { return d.target.x; })
        .attr("y2", function(d) { return d.target.y; });

    gnodes.attr("transform", function(d) { 
        return 'translate(' + [d.x, d.y] + ')'; 
    });
      
    
      
  });
};

drawGraph(graph);

Screenshot:

jsgraph

Note the handoff of the “beaker.graph” object in the beginning of the JavaScript code.  It may be a bit tricky to get the data exchanges right so that JSON produced on the groovy side is interpreted as JSON on the JavaScript side, or vice-versa.  Beaker provides auto-translation for various data structures including DataFrames, but it still takes some trial and error to get it right.

The above JavaScript code comes from the Beaker example project, plus this StackOverFlow article which discusses adding labels to graphs:  http://stackoverflow.com/questions/18164230/add-text-label-to-d3-node-in-force-directed-graph-and-resize-on-hover

In the last Beaker cell, we include the HTML to be the “target” of the JavaScript code:


<style>
.node {
  stroke: #fff;
  stroke-width: 1.5px;
}

.link {
  stroke: #999;
  stroke-opacity: .6;
}
</style>
<div id="fdg"></div>

And a screenshot of the HTML cell with the resulting D3 graph.

happygraph2

Hope you have enjoyed this walkthrough of using Beaker with the VitalService interface, and visualizing query results in a graph with D3.

Please ask any questions in the comments section, or send them to us at info@vital.ai.

Happy New Year!

Vital AI

Optimizing the
 Data Supply Chain
 for Data Science

I gave a talk at the Enterprise Dataversity conference in Chicago in November.

The title of the talk was:

Optimizing the Data Supply Chain for Data Science“.

data-supply-chain-edv2015-hadfield-submitted.001

Below are the slides from that presentation.

Here is a quick summary of the talk:

The Data Supply Chain is the next step in the progression of large scale data management, starting with a “traditional” Data Warehouse, moving to a Hadoop-based environment such as a Data Lake, then to a Microservice Oriented Architecture (Microservices across a set of independently managed Hadoop clusters, “Micro-SOA”), and now to the Data Supply Chain which adds additional data management and coordination processes to produce high quality Data Products across independently management environments.

A Data Product can be any data service such as an eCommerce recommendation system, a Financial Services fraud/compliance predictive service, or Internet of Things (IoT) logistics optimization service.  As a specific example, loading the Amazon.com website triggers more than 170 Data Products predicting consumer sentiment, likely purchases, and much more.

The “Data Supply Chain” (DSC) is a useful metaphor for how a “Data Product” is created and delivered.  Just like a physical “Supply Chain”, data is sourced from a variety of suppliers.  The main difference is that a Data Product can be a real-time combination of all the suppliers at once as compared to a physical product which moves linearly along the supply chain.  However, very often data does flow linearly across the supply chain and becomes more refined downstream.

Each participant of a DSC may be an independent organization, a department within a large organization, or a combination of internal and external data supplies — such as combining internal sales data with social media data.

As each participant in the DSC may have its own model of data, combining data from many sources can be very challenging due to incompatible assumptions.  As a simple example, a “car engine supplier” considers a “car engine” as a finished “product“, whereas a “car manufacturer” considers a “car engine” to be a “car part” and a finished car as a “product“, therefore the definitions of “product” and “car engine” are inconsistent.

As there is no central definition of data as each data supplier is operating independently, there must be an independent mechanism to capture metadata to assist flowing data across the DSC.

At Vital AI, we use semantic data models to capture data models across the DSC.  The models capture all the implicit assumptions in the data, and facilitate moving data across the DSC and building Data Products.

We generate code from the semantic data models which then automatically drives ETL processes, data mapping, queries, machine learning, and predictive analytics — allowing data products to be created and maintained with minimal effort while data sources continue to evolve.

Creating semantic data models not only facilitates creating Data Products, but also provides a mechanism to develop good data standards — Data Governance — across the DSC.  Data Governance is a critical part of high quality Data Science.

As code generated from semantic data models is included at all levels of the software stack, semantic data models also provide a mechanism to keep the interpretation of data consistent across the stack including in User Interfaces, Data Infrastructure (databases), and Data Science including predictive models.

As infrastructure costs continue to fall, the primary cost component of high quality Data Products is human labor.  The use of technologies such as semantic data models to optimize the Data Supply Chain and minimize human labor becomes more and more critical.

To learn more about the Data Supply Chain and Data Products, including how to apply semantic data models to minimize the effort, please contact us at Vital AI!

— Marc Hadfield

Email: info@vital.ai
Telephone: 1.917.463.4776