NY Tech Day 2016

We had a great time at NY Tech Day and we hope you did too!

IMG_3879

Here is Marc discussing some new Artificial Intelligence Apps with some guests at our booth.

And some other photos from the day:

IMG_3893.png

IMG_3883.png

IMG_3870.png

Thanks to everyone for dropping by our booth to learn more about building artificial intelligence applications using the Vital AI Development Kit!

We look forward to keeping in touch with all the awesome people we met yesterday.

Also, many thanks to the organizers for such a wonderful event!  Great job again, and looking forward to next time!

Updating the “If” statement in the JVM for Truth Value Logic

The Vital Development Kit (VDK) provides development tools and an API for Artificial Intelligence (AI) applications and data processing.  This includes a “Domain Specific Language” (DSL) for working with data.

To this DSL we’ve recently added an extension to the venerable “If” statement in the JVM to handle Truth Values. (Truth Values, see: Beyond True and False: Introducing Truth to the JVM )

The “If” statement is the workhorse of computer programming.  If this, do that.  If something is so, then do some action.  The “If” statement evaluates if some condition is “True”, and if so, takes some action.  If the condition is “False”, then it may take some other action.

The condition of an “If” statement yields a Boolean True or False and typically involves tests of variables, such as:  height > 72, speed < 50, name == “John”.

The “If” statement is a special case of the “switch” statement, such that:

if(name == "John") { do something }
else { do something different }

is the same as:

switch (name == "John") {
case true: { do something; break }
case false: { do something different; break }
}

In the VDK we have an extension of Boolean in the JVM called Truth.  Truth may take four values: YES, NO, UNKNOWN, or MU compared to the Boolean TRUE or FALSE.  YES and NO are the familiar TRUE and FALSE, with UNKNOWN providing a value for when a condition cannot be determined because of unknown inputs, and MU providing a value for when a condition is unknowable because it contains a false premise.

For example, for UNKNOWN, the color of a traffic light might be red, green, or yellow but its value is currently UNKNOWN.  And for MU, the favorite color of a traffic light is MU because inanimate objects don’t have favorite colors.

UNKNOWN and MU are part of the familiar Boolean Truth Tables.  For instance True AND True yields True, whereas True AND Unknown yields Unknown.

Details of the Truth implementation in the JVM can be found in the blog post: Beyond True and False: Introducing Truth to the JVM

Because Truth has four values, we need a way to handle four cases when we test a condition.

As above, we could use a “switch” statement like so:

switch (Truth Condition) {
case YES: { handle YES; break }
case NO: { handle NO; break }
case UNKNOWN: { handle UNKNOWN; break }
case MU: { handle MU; break }
} 

This is a little verbose, so we’ve introduced a friendlier statement: consider.

consider (Truth Condition) {
YES: { handle YES }
NO: { handle NO }
UNKNOWN: { handle UNKNOWN }
MU: { handle MU }
} 

So we can have code like:

consider (trafficlight.color == GREEN) {
YES: { car.drive() }
NO: { car.stop() }
UNKNOWN: { car.stop() } // better be safe and look around
MU: { car.stop(); runDiagnostics(); } // error! error!
} 

In the above code, if evaluating our truth condition results in an UNKNOWN value (perhaps a sensor cannot “see” the Traffic Light), we can take some safe action.  If we get a MU value, then we have some significant error, such the “trafficlight” object not actually being a trafficlight — perhaps some variable mixup.  We can also take some defensive measures in this case.

We can also stick with using “If” and use exceptions for the cases of UNKNOWN and MU:

try {
if(trafficlight.color == GREEN) { car.drive() }
else { car.stop() }
} catch(Exception ex) {
// handle UNKNOWN and MU
}

This works because Truth values are coerced to Boolean True or False for the cases of YES or NO.  This coercion throws an exception for the cases of UNKNOWN or MU.  JVM exceptions are a bit ugly and should not be used for normal program control flow (exceptions as flow control is often an anti-pattern), so the consider statement is much preferred.

The logic of Truth is very helpful in defining Rules to process realtime dynamic data, and answer dynamic data queries.  The consider statement allows such rules to be quite succinct and explicitly handle unknown data or queries with non-applicable conditions.

For instance, if we query an API for the status of traffic lights and ask how many are currently yellow, we might get back a reply of 0 (zero).  We might wonder are there really zero traffic lights yellow presently, or is the API not functioning and always returning zero, or does it not track yellow lights?  It would be better to get a reply of UNKNOWN if the API was not functioning.  If we asked how many traffic lights were displaying Purple, a reply of zero would be correct but it would be better to get a reply of MU –  there are no such thing as Purple traffic lights in a world of Red/Yellow/Green lights.

As AI and data-driven applications incorporate more dynamic data models and data sources, instances of missing or incorrect knowledge are more the rule rather than the exception, so the software flow should treat these as normal cases to consider rather than exceptions.

Hope you have enjoyed learning about how the Vital Development Kit has extended “If” to handle Truth Values.  Please post and questions or comments below, or get in touch with us at info@vital.ai.

Beyond True and False: Introducing Truth to the JVM

In this post we introduce a new type for “Truth” to the JVM in the Vital Development Kit (VDK), to include cases that don’t fit into Boolean True and False — in particular a value for Unknown and Mu (nonexistence).  We use this new Truth type in the VDK for logical and conditional expressions, especially in rules and inference.

Computer hardware has binary — ones and zeros.
Computer software has boolean — true and false.
This seems a perfect match, immutable, a Platonic Ideal.
But there are some wrinkles.

Programming languages deeply use booleans to control the flow of a program:

if this is true, then do that
if this is false, then do something else

What if the software doesn’t know a particular value at that moment?

What if the value doesn’t make sense in the current context?

For instance, in the code:

if(TrafficLight.color == RED) then { Stop() }
else { Go() }

What if the TrafficLight color is unknown?  The software would drive through the traffic intersection.  (Hope for the best!)

Or, what if we had code like:

if(Penguin.flightspeed < 10) then { ThrowAFish() }

Should this code work, even if penguins don’t have a flight speed?

In programming languages like Java, these problems lead to a lot of custom workarounds and error checking, so much so that you can no longer use “normal” boolean expressions without a bunch of checks to ensure that it is “safe”.

In rule-based systems or logic languages, often “unknown” is treated as False — this is because a rule like isTall(John) tries to prove that John is tall, and if it can’t it returns “false” or “no”, meaning “I can not prove that”.

But, if the code re-uses that result like:

isShort(X) := NOT isTall(X)

then it is incorrectly combining the Is-Tall case with the I-Don’t-Know case, causing software errors — that is, if the height is unknown, then the person is short, which is quite a leap of logic to be sure.

Three Value Logic

This is not a new problem.  SQL tries to solve this with a third logic value called NULL to mean “UNKNOWN”.  A description of this is here: https://en.wikipedia.org/wiki/Null_(SQL)

Three value logic (3VL) for True/False/Unknown is well established, with a full description here: https://en.wikipedia.org/wiki/Three-valued_logic

Unfortunately 3VL is not built into languages like Java.  And, it gets worse.

Java has low level base types that are efficient like int for integers and object-oriented classes like Integer which have some additional Object overhead but are more “friendly”, and there are ways to convert between Objects and base types.  So far, so good.

All objects (instances of classes) may be set to the value of “null”.  There is a class for Boolean which may be set to True, False, or Null.  And, there is the low-level type “boolean” which may only have the values of true and false.  So a Boolean object has a value (“null”) which can not be set in the base type boolean.  It is like having a base type for integer that you can’t set to zero.

(Mis)Using Null as a Value

In Java, “null” means uninitialized and is not a true distinct “value”.   Some code uses (abuses) “null” to mean “unknown”, but this means we then can’t tell apart an uninitialized value from a initialized “unknown” value — similar to confusing “I-can’t-prove-it” with “false”.  Moreover, we can’t store “null” in the low level boolean type anyway, which again can only be true or false.

1aab4d13e65d4c245d46c0ab4679ffa7
“If that is okay, please give me absolutely no sign.”

The value of “null” also occurs in lots of error situations — the network connection failed, the database connection timed out, the password is incorrect, no memory is available, and on and on.

Using “null” as a value reminds me of this Simpon’s line with Homer interpreting the absence of a message (“null”) as a message.

So, the addition of “null” to Java’s Boolean doesn’t provide a way to represent “unknown” unambiguously, and it causes even more confusion with uninitialized objects and various error conditions.

A Value for Nonexistence

Besides the “unknown” case there is also the case above of: Penguin.flightspeed < 10.

Since we know that a penguin can’t have a flightspeed, this isn’t a case of “unknown.”  We could argue that this statement of less than 10 is “true” since “0” is the absence of a speed and 0 is less than 10, but that requires embedding domain knowledge about how speeds work, i.e. Is the temperature of a song absolute zero? (-459.67F)  Is the color of an integer black?  There isn’t a universal way to assign these True or False when the question doesn’t make sense.

500px-無-still.svg
Mu

And so, we need a fourth value to handle the case of “nonexistence” or “non-applicable”.

We’ve chosen to use the symbol Mu for this as it has been popularly been used in this sense.  One notable example of its use is in Douglas R. Hofstadter’s wonderful book:

Gödel, Escher, Bach: An Eternal Golden Braid.  (http://www.amazon.com/G%C3%B6del-Escher-Bach-Eternal-Golden/dp/0465026567)

Some more details regarding Mu are in its wikipedia article here: https://en.wikipedia.org/wiki/Mu_(negative)

 

Truth implementation in the VDK

In the VDK, we use the Groovy language for scripting on the JVM.  Truth was added as a new type to Groovy with the values: YES, NO, UNKNOWN, and MU.  We use YES and NO instead of True and False to avoid confusion with the existing Boolean values and reserved words.

We follow the truth tables as specified here for 3VL: https://en.wikipedia.org/wiki/Three-valued_logic#Kleene_and_Priest_logics

Some example logic statements using AND, OR, and NOT with Truth:

  • YES AND UNKNOWN := UNKNOWN
  • YES OR UNKNOWN := YES
  • NO OR UNKNOWN := UNKNOWN
  • NOT UNKNOWN := UNKNOWN

For Mu, any logic expression with Mu yields MU:

  • NOT MU := MU
  • YES OR MU := MU

Using our new Truth definition, we can now write or generate code that handles the cases of Unknown or Mu without any ambiguity.

For example:


Truth isTall(Person p) {
if( p.height == UNKNOWN ) { return UNKNOWN }
if  ( p.height > 72.0)  { return YES }
else { return NO }
}

Truth isShort(Person p) {
if( p.height == UNKNOWN ) { return UNKNOWN }
if  ( p.height < 60.0)  { return YES }
else { return NO }
}

Truth averageHeight(Person p) {
return ! ( isTall(p) || isShort(p) )
}

In the above code, the “averageHeight()” function returns a Truth value, and is completely defined by boolean NOT and the isShort() and isTall() functions, and it returns UNKNOWN if either of these functions is UNKNOWN.  The NOT (!) works properly now instead of the previous definition which mistakenly combined the “I can’t prove it” case with the “false” case.

Instead of a “if..then” statement in code, we can use a “switch” statement to handle YES, NO, UNKNOWN, and MU, like so:

switch(averageHeight(p) || isShort(p) ) {
case YES:
println "YES" ; break;
case NO:
println "NO" ; break;
case UNKNOWN:
println "UNKNOWN" ; break;
case MU:
println "MU" ; break;
}

We’d like to add a little DSL syntactic sugar to the switch statement, so the above could be written a bit more succinctly, something along the lines of:

/* idea for improvement to Truth DSL */
if (averageHeight(p) || isShort(p) ) {

YES: { println "YES" }
NO:  {println "NO" }
UNKNOWN: { println "UNKNOWN" }
MU: { println "MU" }
}

which would try to follow the pattern of if()…then()…else() but be: if…yes()..no()…unknown()…mu() with any of the blocks optional.

Please comment if you like this proposed addition to the DSL, or have any suggestions!

Feedback

I hope you have enjoyed learning about how Truth is implemented in the VDK to provide a richer logic representation than the Boolean True/False.

Please post any comments or questions in the comment section, or contact Vital AI directly at info@vital.ai

AI hacking at the Jibo Hackathon

I am very fortunate to be among the first few members of the nascent Jibo developer community, which kicked off today at the first Jibo Hackathon.

The Hackathon was held on the MIT campus, where social robotics was born.

After getting our development environments set up, we got our hands on the Jibo simulator, the SDK, and of course the early Jibo robots.

The Jibo development environment will be familiar to any web application developer, with some added screens reminiscent of Disney cell animation.

We spent some time with some sample code and the simulator.

And then, with a simple shell command on my Mac of ‘jibo run’, my newly created skill (Jibo-speak for “app”) is deployed to my robot friend for the day, and Jibo comes alive.

We got to experiment with a number of Jibo features: animating the Jibo body, Voice Recognition, Natural Language Understanding, Text-to-Speech, Dialogs, Face Tracking.

My first skill was pretty simplistic, but included a bit of all the major Jibo features of the SDK, including snapping a photo, displaying it on the screen, and asking if I liked it.  Plus some Jibo dance moves.  I was in the process of connecting the image up to a Deep Learning image classification API, which sort of worked except for my forgetfulness of JavaScript syntax, when we ran low on time and all happily retired to the local pub.

The ease of working with the simulator and SDK must truly be emphasized.  There is a magic in creating an arc of motion in the simulator, hitting the “Run” button, and having Jibo swing into motion.

Looking forward to the arrival of Jibo in early Spring!  At Vital we’ll be honing our skills in the meanwhile.

jibo

Is it really equal? Introducing Semantic Equality to the JVM.

One of the most fundamental functions of a programming language is to decide if two things are “the same” or are “different”.

The determination of “sameness” can be quite tricky, and introduce subtle software errors or require a significant amount of code to check many cases.

As a simple example, imagine two separate database queries, one for all people with “John” in their name, and another for all people with “Smith” in their name — how to tell that a “John Smith” from the first query is the same as a “John Smith” in the second query, without custom code?

The Vital Development Kit (VDK) includes domain specific language bindings (a DSL) specific to data comparison, inference, and manipulation.  Recently we introduced a new feature called Semantic Equality to the VDK.

With the VDK and Semantic Equality, Data Scientists and Developers can write less code, have fewer bugs, and more easily work with large amounts of diverse data.

Background

The VDK, using the VitalSigns component, manages the domain models of your application and generates code to interact with different types of databases, data predictive components, and user interfaces.  This makes it easy to combine different components into a unified application: such as NoSQL Databases, SQL Databases, Apache Spark, and JavaScript Web Applications.

Since the JVM compares objects “by reference” — the “reference” is a pointer to the bit of memory used to store the object — the following code will typically not be true if the objects were loaded or created at different times (like the “John Smith” objects mentioned before:

/* true only if they have the same memory reference */
if(object1 == object2) { }

To mitigate this, it’s common to write custom code to override the “equals” function in the JVM so that objects can be compared by their data values.  Frameworks such as Object-Relational Mapping tools often include generating such “equals” methods, but this only covers application to database interactions, and even more custom code needs to be written to incorporate other components like machine learning.

The VDK takes a more general approach.

Each VDK data object object has a globally unique identifier, called a URI, associated with it.  So determining if one object refers to the same identical thing as another object is as simple as:

/* they refer to the same thing! */
if(object1.URI == object2.URI) {  }

This is universally true, regardless of the source of the data or the types of objects being compared.

But, what if you want to compare data fields of the objects, like determining if two people have the same birthday?

/* they have the same birthday! */
if(person1.birthday == person2.birthday) {  }

This works when the values of the “birthday” fields match because the VDK handles the “equals” methods.

Terminology note: we call the “birthday” data field a “property” of the “Person” class.  The “Person” class and properties like “birthday” are specified in an external data model, with code generated for the JVM (or JavaScript) using vitalsigns.

What if we tried to do:

/* the dates match! */
if(person1.hireDate == person2.birthday) {  }

If the values were the same, this would be true, but it looks like it might be a programming error as we’re comparing hiring dates with birthdays — apples to oranges instead of apples to apples.

Bugs such as this can be difficult enough with developer created code, but it gets much worse in data analysis and machine learning with comparisons like:

/* data driven action */
/* such as increase likelihood of customer retention */
if(property537 == property675) {  }

which are typically generated through an automated process where it is very difficult to track the meaning of the many thousand properties being analyzed.

Adding Semantics and Semantic Equality

In the VDK, both classes like “Person” and properties like “birthday” have a semantic marker to specify what they “mean”.  So in addition to “birthday” being associated with the “Date” data type, it also has a semantic marker like:

http://vital.ai/ontology/vital-examples#birthday

This URI places “birthday” into a domain model, which can then be used to see if comparisons are “compatible” with another property.  Using such logic we can compare fields like “birthday” and “age” since we can convert one such property to another.

Implementation Note: we use the “trait” language capability of the JVM (Java/Groovy/Scala) to semantically “mark” objects.  Some documentation about the Groovy implementation of traits can be found here: http://docs.groovy-lang.org/next/html/documentation/core-traits.html

With these URIs associated with properties, we modified “==” to take into account whether two properties (or classes) can be compared semantically.

We call the redefined “==” symbol: Semantic Equality.

For example, let’s say we have a property “name” and a subproperty “nickName” and another subproperty of “name” for “familyName”, so a property hierarchy like:

name
+—- nickName
+—- familyName

Then we can have:

/* this could be true */
if(person1.name == person2.nickName) {  }

/* this could be true */
if(person1.name == person2.familyName) {  }

/* this can't be true! */
if(person1.nickName == person2.familyName) {  }

The last case can’t be true because familyName is not an ancestor of nickName or vice versa according to the property hierarchy.

This helps us catch bugs like:

/* now is always false! */
if(person1.hireDate == person2.birthday) {  }

by making them never evaluate to true because “birthday” and “hireDate” are not semantically compatible.  In the same way, your favorite food can not be “armchair” because “armchair” is not a food.

The Semantic Equality operator is similar to the JavaScript “===”, except stronger.  In JavaScript, the “==” operator will try to convert one type to another, like a number to a string so that two things can be compared with a common type, so it is forgiving of type differences, i.e. weakly typed.  This can be handy, but often leads to bugs. The JavaScript “===” operator on the other hand, does not do type conversion, so it “strongly” enforces data type comparison.  The VDK Semantic Equality adds one more “level” to this by enforcing that the compared data is semantically compatible.

Comparing Values without Semantics

Now, let’s say we really want to compare the values and not take into account the semantics of the properties.

We introduced an operator for this case, “^=“, by redefining the XOR assignment operator.  Mainly we don’t do XOR assignments often, but also the caret “^” is sometimes used for boolean NOT operations, so we thought it would be a good match for when the semantics are “NOT” a match.

So, if we really want to compare the values of birthday and hireDate, we can do:

/* true if values match */
if(person1.hireDate ^= person2.birthday) {  }

which is true when the hireDate and birthday values match, ignoring the semantics of the properties.  This is analogous to the JavaScript difference between “===” (strong) and “==” (weak) except in JavaScript is it either enforcing datatypes (or not) and with the VDK we are enforcing semantics (or not).

VDK Groovy Language Extensions

Semantic Equality is part of the language extensions and DSL (domain specific language) incorporated into the Groovy JVM language with the Vital Development Kit (VDK) to make it easier to work with diverse data.

Feedback!

I hope you have enjoyed learning about the Semantic Equality feature of the VDK!  Please post your comments and questions here, or follow up with us at Vital AI via info@vital.ai.

Running shell commands in Beaker Notebook

Data Science Notebooks like Beaker Notebook are a great way to not only explore and analyze data but also record the steps, so that the next Data Scientist can reproduce the results — just by clicking “Run”.

Few people like to spend time meticulously documenting their data analysis steps so to the degree that Data Science Notebooks can be “Self Documenting” — it greatly makes things a lot easier.

As Beaker Notebook can mix many programming languages within one Notebook like R, Python, and Groovy — most steps can be captured completely in a Notebook.

One case that is missing from Beaker however is the command line shell.  Bash is the default on Macintosh OSX, but the same is true for other shells, including the Windows shell or other Unix shells such as “csh” or “tcsh”.

Oftentimes shell commands are used for running data manipulation programs (like awk, perl, or sed), or running compiler processes (like maven or ant).

At Vital AI we use shell commands to run “vitalsigns” which compiles a data model into code, which is then used in data analysis, database queries, and machine learning processes (running inside Apache Spark).

It’s nice to run these within Beaker as not only a convenient way to avoid switching from Beaker to a terminal screen and back, but also to document these steps for reproducibility.

Fortunately with a little helper class it’s easy to run shell commands from Groovy cells in the Beaker Notebook.

The helper class is “RunBash.groovy” and is found on github here:
https://github.com/vital-ai/vital-data-utils/blob/master/src/main/groovy/ai/vital/data/utils/RunBash.groovy

Once a jar with this class is made available to Groovy via the Language Manager (see screenshot below), it can be used in a Groovy cell to run Bash scripts, like so:

import ai.vital.data.utils.RunBash

RunBash.enable() // hook .bash() to strings

//bash script begins here, in a Groovy multi-line string:
"""
echo \$VITAL_HOME

vitalsigns generate -o \${VITAL_HOME}/domain-ontology/vital-samples-0.1.0.owl -or
"""
.bash() // this runs the script</blockquote>

Here’s a screenshot of that running in Beaker:

runbash

Here’s a screenshot of the Language Manager (from the Notebook menu), with jars added to the Beaker Notebook classpath for Groovy.

beaker-languagemanager

 

Vital AI Cytoscape App in App Store

We recently published our Cytoscape App to the Cytoscape App Store.

Cytoscape is a wonderful graph visualization tool that is open-source, available on Desktops, and quite handy for graph analysis and visualization.

Our plugin allows Cytoscape to connect to databases, servers, and Apache Spark/Hadoop using the VitalService API.

The plugin is available directly in Cytoscape, or here: http://apps.cytoscape.org/apps/vitalaigraphvisualization

Cytoscape is available here: http://cytoscape.org/

Prior to using the plugin, the Vital AI software must be installed and configured.  The Vital AI software can be downloaded from here: http://vital.ai/#download

The Cytoscape plugin uses the VITAL_HOME environmental variable to find the Vital AI software and configuration files.

For those using Mac OSX, OSX needs some extra help for desktop applications like Cytoscape to use environmental variables.

Here is a good StackOverflow answer which helps Mac OSX use environmental variables: http://stackoverflow.com/a/32405815/2138426

The first tab in the Vital AI plugin enables selecting which VitalService endpoint to connect to.  These come from the VitalService configuration file found at:

$VITAL_HOME/vital-config/vitalservice/vitalservice.config

For Prime endpoints, an authorization key is used to connect.  For convenience, you can put such keys into your vitalsigns configuration file, like so:

config: {

local-key: key1-key1-key1  

}

The naming convention is: “vitalservicename”-key

So, the corresponding VitalService configuration entry would be:

profile.local {

type = VitalPrime

appID = analytics

VitalPrime {

endpointURL = “http://127.0.0.1:9081/java&#8221;

}

}

 

Now back to the plugin…

cytoscape-screenshot1

Here is the Connection tab.  Select the desired VitalService endpoint from the drop-down and hit “Connect”.

Now that we have connected, we can use the “Search” tab to search.

cytoscape-screenshot4

We can select which databases (“Segments”) to include, as well as what property to search.  In this case the “Wordnet” database is selected and the “name” property.

Let’s put all the results into a network.

cytoscape-screenshot5

 

Now let’s select them all and find what is connected to them.  This is called an “expansion” query and looks for everything connected to the starting node up to two hops (edges) away.

cytoscape-screenshot6

Starting the expansion…

cytoscape-screenshot7

Expanding all the selected nodes…  The “Paths” tab is used to select whether the expansion will be for one hop (one edge) or two hops (two edges), the direction of the desired edges (forward, backward, or both), and which “Segments” to include in the expansion.

cytoscape-screenshot8

And now we have some results!

cytoscape-screenshot9

Let’s zoom in on part of the network.  We can then further analyze the results, continue to explore and expand the network, or tune the visualization.

 

cytoscape-screenshot10

 

If we are connected to a Prime endpoint, we can use the Paths tab to select the node and edge types to filter with during an expansion query.

cytoscape-screenshot11

 

Also with a Prime endpoint, we can use the “DataScripts” tab to run datascripts on the server.  Datascripts can be used to analyze data, trigger Spark or Hadoop jobs, use a prediction model, or anything you like.

Please send along any comments or questions, and hope you enjoy using Cytoscape and our plugin to visualize your data.

 

Using the Beaker Notebook with Vital Service

In this post I’ll describe using the Beaker data science notebook with Vital back-end components for data exploration and analysis, using the Wordnet dataset as an example.

At Vital AI we use many tools to explore and analyze data, and chief among them are data science notebooks.  Examples include IPython/Jupyter and Zepplin, plus similar products/services such as Databricks and RStudio.

One that has become a recent favorite is the Beaker Notebook ( http://beakernotebook.com/ ).  Beaker is open-source with a git repo on github ( https://github.com/twosigma/beaker-notebook ) under very active development.

Beaker fully embraces polyglot programming and supports many programming languages including Javascript, Python, R, and JVM languages including Groovy, Java, and Scala.

Scala is especially nice for integration with Apache Spark.  R of course is great for stats and visualization, and JavaScript is convenient for visualization and web dashboards, especially when using visualization libraries like D3.

At Vital AI we typically use JVM for production server applications in combination with Apache Spark, so having all options available in a single notebook makes data analysis a very agile process.

About Data Models

At Vital AI we model data by creating data models (aka ontologies) to capture the meaning of the data, and use these data models within our code.  This allows the meaning of the data to guide our analysis, as well as enable strong data standards – saving a huge amount of manual effort.

We create data models using the open standard OWL, and then generate code using the VitalSigns tool.  This data model code is then utilized within all data analysis and workflows.

At runtime, VitalSigns loads data models into the JVM in one of two ways: from the classpath that was specified when the JVM started via the ServiceLoader API or dynamically via a dynamic classloader.

By using the dynamic method, we can use the Vital Prime server as a “data model server” so that data models are discovered and loaded at run-time from the Prime server.  Thus the data models are kept in sync with data managed by the Prime server, so data analysis is always working with the latest data definitions.

As Groovy is a dynamic language on the JVM, we use Groovy for many data analysis scripts that use data models.

About Wordnet

One of our favorite datasets to use is Wordnet.  From the Wordnet website ( http://wordnet.princeton.edu ):

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

As Wordnet has the form of a graph – words linked to words linked to other words – it is very convenient for visualization.

VitalService API

The VitalService API is a standard API that includes methods for data queries, running analysis scripts (aka datascripts), and reading, saving, updating, and deleting data (so-called “CRUD” operations).  We use the VitalService API for working with data locally, accessing a database, or using a remote service.  This means we use the same API calls when we switch from working with data locally to working with a production service, so we can use the same code library throughout.

Application Architecture

vital-arch1

A full-stack of an application may include a web application layer, a VitalService implementation such as Prime, a Database such as DynamoDB, and an analysis environment based on Apache Spark and Hadoop.  Above, Prime is managing the data models (the “gear” icons) and provides “datascripts” to process data via a scripting interface.

 

vital-arch2

The above diagram focuses on the current case of Beaker Notebook where we are connecting to VitalService Prime as a client, synchronizing the data models, and sending queries to an underlying database.  In our example, the database contains the Wordnet data.

Some sample code to generate the Wordnet dataset is here: https://github.com/vital-ai/vital-examples/blob/master/vital-samples/src/main/groovy/ai/vital/samples/SampleWordnetGenerate.groovy

And some sample code to load the Wordnet data into VitalService is here:  https://github.com/vital-ai/vital-examples/blob/master/vital-samples/src/main/groovy/ai/vital/samples/SampleWordnetImport.groovy

Or more generally the vitalimport utility could be used, found in here: https://github.com/vital-ai/vital-utils

Back to the Beaker Notebook

Now that we’ve a few definitions out of the way, we can get back to using Beaker.

The previous version of Beaker had some limitations to loading JVM classes (see: https://github.com/twosigma/beaker-notebook/issues/2276 ) which are now fixed in Beaker’s github but not yet included in a released version.  We’re currently using a patched version here: https://github.com/vital-ai/beaker-notebook until the next release.

For this example, let’s query some data using VitalService and then visualize the resulting data using D3 with JavaScript.

Our example is based on the one found here: https://pub.beakernotebook.com/#/publications/560c9f9b-14e6-4d95-8e78-cc0a60bf4e5a?fullscreen=true

Our example will include 3 cells: one Groovy one to do a query, one JavaScript one to run D3 over the data, and one HTML one to display the resulting graph.

The Groovy cell first connects to VitalService, like so:

VitalSigns vs = VitalSigns.get()

VitalServiceKey key = new VitalServiceKey().generateURI()
key.key = vs.getConfig("analyticsKey")

def service = VitalServiceFactory.openService(key, "prime", "AnalyticsService")

The code above initializes VitalSigns, sets an authentication key based upon a value in a configuration file, and connects to the VitalService endpoint.  Prime requires an authentication key for security.


vs.pipeline { ->

def builder = new VitalBuilder()

VitalGraphQuery q = builder.query {

// query for graphs like:
// node1(name:happy) ---edge--->node2

GRAPH {

value segments: ["wordnet"]
value inlineObjects: true

ARC {
// bind this node to name &amp;amp;quot;node1&amp;amp;quot;
node_bind { "node1" }

// include subclasses of SynsetNode: Noun, Verb, Adjective, Adverb
node_constraint { SynsetNode.expandSubclasses(true) }
node_constraint { SynsetNode.props().name.equalTo("happy") }

ARC {
// bind the node and edge to names "node2" and "edge"
edge_bind { "edge" }
node_bind { "node2" }
}
}
}
}.toQuery()

ResultList list = service.query( q )

// count the results
def j = 1

list.each {

// Use the binding names to get the URI values out of GraphMatch

def node1_uri = it."node1".toString()
def edge_uri = it."edge".toString()
def node2_uri = it."node2".toString()

// inlineObjects is true, which embeds unseen objects into the results
// if cache is null, get graph object out of GraphMatch results
// graph objects referenced via the URI

def node1 = vs.getFromCache(node1_uri) ?: it."$node1_uri"
def edge = vs.getFromCache(edge_uri) ?: it."$edge_uri"
def node2 = vs.getFromCache(node2_uri) ?: it."$node2_uri"

// add new ones into cache, doesn't hurt to refresh existing ones
vs.addToCache([node1, edge, node2])

// print out node1 --edge--&amp;amp;gt; node2, with edge type (minus the namespace)
println j++ + ": " + node1.name + "---" + (edge.vitaltype.toString() - "http://vital.ai/ontology/vital-wordnet#") + "-->" + node2.name
}

}

service.close()

The above code performs a query for all Wordnet entries with the name “happy”, and then follows all links from those to other words, putting the results into a cache, as well as printing them out.

Note the use of data model objects in the code above like: “SynsetNode”, “VITAL_Node”, and “VITAL_Edge”.  These are used which avoids having any code which directly parses data – the analysis code receives data objects which are “typed” according to the data model.

A screenshot:

querygraph

The result of the “println” statements is:

1: happy—Edge_WordnetSimilarTo–>laughing, riant
2: happy—Edge_WordnetAlsoSee–>joyful
3: happy—Edge_WordnetAlsoSee–>joyous
4: happy—Edge_WordnetSimilarTo–>golden, halcyon, prosperous
5: happy—Edge_WordnetAttribute–>happiness, felicity
6: happy—Edge_WordnetAlsoSee–>euphoric
7: happy—Edge_WordnetAlsoSee–>elated
8: happy—Edge_WordnetAlsoSee–>cheerful
9: happy—Edge_WordnetAlsoSee–>felicitous
10: happy—Edge_WordnetAlsoSee–>glad
11: happy—Edge_WordnetAlsoSee–>contented, content
12: happy—Edge_WordnetSimilarTo–>blissful
13: happy—Edge_WordnetSimilarTo–>blessed
14: happy—Edge_WordnetSimilarTo–>bright
15: happy—Edge_WordnetAttribute–>happiness

We then take all the nodes and edges in the cache and turn them into JSON data as D3 expects.


def nodes = []
  def links = []
   
  Iterator i = vs.getCacheIterator()
  
  def c = 0
  
  while(i.hasNext() ) {
     
    GraphObject g = i.next()
    
    if(g.isSubTypeOf(VITAL_Node)) {
        
      g."local:index" = c
      
      nodes.add ( "{\"name\": \"$g.name\", \"group\": $c}" )
      
      c++
      
    }
       
  }
   
  def max = c
  
  i = vs.getCacheIterator()
  
  while(i.hasNext() ) {
     
  GraphObject g = i.next()
    
    if(g.isSubTypeOf(VITAL_Edge)) {
       
        def srcURI = g.sourceURI
        def destURI = g.destinationURI
      
        def source = vs.getFromCache(srcURI)
        def destination = vs.getFromCache(destURI)
        
        def sourceIndex = source."local:index"
        def destinationIndex = destination."local:index"
      
      
      links.add (  "{\"source\": $sourceIndex, \"target\": $destinationIndex, \"value\": 10}"   ) 
         
    }
    
  }
  
  println "Graph:" + "{\"nodes\": $nodes, \"links\": $links}"
  
  beaker.graph = "{\"nodes\": $nodes, \"links\": $links}"

 

The last line above puts the data into a “beaker” object, which is the handoff point to other languages.

A screenshot of the results and the JSON:

resultsgraph

Then in a Javascript cell:


var graphstr = JSON.stringify(beaker.graph);

var graph = JSON.parse(graphstr)


var width = 800,
    height = 300;

var color = d3.scale.category20();

var force = d3.layout.force()
    .charge(-120)
    .linkDistance(100)
    .size([width, height]);

var svg = d3.select("#fdg").append("svg")
    .attr("width", width)
    .attr("height", height);

var drawGraph = function(graph) {
  force
      .nodes(graph.nodes)
      .links(graph.links)
      .start();

  var link = svg.selectAll(".link")
      .data(graph.links)
    .enter().append("line")
      .attr("class", "link")
      .style("stroke-width", function(d) { return Math.sqrt(d.value); });

  var gnodes = svg.selectAll('g.gnode')
     .data(graph.nodes)
     .enter()
     .append('g')
     .classed('gnode', true);
    
  var node = gnodes.append("circle")
      .attr("class", "node")
      .attr("r", 10)
      .style("fill", function(d) { return color(d.group); })
      .call(force.drag);

  var labels = gnodes.append("text")
      .text(function(d) { return d.name; });

  
  force.on("tick", function() {
    link.attr("x1", function(d) { return d.source.x; })
        .attr("y1", function(d) { return d.source.y; })
        .attr("x2", function(d) { return d.target.x; })
        .attr("y2", function(d) { return d.target.y; });

    gnodes.attr("transform", function(d) { 
        return 'translate(' + [d.x, d.y] + ')'; 
    });
      
    
      
  });
};

drawGraph(graph);

Screenshot:

jsgraph

Note the handoff of the “beaker.graph” object in the beginning of the JavaScript code.  It may be a bit tricky to get the data exchanges right so that JSON produced on the groovy side is interpreted as JSON on the JavaScript side, or vice-versa.  Beaker provides auto-translation for various data structures including DataFrames, but it still takes some trial and error to get it right.

The above JavaScript code comes from the Beaker example project, plus this StackOverFlow article which discusses adding labels to graphs:  http://stackoverflow.com/questions/18164230/add-text-label-to-d3-node-in-force-directed-graph-and-resize-on-hover

In the last Beaker cell, we include the HTML to be the “target” of the JavaScript code:


<style>
.node {
  stroke: #fff;
  stroke-width: 1.5px;
}

.link {
  stroke: #999;
  stroke-opacity: .6;
}
</style>
<div id="fdg"></div>

And a screenshot of the HTML cell with the resulting D3 graph.

happygraph2

Hope you have enjoyed this walkthrough of using Beaker with the VitalService interface, and visualizing query results in a graph with D3.

Please ask any questions in the comments section, or send them to us at info@vital.ai.

Happy New Year!

Vital AI

Optimizing the
 Data Supply Chain
 for Data Science

I gave a talk at the Enterprise Dataversity conference in Chicago in November.

The title of the talk was:

Optimizing the Data Supply Chain for Data Science“.

data-supply-chain-edv2015-hadfield-submitted.001

Below are the slides from that presentation.

Here is a quick summary of the talk:

The Data Supply Chain is the next step in the progression of large scale data management, starting with a “traditional” Data Warehouse, moving to a Hadoop-based environment such as a Data Lake, then to a Microservice Oriented Architecture (Microservices across a set of independently managed Hadoop clusters, “Micro-SOA”), and now to the Data Supply Chain which adds additional data management and coordination processes to produce high quality Data Products across independently management environments.

A Data Product can be any data service such as an eCommerce recommendation system, a Financial Services fraud/compliance predictive service, or Internet of Things (IoT) logistics optimization service.  As a specific example, loading the Amazon.com website triggers more than 170 Data Products predicting consumer sentiment, likely purchases, and much more.

The “Data Supply Chain” (DSC) is a useful metaphor for how a “Data Product” is created and delivered.  Just like a physical “Supply Chain”, data is sourced from a variety of suppliers.  The main difference is that a Data Product can be a real-time combination of all the suppliers at once as compared to a physical product which moves linearly along the supply chain.  However, very often data does flow linearly across the supply chain and becomes more refined downstream.

Each participant of a DSC may be an independent organization, a department within a large organization, or a combination of internal and external data supplies — such as combining internal sales data with social media data.

As each participant in the DSC may have its own model of data, combining data from many sources can be very challenging due to incompatible assumptions.  As a simple example, a “car engine supplier” considers a “car engine” as a finished “product“, whereas a “car manufacturer” considers a “car engine” to be a “car part” and a finished car as a “product“, therefore the definitions of “product” and “car engine” are inconsistent.

As there is no central definition of data as each data supplier is operating independently, there must be an independent mechanism to capture metadata to assist flowing data across the DSC.

At Vital AI, we use semantic data models to capture data models across the DSC.  The models capture all the implicit assumptions in the data, and facilitate moving data across the DSC and building Data Products.

We generate code from the semantic data models which then automatically drives ETL processes, data mapping, queries, machine learning, and predictive analytics — allowing data products to be created and maintained with minimal effort while data sources continue to evolve.

Creating semantic data models not only facilitates creating Data Products, but also provides a mechanism to develop good data standards — Data Governance — across the DSC.  Data Governance is a critical part of high quality Data Science.

As code generated from semantic data models is included at all levels of the software stack, semantic data models also provide a mechanism to keep the interpretation of data consistent across the stack including in User Interfaces, Data Infrastructure (databases), and Data Science including predictive models.

As infrastructure costs continue to fall, the primary cost component of high quality Data Products is human labor.  The use of technologies such as semantic data models to optimize the Data Supply Chain and minimize human labor becomes more and more critical.

To learn more about the Data Supply Chain and Data Products, including how to apply semantic data models to minimize the effort, please contact us at Vital AI!

— Marc Hadfield

Email: info@vital.ai
Telephone: 1.917.463.4776