I gave a talk at the Enterprise Dataversity conference in Chicago in November.
The title of the talk was:
“Optimizing the Data Supply Chain for Data Science“.
Below are the slides from that presentation.
Here is a quick summary of the talk:
The Data Supply Chain is the next step in the progression of large scale data management, starting with a “traditional” Data Warehouse, moving to a Hadoop-based environment such as a Data Lake, then to a Microservice Oriented Architecture (Microservices across a set of independently managed Hadoop clusters, “Micro-SOA”), and now to the Data Supply Chain which adds additional data management and coordination processes to produce high quality Data Products across independently management environments.
A Data Product can be any data service such as an eCommerce recommendation system, a Financial Services fraud/compliance predictive service, or Internet of Things (IoT) logistics optimization service. As a specific example, loading the Amazon.com website triggers more than 170 Data Products predicting consumer sentiment, likely purchases, and much more.
The “Data Supply Chain” (DSC) is a useful metaphor for how a “Data Product” is created and delivered. Just like a physical “Supply Chain”, data is sourced from a variety of suppliers. The main difference is that a Data Product can be a real-time combination of all the suppliers at once as compared to a physical product which moves linearly along the supply chain. However, very often data does flow linearly across the supply chain and becomes more refined downstream.
Each participant of a DSC may be an independent organization, a department within a large organization, or a combination of internal and external data supplies — such as combining internal sales data with social media data.
As each participant in the DSC may have its own model of data, combining data from many sources can be very challenging due to incompatible assumptions. As a simple example, a “car engine supplier” considers a “car engine” as a finished “product“, whereas a “car manufacturer” considers a “car engine” to be a “car part” and a finished car as a “product“, therefore the definitions of “product” and “car engine” are inconsistent.
As there is no central definition of data as each data supplier is operating independently, there must be an independent mechanism to capture metadata to assist flowing data across the DSC.
At Vital AI, we use semantic data models to capture data models across the DSC. The models capture all the implicit assumptions in the data, and facilitate moving data across the DSC and building Data Products.
We generate code from the semantic data models which then automatically drives ETL processes, data mapping, queries, machine learning, and predictive analytics — allowing data products to be created and maintained with minimal effort while data sources continue to evolve.
Creating semantic data models not only facilitates creating Data Products, but also provides a mechanism to develop good data standards — Data Governance — across the DSC. Data Governance is a critical part of high quality Data Science.
As code generated from semantic data models is included at all levels of the software stack, semantic data models also provide a mechanism to keep the interpretation of data consistent across the stack including in User Interfaces, Data Infrastructure (databases), and Data Science including predictive models.
As infrastructure costs continue to fall, the primary cost component of high quality Data Products is human labor. The use of technologies such as semantic data models to optimize the Data Supply Chain and minimize human labor becomes more and more critical.
To learn more about the Data Supply Chain and Data Products, including how to apply semantic data models to minimize the effort, please contact us at Vital AI!
— Marc Hadfield