Chris Ré: 2017 Plenary Session


Tuesday, April 11, 2017
Location: McCaw Hall, Arrillaga Alumni Center

"DeepDive and Snorkel: Dark Data Systems"



Building applications that can read and analyze a wide variety of data may change the way we do science, make business decisions, and develop policy. However, building such applications is challenging: real world data is expressed in natural language, images, or other "dark" data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk describes DeepDive, a new type of system designed to cope with Dark Data by combining extraction, integration and prediction into one system. For some paleobiology and materials science tasks, DeepDive-based systems have surpassed human volunteers in quantity and quality (recall and precision) of extracted information. DeepDive is in daily use by scientists in areas including genomics and drug repurposing, by a number of companies involved in various forms of search, and by law enforcement in the fight against human trafficking.

This talk will also describe Snorkel, whose goal is to make routine Dark Data tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We'll describe our preliminary evidence that the Snorkel approach allows a broader set of users to write dark data programs more efficiently than previous approaches. We will also describe the underlying theory, in particular our recent work on new convergence guarantees for Gibbs sampling and large-scale non-convex optimization which play a key role in enabling Snorkel to scale.

DeepDive and Snorkel are open source on github and available from DeepDive.Stanford.Edu and


Chris Ré is an associate professor in the InfoLab who is affiliated with the Statistical Machine Learning Group, PPL, and SAIL (bio). He works on the foundations of the next generation of data analytics systems. These systems extend ideas from databases, machine learning, and theory, and our group is active in all areas. A major application of our work is to make it dramatically easier to build high-quality machine learning systems to process dark data including text, images, and video, e.g., Snorkel.

The DeepDive (one pager) project is commericialized as Lattice. Our code is on github.