2015 Data Science Workshop


Wed, April 29, 2015
Location: Fisher Conference Center, Arrillaga Alumni Center

"Mapping the “Social Genome"


The initial research plan is built around three interrelated levels of analysis: individual, group, and society. At each level, we are investigating the interplay between static and dynamic properties, and paying special attention to the ethical and economic issues that arise when confronting major scientific challenges like this one. Our ultimate goal is to identify ways in which scientists, engineers, community builders, and community leaders can contribute to the development of more productive, vibrant, and informed teams, online and offline communities, and societies.

The goal of this project is to develop data science tools and statistical models that bring networks and language together in order to make more and better predictions about both.Our focus is on joint models of language and network structure.This brings natural language processing and social network analysis together to provide a detailed picture not only of what is being said in a community, but also who is saying it, how the information is being transmitted through the network, how that transmission affects network structure, and, coming full circle, how those evolving structures affect linguistic expression.We plan to develop statistical models using diverse data sets, including not only online social networks (Twitter, Reddit, Facebook), but also hyperlink networks of news outlets (using massive corpora we collect on an ongoing basis) and networks of political groups, labs, and corporations.

Leskovec maintains a large collection of network and language data sets at the website for the Stanford Network Analysis Project SNAP (http://snap.stanford.edu).The pilot work described in general terms here relies mainly on resources that have been posted on SNAP for public use. (In some cases, privacy or business concerns preclude such distribution.) Moreover, we have access to several powerful, comprehensive data sets: (i) cell phone call traces of entire countries; (ii) complete article commenting and voting from sites like CNN, NPR, FOX, and similar; (iii) a near complete U.S. media picture: 10 billion blog posts and news articles (5 million per day over last six years); (iv) complete Twitter, LinkedIn, and Facebook data (through direct collaboration with these companies); (v) five years of email logs from a medium-sized company.