Arvind Narayanan : 2012 Security Session


Monday, April 2, 2012
Location: Fisher Conference Center, Arrillaga Alumni Center

"Is Writing Style Sufficient to Deanonymize Material Posted Online?"
1:30pm - 2:00pm


I will present the results of a recent Stanford-Berkeley research collaboration on identifying an anonymous author via linguistic stylometry, i.e., by comparing the writing style against a corpus of texts of known authorship. We experimentally demonstrated the effectiveness of our techniques with as many as 100,000 candidate authors. Given the increasing availability of writing samples online, our result has serious implications for anonymity and free speech - an anonymous blogger or whistleblower may be unmasked unless they take steps to obfuscate their writing style.

While there is a huge body of literature on authorship recognition based on writing style, prior to our work almost none of it studied corpora of more than a few hundred authors. The problem becomes qualitatively different with large datasets and techniques from prior work fail to scale, both in terms of accuracy and performance. We studied a variety of classifiers and showed how to handle the huge number of classes (authors). We also developed novel techniques for confidence estimation of classifier outputs. Finally, we demonstrated stylometric authorship recognition on texts written in different contexts.

In over 20% of cases, our classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors; in about 35% of cases the correct author is one of the top 20 guesses. If we allow the classifier the option of not making a guess, via confidence estimation we are able to increase the precision of the top guess from 20% to over 80% with only a halving of recall.

Joint work with Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, Dawn Song.


Arvind Narayanan is a post-doctoral computer science researcher at Stanford and a junior affiliate scholar at the Stanford Law School Center for Internet and Society. He completed his Ph.D at UT Austin in 2009. He studies information privacy and security, and moonlights in policy.

Narayanan's doctoral work exposed the problems with data anonymization. His paper on deanonymization of large datasets won the 2008 Privacy Enhancing Technologies award. Narayanan's more recent work has focused on privacy-conscious system design in the areas of online behavioral advertising, including Do Not Track, and location privacy.