Skip to main content Skip to secondary navigation

Annual Meeting : 2006 InfoLab Workshop

Main content start

March 22, 2006 

The InfoLab Workshop, March 22, was held in the Frances C Arrillaga Alumni Center at Stanford University. Chaired by Professors Hector Garcia-Molina & Jennifer Widom. The InfoLab workshop brings together our industrial partners, alumni, and academic colleagues with research interests covering a variety of areas related to information management. We report on Stanford's latest research projects and hear about key problems and issues in industry.

Time Agenda PDF
8:30 AM Check-in & Continental Breakfast  
9:00 AM Welcome & InfoLab Overview
Professor Hector Garcia-Molina
9:15 AM Utkarsh Srivastava
Query Processing in a Web Service Management System

Web services are rapidly taking hold as a standard method of sharing data and functionality among loosely-coupled systems, not only across the web but also within enterprises. At Stanford, we have begun developing a general-purpose Web Service Management System (WSMS) whose goal is to enable querying multiple web services in an integrated and efficient fashion. This talk discusses the first step toward this general goal: optimizing Select-Project-Join queries spanning multiple web services. I will describe algorithms we devised for arranging web services into a pipelined execution plan that minimizes the total running time of the query. I will also report some experimental results with our initial prototype and outline many remaining challenges in realizing a general-purpose WSMS.


Utkarsh Srivastava is a Computer Science Ph.D. candidate in theInfoLab at Stanford University, currently leading the Web Service Management System (WSMS) project. He also contributed to the StanfordData Stream project during his graduate work. He holds a B.Tech. in Computer Science from IIT Kanpur and is a recipient of a Stanford Graduate Fellowship and a Microsoft Graduate Fellowship.

9:45 AM Zoltan Gyongyi
Link Spam Detection Based on Mass Estimation

Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. In this talk I introduce the concept of spam mass, a measure of the impact of link spamming on a page's ranking. I discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. I conclude by presenting our experiments on the host-level Yahoo! web graph, in which we use spam mass estimates to successfully identify tens of thousands of instances of heavy-weight link spamming.


Joint work with Pavel Berkhin and Jan Pedersen from Yahoo! and Hector Garcia-Molina from Stanford. 

Zoltan Gyongyi is a PhD candidate in Computer Science at Stanford University. His primary research interest is web search. Recently he has been working on improving the relevance of search results by combating various types of web spam.

10:45 AM Professor Jennifer Widom
Trio: A System for Data, Uncertainty, and Lineage

Abstract: Trio is a new type of database system that manages uncertainty and lineage of data as first-class concepts, along with the data itself. Uncertainty and lineage arise in a variety of data-intensive applications, including data cleaning, data integration, scientific and sensor data management, and information extraction. This talk will provide an overview of: the new "ULDB" model upon which the Trio system is built; Trio's SQL-based query language (TriQL); a variety of new theoretical challenges and results; Trio's initial prototype implementation; and finally our overall research plan.


Jennifer Widom is a Professor in the Computer Science and Electrical Engineering Departments at Stanford University. She received her Bachelors degree from the Indiana University School of Music in 1982 and her Computer Science Ph.D. from Cornell University in 1987. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering, was a Guggenheim Fellow, and has served on a variety of program committees, advisory boards, and editorial boards.

11:15 AM Omar Benjelloun
Entity Resolution in SERF

In the SERF project, we consider the Entity Resolution (ER) problem, in which records determined to represent the same real-world entity (e.g., a person, or a product) are successively located and merged.


Our approach treats the functions for comparing and merging records as black-boxes, which permits generic, extensible ER solutions. This talk will give an overview of our generic framework for ER, and show how simple properties satisfied by the black-box functions enable efficient and deterministic ER algorithms. I will also sketch extensions of our framework to handle numerical confidences associated with the records, and to reduce the cost of ER by parallelizing the computation across multiple processors.

Omar Benjelloun is currently a postdoc in the Stanford InfoLab, working hard on tough data management questions such as: "Where does uncertainty come from?" (TRIO project) or: "Are we who the data says we are?" (SERF project). Omar did his PhD in Computer science at INRIA, France, developing Active XML, a framework for distributed data management based on XML and Web Services. Previously, he worked as a software developer, for the Klee Group, a french "E-business solutions provider", for INRIA, and for the French Navy (because he had to). He also holds an engineering degree in telecommunications from Telecom Paris (ENST).

11:45 AM Lunch  
1:30 PM Andreas Paepcke
BioAct Project

I will introduce our BioACT project. This NSF funded effort brings together computer scientists and biodiversity researchers. One frequent challenge in this area of biology is that information is generated and consumed out in the field where daylight hours are precious, battery replenishment is intermittent, and the benefits of computation are weighed against the strain of carrying the associated equipment. We are developing tools that allow biodiversity researchers to identify species and generate field notes under these challenging conditions, yet bring the full power of computation to bear when it is available back in the lab.


Dr. Andreas Paepcke is a Senior Research Scientist and director of the Digital Library and BioACT Projects at Stanford University. His interests include user interfaces for small devices, novel Web search facilities, and browsing facilities for digital artifacts that are difficult to index. With his group of students he has designed and implemented WebBase, an experimental storage and high speed dissemination system for Web contents. His work on small devices has focused on novel methods for summarizing and transforming Web pages, and on browsing images on small displays. Dr. Paepcke has served on numerous program committees, including a position as Vice Program Chair, heading the World-Wide Web Conference's 'Browsers and User Interfaces' program track. He was a member on several National Science Foundation proposal evaluation panels. Dr. Paepcke received BS and MS degrees in applied mathematics from Harvard University, and a Ph.D. in Computer Science from the University of Karlsruhe, Germany. Previously, he worked as a researcher at Hewlett-Packard Laboratory, and as a research consultant at Xerox PARC.

2:00 PM Jeff Klingner
Current Visualization Research in the Stanford Graphics Lab

I will survey current information visualization research projects in the Stanford Graphics Lab. We're working on analysis and visualization of large graphs; a collaboration with the sociology department studying the spread and development of news stories; work on visual interfaces for associative semantic dataspaces; a medical imaging project looking at how to locate and interact with brain pathways; and a large project on network intrusion detection including visualization and analytical support for situational awareness and intrusion forensics. I will highlight connections with Stanford database research.


Jeff Klingner is a Ph.D. candidate in the Stanford Graphics Lab working with Pat Hanrahan. His work focuses on the interactive analysis and visualization of large graphs. He has been awarded the Stanford Graduate Fellowship and the National Science Foundation Graduate Research Fellowship.

3:00 PM Industry Panel
Exciting Problems in Search and the Web
Panelists include:
Andrei Broder, Yahoo
Jim Gray, Microsoft Research
Alon Halevy, Google
Anant Jhingran, IBM
Anand Rajaraman, Cambrian Ventures

Broder: PDF

Halevy: PDF

Jhingran: PDF

Rajaraman: PDF