Annual Meeting : 2006 InfoLab Workshop
March 22, 2006
The InfoLab Workshop, March 22, was held in the Frances C Arrillaga Alumni Center at Stanford University. Chaired by Professors Hector Garcia-Molina & Jennifer Widom. The InfoLab workshop brings together our industrial partners, alumni, and academic colleagues with research interests covering a variety of areas related to information management. We report on Stanford's latest research projects and hear about key problems and issues in industry.
|8:30 AM||Check-in & Continental Breakfast|
|9:00 AM||Welcome & InfoLab Overview
Professor Hector Garcia-Molina
|9:15 AM||Utkarsh Srivastava
Query Processing in a Web Service Management System
Web services are rapidly taking hold as a standard method of sharing data and functionality among loosely-coupled systems, not only across the web but also within enterprises. At Stanford, we have begun developing a general-purpose Web Service Management System (WSMS) whose goal is to enable querying multiple web services in an integrated and efficient fashion. This talk discusses the first step toward this general goal: optimizing Select-Project-Join queries spanning multiple web services. I will describe algorithms we devised for arranging web services into a pipelined execution plan that minimizes the total running time of the query. I will also report some experimental results with our initial prototype and outline many remaining challenges in realizing a general-purpose WSMS.
|9:45 AM||Zoltan Gyongyi
Link Spam Detection Based on Mass Estimation
Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. In this talk I introduce the concept of spam mass, a measure of the impact of link spamming on a page's ranking. I discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. I conclude by presenting our experiments on the host-level Yahoo! web graph, in which we use spam mass estimates to successfully identify tens of thousands of instances of heavy-weight link spamming.
Joint work with Pavel Berkhin and Jan Pedersen from Yahoo! and Hector Garcia-Molina from Stanford.
|10:45 AM||Professor Jennifer Widom
Trio: A System for Data, Uncertainty, and Lineage
Abstract: Trio is a new type of database system that manages uncertainty and lineage of data as first-class concepts, along with the data itself. Uncertainty and lineage arise in a variety of data-intensive applications, including data cleaning, data integration, scientific and sensor data management, and information extraction. This talk will provide an overview of: the new "ULDB" model upon which the Trio system is built; Trio's SQL-based query language (TriQL); a variety of new theoretical challenges and results; Trio's initial prototype implementation; and finally our overall research plan.
|11:15 AM||Omar Benjelloun
Entity Resolution in SERF
In the SERF project, we consider the Entity Resolution (ER) problem, in which records determined to represent the same real-world entity (e.g., a person, or a product) are successively located and merged.
Our approach treats the functions for comparing and merging records as black-boxes, which permits generic, extensible ER solutions. This talk will give an overview of our generic framework for ER, and show how simple properties satisfied by the black-box functions enable efficient and deterministic ER algorithms. I will also sketch extensions of our framework to handle numerical confidences associated with the records, and to reduce the cost of ER by parallelizing the computation across multiple processors.
|1:30 PM||Andreas Paepcke
I will introduce our BioACT project. This NSF funded effort brings together computer scientists and biodiversity researchers. One frequent challenge in this area of biology is that information is generated and consumed out in the field where daylight hours are precious, battery replenishment is intermittent, and the benefits of computation are weighed against the strain of carrying the associated equipment. We are developing tools that allow biodiversity researchers to identify species and generate field notes under these challenging conditions, yet bring the full power of computation to bear when it is available back in the lab.
|2:00 PM||Jeff Klingner
Current Visualization Research in the Stanford Graphics Lab
I will survey current information visualization research projects in the Stanford Graphics Lab. We're working on analysis and visualization of large graphs; a collaboration with the sociology department studying the spread and development of news stories; work on visual interfaces for associative semantic dataspaces; a medical imaging project looking at how to locate and interact with brain pathways; and a large project on network intrusion detection including visualization and analytical support for situational awareness and intrusion forensics. I will highlight connections with Stanford database research.
|3:00 PM||Industry Panel
Exciting Problems in Search and the Web
Andrei Broder, Yahoo
Jim Gray, Microsoft Research
Alon Halevy, Google
Anant Jhingran, IBM
Anand Rajaraman, Cambrian Ventures