2014 Poster Sessions : Efficient Similarity Queries via Lossy Compression

Student Name : Idoia Ochoa
Advisor : Tsachy Weissman
Research Areas: Information Systems
The generation of new databases and the increase in data on existing ones is growing exponentially. As a result, executing queries on existing databases is becoming a challenging
task. With this in mind, we study the problem of compressing sequences in a database so that similarity queries can still be
perform efficiently on the compressed database. The fundamental limits of this problem characterize the tradeoff between compression rate and the reliability of the queries performed on the compressed data. While those asymptotic limits have been studied in the past, it is unclear how to implement such schemes in practice. In this paper we propose a method for this task,
based in part on existing lossy compression algorithms.

Specifically, we consider queries of the form: “which sequences in the database are similar to a given sequence y?”. We demonstrate the performance of our scheme when the Hamming distortion measures the similarity between sequences, and show that our scheme’s performance is comparable to that predicted by the fundamental limits. Furthermore, we test our scheme on a sample database of actual DNA sequences, and show significant compression while still allowing highly reliable query answers.

Idoia Ochoa received his telecommunications engineering degree (a combined B.Sc. and M.Sc. electrical engineering degree) in 2009 from the University of Navarra, Spain. She also received her Masters in Electrical Engineering from Stanford University in 2012. She is currently working towards the PhD degree in the Electrical Engineering department under the supervision of Prof. Tsachy Weissman. Her research focuses on compression of genomic data, information theory and coding.