2016 Poster Sessions : Fast Retrieval from Compressed Collections of Genomic Variants

Student Name : Mikel Hernaez, Idoia Ochoa, Kedar Tatwawadi
Advisor : Tsachy Weissman
Research Areas: Information Systems
The dramatic decrease in the cost of sequencing has resulted in the
generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project (MVP), with the numbers of sequenced genomes ranging in order from 10K to 1M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes rep-resented as variants are stored in databases which are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack ecient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inecient, to the point where compression is often relinquished altogether.

We present a new algorithm for this task that achieves signi cant compression while allowing fast random querying from the compressed database. We make use of and adapt techniques from information theory and compression, such as a specialized Lempel-Ziv compressor and tailored succinct data structures.