2014 Poster Sessions : SociaLite: High-level Query Language for Big Data Analysis

Student Name : Jiwon Seo
Advisor : Monica Lam
Research Areas: Computer Systems
Abstract:
SociaLite is a high-level language for big data analysis. It makes big data analysis simple, yet achieves fast performance with its compiler optimizations, often more than three orders of magnitude faster than Hadoop MapReduce programs. For example, PageRank algorithm can be implemented in just 2 lines of SociaLite query, which runs nearly as fast as an optimal C implementation.

High-level abstractions in SociaLite help implement distributed data analysis algorithms. For example, its distributed in-memory tables allow large data to be stored across multiple machines, and with minimal user annotations, fast distributed join operations can be performed. Moreover, its Python integration makes SociaLite very powerful. We support embedding and extending, where embedding supports using SociaLite queries directly in Python code, and extending supports using Python functions in SociaLite queries. The integration makes it easy to implement various data mining algorithms such as PageRank, k-means, and logistic regression in SociaLite and Python.

SociaLite high-level queries achieve fast performance with compiler optimizations. The queries are compiled to Java bytecode with compiler optimizations applied, such as prioritizations or pipelined evaluation. Moreover, the runtime system masks network latency with smart task scheduling, and uses optimized memory allocator to reduce memory allocation time as well as GC running time. With the compiler optimizations and the runtime system we achieve very fast performance that is often close to optimal C implementations.

Bio
Jiwon Seo is a PhD student at Stanford working with professor Monica Lam. He is interested in distributed systems, big data mining, and graph analysis. He has been working on SociaLite, which is a distributed query language for data analysis. SociaLite eases distributed programming with its high-level queries, which can achieve fast performance with the compiler optimizations.