2013 Poster Sessions : ERSA: Error Resilient System Architecture for Probabilistic Applications

Student Name : Hyungmin Cho
Advisor : Subhasish Mitra
Research Areas: Computer Systems
Abstract:
There is a growing concern about the increasing vulnerability of future computing systems to errors in the underlying hardware. Traditional redundancy techniques are expensive for designing energy-efficient systems that are resilient to high error rates. We present Error Resilient System Architecture (ERSA), a robust system architecture which targets emerging killer applications such as recognition, mining, and synthesis (RMS) with inherent error resilience, and ensures high degrees of resilience at low cost. Using the concept of configurable reliability, ERSA may also be adapted for general-purpose applications that are less resilient to errors (but at higher costs). While resilience of RMS applications to errors in low-order bits of data is well-known, execution of such applications on error-prone hardware significantly degrades output quality (due to high-order bit errors and crashes). ERSA achieves high error resilience to high-order bit errors and control flow errors (in addition to low-order bit errors) using a judicious combination of the following key ideas: 1) asymmetric reliability in many-core architectures; 2) error resilient algorithms at the core of probabilistic applications; and 3) intelligent software optimizations. Error injection experiments on a multicore ERSA hardware prototype demonstrate that, even at very high error rates, ERSA maintains 90% or better accuracy of output results, together with minimal impact on execution time, for probabilistic applications such as K-Means clustering, LDPC decoding, and Bayesian network inference.

Bio:
Hyungmin Cho received the B.S. degree in computer science and engineering from Seoul National University, Seoul, Korea, in 2005, and the M.S. degree in electrical engineering from Stanford University, Stanford, CA, in 2010. Currently, he is pursuing the Ph.D. degree in electrical engineering with Stanford University. He was a Technical Intern with Samsung Data Systems, Seoul, Korea, in 2003, with NEC Laboratories America, Princeton, NJ, in 2009, and with Texas Instruments, Dallas, in 2011. His current research interests include reliable computer architecture and computing models for robust systems.