ABSTRACT
Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle, and do not allow programmers to trade off performance for SDC coverage. Further, many of them require tens of thousands of fault injection experiments, which are highly time-intensive. In this paper, we propose an empirical model to predict the SDC proneness of a program's data called SDCTune. SDCTune is based on static and dynamic features of the program alone, and does not require fault injections to be performed. We then develop an algorithm using SDCTune to selectively protect the most SDC-prone data in the program subject to a given performance overhead bound. Our results show that our technique is highly accurate at predicting the relative SDC rate of an application, and outperforms full duplication by a factor of 0.83 to 1.87x in efficiency of detection (i.e., ratio of SDC coverage provided to performance overhead).
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, et al. The NAS parallel benchmarks. HPCA, 1991.Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In PACT, 2008. Google ScholarDigital Library
- S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. MICRO, 2005. Google ScholarDigital Library
- J. Cong and K. Gururaj. Assuring application-level correctness against soft errors. In ICCAD, 2011. Google ScholarDigital Library
- C. Constantinescu. Intermittent faults and effects on reliability of integrated circuits. In RAMS, 2008. Google ScholarDigital Library
- M. de Kruijf, S. Nomura, and K. Sankaralingam. Relax: An architectural framework for software recovery of hardware faults. In ISCA. 2010. Google ScholarDigital Library
- M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin. Dynamically discovering likely program invariants to support program evolution. Software Engineering, IEEE Transactions on, 2001. Google ScholarDigital Library
- S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In ASPLOS, 2010. Google ScholarDigital Library
- S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-level detectors for reducing silent data corruptions. In DSN, 2012. Google ScholarDigital Library
- J. L. Henning. SPEC CPU2000: Measuring CPU performance in the new millennium. Computer, 2000. Google ScholarDigital Library
- M. Hiller, A. Jhumka, and N. Suri. On the placement of software mechanisms for detection of data errors. In DSN, 2002. Google ScholarDigital Library
- D. S. Khudia, G. Wright, and S. Mahlke. Efficient soft error protection for commodity embedded microprocessors using profile information. In LCTES, 2012. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO., 2004. Google ScholarDigital Library
- K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian. Partially protected caches to reduce failures due to soft errors in multimedia applications. IEEE Transactions on VLSI, 2009. Google ScholarDigital Library
- S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. Flikker: Saving DRAM refresh-power through critical data partitioning. In ASPLOS, 2011. Google ScholarDigital Library
- T. Mason et al. LAMPVIEW: A loop-aware toolset for facilitating parallelization. Master's thesis, Dept. of Electrical Engineeringi, Princeton University, 2009.Google Scholar
- K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer. Application-based metrics for strategic placement of detectors. In PRDC., 2005. Google ScholarDigital Library
- K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. Dynamic derivation of application-specific error detectors and their implementation in hardware. In EDCC., 2006. Google ScholarDigital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In CGO, 2005. Google ScholarDigital Library
- S. K. Sahoo, M.-L. Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y. Zhou. Using likely program invariants to detect hardware errors. In DSN, 2008.Google ScholarCross Ref
- M. Shafique, S. Rehman, P. V. Aceituno, and J. Henkel. Exploiting program-level masking and error propagation for constrained reliability optimization. In DAC, 2013. Google ScholarDigital Library
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.Google Scholar
- A. Thomas and K. Pattabiraman. Error detector placement for soft computation. In DSN, 2013. Google ScholarDigital Library
- L. Wang, Z. Kalbarczyk, and R. Iyer. Formalizing system behavior for evaluating a system hang detector. In Reliable Distributed Systems, 2008. SRDS '08. IEEE Symposium on, pages 269--278, Oct 2008. Google ScholarDigital Library
- J. Wei, A. Thomas, G. Li, and K. Pattabiraman. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In DSN, 2014. Google ScholarDigital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In ISCA, 1995. Google ScholarDigital Library
- K. C. Yeager. The MIPS R10000 superscalar microprocessor. MICRO, 1996. Google ScholarDigital Library
Index Terms
- SDCTune: a model for predicting the SDC proneness of an application for configurable protection
Recommendations
Configurable Detection of SDC-causing Errors in Programs
Special Issue on Embedded Computing for IoT, Special Issue on Big Data and Regular PapersSilent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle and do not allow programmers to trade off performance for SDC coverage. Further, many require ...
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...
Reliability Analysis of N-Modular Redundancy Systems with Intermittent and Permanent Faults
It is well known that static redundancy techniques are very efficient against intermittent (transient) faults which constitute a large portion of logic faults in digital systems. However, very little theoretical work has been done in evaluating the ...
Comments