Abstract
Safety-critical embedded systems may either use specialized hardware or rely on Software-Implemented Hardware Fault Tolerance (SIHFT) to meet soft error resilience requirements. SIHFT has the advantage that it can be used with low-cost, off-the-shelf components such as standard Micro-Controller Units. For this, SIHFT methods apply redundancy in software computation and special checker codes to detect transient errors, so called soft errors, that either corrupt the data flow or the control flow of the software and may lead to Silent Data Corruption (SDC). So far, this is done by applying separate SIHFT methods for the data and control flow protection, which leads to large overheads in computation time.
This work in contrast presents REPAIR, a method that exploits the checks of the SIHFT data flow protection to also detect control flow errors as well, thereby, yielding higher SDC resilience with less computational overhead. For this, the data flow protection methods entail duplicating the computation with subsequent checks placed strategically throughout the program. These checks assure that the two redundant computation paths, which work on two different parts of the register file, yield the same result. By updating the pairing between the registers used in the primary computation path and the registers in the duplicated computation path using the REPAIR method, these checks also fail with high coverage when a control flow error, which leads to an illegal jumps, occurs. Extensive RTL fault injection simulations are carried out to accurately quantify soft error resilience while evaluating Mibench programs along with an embedded case-study running on an OpenRISC processor. Our method performs slightly better on average in terms of soft error resilience compared to the best state-of-the-art method but requiring significantly lower overheads. These results show that REPAIR is a valuable addition to the set of known SIHFT methods.
- Zeyad Alkhalifa, Suku Nair, Narayanan Krishnamurthy, and Jacob A. Abraham. 1999. Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems 10, 6 (1999), 627–641. DOI:https://doi.org/10.1109/71.774911Google ScholarDigital Library
- Adam Bennett. 2008. Recommended Practice for DMX512: A Guide for Users and Installers : Incorporating USITT DMX512-A and Remote Device Management, RDM. PLASA. https://books.google.de/books?id=NQopQwAACAAJ.Google Scholar
- Matthew Bohman, Benjamin James, Michael J. Wirthlin, Heather Quinn, and Jeffrey Goeders. 2019. Microcontroller compiler-assisted software fault tolerance. IEEE Transactions on Nuclear Science 66, 1 (2019), 223–232. DOI:https://doi.org/10.1109/TNS.2018.2886094Google ScholarCross Ref
- Zizhong Chen. 2013. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. SIGPLAN Not. 48, 8 (Feb. 2013), 167–176. DOI:https://doi.org/10.1145/2517327.2442533Google ScholarDigital Library
- Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham, Pradip Bose, and Subhasish Mitra. 2016. CLEAR: Crosslayer exploration for architecting resilience combining hardware and software techniques to tolerate soft errors in processor cores. Proceedings - Design Automation Conference 05-09-June (2016). DOI:https://doi.org/10.1145/2897937.2897996Google ScholarDigital Library
- Ph. Cheynet, Bogdan Nicolescu, Raoul Velazco, Maurizio Rebaudengo, Matteo Sonza Reorda, and Massimo Violante. 2001. System safety through automatic high-level code transformations: an experimental evaluation. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE 2001, Munich, Germany, March 12-16, 2001, Wolfgang Nebel and Ahmed Jerraya (Eds.). IEEE Computer Society, 297–301. DOI:https://doi.org/10.1109/DATE.2001.915040Google Scholar
- Eduardo Chielle, Gennaro S. Rodrigues, Fernanda L. Kastensmidt, Sergio Cuenca-Asensi, Lucas A. Tambara, Paolo Rech, and Heather Quinn. 2015. S-SETA: Selective software-only error-detection technique using assertions. IEEE Transactions on Nuclear Science 62, 6 (2015), 3088–3095. DOI:https://doi.org/10.1109/TNS.2015.2484842Google ScholarCross Ref
- Hyungmin Cho, Shahrzad Mirkhani, Chen Yong Cher, Jacob A. Abraham, and Subhasish Mitra. 2013. Quantitative evaluation of soft error injection techniques for robust system dsesign. Proceedings - Design Automation Conference (2013). DOI:https://doi.org/10.1145/2463209.2488859Google ScholarDigital Library
- Moslem Didehban and Aviral Shrivastava. 2016. NZDC: A compiler technique for near zero silent data corruption. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). Association for Computing Machinery, New York, NY, USA, Article 48, 6 pages. DOI:https://doi.org/10.1145/2897937.2898054Google ScholarDigital Library
- Moslem Didehban, Aviral Shrivastava, and Sai Ram Dheeraj Lokam. 2017. NEMESIS: A software approach for computing in presence of soft errors. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 297–304. DOI:https://doi.org/10.1109/ICCAD.2017.8203792Google ScholarDigital Library
- Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault, and Jack Dongarra. 2012. Algorithm-based fault tolerance for dense matrix factorizations. SIGPLAN Not. 47, 8 (Feb. 2012), 225–234. DOI:https://doi.org/10.1145/2370036.2145845Google ScholarDigital Library
- Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS (2010), 385–396. DOI:https://doi.org/10.1145/1736020.1736063Google ScholarDigital Library
- Olga Goloubeva, Maurizio Rebaudengo, Matteo Sonza Reorda, and Massimo Violante. 2003. Soft-error detection using control flow assertions. In Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems. 581–588. DOI:https://doi.org/10.1109/DFTVS.2003.1250158Google ScholarCross Ref
- Olga Goloubeva, Maurizio Rebaudengo, Matteo Sonza Reorda, and Massimo Violante. 2006. Software-Implemented Hardware Fault Tolerance. Springer US. 228 pages. DOI:https://doi.org/10.1007/0-387-32937-4Google ScholarDigital Library
- Matthew R. Guthaus, Jeffrey S. Ringenberg, Daniel J. Ernst, Todd M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538). 3–14. DOI:https://doi.org/10.1109/WWC.2001.990739Google ScholarCross Ref
- Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012. Low-cost program-level detectors for reducing silent data corruptions. Proceedings of the International Conference on Dependable Systems and Networks (2012). DOI:https://doi.org/10.1109/DSN.2012.6263960Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. International Symposium on Code Generation and Optimization, CGOc (2004), 75–86. DOI:https://doi.org/10.1109/CGO.2004.1281665Google ScholarCross Ref
- Régis Leveugle, A. Calvez, Paolo Maistri, and Pierre Vanhauwaert. 2009. Statistical fault injection: Quantified error and confidence. Proceedings -Design, Automation and Test in Europe, DATE (2009), 502–506. DOI:https://doi.org/10.1109/date.2009.5090716Google ScholarCross Ref
- Aiguo Li and Bingrong Hong. 2007. Software implemented transient fault detection in space computer. Aerospace Science and Technology 11, 2–3 (2007), 245–252. DOI:https://doi.org/10.1016/j.ast.2006.06.006Google ScholarCross Ref
- S. S. Mukherjee, J. Emer, and S. K. Reinhardt. 2005. The soft error problem: An architectural perspective. In 11th International Symposium on High-Performance Computer Architecture. 243–247. DOI:https://doi.org/10.1109/HPCA.2005.37Google ScholarDigital Library
- Shubhendu S. Mukherjee, Michael Kontz, and Steven K. Reinhardt. 2002. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, USA, 99–110.Google Scholar
- Bogdan Nicolescu, Yvon Savaria, and Raoul Velazco. 2003. SIED: Software implemented error detection. In Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems. 589–596. DOI:https://doi.org/10.1109/DFTVS.2003.1250159Google ScholarCross Ref
- Nahmsuk Oh, Philip P. Shirvani, and Edward J. McCluskey. 2002. Control-flow checking by software signatures. IEEE Transactions on Reliability 51, 1 (2002), 111–122. DOI:https://doi.org/10.1109/24.994926Google ScholarCross Ref
- Nahmsuk Oh, Philip P. Shirvani, and Edward J. McCluskey. 2002. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51, 1 (2002), 63–75. DOI:https://doi.org/10.1109/24.994913Google ScholarCross Ref
- Preeti Ranjan Panda. 2001. SystemC: A modeling platform supporting multiple design abstractions. In Proceedings of the 14th International Symposium on Systems Synthesis (ISSS’01). Association for Computing Machinery, New York, NY, USA, 75–80. DOI:https://doi.org/10.1145/500001.500018Google ScholarDigital Library
- Maurizio Rebaudengo, Matteo Sonza Reorda, Marco Torchiano, and Massimo Violante. 1999. Soft-error detection through software fault-tolerance techniques. In Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT’99). 210–218. DOI:https://doi.org/10.1109/DFTVS.1999.802887Google ScholarCross Ref
- Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jörg Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. Embedded Systems Week 2011, ESWEEK 2011 - Proceedings of the 9th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS’11 (2011), 237–246. DOI:https://doi.org/10.1145/2039370.2039408Google ScholarDigital Library
- George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. SWIFT: Software implemented fault tolerance. In International Symposium on Code Generation and Optimization. IEEE, 243–254. DOI:https://doi.org/10.1109/CGO.2005.34Google ScholarDigital Library
- Abhishek Rhisheekesan, Reiley Jeyapaul, and Aviral Shrivastava. 2019. Control flow checking or not? (for Soft Errors). ACM Transactions on Embedded Computing Systems 18, 1 (2019). DOI:https://doi.org/10.1145/3301311Google ScholarDigital Library
- Eric Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352). 84–91. DOI:https://doi.org/10.1109/FTCS.1999.781037Google ScholarCross Ref
- Horst Schirmeier, Christoph Borchert, and Olaf Spinczyk. 2015. Avoiding pitfalls in fault-injection based comparison of program susceptibility to soft errors. Proceedings of the International Conference on Dependable Systems and Networks 2015-September (2015), 319–330. DOI:https://doi.org/10.1109/DSN.2015.44Google ScholarDigital Library
- Charles Slayman. 2011. Soft error trends and mitigation techniques in memory devices. In 2011 Proceedings - Annual Reliability and Maintainability Symposium. 1–5. DOI:https://doi.org/10.1109/RAMS.2011.5754515Google ScholarCross Ref
- Anna Thomas and Karthik Pattabiraman. 2013. Error detector placement for soft computation. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 1–12. DOI:https://doi.org/10.1109/DSN.2013.6575353Google ScholarDigital Library
- Jens Vankeirsbilck, Niels Penneman, Hans Hallez, and Jeroen Boydens. 2017. Random Additive signature monitoring for control flow error detection. IEEE Transactions on Reliability 66, 4 (dec 2017), 1178–1192. DOI:https://doi.org/10.1109/TR.2017.2754548Google ScholarCross Ref
- Jens Vankeirsbilck, Niels Penneman, Hans Hallez, and Jeroen Boydens. 2018. Random Additive control flow error detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Barbara Gallina, Amund Skavhaug, and Friedemann Bitsch (Eds.), Vol. 11093 LNCS. Springer International Publishing, Cham, 220–234. DOI:https://doi.org/10.1007/978-3-319-99130-6_15Google Scholar
- Ramtilak Vemu and Jacob A. Abraham. 2011. CEDA: Control-flow error detection using assertions. IEEE Trans. Comput. 60, 9 (2011), 1233–1245. DOI:https://doi.org/10.1109/TC.2011.101Google ScholarDigital Library
- Nicholas J. Wang, Justin Quek, Todd M. Rafacz, and Sanjay J. Patel. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In International Conference on Dependable Systems and Networks, 2004. IEEE, 61–70. DOI:https://doi.org/10.1109/DSN.2004.1311877Google ScholarCross Ref
Index Terms
- REPAIR: Control Flow Protection based on Register Pairing Updates for SW-Implemented HW Fault Tolerance
Recommendations
Trading Fault Tolerance for Performance in AN Encoding
CF'17: Proceedings of the Computing Frontiers ConferenceIncreasing rates of transient hardware faults pose a problem for computing applications. Current and future trends are likely to exacerbate this problem. When a transient fault occurs during program execution, data in the output can become corrupted. ...
Exploiting Idle Hardware to Provide Low Overhead Fault Tolerance for VLIW Processors
Special Issue on Nanoelectronic Circuit and System Design Methods for the Mobile Computing Era and Regular PapersBecause of technology scaling, the soft error rate has been increasing in digital circuits, which affects system reliability. Therefore, modern processors, including VLIW architectures, must have means to mitigate such effects to guarantee reliable ...
Comments