Abstract
The new trend in computing systems is providing solutions by using multicore and many-core processors. COTS processors are preferred because they offer a high performance with low-power consumption within an affordable price. Lately these devices have been used in High Performance Computing systems due to their massive parallelism and low-power budget. For the last decade, industrial and academic partners have worked together to overcome with dependability issues to extend their usage in embedded systems. Despite of multiple proposals for improving the multi-core reliability, their use is not yet validated for critical tasks. This chapter describes a new fault-tolerance approach called NMR-MPar which is based on N-Modular Redundancy and M-Partitions to improve the reliability of applications running on these devices. The evaluation of the effectiveness of the NMR-MPar approach on two complementary benchmark applications running on the 28 nm CMOS MPPA-256 many-core processor has shown the possibility to consider this approach for mixed-criticality systems. Finally, this chapter analyses the overhead of the approach in terms of power consumption and energy.
Based on “NMR-MPar: A Fault-Tolerance Approach for Multi-Core and Many-Core Processors,” by V. Vargas, P. Ramos, Jean-Francois Mehaut, and Raoul Velazco which published in Appl. Sci., 8(3), 465, 2018.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this context, the Calcul Parallele` pour Applications Critiques en Temps et Surete´ (CAPACITES) project gathers French academic and industrial partners for building a many-core processor platform for critical-embedded applications including avionics.
References
S. Saidi, R. Ernst, S. Uhrig, H. Theiling, B. Dupont de Dinechin, The shift to multicores in real-time and safety-critical systems, in Proceeding of the Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October 2015, pp. 220–229
Across, Advanced cockpit for reduction of stress and workload (2016), http://www.across-fp7.eu
MultiPARTES, Multi-cores Partitioning for Trusted Embedded Systems (2016), http://www.multipartes.eu
D.A.I.T.V. Kalray, Airbus. Calcul parallèle pour applications critiques en temps et sureté. http://www-verimag.imag.fr/CAPACITES.html?lang=en, Accessed 16 Mar 2018
E. Normand, Single-event effects in avionics. IEEE Trans. Nucl. Sci. 43(2), 461–474 (1996)
G.H. Asadi, S. Vilas, M.B. Tahoori, D. Kaeli. Balancing performance and reliability in the memory hierarchy. in Proceeding of Performance Analysis of Systems and Software, pp. 269–279, March 2005
Y. Cai, M.T. Schmitz, A. Ejlali, B.M. Al-Hashimi, S.M. Reddy. Cache size selection for performance, energy and reliability of time-constrained systems, in Asia and South Pacific Conference on Design Automation, January 2006, pp. 6
H. Naeimi, C. Augustine, A. Raychowdhury, S. Lu, J. Tschanz, Sttram scaling and retention failure. Intel Technol. J. 17(1), 54–75 (2013)
S. Guertin. Initial SEE Test of Maestro. Pasadena, CA: Jet Propulsion Laboratory, National Aeronautics and Space Administration, July 2012
D.A.G. Oliveira, P. Rech, H.M. Quinn, T.D. Fairbanks, L. Monroe, S.E. Michalak, C. Anderson-Cook, P.O.A. Navaux, L. Carro, Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans. Nucl. Sci. 61(6), 3115–3122 (2014)
P. Ramos, V. Vargas, M. Baylac, F. Villa, S. Rey, J.A. Clemente, N.E. Zergainoh, J.F. Mehaut, R. Velazco, Evaluating the SEE sensitivity of a 45nm SOI multi-core processor due to 14 MeV neutrons. IEEE Trans. Nucl. Sci. 63(4), 2193–2200 (2016)
S.S. Stolt, E. Normand, A multicore server SEE cross section model. IEEE Trans. Nucl. Sci. 59(6), 2803–2810 (2012)
V. Vargas, P. Ramos, V. Ray, C. Jalier, R. Stevens, B. Dupont de Dinechin, M. Baylac, F. Villa, S. Rey, N.E. Zergainoh, J.F. Mehaut, R. Velazco, Radiation experiments on a 28nm single-chip many-core processor and SEU error-rate prediction. IEEE Trans. Nucl. Sci. 99(4), 1–8 (2016)
A. Vajda, Multi-core and many-core processor architectures, in Programming Many-Core Chips, (Springer, New York, 2011), pp. 9–43
Freescale. Running AMP, SMP or BMP Mode for Multicore Embedded Systems, 2012
IEEE Computer Society 1003.1-2001 IEEE Standard for IEEE Information Technology Portable Operating System Interface (POSIX(R)) (2001), http://standards.ieee.org/findstds/standard/1003.1-2001.html
S. Kim, A.K. Somani, Area efficient architectures for information integrity in cache memories, in Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367), 1999, pp. 246–255
M. Sugihara, T. Ishihara, K. Murakami. Task scheduling for reliable cache architectures of multiprocessor systems, in 2007 Design, Automation Test in Europe Conference Exhibition, April 2007, pp. 1–6
W. Zhang, Replication cache: a small fully associative cache to improve data cache reliability. IEEE Trans. Comput. 54(12), 1547–1555 (2005)
W. Zhang, S. Gurumurthi, M. Kandemir, A. Sivasubramaniam, Icr: in-cache replication for enhancing data cache reliability, in Proceedings of the 2003 International Conference on Dependable Systems and Networks, June 2003, p. 291–300
A. Sundaram, A. Aakel, D. Lockhart, D. Thaker, D. Franklin, Efficient fault tolerance in multi-media applications through selective instruction replication, in Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies, WREFT ’08, (ACM, New York, NY, 2008), pp. 339–346
G. Memik, M. Kandemir, O. Ozturk, Increasing register file immunity to transient errors. Design Automat. Test Europe 1, 586–591 (2005)
H. Tabkhi, Application-specific power-efficient approach for reducing register file vulnerability, in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 574–577
R. Lyons, W. Vanderkulk, The use of triple modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)
J.P. Walters, R. Kost, K. Singh, J. Suh, S.P. Crago, Software-based fault tolerance for the Maestro many-core processor, in Proceedings of 2011 Aerospace Conference, March 2011
Z. Basile, C. Kalbarczyk, R.K. Iyer, Active replication of multithreaded applications. IEEE Trans. Parallel Distr. Syst. 17(5), 448–465 (2006)
S. Mukherjee, M. Kontz, S. Reinhardt, Detailed design and evaluation of redundant multi-threading alternatives, in Proceedings 29th Annual International Symposium on Computer Architecture, 2002, pp. 99–110
H. Mushtaq, Z. Al-Ars, K. Bertels, Efficient software-based fault tolerance approach on multicore platforms, in Proceedings of Design, Automation & Test in Europe Conference, March 2013, pp. 921–926
S. Reinhardt, S. Mukherjee, Transient fault detection via simultaneous multithreading, in Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), June 2000, pp. 25–36
T.N. Vijaykumar, I. Pomeranz, K. Cheng, Transient fault recovery using simultaneous multithreading, in Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002, pp. 87–98
A. Holler, T. Rauter, J. Iber, G. Macher, C. Kreiner. Software-Based Fault Recovery via Adaptive Diversity for Reliable COTS Multi-Core Processors, p. 1–6, 2016
M.S. Alhakeem, P. Munk, R. Lisicki, H. Parzyjegla, H. Parzyjegla, G. Muehl. A frame-work for adaptive software-based reliability in cots many-core processors, in ARCS 2015—The 28th International Conference on Architecture of Computing Systems. Proceedings, March 2015, pp. 1–4
E.P. Kim, N.R. Shanbhag, Soft n-modular redundancy. IEEE Trans. Comput. 61(3), 323–336 (2012)
C. Bolchini, A. Miele, D. Sciuto. An adaptive approach for online fault management in many-core architectures. in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), p. 1429–1432, March 2012
A. Shye, J. Blomstedt, T. Moseley, V. Janapa Reddi, D.A. Connors, PLR: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Depend Sec. Comput. 6(2), 135–148 (2009)
I.G. Spec. Rtca/do-297 - integrated modular avionics (ima) development guidance and certification considerations. https://standards.globalspec.com/std/2018378/rtca-do-297, Accessed 16 Mar 2018
A. Lofwenmark, S. Nadjm-Tehrani, Challenges in future avionic systems on multi-core platforms., in 2014 IEEE International Symposium on Software Reliability Engineering Work-shops, November 2014, pp. 115–119
M.S. Mollison, J.P. Erickson, J.H. Anderson, S.K. Baruah, J.A. Scoredos, Mixed-criticality real-time scheduling for multicore systems, in 2010 10th IEEE International Conference on Computer and Information Technology, June 2010, pp. 1864–1871
M. Panic, E. Quinones, P. G. Zavkov, C. Hernandez, J. Abella, F.J. Cazorla. Parallel many-core avionics systems, in 2014 International Conference on Embedded Software (EMSOFT), October 2014, pp. 1–10
S. Trujillo, A. Crespo, A. Alonso, J. Pérez, Multipartes: multi-core partitioning and virtualization for easing the certification of mixed-criticality systems. Microprocess. Microsyst. 38(8, Part B), 921–932 (2014)
M. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design (John Wiley & Sons, Inc., New York, NY, 2002)
I. Koren, S.Y.H. Su, Reliability analysis of n-modular redundancy systems with intermit-tent and permanent faults. IEEE Trans. Comput. 28(7), 514–520 (1979)
Top500, Top 500 supercomputer list (2017), https://www.top500.org/lists/2017/11/, Accessed 16 Mar 2018
E. Francesquini, M. Castro, P. Penna, F. Dupros, H. Freitas, P. Navaux, J.F. Mehaut, On the energy efficiency and performance of irregular application executions on multicore, NUMA and manycore platforms. J Parallel Distr Com 76, 32–48 (Feb. 2015)
Kalray. MPPA ACCESSCORE V1.4 Introductory Manual, 2015
B.D. de Dinechin, P.G. de Massas, G. Lager, C. Leger, B. Orgogozo, J. Reybert, T. Strudel. A distributed run-time environment for the kalray mppa-256 integrated manycore processor. Procedia Computer Science. 2013 International Conference on Computational Science, 18:1654– 1663, 2013
D.L. Applegate, R.E. Bixby, V. Chvatal, W.J. Cook, The Traveling Salesman Problem: A Computational Study (Princeton University Press, Princeton, NJ, 2007), pp. 49–53
D. Johnson, C. Paadimitriou, Computational complexity, in Wiley Series in Discrete Mathematics and Optimization, (Wiley and Sons, Chichester, 1995), pp. 37–85
V. Vargas, P. Ramos, J. Mehaut, R. Velazco, Nmr-mpar: a fault-tolerance approach for multi-core and many-core processors. Appl. Sci. 8(3), 465 (2018)
P. Ramos, V. Vargas, M. Baylac, F. Villa, S. Rey, J.A. Clemente, N.E. Zergainoh, R. Ve-lazco, Sensitivity to neutron radiation of a 45nm SOI multi-core processor, in Proceedings of Radia-tion Effects on Components and Systems, September 2015,pp. 135–138
V. Vargas, P. Ramos, W. Mansour, R. Velazco, N.E. Zergainoh, J.F. Mehaut, Preliminary results of SEU fault-injection on multicore processors in AMP mode, in Proceedings of IEEE 20th International On-Line Testing Symposium (IOLTS), September 2014, pp. 194–197
V. Vargas, P. Ramos, R. Velazco, J.F. Mehaut, N.E. Zergainoh, Evaluating SEU fault-injection on parallel applications implemented on multicore processors, in Proceedings of the 6th Latin American Symposium on Circuits & Systems (LASCAS), February 2015, pp. 181–184
P. Peronnard, R. Ecoffet, M. Pignol, D. Bellin, R. Velazco, Predicting the SEU error rate through fault injection for a complex microprocessor, in Proceedings of 2008 IEEE International Symposium on Industrial Electronics, p. 2288–2292, September 2008
V. Vargas, P. Ramos, J. Mehaut, R. Velazco, Swifi Fault injector for heterogeneous many-core processors. Pontificia Universidad Católica del Ecuador, ISSN: 2528-8156 (accepted), 106, May 2018
C. Villalpando, D. Rennels, R. Some, M. Cabanas-Holmen, Reliable multicore processors for NASA space missions, in Proceeding of the Aerospace Conference, March 2011, pp. 1–12
Acknowledgments
This work was supported in part by the Universidad de las Fuerzas Armadas ESPE and by the Secretaria de Educación Superior, Ciencia, Tecnología e Innovación del Ecuador (SENESCYT) through the grant PIC-2017-EXT-004 and STIC—AmSud (Science et Tech-nologie de l’Information et de la Communication en Amrique du Sud)—Energy-aware Scheduling and Fault Tolerance Techniques for the Exascale Era (EnergySFE) Project PIC-16-ESPE-STIC-001, and by the French authorities through the “Investissements d’Avenir” program (CAPACITES project). The authors thank Stephané Gailhard from the Societé Kalray for his valuable contribution to solving the MPPA programming issues.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Vargas, V., Ramos, P., Méhaut, JF., Velazco, R. (2019). Improving Reliability of Multi-/Many-Core Processors by Using NMR-MPar Approach. In: Velazco, R., McMorrow, D., Estela, J. (eds) Radiation Effects on Integrated Circuits and Systems for Space Applications. Springer, Cham. https://doi.org/10.1007/978-3-030-04660-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-04660-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04659-0
Online ISBN: 978-3-030-04660-6
eBook Packages: EngineeringEngineering (R0)