Skip to main content

Improving Reliability of Multi-/Many-Core Processors by Using NMR-MPar Approach

  • Chapter
  • First Online:
Radiation Effects on Integrated Circuits and Systems for Space Applications

Abstract

The new trend in computing systems is providing solutions by using multicore and many-core processors. COTS processors are preferred because they offer a high performance with low-power consumption within an affordable price. Lately these devices have been used in High Performance Computing systems due to their massive parallelism and low-power budget. For the last decade, industrial and academic partners have worked together to overcome with dependability issues to extend their usage in embedded systems. Despite of multiple proposals for improving the multi-core reliability, their use is not yet validated for critical tasks. This chapter describes a new fault-tolerance approach called NMR-MPar which is based on N-Modular Redundancy and M-Partitions to improve the reliability of applications running on these devices. The evaluation of the effectiveness of the NMR-MPar approach on two complementary benchmark applications running on the 28 nm CMOS MPPA-256 many-core processor has shown the possibility to consider this approach for mixed-criticality systems. Finally, this chapter analyses the overhead of the approach in terms of power consumption and energy.

Based on “NMR-MPar: A Fault-Tolerance Approach for Multi-Core and Many-Core Processors,” by V. Vargas, P. Ramos, Jean-Francois Mehaut, and Raoul Velazco which published in Appl. Sci., 8(3), 465, 2018.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this context, the Calcul Parallele` pour Applications Critiques en Temps et Surete´ (CAPACITES) project gathers French academic and industrial partners for building a many-core processor platform for critical-embedded applications including avionics.

References

  1. S. Saidi, R. Ernst, S. Uhrig, H. Theiling, B. Dupont de Dinechin, The shift to multicores in real-time and safety-critical systems, in Proceeding of the Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October 2015, pp. 220–229

    Google Scholar 

  2. Across, Advanced cockpit for reduction of stress and workload (2016), http://www.across-fp7.eu

  3. MultiPARTES, Multi-cores Partitioning for Trusted Embedded Systems (2016), http://www.multipartes.eu

  4. D.A.I.T.V. Kalray, Airbus. Calcul parallèle pour applications critiques en temps et sureté. http://www-verimag.imag.fr/CAPACITES.html?lang=en, Accessed 16 Mar 2018

  5. E. Normand, Single-event effects in avionics. IEEE Trans. Nucl. Sci. 43(2), 461–474 (1996)

    Article  Google Scholar 

  6. G.H. Asadi, S. Vilas, M.B. Tahoori, D. Kaeli. Balancing performance and reliability in the memory hierarchy. in Proceeding of Performance Analysis of Systems and Software, pp. 269–279, March 2005

    Google Scholar 

  7. Y. Cai, M.T. Schmitz, A. Ejlali, B.M. Al-Hashimi, S.M. Reddy. Cache size selection for performance, energy and reliability of time-constrained systems, in Asia and South Pacific Conference on Design Automation, January 2006, pp. 6

    Google Scholar 

  8. H. Naeimi, C. Augustine, A. Raychowdhury, S. Lu, J. Tschanz, Sttram scaling and retention failure. Intel Technol. J. 17(1), 54–75 (2013)

    Google Scholar 

  9. S. Guertin. Initial SEE Test of Maestro. Pasadena, CA: Jet Propulsion Laboratory, National Aeronautics and Space Administration, July 2012

    Google Scholar 

  10. D.A.G. Oliveira, P. Rech, H.M. Quinn, T.D. Fairbanks, L. Monroe, S.E. Michalak, C. Anderson-Cook, P.O.A. Navaux, L. Carro, Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans. Nucl. Sci. 61(6), 3115–3122 (2014)

    Article  Google Scholar 

  11. P. Ramos, V. Vargas, M. Baylac, F. Villa, S. Rey, J.A. Clemente, N.E. Zergainoh, J.F. Mehaut, R. Velazco, Evaluating the SEE sensitivity of a 45nm SOI multi-core processor due to 14 MeV neutrons. IEEE Trans. Nucl. Sci. 63(4), 2193–2200 (2016)

    Article  Google Scholar 

  12. S.S. Stolt, E. Normand, A multicore server SEE cross section model. IEEE Trans. Nucl. Sci. 59(6), 2803–2810 (2012)

    Article  Google Scholar 

  13. V. Vargas, P. Ramos, V. Ray, C. Jalier, R. Stevens, B. Dupont de Dinechin, M. Baylac, F. Villa, S. Rey, N.E. Zergainoh, J.F. Mehaut, R. Velazco, Radiation experiments on a 28nm single-chip many-core processor and SEU error-rate prediction. IEEE Trans. Nucl. Sci. 99(4), 1–8 (2016)

    Google Scholar 

  14. A. Vajda, Multi-core and many-core processor architectures, in Programming Many-Core Chips, (Springer, New York, 2011), pp. 9–43

    Chapter  Google Scholar 

  15. Freescale. Running AMP, SMP or BMP Mode for Multicore Embedded Systems, 2012

    Google Scholar 

  16. IEEE Computer Society 1003.1-2001 IEEE Standard for IEEE Information Technology Portable Operating System Interface (POSIX(R)) (2001), http://standards.ieee.org/findstds/standard/1003.1-2001.html

  17. S. Kim, A.K. Somani, Area efficient architectures for information integrity in cache memories, in Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367), 1999, pp. 246–255

    Article  Google Scholar 

  18. M. Sugihara, T. Ishihara, K. Murakami. Task scheduling for reliable cache architectures of multiprocessor systems, in 2007 Design, Automation Test in Europe Conference Exhibition, April 2007, pp. 1–6

    Google Scholar 

  19. W. Zhang, Replication cache: a small fully associative cache to improve data cache reliability. IEEE Trans. Comput. 54(12), 1547–1555 (2005)

    Article  Google Scholar 

  20. W. Zhang, S. Gurumurthi, M. Kandemir, A. Sivasubramaniam, Icr: in-cache replication for enhancing data cache reliability, in Proceedings of the 2003 International Conference on Dependable Systems and Networks, June 2003, p. 291–300

    Google Scholar 

  21. A. Sundaram, A. Aakel, D. Lockhart, D. Thaker, D. Franklin, Efficient fault tolerance in multi-media applications through selective instruction replication, in Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies, WREFT ’08, (ACM, New York, NY, 2008), pp. 339–346

    Chapter  Google Scholar 

  22. G. Memik, M. Kandemir, O. Ozturk, Increasing register file immunity to transient errors. Design Automat. Test Europe 1, 586–591 (2005)

    Article  Google Scholar 

  23. H. Tabkhi, Application-specific power-efficient approach for reducing register file vulnerability, in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 574–577

    Google Scholar 

  24. R. Lyons, W. Vanderkulk, The use of triple modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)

    Article  Google Scholar 

  25. J.P. Walters, R. Kost, K. Singh, J. Suh, S.P. Crago, Software-based fault tolerance for the Maestro many-core processor, in Proceedings of 2011 Aerospace Conference, March 2011

    Google Scholar 

  26. Z. Basile, C. Kalbarczyk, R.K. Iyer, Active replication of multithreaded applications. IEEE Trans. Parallel Distr. Syst. 17(5), 448–465 (2006)

    Article  Google Scholar 

  27. S. Mukherjee, M. Kontz, S. Reinhardt, Detailed design and evaluation of redundant multi-threading alternatives, in Proceedings 29th Annual International Symposium on Computer Architecture, 2002, pp. 99–110

    Google Scholar 

  28. H. Mushtaq, Z. Al-Ars, K. Bertels, Efficient software-based fault tolerance approach on multicore platforms, in Proceedings of Design, Automation & Test in Europe Conference, March 2013, pp. 921–926

    Google Scholar 

  29. S. Reinhardt, S. Mukherjee, Transient fault detection via simultaneous multithreading, in Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), June 2000, pp. 25–36

    Google Scholar 

  30. T.N. Vijaykumar, I. Pomeranz, K. Cheng, Transient fault recovery using simultaneous multithreading, in Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002, pp. 87–98

    Google Scholar 

  31. A. Holler, T. Rauter, J. Iber, G. Macher, C. Kreiner. Software-Based Fault Recovery via Adaptive Diversity for Reliable COTS Multi-Core Processors, p. 1–6, 2016

    Google Scholar 

  32. M.S. Alhakeem, P. Munk, R. Lisicki, H. Parzyjegla, H. Parzyjegla, G. Muehl. A frame-work for adaptive software-based reliability in cots many-core processors, in ARCS 2015—The 28th International Conference on Architecture of Computing Systems. Proceedings, March 2015, pp. 1–4

    Google Scholar 

  33. E.P. Kim, N.R. Shanbhag, Soft n-modular redundancy. IEEE Trans. Comput. 61(3), 323–336 (2012)

    Article  MathSciNet  Google Scholar 

  34. C. Bolchini, A. Miele, D. Sciuto. An adaptive approach for online fault management in many-core architectures. in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), p. 1429–1432, March 2012

    Google Scholar 

  35. A. Shye, J. Blomstedt, T. Moseley, V. Janapa Reddi, D.A. Connors, PLR: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Depend Sec. Comput. 6(2), 135–148 (2009)

    Article  Google Scholar 

  36. I.G. Spec. Rtca/do-297 - integrated modular avionics (ima) development guidance and certification considerations. https://standards.globalspec.com/std/2018378/rtca-do-297, Accessed 16 Mar 2018

  37. A. Lofwenmark, S. Nadjm-Tehrani, Challenges in future avionic systems on multi-core platforms., in 2014 IEEE International Symposium on Software Reliability Engineering Work-shops, November 2014, pp. 115–119

    Google Scholar 

  38. M.S. Mollison, J.P. Erickson, J.H. Anderson, S.K. Baruah, J.A. Scoredos, Mixed-criticality real-time scheduling for multicore systems, in 2010 10th IEEE International Conference on Computer and Information Technology, June 2010, pp. 1864–1871

    Google Scholar 

  39. M. Panic, E. Quinones, P. G. Zavkov, C. Hernandez, J. Abella, F.J. Cazorla. Parallel many-core avionics systems, in 2014 International Conference on Embedded Software (EMSOFT), October 2014, pp. 1–10

    Google Scholar 

  40. S. Trujillo, A. Crespo, A. Alonso, J. Pérez, Multipartes: multi-core partitioning and virtualization for easing the certification of mixed-criticality systems. Microprocess. Microsyst. 38(8, Part B), 921–932 (2014)

    Article  Google Scholar 

  41. M. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design (John Wiley & Sons, Inc., New York, NY, 2002)

    Book  Google Scholar 

  42. I. Koren, S.Y.H. Su, Reliability analysis of n-modular redundancy systems with intermit-tent and permanent faults. IEEE Trans. Comput. 28(7), 514–520 (1979)

    Article  MathSciNet  Google Scholar 

  43. Top500, Top 500 supercomputer list (2017), https://www.top500.org/lists/2017/11/, Accessed 16 Mar 2018

  44. E. Francesquini, M. Castro, P. Penna, F. Dupros, H. Freitas, P. Navaux, J.F. Mehaut, On the energy efficiency and performance of irregular application executions on multicore, NUMA and manycore platforms. J Parallel Distr Com 76, 32–48 (Feb. 2015)

    Article  Google Scholar 

  45. Kalray. MPPA ACCESSCORE V1.4 Introductory Manual, 2015

    Google Scholar 

  46. B.D. de Dinechin, P.G. de Massas, G. Lager, C. Leger, B. Orgogozo, J. Reybert, T. Strudel. A distributed run-time environment for the kalray mppa-256 integrated manycore processor. Procedia Computer Science. 2013 International Conference on Computational Science, 18:1654– 1663, 2013

    Article  Google Scholar 

  47. D.L. Applegate, R.E. Bixby, V. Chvatal, W.J. Cook, The Traveling Salesman Problem: A Computational Study (Princeton University Press, Princeton, NJ, 2007), pp. 49–53

    MATH  Google Scholar 

  48. D. Johnson, C. Paadimitriou, Computational complexity, in Wiley Series in Discrete Mathematics and Optimization, (Wiley and Sons, Chichester, 1995), pp. 37–85

    Google Scholar 

  49. V. Vargas, P. Ramos, J. Mehaut, R. Velazco, Nmr-mpar: a fault-tolerance approach for multi-core and many-core processors. Appl. Sci. 8(3), 465 (2018)

    Article  Google Scholar 

  50. P. Ramos, V. Vargas, M. Baylac, F. Villa, S. Rey, J.A. Clemente, N.E. Zergainoh, R. Ve-lazco, Sensitivity to neutron radiation of a 45nm SOI multi-core processor, in Proceedings of Radia-tion Effects on Components and Systems, September 2015,pp. 135–138

    Google Scholar 

  51. V. Vargas, P. Ramos, W. Mansour, R. Velazco, N.E. Zergainoh, J.F. Mehaut, Preliminary results of SEU fault-injection on multicore processors in AMP mode, in Proceedings of IEEE 20th International On-Line Testing Symposium (IOLTS), September 2014, pp. 194–197

    Google Scholar 

  52. V. Vargas, P. Ramos, R. Velazco, J.F. Mehaut, N.E. Zergainoh, Evaluating SEU fault-injection on parallel applications implemented on multicore processors, in Proceedings of the 6th Latin American Symposium on Circuits & Systems (LASCAS), February 2015, pp. 181–184

    Google Scholar 

  53. P. Peronnard, R. Ecoffet, M. Pignol, D. Bellin, R. Velazco, Predicting the SEU error rate through fault injection for a complex microprocessor, in Proceedings of 2008 IEEE International Symposium on Industrial Electronics, p. 2288–2292, September 2008

    Google Scholar 

  54. V. Vargas, P. Ramos, J. Mehaut, R. Velazco, Swifi Fault injector for heterogeneous many-core processors. Pontificia Universidad Católica del Ecuador, ISSN: 2528-8156 (accepted), 106, May 2018

    Google Scholar 

  55. C. Villalpando, D. Rennels, R. Some, M. Cabanas-Holmen, Reliable multicore processors for NASA space missions, in Proceeding of the Aerospace Conference, March 2011, pp. 1–12

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the Universidad de las Fuerzas Armadas ESPE and by the Secretaria de Educación Superior, Ciencia, Tecnología e Innovación del Ecuador (SENESCYT) through the grant PIC-2017-EXT-004 and STIC—AmSud (Science et Tech-nologie de l’Information et de la Communication en Amrique du Sud)—Energy-aware Scheduling and Fault Tolerance Techniques for the Exascale Era (EnergySFE) Project PIC-16-ESPE-STIC-001, and by the French authorities through the “Investissements d’Avenir” program (CAPACITES project). The authors thank Stephané Gailhard from the Societé Kalray for his valuable contribution to solving the MPPA programming issues.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vanessa Vargas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Vargas, V., Ramos, P., Méhaut, JF., Velazco, R. (2019). Improving Reliability of Multi-/Many-Core Processors by Using NMR-MPar Approach. In: Velazco, R., McMorrow, D., Estela, J. (eds) Radiation Effects on Integrated Circuits and Systems for Space Applications. Springer, Cham. https://doi.org/10.1007/978-3-030-04660-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04660-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04659-0

  • Online ISBN: 978-3-030-04660-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics