Skip to main content

End-to-End Resilience for HPC Applications

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11501))

Abstract

A plethora of resilience techniques have been investigated to protect application kernels. If, however, such techniques are combined and they interact across kernels, new vulnerability windows are created. This work contributes the idea of end-to-end resilience by protecting windows of vulnerability between kernels guarded by different resilience techniques. It introduces the live vulnerability factor (LVF), a new metric that quantifies any lack of end-to-end protection for a given data structure. The work further promotes end-to-end application protection across kernels via a pragma-based specification for diverse resilience schemes with minimal programming effort. This lifts the data protection burden from application programmers allowing them to focus solely on algorithms and performance while resilience is specified and subsequently embedded into the code through the compiler/library and supported by the runtime system. In experiments with case studies and benchmarks, end-to-end resilience has an overhead over kernel-specific resilience of less than \(3\%\) on average and increases protection against bit flips by a factor of three to four.

This work was supported in part by a subcontract from Lawrence Berkeley National Laboratory and NSF grants 1525609, 1058779, and 0958311. This manuscript has three authors of Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Bit flips in code (instruction bits) create unpredictable outcomes (most of the time segmentation faults or crashes but sometimes also incorrect but legal jumps) and are out of the scope of this work.

  2. 2.

    Extra checks are added to guarantee the correctness of data stored in a safe region. A safe region is assumed to neither be subject to bit flips nor data corruption from the application viewpoint—yet, the techniques to make the region safe remain transparent to the programmer. In other words, a safe region is simply one subject to data protection/verification via checking.

  3. 3.

    Inputs are read from disk and stored in globals or on the heap, but may be recovered by re-reading from disk. Globals are calculated in the program and can only be recovered by re-calculation or ABFT schemes.

References

  1. Anderson, J.H., Calandrino, J.M.: Parallel task scheduling on multicore platforms. SIGBED Rev. 3(1), 1–6 (2006)

    Article  Google Scholar 

  2. Biswas, S., Supinski, B.R.D., Schulz, M., Franklin, D., Sherwood, T., Chong, F.T.: Exploiting data similarity to reduce memory footprints. In: IPDPS, pp. 152–163 (2011)

    Google Scholar 

  3. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP, pp. 207–216 (1995)

    Google Scholar 

  4. Böhm, S., Engelmann, C.: File I/O for MPI applications in redundant execution scenarios. In: Parallel, Distributed, and Network-Based Processing, February 2012

    Google Scholar 

  5. Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J.: PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 15(6), 36–45 (2013)

    Article  Google Scholar 

  6. Cao, C., Herault, T., Bosilca, G., Dongarra, J.: Design for a soft error resilient dynamic task-based runtime. In: IPDPS, pp. 765–774, May 2015

    Google Scholar 

  7. Chen, S., et al.: Scheduling threads for constructive cache sharing on CMPs. In: SPAA, pp. 105–115 (2007)

    Google Scholar 

  8. Chen, Z., Wu, P.: Fail-stop failure algorithm-based fault tolerance for cholesky decomposition. IEEE TPDS 99(PrePrints), 1 (2014)

    Google Scholar 

  9. Chung, J., et al.: Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Supercomputing, pp. 58:1–58:11 (2012)

    Google Scholar 

  10. Dave, C., Bae, H., Min, S.J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: a source-to-source compiler infrastructure for multicores. Computer 42(12), 36–42 (2009)

    Article  Google Scholar 

  11. Diniz, P.C., Liao, C., Quinlan, D.J., Lucas, R.F.: Pragma-controlled source-to-source code transformations for robust application execution. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 660–670. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_53

    Chapter  Google Scholar 

  12. Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J.: Algorithm-based fault tolerance for dense matrix factorizations. In: PPoPP, pp. 225–234 (2012)

    Google Scholar 

  13. Duell, J.: The design and implementation of Berkeley Labs Linux Checkpoint/Restart. Technical report, LBNL (2003)

    Google Scholar 

  14. Duran, A., et al.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parall. Process. Lett. 21(2), 173–193 (2011)

    Article  MathSciNet  Google Scholar 

  15. Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: IPDPS, pp. 1193–1202 (2014)

    Google Scholar 

  16. Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: ICDCS, 18–21 June 2012

    Google Scholar 

  17. Fiala, D., Mueller, F., Engelmann, C., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Supercomputing (2012)

    Google Scholar 

  18. Geist, A.: How to kill a supercomputer: dirty power, cosmic rays, and bad solder. In: IEEE Spectrum, February 2016

    Google Scholar 

  19. Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)

    Article  Google Scholar 

  20. Islam, T.Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B.R., Eigenmann, R.: MCREngine: a scalable checkpointing system using data-aware aggregation and compression. In: Supercomputing, pp. 17:1–17:11 (2012)

    Google Scholar 

  21. Kale, L.V., Krishnan, S.: Charm++: a portable concurrent object oriented system based on c++. In: OOPSLA, pp. 91–108 (1993)

    Article  Google Scholar 

  22. Kiczales, G., et al.: Aspect-oriented programming. In: ECOOP, pp. 220–242 (1997)

    Google Scholar 

  23. Li, S., Sridharan, V., Gurumurthi, S., Yalamanchili, S.: Software-based dynamic reliability management for GPU applications. In: Workshop in Silicon Errors in Logic System Effects (2015)

    Google Scholar 

  24. Martsinkevich, T., Subasi, O., Unsal, O., Cappello, F., Labarta, J.: Fault-tolerant protocol for hybrid task-parallel message-passing applications. In: Cluster Computing, pp. 563–570, September 2015

    Google Scholar 

  25. Min, S., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: Partitioned Global Address Space Programming Models (2011)

    Google Scholar 

  26. Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Supercomputing, pp. 1–11 (2010)

    Google Scholar 

  27. Panzer-Steindel, B.: Data integrity. Technical report, 1.3, CERN (2007)

    Google Scholar 

  28. Parr, T., Quong, R.: ANTLR: a predicated. Softw. Pract. Exp. 25(7), 789–810 (1995)

    Article  Google Scholar 

  29. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN, pp. 249–258 (2006)

    Google Scholar 

  30. Schroeder, B., Pinheiro, E., Weber, W.D.: Dram errors in the wild: a large-scale field study. SIGMETRICS Perform. Eval. Rev. 37(1), 193–204 (2009)

    Google Scholar 

  31. Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Supercomputing, pp. 69–78 (2012)

    Google Scholar 

  32. Simon, T.A., Dorband, J.: Improving application resilience through probabilistic task replication. In: Workshop on Algorithmic and Application Error Resilience, June 2013

    Google Scholar 

  33. Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. (2013)

    Google Scholar 

  34. Sridharan, V., Kaeli, D.: Eliminating microarchitectural dependency from Architectural Vulnerability. In: HPCA, pp. 117–128, February 2009

    Google Scholar 

  35. Sridharan, V., et al.: Memory errors in modern systems: the good, the bad, and the ugly. In: ASPLOS, pp. 297–310 (2015)

    Google Scholar 

  36. Yim, K.S., Pham, C., Saleheen, M., Kalbarczyk, Z., Iyer, R.: Hauberk: lightweight silent data corruption error detector for GPGPU. In: IPDPS, pp. 287–300 (2011)

    Google Scholar 

  37. Yu, L., Li, D., Mittal, S., Vetter, J.S.: Quantitatively modeling application resilience with the data vulnerability factor. In: Supercomputing, pp. 695–706 (2014)

    Google Scholar 

  38. Zhang, Y., Mueller, F., Cui, X., Potok, T.: Large-scale multi-dimensional document clustering on GPU clusters. In: IPDPS, pp. 1–10, April 2010

    Google Scholar 

  39. Zheng, Z., Chien, A.A., Teranishi, K.: Fault tolerance in an inner-outer solver: a GVR-enabled case study. In: Daydé, M., Marques, O., Nakajima, K. (eds.) VECPAR 2014. LNCS, vol. 8969, pp. 124–132. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17353-5_11

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Mueller .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rezaei, A., Khetawat, H., Patil, O., Mueller, F., Hargrove, P., Roman, E. (2019). End-to-End Resilience for HPC Applications. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11501. Springer, Cham. https://doi.org/10.1007/978-3-030-20656-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20656-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20655-0

  • Online ISBN: 978-3-030-20656-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics