End-to-End Resilience for HPC Applications

Rezaei, Arash; Khetawat, Harsh; Patil, Onkar; Mueller, Frank; Hargrove, Paul; Roman, Eric

doi:10.1007/978-3-030-20656-7_14

End-to-End Resilience for HPC Applications

Arash Rezaei¹⁸,
Harsh Khetawat¹⁸,
Onkar Patil¹⁸,
Frank Mueller ORCID: orcid.org/0000-0002-0258-0294¹⁸,
Paul Hargrove¹⁹ &
…
Eric Roman¹⁹

Conference paper
First Online: 17 May 2019

1017 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11501))

Abstract

A plethora of resilience techniques have been investigated to protect application kernels. If, however, such techniques are combined and they interact across kernels, new vulnerability windows are created. This work contributes the idea of end-to-end resilience by protecting windows of vulnerability between kernels guarded by different resilience techniques. It introduces the live vulnerability factor (LVF), a new metric that quantifies any lack of end-to-end protection for a given data structure. The work further promotes end-to-end application protection across kernels via a pragma-based specification for diverse resilience schemes with minimal programming effort. This lifts the data protection burden from application programmers allowing them to focus solely on algorithms and performance while resilience is specified and subsequently embedded into the code through the compiler/library and supported by the runtime system. In experiments with case studies and benchmarks, end-to-end resilience has an overhead over kernel-specific resilience of less than \(3\%\) on average and increases protection against bit flips by a factor of three to four.

This work was supported in part by a subcontract from Lawrence Berkeley National Laboratory and NSF grants 1525609, 1058779, and 0958311. This manuscript has three authors of Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Bit flips in code (instruction bits) create unpredictable outcomes (most of the time segmentation faults or crashes but sometimes also incorrect but legal jumps) and are out of the scope of this work.
2.
Extra checks are added to guarantee the correctness of data stored in a safe region. A safe region is assumed to neither be subject to bit flips nor data corruption from the application viewpoint—yet, the techniques to make the region safe remain transparent to the programmer. In other words, a safe region is simply one subject to data protection/verification via checking.
3.
Inputs are read from disk and stored in globals or on the heap, but may be recovered by re-reading from disk. Globals are calculated in the program and can only be recovered by re-calculation or ABFT schemes.

References

Anderson, J.H., Calandrino, J.M.: Parallel task scheduling on multicore platforms. SIGBED Rev. 3(1), 1–6 (2006)
Article Google Scholar
Biswas, S., Supinski, B.R.D., Schulz, M., Franklin, D., Sherwood, T., Chong, F.T.: Exploiting data similarity to reduce memory footprints. In: IPDPS, pp. 152–163 (2011)
Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP, pp. 207–216 (1995)
Google Scholar
Böhm, S., Engelmann, C.: File I/O for MPI applications in redundant execution scenarios. In: Parallel, Distributed, and Network-Based Processing, February 2012
Google Scholar
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J.: PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 15(6), 36–45 (2013)
Article Google Scholar
Cao, C., Herault, T., Bosilca, G., Dongarra, J.: Design for a soft error resilient dynamic task-based runtime. In: IPDPS, pp. 765–774, May 2015
Google Scholar
Chen, S., et al.: Scheduling threads for constructive cache sharing on CMPs. In: SPAA, pp. 105–115 (2007)
Google Scholar
Chen, Z., Wu, P.: Fail-stop failure algorithm-based fault tolerance for cholesky decomposition. IEEE TPDS 99(PrePrints), 1 (2014)
Google Scholar
Chung, J., et al.: Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Supercomputing, pp. 58:1–58:11 (2012)
Google Scholar
Dave, C., Bae, H., Min, S.J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: a source-to-source compiler infrastructure for multicores. Computer 42(12), 36–42 (2009)
Article Google Scholar
Diniz, P.C., Liao, C., Quinlan, D.J., Lucas, R.F.: Pragma-controlled source-to-source code transformations for robust application execution. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 660–670. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_53
Chapter Google Scholar
Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J.: Algorithm-based fault tolerance for dense matrix factorizations. In: PPoPP, pp. 225–234 (2012)
Google Scholar
Duell, J.: The design and implementation of Berkeley Labs Linux Checkpoint/Restart. Technical report, LBNL (2003)
Google Scholar
Duran, A., et al.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parall. Process. Lett. 21(2), 173–193 (2011)
Article MathSciNet Google Scholar
Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: IPDPS, pp. 1193–1202 (2014)
Google Scholar
Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: ICDCS, 18–21 June 2012
Google Scholar
Fiala, D., Mueller, F., Engelmann, C., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Supercomputing (2012)
Google Scholar
Geist, A.: How to kill a supercomputer: dirty power, cosmic rays, and bad solder. In: IEEE Spectrum, February 2016
Google Scholar
Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)
Article Google Scholar
Islam, T.Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B.R., Eigenmann, R.: MCREngine: a scalable checkpointing system using data-aware aggregation and compression. In: Supercomputing, pp. 17:1–17:11 (2012)
Google Scholar
Kale, L.V., Krishnan, S.: Charm++: a portable concurrent object oriented system based on c++. In: OOPSLA, pp. 91–108 (1993)
Article Google Scholar
Kiczales, G., et al.: Aspect-oriented programming. In: ECOOP, pp. 220–242 (1997)
Google Scholar
Li, S., Sridharan, V., Gurumurthi, S., Yalamanchili, S.: Software-based dynamic reliability management for GPU applications. In: Workshop in Silicon Errors in Logic System Effects (2015)
Google Scholar
Martsinkevich, T., Subasi, O., Unsal, O., Cappello, F., Labarta, J.: Fault-tolerant protocol for hybrid task-parallel message-passing applications. In: Cluster Computing, pp. 563–570, September 2015
Google Scholar
Min, S., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: Partitioned Global Address Space Programming Models (2011)
Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Supercomputing, pp. 1–11 (2010)
Google Scholar
Panzer-Steindel, B.: Data integrity. Technical report, 1.3, CERN (2007)
Google Scholar
Parr, T., Quong, R.: ANTLR: a predicated. Softw. Pract. Exp. 25(7), 789–810 (1995)
Article Google Scholar
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN, pp. 249–258 (2006)
Google Scholar
Schroeder, B., Pinheiro, E., Weber, W.D.: Dram errors in the wild: a large-scale field study. SIGMETRICS Perform. Eval. Rev. 37(1), 193–204 (2009)
Google Scholar
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Supercomputing, pp. 69–78 (2012)
Google Scholar
Simon, T.A., Dorband, J.: Improving application resilience through probabilistic task replication. In: Workshop on Algorithmic and Application Error Resilience, June 2013
Google Scholar
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. (2013)
Google Scholar
Sridharan, V., Kaeli, D.: Eliminating microarchitectural dependency from Architectural Vulnerability. In: HPCA, pp. 117–128, February 2009
Google Scholar
Sridharan, V., et al.: Memory errors in modern systems: the good, the bad, and the ugly. In: ASPLOS, pp. 297–310 (2015)
Google Scholar
Yim, K.S., Pham, C., Saleheen, M., Kalbarczyk, Z., Iyer, R.: Hauberk: lightweight silent data corruption error detector for GPGPU. In: IPDPS, pp. 287–300 (2011)
Google Scholar
Yu, L., Li, D., Mittal, S., Vetter, J.S.: Quantitatively modeling application resilience with the data vulnerability factor. In: Supercomputing, pp. 695–706 (2014)
Google Scholar
Zhang, Y., Mueller, F., Cui, X., Potok, T.: Large-scale multi-dimensional document clustering on GPU clusters. In: IPDPS, pp. 1–10, April 2010
Google Scholar
Zheng, Z., Chien, A.A., Teranishi, K.: Fault tolerance in an inner-outer solver: a GVR-enabled case study. In: Daydé, M., Marques, O., Nakajima, K. (eds.) VECPAR 2014. LNCS, vol. 8969, pp. 124–132. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17353-5_11
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, North Carolina State University, Raleigh, NC, 27695-8206, USA
Arash Rezaei, Harsh Khetawat, Onkar Patil & Frank Mueller
Lawrence Berkeley National Laboratory, Berkeley, CA, 94730, USA
Paul Hargrove & Eric Roman

Authors

Arash Rezaei
View author publications
You can also search for this author in PubMed Google Scholar
Harsh Khetawat
View author publications
You can also search for this author in PubMed Google Scholar
Onkar Patil
View author publications
You can also search for this author in PubMed Google Scholar
Frank Mueller
View author publications
You can also search for this author in PubMed Google Scholar
Paul Hargrove
View author publications
You can also search for this author in PubMed Google Scholar
Eric Roman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank Mueller .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Michèle Weiland
Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
Guido Juckeland
Technical University of Munich, Munich, Germany
Carsten Trinitis
Ohio State University, Columbus, USA
Ponnuswamy Sadayappan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rezaei, A., Khetawat, H., Patil, O., Mueller, F., Hargrove, P., Roman, E. (2019). End-to-End Resilience for HPC Applications. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11501. Springer, Cham. https://doi.org/10.1007/978-3-030-20656-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-20656-7_14
Published: 17 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20655-0
Online ISBN: 978-3-030-20656-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics