Abstract
One of the key elements required for writing self-healing applications for distributed and dynamic computing environments is checkpointing. Checkpointing is a mechanism by which an application is made resilient to failures by storing its state periodically to the disk. The main goal of this research is to enable non-invasive reengineering of existing applications to insert Application-Level Checkpointing (ALC) mechanism. The Domain-Specific Language (DSL) developed in this research serves as a perfect means towards this end and is used for obtaining the ALC-specifications from the end-users. These specifications are used for generating and inserting the actual checkpointing code into the existing application. The performance of the application having the generated checkpointing code is comparable to the performance of the application in which the checkpointing code was inserted manually. With slight modifications, the DSL developed in this research can be used for specifying the ALC mechanism in several base languages (e.g., C/C++, Java, and FORTRAN).
Similar content being viewed by others
References
Das R, Qian B, Raman S, Vernon R, Thompson J, Bradley P, Khare S, Tyka M, Bhat D, Chivian D, Kim D, Sheffler W, Malmström L, Wollacott A, Wang C, Andre I, Baker D (2007) Structure prediction for CASP7 targets using extensive all-atom refinement with Rosetta@home. Proteins 69(S8):118–128
Chen Q, Laminie J, Rousseau A, Temam R, Tribbia J (2007) A 2.5 model for the equations of the ocean and the atmosphere. Anal Appl 5(3):199–229
Prvulovic M, Zhang Z, Torrellas J (2002) Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In the Proceedings of international symposium on computer architecture, pp 111–122
Duell J (2005) The design and implementation of Berkeley Lab’s Linux checkpoint/restart. Lawrence Berkeley National Laboratory, Paper LBNL-54941. http://crd.lbl.gov/~jcduell/papers/blcr.pdf
Litzkow M, Tannenbaum T, Basney J, Livny M (1997) Checkpoint and migration of Unix processes in the condor distributed processing system. Technical report 1346, University of Wisconsin-Madison Computer Science Technical Report #1346
Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) Automated application-level checkpointing of MPI programs. In: Symposium on principles and practice of parallel programming (PPOPP 2003), pp 84–94
Bronevetsky G, Marques D, Pingali K, Szwed PK, Schulz M (2004) Application-level checkpointing for shared memory programs. In: Architectural support for programming languages and operating systems (ASPLOS 2004), pp 235–247
Bronevetsky G, Daniel M, Pingali K, Radu R (2008) Compiler-enhanced incremental checkpointing. In: Languages and compilers for parallel computing: 20th international workshop, LCPC 2007, pp 1–15
Arora R, Bangalore PV (2008) Using aspect-oriented programming for checkpointing a parallel application. In: Parallel and distributed processing techniques and applications conference, Las Vegas, Nevada, pp 955–961
Haines J, Lakamraju V, Koren I, Krishna CM (2000) Application-level fault tolerance as a complement to system-level fault tolerance. J Supercomput 16(1–2):53–68
Walters JP, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In: International conference on distributed computing and Internet technologies (ICDCIT 2006), pp 221–234
Kiczales G, Lamping J, Mendhekar A, Maeda C, Lopes C, Loingtier J-M, Irwin J (1997) Aspect-oriented programming. In: ECOOP’97—object-oriented programming, 11th European conference. Lecture notes in computer science, vol 1241. Springer, Berlin, pp 220–242
Czarnecki K, Eisenecker U (2000) Generative programming: methods, tools, and applications. Addison-Wesley Professional, Reading
Ramkumar B, Strumpen V (1997) Portable checkpointing for heterogeneous architectures. In: 27th International symposium on fault-tolerant computing—digest of papers, Seattle, WA, pp 58–67
Jiang H, Chaudhary V (2002) MigThread: compile/runtime support for thread migration. In: Proceedings of international parallel and distributed processing symposium, IPDPS 2002, pp 58–66
Czarnul P, Fraczak M (2005) New user-guided and ckpt-based checkpointing libraries for parallel MPI applications. In: Proceedings of Euro PVM/MPI 2005, 12th European PVM/MPI users’ group meeting. Lecture notes in computer science, vol 3666. Springer, Berlin, pp 351–358
Harbulot B, Gurd J (2004) Using AspectJ to separate concerns in parallel scientific Java code. In: Proceedings of the 3rd international conference on aspect-oriented software development, Lancaster, UK, pp 122–131
Roychoudhury S, Jouault F, Gray J (2007) Model-based aspect weaver construction. In: 4th International workshop on language engineering (ATEM), held at MODELS 2007, Nashville, TN, pp 117–126
Mernik M, Heering J, Sloane AM (2005) When and how to develop domain-specific languages. ACM Comput Surv 37(4):316–344
Kalaiselvi S, Rajaraman V (2000) A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25(5):489–510
Message Passing Interface Forum (1998) MPI2: A message-passing interface standard. Int J Supercomput Appl High Perform Comput 12(1/2):1–299. Special Issue
Baxter I (1992) Design maintenance systems. Commun ACM 35(4):73–89
Jouault F, Kurtev I (2005) Transforming models with ATL. In: Model transformations in practice workshop at MoDELS, Montego Bay, Jamaica, pp 128–138
Arora R, Mernik M, Bangalore P, Roychoudhury S, Mukkai S (2008) A domain-specific language for application-level checkpointing. In: International conference on distributed computing and Internet technologies (ICDCIT 2008), New Delhi, India, pp 26–38
Arora R, Bangalore P, Mernik M (2009) Developing scientific applications using generative programming. In: 2009 International conference on software engineering workshop on software engineering for computational science and engineering, Vancouver, Canada, pp 51–58
Chengcui Z, Xin C (2005) Region based image clustering and retrieval using multiple instance learning. In: Image/video annotation and clustering. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 194–204
Chung TJ (2002) Computational fluid dynamics, 1st edn. Cambridge University Press, Cambridge
Quinn M (2004) Parallel programming in C with MPI and OpenMP. McGraw-Hill, New York
Krishnan S, Gannon D (2004) Checkpoint and restart for distributed components in XCAT3. In: Proceedings of the fifth IEEE/ACM international workshop on grid computing (GRID 2004), pp 281–288
Subramaniyan R, Grobelny E, Studham S, George AD (2008) Optimization of checkpointing-related i/o for high-performance parallel and distributed computing. J Supercomput 46(2):150–180
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Arora, R., Bangalore, P. & Mernik, M. A technique for non-invasive application-level checkpointing. J Supercomput 57, 227–255 (2011). https://doi.org/10.1007/s11227-010-0383-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-010-0383-5