Skip to main content
Log in

A technique for non-invasive application-level checkpointing

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

One of the key elements required for writing self-healing applications for distributed and dynamic computing environments is checkpointing. Checkpointing is a mechanism by which an application is made resilient to failures by storing its state periodically to the disk. The main goal of this research is to enable non-invasive reengineering of existing applications to insert Application-Level Checkpointing (ALC) mechanism. The Domain-Specific Language (DSL) developed in this research serves as a perfect means towards this end and is used for obtaining the ALC-specifications from the end-users. These specifications are used for generating and inserting the actual checkpointing code into the existing application. The performance of the application having the generated checkpointing code is comparable to the performance of the application in which the checkpointing code was inserted manually. With slight modifications, the DSL developed in this research can be used for specifying the ALC mechanism in several base languages (e.g., C/C++, Java, and FORTRAN).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Das R, Qian B, Raman S, Vernon R, Thompson J, Bradley P, Khare S, Tyka M, Bhat D, Chivian D, Kim D, Sheffler W, Malmström L, Wollacott A, Wang C, Andre I, Baker D (2007) Structure prediction for CASP7 targets using extensive all-atom refinement with Rosetta@home. Proteins 69(S8):118–128

    Article  Google Scholar 

  2. Chen Q, Laminie J, Rousseau A, Temam R, Tribbia J (2007) A 2.5 model for the equations of the ocean and the atmosphere. Anal Appl 5(3):199–229

    Article  MathSciNet  Google Scholar 

  3. Prvulovic M, Zhang Z, Torrellas J (2002) Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In the Proceedings of international symposium on computer architecture, pp 111–122

  4. Duell J (2005) The design and implementation of Berkeley Lab’s Linux checkpoint/restart. Lawrence Berkeley National Laboratory, Paper LBNL-54941. http://crd.lbl.gov/~jcduell/papers/blcr.pdf

  5. Litzkow M, Tannenbaum T, Basney J, Livny M (1997) Checkpoint and migration of Unix processes in the condor distributed processing system. Technical report 1346, University of Wisconsin-Madison Computer Science Technical Report #1346

  6. Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) Automated application-level checkpointing of MPI programs. In: Symposium on principles and practice of parallel programming (PPOPP 2003), pp 84–94

  7. Bronevetsky G, Marques D, Pingali K, Szwed PK, Schulz M (2004) Application-level checkpointing for shared memory programs. In: Architectural support for programming languages and operating systems (ASPLOS 2004), pp 235–247

  8. Bronevetsky G, Daniel M, Pingali K, Radu R (2008) Compiler-enhanced incremental checkpointing. In: Languages and compilers for parallel computing: 20th international workshop, LCPC 2007, pp 1–15

  9. Arora R, Bangalore PV (2008) Using aspect-oriented programming for checkpointing a parallel application. In: Parallel and distributed processing techniques and applications conference, Las Vegas, Nevada, pp 955–961

  10. Haines J, Lakamraju V, Koren I, Krishna CM (2000) Application-level fault tolerance as a complement to system-level fault tolerance. J Supercomput 16(1–2):53–68

    Article  Google Scholar 

  11. Walters JP, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In: International conference on distributed computing and Internet technologies (ICDCIT 2006), pp 221–234

  12. Kiczales G, Lamping J, Mendhekar A, Maeda C, Lopes C, Loingtier J-M, Irwin J (1997) Aspect-oriented programming. In: ECOOP’97—object-oriented programming, 11th European conference. Lecture notes in computer science, vol 1241. Springer, Berlin, pp 220–242

    Google Scholar 

  13. Czarnecki K, Eisenecker U (2000) Generative programming: methods, tools, and applications. Addison-Wesley Professional, Reading

    Google Scholar 

  14. Ramkumar B, Strumpen V (1997) Portable checkpointing for heterogeneous architectures. In: 27th International symposium on fault-tolerant computing—digest of papers, Seattle, WA, pp 58–67

  15. Jiang H, Chaudhary V (2002) MigThread: compile/runtime support for thread migration. In: Proceedings of international parallel and distributed processing symposium, IPDPS 2002, pp 58–66

  16. Czarnul P, Fraczak M (2005) New user-guided and ckpt-based checkpointing libraries for parallel MPI applications. In: Proceedings of Euro PVM/MPI 2005, 12th European PVM/MPI users’ group meeting. Lecture notes in computer science, vol 3666. Springer, Berlin, pp 351–358

    Google Scholar 

  17. Harbulot B, Gurd J (2004) Using AspectJ to separate concerns in parallel scientific Java code. In: Proceedings of the 3rd international conference on aspect-oriented software development, Lancaster, UK, pp 122–131

  18. Roychoudhury S, Jouault F, Gray J (2007) Model-based aspect weaver construction. In: 4th International workshop on language engineering (ATEM), held at MODELS 2007, Nashville, TN, pp 117–126

  19. Mernik M, Heering J, Sloane AM (2005) When and how to develop domain-specific languages. ACM Comput Surv 37(4):316–344

    Article  Google Scholar 

  20. Kalaiselvi S, Rajaraman V (2000) A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25(5):489–510

    Article  Google Scholar 

  21. Message Passing Interface Forum (1998) MPI2: A message-passing interface standard. Int J Supercomput Appl High Perform Comput 12(1/2):1–299. Special Issue

    Google Scholar 

  22. Baxter I (1992) Design maintenance systems. Commun ACM 35(4):73–89

    Article  Google Scholar 

  23. Jouault F, Kurtev I (2005) Transforming models with ATL. In: Model transformations in practice workshop at MoDELS, Montego Bay, Jamaica, pp 128–138

  24. Arora R, Mernik M, Bangalore P, Roychoudhury S, Mukkai S (2008) A domain-specific language for application-level checkpointing. In: International conference on distributed computing and Internet technologies (ICDCIT 2008), New Delhi, India, pp 26–38

  25. Arora R, Bangalore P, Mernik M (2009) Developing scientific applications using generative programming. In: 2009 International conference on software engineering workshop on software engineering for computational science and engineering, Vancouver, Canada, pp 51–58

  26. Chengcui Z, Xin C (2005) Region based image clustering and retrieval using multiple instance learning. In: Image/video annotation and clustering. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 194–204

    Google Scholar 

  27. Chung TJ (2002) Computational fluid dynamics, 1st edn. Cambridge University Press, Cambridge

    Book  Google Scholar 

  28. Quinn M (2004) Parallel programming in C with MPI and OpenMP. McGraw-Hill, New York

    Google Scholar 

  29. Krishnan S, Gannon D (2004) Checkpoint and restart for distributed components in XCAT3. In: Proceedings of the fifth IEEE/ACM international workshop on grid computing (GRID 2004), pp 281–288

  30. Subramaniyan R, Grobelny E, Studham S, George AD (2008) Optimization of checkpointing-related i/o for high-performance parallel and distributed computing. J Supercomput 46(2):150–180

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ritu Arora.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arora, R., Bangalore, P. & Mernik, M. A technique for non-invasive application-level checkpointing. J Supercomput 57, 227–255 (2011). https://doi.org/10.1007/s11227-010-0383-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-010-0383-5

Keywords

Navigation