Skip to main content

Application-Level Checkpointing Techniques for Parallel Programs

  • Conference paper
Distributed Computing and Internet Technology (ICDCIT 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4317))

Abstract

In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user’s source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications.

This research was supported in part by NSF IGERT grant 9987598 and the Institute for Scientific Computing at Wayne State University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bronevetsky, G., Marques, D., Pingali, K., Szwed, P., Schulz, M.: Application-level checkpointing for shared memory programs. In: ASPLOS-XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, pp. 235–247. ACM Press, New York (2004)

    Chapter  Google Scholar 

  2. Milojicic, D.S., Douglis, F., Paindaveine, Y., Wheeler, R., Zhou, S.: Process migration. ACM Comput. Surv. 32(3), 241–299 (2000)

    Article  Google Scholar 

  3. Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: ISCA 2002: Proceedings of the 29th annual international symposium on Computer architecture, pp. 123–134. IEEE Computer Society Press, Los Alamitos (2002)

    Chapter  Google Scholar 

  4. Duell, J.: The design and implementation of berkeley lab’s linux checkpoint/restart (2003), http://old-www.nersc.gov/research/FTG/checkpoint/reports.html

  5. Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In: Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA (2003)

    Google Scholar 

  6. Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for mpi programs over infiniband. In: ICPP 2006: Proceedings of the 35th International Conference on Parallel Processing, Columbus, OH (2006)

    Google Scholar 

  7. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. Technical Report UT-CS-94-242 (1994)

    Google Scholar 

  8. Bozyigit, M., Wasiq, M.: User-level process checkpoint and restore for migration. SIGOPS Oper. Syst. Rev. 35(2), 86–96 (2001)

    Article  MATH  Google Scholar 

  9. Dimitrov, B., Rego, V.: Arachne: A portable threads system supporting migrant threads on heterogeneous network farms. IEEE Transactions on Parallel and Distributed Systems 9(5), 459 (1998)

    Article  Google Scholar 

  10. Mascarenhas, E., Rego, V.: Ariadne: Architecture of a portable threads system supporting thread migration. Software- Practice and Experience 26(3), 327–356 (1996)

    Article  Google Scholar 

  11. Itzkovitz, A., Schuster, A., Wolfovich, L.: Thread migration and its applications in distributed shared memory systems. Technical Report LPCR9603, Technion, Isreal (1996)

    Google Scholar 

  12. Jiang, H., Chaudhary, V.: Process/thread migration and checkpointing in heterogeneous distributed systems. In: Proceedings of the 37th Annual Hawaii International Conference on System Sciences, p. 282 (2004)

    Google Scholar 

  13. Karablieh, F., Bazzi, R.A.: Heterogeneous checkpointing for multithreaded applications. In: Proceedings. 21st IEEE Symposium on Reliable Distributed Systems, p. 140 (2002)

    Google Scholar 

  14. Jiang, H., Chaudhary, V., Walters, J.P.: Data conversion for process/thread migration and checkpointing. In: Proceedings. 2003 International Conference on Parallel Processing, p. 473 (2003)

    Google Scholar 

  15. Beguelin, A., Seligman, E., Stephan, P.: Application level fault tolerance in heterogeneous networks of workstations. J. Parallel Distrib. Comput. 43(2), 147–155 (1997)

    Article  Google Scholar 

  16. Jiang, H., Chaudhary, V.: On improving thread migration: Safety and performance. In: Sahni, S.K., Prasanna, V.K., Shukla, U. (eds.) HiPC 2002. LNCS, vol. 2552, pp. 474–484. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  17. Karablieh, F., Bazzi, R.A., Hicks, M.: Compiler-assisted heterogeneous checkpointing. In: Proceedings. 20th IEEE Symposium on Reliable Distributed Systems, p. 56 (2001)

    Google Scholar 

  18. Szwed, P.K., Marques, D., Buels, R.M., McKee, S.A., Schulz, M.: Simsnap: fast-forwarding via native execution and application-level checkpointing. In: INTERACT-8 2004. Eighth Workshop on Interaction between Compilers and Computer Architectures, p. 65 (2004)

    Google Scholar 

  19. Strumpen, V.: Compiler technology for portable checkpoints (1998)

    Google Scholar 

  20. Lyon, B.: Sun external data representation specification. Technical report, SUN Microsystems, Inc., Mountain View (1984)

    Google Scholar 

  21. Krishnan, S., Gannon, D.: Checkpoint and restart for distributed components in xcat3. In: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, p. 281 (2004)

    Google Scholar 

  22. Ramkumar, B., Strumpen, V.: Portable checkpointing for heterogeneous architectures. In: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, pp. 58–67. IEEE Computer Society Press, Los Alamitos (1997)

    Google Scholar 

  23. Zhou, H., Geist, A.: “Receiver makes right” data conversion in PVM. In: Conference Proceedings of the 1995 IEEE Fourteenth Annual International Phoenix Conference on Computers and Communications, pp. 458–464. IEEE Computer Society, Los Alamitos (1995)

    Google Scholar 

  24. Zhong, H., Nieh, J.: The ergonomics of software porting: Automatically configuring software to the runtime environment (2006), http://www.cwi.nl/ftp/steven/enquire/enquire.html

  25. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Collective operations in application-level fault-tolerant mpi. In: ICS 2003: Proceedings of the 17th annual international conference on Supercomputing, pp. 234–243. ACM Press, New York (2003)

    Chapter  Google Scholar 

  26. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of mpi programs. In: PPoPP 2003: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 84–94. ACM Press, New York (2003)

    Chapter  Google Scholar 

  27. Jiang, H., Chaudhary, V.: Compile/run-time support for thread migration. In: Proceedings International Parallel and Distributed Processing Symposium, IPDPS, pp. 58–66. IEEE Computer Society Press, Los Alamitos (2002)

    Google Scholar 

  28. Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. In: IEEE Computational Science and Engineering, pp. 46–55. IEEE Computer Society Press, Los Alamitos (1998)

    Google Scholar 

  29. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  30. de Camargo, R.Y., Goldchleger, A., Kon, F., Goldman, A.: Checkpointing-based rollback recovery for parallel applications on the integrade grid middleware. In: Proceedings of the 2nd workshop on Middleware for grid computing, pp. 35–40. ACM Press, New York (2004)

    Chapter  Google Scholar 

  31. Agbaria, A., Freund, A., Friedman, R.: Evaluating distributed checkpointing protocols. In: Proceedings. 23rd International Conference on Distributed Computing Systems, p. 266 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Walters, J.P., Chaudhary, V. (2006). Application-Level Checkpointing Techniques for Parallel Programs. In: Madria, S.K., Claypool, K.T., Kannan, R., Uppuluri, P., Gore, M.M. (eds) Distributed Computing and Internet Technology. ICDCIT 2006. Lecture Notes in Computer Science, vol 4317. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11951957_21

Download citation

  • DOI: https://doi.org/10.1007/11951957_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68379-7

  • Online ISBN: 978-3-540-68380-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics