Abstract
In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user’s source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications.
This research was supported in part by NSF IGERT grant 9987598 and the Institute for Scientific Computing at Wayne State University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bronevetsky, G., Marques, D., Pingali, K., Szwed, P., Schulz, M.: Application-level checkpointing for shared memory programs. In: ASPLOS-XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, pp. 235–247. ACM Press, New York (2004)
Milojicic, D.S., Douglis, F., Paindaveine, Y., Wheeler, R., Zhou, S.: Process migration. ACM Comput. Surv. 32(3), 241–299 (2000)
Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: ISCA 2002: Proceedings of the 29th annual international symposium on Computer architecture, pp. 123–134. IEEE Computer Society Press, Los Alamitos (2002)
Duell, J.: The design and implementation of berkeley lab’s linux checkpoint/restart (2003), http://old-www.nersc.gov/research/FTG/checkpoint/reports.html
Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In: Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA (2003)
Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for mpi programs over infiniband. In: ICPP 2006: Proceedings of the 35th International Conference on Parallel Processing, Columbus, OH (2006)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. Technical Report UT-CS-94-242 (1994)
Bozyigit, M., Wasiq, M.: User-level process checkpoint and restore for migration. SIGOPS Oper. Syst. Rev. 35(2), 86–96 (2001)
Dimitrov, B., Rego, V.: Arachne: A portable threads system supporting migrant threads on heterogeneous network farms. IEEE Transactions on Parallel and Distributed Systems 9(5), 459 (1998)
Mascarenhas, E., Rego, V.: Ariadne: Architecture of a portable threads system supporting thread migration. Software- Practice and Experience 26(3), 327–356 (1996)
Itzkovitz, A., Schuster, A., Wolfovich, L.: Thread migration and its applications in distributed shared memory systems. Technical Report LPCR9603, Technion, Isreal (1996)
Jiang, H., Chaudhary, V.: Process/thread migration and checkpointing in heterogeneous distributed systems. In: Proceedings of the 37th Annual Hawaii International Conference on System Sciences, p. 282 (2004)
Karablieh, F., Bazzi, R.A.: Heterogeneous checkpointing for multithreaded applications. In: Proceedings. 21st IEEE Symposium on Reliable Distributed Systems, p. 140 (2002)
Jiang, H., Chaudhary, V., Walters, J.P.: Data conversion for process/thread migration and checkpointing. In: Proceedings. 2003 International Conference on Parallel Processing, p. 473 (2003)
Beguelin, A., Seligman, E., Stephan, P.: Application level fault tolerance in heterogeneous networks of workstations. J. Parallel Distrib. Comput. 43(2), 147–155 (1997)
Jiang, H., Chaudhary, V.: On improving thread migration: Safety and performance. In: Sahni, S.K., Prasanna, V.K., Shukla, U. (eds.) HiPC 2002. LNCS, vol. 2552, pp. 474–484. Springer, Heidelberg (2002)
Karablieh, F., Bazzi, R.A., Hicks, M.: Compiler-assisted heterogeneous checkpointing. In: Proceedings. 20th IEEE Symposium on Reliable Distributed Systems, p. 56 (2001)
Szwed, P.K., Marques, D., Buels, R.M., McKee, S.A., Schulz, M.: Simsnap: fast-forwarding via native execution and application-level checkpointing. In: INTERACT-8 2004. Eighth Workshop on Interaction between Compilers and Computer Architectures, p. 65 (2004)
Strumpen, V.: Compiler technology for portable checkpoints (1998)
Lyon, B.: Sun external data representation specification. Technical report, SUN Microsystems, Inc., Mountain View (1984)
Krishnan, S., Gannon, D.: Checkpoint and restart for distributed components in xcat3. In: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, p. 281 (2004)
Ramkumar, B., Strumpen, V.: Portable checkpointing for heterogeneous architectures. In: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, pp. 58–67. IEEE Computer Society Press, Los Alamitos (1997)
Zhou, H., Geist, A.: “Receiver makes right” data conversion in PVM. In: Conference Proceedings of the 1995 IEEE Fourteenth Annual International Phoenix Conference on Computers and Communications, pp. 458–464. IEEE Computer Society, Los Alamitos (1995)
Zhong, H., Nieh, J.: The ergonomics of software porting: Automatically configuring software to the runtime environment (2006), http://www.cwi.nl/ftp/steven/enquire/enquire.html
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Collective operations in application-level fault-tolerant mpi. In: ICS 2003: Proceedings of the 17th annual international conference on Supercomputing, pp. 234–243. ACM Press, New York (2003)
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of mpi programs. In: PPoPP 2003: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 84–94. ACM Press, New York (2003)
Jiang, H., Chaudhary, V.: Compile/run-time support for thread migration. In: Proceedings International Parallel and Distributed Processing Symposium, IPDPS, pp. 58–66. IEEE Computer Society Press, Los Alamitos (2002)
Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. In: IEEE Computational Science and Engineering, pp. 46–55. IEEE Computer Society Press, Los Alamitos (1998)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
de Camargo, R.Y., Goldchleger, A., Kon, F., Goldman, A.: Checkpointing-based rollback recovery for parallel applications on the integrade grid middleware. In: Proceedings of the 2nd workshop on Middleware for grid computing, pp. 35–40. ACM Press, New York (2004)
Agbaria, A., Freund, A., Friedman, R.: Evaluating distributed checkpointing protocols. In: Proceedings. 23rd International Conference on Distributed Computing Systems, p. 266 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Walters, J.P., Chaudhary, V. (2006). Application-Level Checkpointing Techniques for Parallel Programs. In: Madria, S.K., Claypool, K.T., Kannan, R., Uppuluri, P., Gore, M.M. (eds) Distributed Computing and Internet Technology. ICDCIT 2006. Lecture Notes in Computer Science, vol 4317. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11951957_21
Download citation
DOI: https://doi.org/10.1007/11951957_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68379-7
Online ISBN: 978-3-540-68380-3
eBook Packages: Computer ScienceComputer Science (R0)