Lightweight checkpointing for concurrent ML

LUKASZ ZIAREK; SURESH JAGANNATHAN

doi:10.1017/S0956796810000067

Lightweight checkpointing for concurrent ML

Part of: JFP Research Articles

Published online by Cambridge University Press: 19 March 2010

LUKASZ ZIAREK and

SURESH JAGANNATHAN

Show author details

LUKASZ ZIAREK: Affiliation:
Department of Computer Science Purdue University, 305 N. University Street, West Lafayette, IN 47907-2107, USA (e-mail: lziarek@cs.purdue.edu, suresh@cs.purdue.edu)
SURESH JAGANNATHAN: Affiliation:
Department of Computer Science Purdue University, 305 N. University Street, West Lafayette, IN 47907-2107, USA (e-mail: lziarek@cs.purdue.edu, suresh@cs.purdue.edu)

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Transient faults that arise in large-scale software systems can often be repaired by reexecuting the code in which they occur. Ascribing a meaningful semantics for safe reexecution in multithreaded code is not obvious, however. For a thread to reexecute correctly a region of code, it must ensure that all other threads that have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior might result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward because thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction, called stabilizers, that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g., message-passing operations or updates to shared variables). We present a formal characterization of its design, and provide a detailed description of its implementation within MLton, a whole-program optimizing compiler for Standard ML. Our experimental results on microbenchmarks as well as several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.

Type: Articles
Information: Journal of Functional Programming , Volume 20 , Issue 2 , March 2010 , pp. 137 - 173

DOI: https://doi.org/10.1017/S0956796810000067 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2010

References

Adya, A., Gruber, R., Liskov, B. & Maheshwari, U. (1995) Efficient optimistic concurrency control using loosely synchronized clocks, SIGMOD Rec., 24 (2): 23–34.CrossRef Google Scholar

Agarwal, S., Garg, R., Gupta, M. S. & Moreira, J. E. (2004) Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th Annual International Conference on Supercomputing. Malo, France, ACM, pp. 277–286.CrossRef Google Scholar

Beck, M., Plank, J. S. & Kingsley, G. (1994) Compiler-Assisted Checkpointing. Tech. rept. Knoxville, TN: University of Tennessee.Google Scholar

Bronevetsky, G., Marques, D., Pingali, K. & Stodghill, P. (2003) Automated application-level checkpointing of MPI programs. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. San Diego, California, USA, ACM, pp. 84–94.Google Scholar

Bronevetsky, G., Marques, D., Pingali, K., Szwed, P. & Schulz, M. (2004) Application-level checkpointing for shared memory programs. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages And Operating Systems. Boston, MA, USA, ACM, pp. 235–247.CrossRef Google Scholar

Bruni, R., Melgratti, H. & Montanari, U. (2005) Theoretical Foundations for compensations in flow composition languages. In Proceedings of the 32nd ACM SIGPLAN Symposium on Principles of Programming Languages. Long Beach, CA, USA, ACM, pp. 209–220.Google Scholar

Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G. & Fox, A. (2004). Microreboot – A technique for cheap recovery. In Proceedings of the 6th ACM Symposium on Operating Systems Design and Implementation. San Francisco, CA, USA, USENIX Association, p. 3.Google Scholar

Chen, Y., Plank, J. S. & Li, K. (1997) CLIP: A checkpointing tool for message-passing parallel programs. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing. San Jose, CA, USA, ACM, pp. 1–11.Google Scholar

Christiansen, J. & Huch, F. (2004) Searching for deadlocks while debugging concurrent Haskell programs. In Proceedings of the 9th ACM SIGPLAN International Conference on Functional Programming. Snow Bird, UT, USA, ACM, pp. 28–39.Google Scholar

Chrysanthis, P. K. & Ramamritham, K. (1992) ACTA: the SAGA continues. In Database Transaction Models for Advanced Applications. Morgan-Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 349–397.Google Scholar

Dieter, W. R. & Lumpp, J. E. Jr. (1999) A user-level checkpointing library for POSIX threads programs. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing. Madison, WI, USA, IEEE Computer Society, p. 224.Google Scholar

Donnelly, K. & Fluet, M. (2008) Transactional events, J. Funct. Program., 18, 649–706.CrossRef Google Scholar

Effinger-Dean, L., Kehrt, M. & Grossman, D. (2008) Transactional events for ML. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming. Victoria, BC, Canada, ACM, pp. 103–114.CrossRef Google Scholar

Elnozahy, E. N. (Mootaz), Alvisi, L., Wang, Y-M & Johnson, D. B. (2002) A survey of rollback-recovery protocols in message-passing systems. Acm Comput. Surv., 34 (3): 375–408.CrossRef Google Scholar

Field, J. & Varela, C. A. (2005) Transactors: A programming model for maintaining globally consistent distributed state in unreliable environments. In Proceedings of the 32nd ACM SIGPLAN Symposium on Principles of Programming Languages. Long Beach, CA, USA, ACM, pp. 195–208.Google Scholar

Flatt, M. & Findler, R. B. (2004) Kill-safe synchronization abstractions. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation. Washington DC, USA, ACM, pp. 47–58.CrossRef Google Scholar

Gray, J. & Reuter, A. (1993) Transaction Processing. Morgan-Kaufmann. Publishers Inc., San Francisco, CA, USA.Google Scholar

Harris, T. & Fraser, K. (2003). Language support for lightweight transactions. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications. Anaheim, CA, USA, ACM, pp. 388–402.Google Scholar

Harris, T., Marlow, S., Simon, P. J., & Herlihy, M. (2005) Composable memory transactions. In Proceedings of the 10th ACM SIGPLAN Conference on Principles and Practice of Parallel Programming. Chicago, IL, USA, ACM, pp. 48–60.Google Scholar

Herlihy, M., Luchangco, V., Moir, M. & Scherer, W. N. III (2003). Software transactional memory for dynamic-sized data structures. In Proceedings of the ACM Conference on Principles of Distributed Computing. Boston, MA, USA, ACM, pp. 92–101.Google Scholar

Hulse, D. (1995) On page-based optimistic process checkpointing. In Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems. Lund, Sweden, IEEE Computer Society, p. 24.Google Scholar

Kasbekar, M. & Das, C. (2001) Selective checkpointing and rollback in multithreaded distributed systems. In Proceedings of the 21st International Conference on Distributed Computing Systems. Mesa, AZ, USA, IEEE Computer Society.Google Scholar

Kung, H. T. & Robinson, J. T. (1981) On optimistic methods for concurrency control, ACM Trans. Database Syst., 6 (2), 213–226.CrossRef Google Scholar

Li, K., Naughton, J. & Plank, J. (1990) Real-time concurrent checkpoint for parallel programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Seattle, WA, USA, ACM, pp. 79–88.Google Scholar

Manson, J., Pugh, W. & Adve, S. V. (2005) The Java memory model. In Proceedings of the 32nd ACM SIGPLAN Symposium on Principles of Programming Languages. Long Beach, CA, USA, ACM, pp. 378–391.Google Scholar

Reppy, J. (1999). Concurrent Programming in ML. Cambridge University Press.CrossRef Google Scholar

Rinard, M. (1999) Effective fine-grained synchronization for automatically parallelized programs using optimistic synchronization primitives, ACM Trans. Comput. Syst., 17 (4), 337–371.CrossRef Google Scholar

Ringenburg, M. F. & Grossman, D. (2005) AtomCaml: First-class atomicity via rollback. In Proceedings of the 10th ACM SIGPLAN International Conference on Functional Programming. Tallinn, Estonia, ACM, pp. 92–104.Google Scholar

Tantawi, A. N. & Ruschitzka, M. (1984). Performance analysis of checkpointing strategies, ACM Trans. Comput. Syst., 2 (2), 123–144.CrossRef Google Scholar

Tolmach, A. P. & Appel, A. W. (1990) Debugging standard ML without reverse engineering. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming. Nice, Francs, ACM, pp. 1–12.Google Scholar

Tolmach, A. P. & Appel, A. W. (1991) Debuggable concurrency extensions for standard ML. Proceedings of the 1991 ACM/ONR Workshop on Parallel and Distributed Debugging. Santa Cruz, CA, USA, ACM, pp. 120–131.CrossRef Google Scholar

Welc, A., Jagannathan, S. & Hosking, A. L. (2004) Transactional monitors for concurrent objects. In Proceedings of the European Conference on Object-Oriented Programming. Oslo, Norway, Springer Berlin/Heidelberg, pp. 519–542.Google Scholar

Welc, A., Jagannathan, S. & Hosking, A. (2005) Safe futures for Java. In Proceedings of the 20th ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications. San Diego, CA, USA, ACM, pp. 439–453.CrossRef Google Scholar

Ziarek, L., Sivaramakrishnan, K. C. & Jagannathan, S. (2009) Partial memoization of concurrency and communication. In Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming. Edinburgh, Scotland, ACM, pp. 161–172.CrossRef Google Scholar

Submit a response

Discussions

No Discussions have been published for this article.

Article contents

Lightweight checkpointing for concurrent ML

Abstract

References

Discussions

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests