Abstract
We design and implement a high availability parallel run-time system—ChaRM64, a Checkpoint- based Rollback Recovery and Migration system for parallel running programs on a cluster of IA-64 computers. At first, we discuss our solution of a user-level, single process checkpoint/recovery library running on IA-64 systems. Based on this library, ChaRM64 is realized, which implements a user-transparent, coordinated checkpointing and rollback recovery (CRR) mechanism, quasi-asynchronous migration and the dynamic reconfiguration function. Owing to the above techniques and efficient error detection, ChaRM64 can handle cluster node crashes and hardware transient faults in a IA-64 cluster. Now ChaRM64 for PVM has been implemented in Linux and the MPI version is under construction. As we know, there are few similar projects accomplished for IA-64 architecture.
Supported by High Technology and Development Program of China (No. 2002AA1Z2103).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Top500 supercomputer list (November 2003), http://www.top500.org/
Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A Survey of Rollback Recovery Protocols in Message-Passing System. Technical Report. Pittsburgh, PA: CMU-CS-96-181. Carnegie Mellon University (October 1996)
Elnozahy, E.N.: Fault tolerance for clusters of workstations. In: Banatre, M., Lee, P. (eds.) chapter 8, August 1994. Springer, Heidelberg (1994)
Tarp, S.: IA-64 architecture: A detailed tutorial. CERN-IT Division (November 1999)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent Checkpointing under Unix. In: Conference Proceedings, Usenix Winter 1995 Technical Conference, New Orleans, LA, January 1995, pp. 213–223 (1995)
Litzkow, M., Solomon, M.: The Evolution of Condor Checkpointing (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, Y., Wang, D., Zheng, W. (2004). Parallel Checkpoint/Recovery on Cluster of IA-64 Computers. In: Cao, J., Yang, L.T., Guo, M., Lau, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2004. Lecture Notes in Computer Science, vol 3358. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30566-8_26
Download citation
DOI: https://doi.org/10.1007/978-3-540-30566-8_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24128-7
Online ISBN: 978-3-540-30566-8
eBook Packages: Computer ScienceComputer Science (R0)