Abstract
As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage, most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such a failure leads to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. It is thus not safe to rely on the high MTBF of specific machines to store the checkpoint images.
This paper introduces a new protocol that ensure the checkpoint storage reliability even if one or more Checkpoint Servers fail. To provide this reliability the protocol is based on a replication process. We evaluate our solution through simulations against several criteria: scalability, topology, and reliability of the nodes. We also compare between two replication strategies to decide which one should be used in the implementation.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
W. Groop and E. Lusk, Fault Tolerance in MP Programs. OAI-PMH server at cs1.ist.psu.edu, 2002.
E. N. Elnozahy et al.A survey of Rollback-Recovery Protocols in Message-Passing Sys-tems, Journal "CSURV: Computer Surveys", volume 34, 2002.
K.M. Chandy and L. Lamport, Distributed snapshots: Determining global states of dis-tributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63? 75, 1985.
A. Bouteiller et al.Mpich-v: a multiprotocol fault tolerant mpi. International Journal of High Performance Computing and Applications, 20(8):319?333, fall, 2006.
G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment forMPI, 1994.
L. Alvisi et al.An analysis of communication induced checkpointing. In Proceedings of the symposium on fault-tolerant computing, pages 242?249, 1999.
F. Baude et al.A hybrid message logging-cic protocol for constrained checkpointability. In Proceedings of EuroPar2005, LNCS, 2005.
James S. Plank and Kai Li, Faster Checkpointing with N+1 Parity, 24th International Symposium on Fault-Tolerant Computing, Austin, TX, June, 1994, pp 288-297.
Z. Chen et al.Building fault survivable MPI programs with FT-MPI using diskless-checkpointing. In Proceedings of the tenth ACM SIGPLAN Symposium on (PPoPP), June 2005.
G. Zheng, L. Shi, and L. V. Kale. Ftc-charm++: an inmemory checkpoint-based fault toler-ant runtime for charm++ and mpi. In Proceedings of the IEEE International Conference on Cluster Computing, USA, 2004. IEEE Computer Society.
C. Huang et al.Performance evaluation of adaptive MPI. PPOPP 2006: 12-21
L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Wilson, G.V., Lu, P., eds.: Parallel programming using C++. MIT Press (1996) 175-213.
L. V. Kale. The Virtualization approach to Parallel Programming: Runtime Optimization and the State of Art. In LACSI 2002, Albuquerque, October 2002.
S. Chakravorty, C. L. Mendes, and L. V. Kalé, Proactive Fault Tolerance in MPI Applica-tions Via Task Migration. HiPC 2006: 485-496
L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Wilson, G.V., Lu, P., eds.: Parallel programming using C++. MIT Press (1996) 175-213.
S. Chakravorty and L. V. Kalé, A fault tolerance Protocol with Fast Fault Recovery, Accepted for publication at IPDPS 2007.
R. Guerraoui and A. Schiper. Software based replication for fault tolerance. IEEE Com-puter, 30(4):68?74, Apr. 1997.
N. Budhiraja et al.The primary-backup approach, Dec. 01 1993.
L. Rilling and C. Morin. A practical transparent data sharing service for the grid. In Proc. Fifth InternationalWorkshop on Distributed SharedMemory (DSM 2005), Cardiff, UK, May 2005. Held in conjunction with CCGrid 2005.
C. Leangsuksun et al.Asymmetric active-active high availability for high-end computing. In Proceedings of (COSET-2), in conjunction with the 19th ACM International Conference on Supercomputing (ICS), Cambridge, MA, USA, 2005.
C. Engelmann et al.Symmetric active/active high availability for high-performance com-puting system services. Journal of Computers (JCP), 1(8), 2006.
INRIA. Simgrid project. http://simgrid.gforge.inria.fr.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Bouabache, F., Herault, T., Fedak, G., Cappello, F. (2008). A Distributed and Replicated Service for Checkpoint Storage. In: Making Grids Work. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-78448-9_24
Download citation
DOI: https://doi.org/10.1007/978-0-387-78448-9_24
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-78447-2
Online ISBN: 978-0-387-78448-9
eBook Packages: Computer ScienceComputer Science (R0)