Evaluations of domino-free communication-induced checkpointing protocols

https://doi.org/10.1016/S0020-0190(98)00183-5Get rights and content

Abstract

We give a detailed analysis of communication-induced checkpointing protocols that are free of domino effect. We investigate the validity of a common intuition in the literature and demonstrate that there is no optimal on-line domino-free protocol in terms of the number of forced checkpoints. Formal proofs on comparing existing protocols in the literature are given.

References (11)

  • Y.M. Wang et al.

    Consistent global checkpoints based on direct dependency tracking

    Inform. Process. Lett.

    (1994)
  • D. Briatico et al.

    A distributed domino effect free recovery algorithm

  • K.M. Chandy et al.

    Distributed snapshots: determining global states of distributed systems

    ACM Trans. Comput. Syst.

    (1985)
  • E.N. Elnozahy et al.

    A survey of rollback-recovery protocols in message-passing systems

  • J.M. Helary et al.

    Communication-based prevention of useless checkpoints in distributed computations

There are more references available in the full text version of this article.

Cited by (23)

  • An efficient validation approach for quasi-synchronous checkpointing oriented to distributed diagnosability

    2016, Journal of Systems and Software
    Citation Excerpt :

    Finding a method to construct a consistent snapshot in a ZCF system has been an open problem. The impossibility of designing an optimal ZCF quasi-synchronous checkpointing algorithm has been treated by Tsai et al. (1998). Recently, some algorithms which are ZCF have been developed, for example, the Fully Informed (FI) algorithm of Helary et al. (2000), the Fully Informed aNd Efficient (FINE) algorithm of Luo and Manivannan (2009), the Delayed Communication-Induced Checkpointing (DCFI) algorithm (Simon et al., 2013a) and the Scalable Fully Informed (SF-I) algorithm (Simon et al., 2013b) of Calixto et al..

  • Theoretical and experimental evaluation of communication-induced checkpointing protocols in F<inf>E</inf> and F<inf>Lazy-E</inf> families

    2011, Performance Evaluation
    Citation Excerpt :

    A consistent global checkpoint (also called a recovery line) is where the whole system can rollback to, in case of a failure. “the HMNR1 protocol must force at least one checkpoint between any two consecutive forced checkpoints taken by protocol CP in a process” [25]. HMNR2 protocol must force at least one checkpoint between any two consecutive forced checkpoints taken by the FINE protocol in a process.

  • Quantifying rollback propagation in distributed checkpointing

    2004, Journal of Parallel and Distributed Computing
  • Interval consistency of asynchronous distributed computations

    2002, Journal of Computer and System Sciences
View all citing articles on Scopus
1

Tsai and Kuo's work is supported by the National Science Council, Taiwan, ROC, under Grant NSC 87-2213-E-259-007.

View full text