Evaluations of domino-free communication-induced checkpointing protocols

doi:10.1016/S0020-0190(98)00183-5

Information Processing Letters

Volume 69, Issue 1, 15 January 1999, Pages 31-37

https://doi.org/10.1016/S0020-0190(98)00183-5 Get rights and content

Abstract

We give a detailed analysis of communication-induced checkpointing protocols that are free of domino effect. We investigate the validity of a common intuition in the literature and demonstrate that there is no optimal on-line domino-free protocol in terms of the number of forced checkpoints. Formal proofs on comparing existing protocols in the literature are given.

References (11)

Y.M. Wang et al.
Consistent global checkpoints based on direct dependency tracking
Inform. Process. Lett.
(1994)
D. Briatico et al.
A distributed domino effect free recovery algorithm
K.M. Chandy et al.
Distributed snapshots: determining global states of distributed systems
ACM Trans. Comput. Syst.
(1985)
E.N. Elnozahy et al.
A survey of rollback-recovery protocols in message-passing systems
J.M. Helary et al.
Communication-based prevention of useless checkpoints in distributed computations

There are more references available in the full text version of this article.

Cited by (23)

An efficient validation approach for quasi-synchronous checkpointing oriented to distributed diagnosability
2016, Journal of Systems and Software
Citation Excerpt :
Finding a method to construct a consistent snapshot in a ZCF system has been an open problem. The impossibility of designing an optimal ZCF quasi-synchronous checkpointing algorithm has been treated by Tsai et al. (1998). Recently, some algorithms which are ZCF have been developed, for example, the Fully Informed (FI) algorithm of Helary et al. (2000), the Fully Informed aNd Efficient (FINE) algorithm of Luo and Manivannan (2009), the Delayed Communication-Induced Checkpointing (DCFI) algorithm (Simon et al., 2013a) and the Scalable Fully Informed (SF-I) algorithm (Simon et al., 2013b) of Calixto et al..
The autonomic computing paradigm is oriented towards enabling complex distributed systems to manage themselves, even in faulty situations. The diagnosability analysis is a priori a study through which a system can be self-aware about its current state. It is from the determination of a consistent state that a system can take some action to repair or reconfigure itself. Nevertheless, in a distributed system it is hard to determine consistent states since we cannot observe simultaneously all the local variables of different processes. In this context, the challenge is to efficiently monitor the system execution over time to capture trace information in order to determine if the system accomplishes both functional and non-functional requirements. Quasi-synchronous checkpointing is a technique that collects information from which a system can establish consistent snapshots. Based on this technique, several checkpointing algorithms have been developed. According to the checkpoint properties detected and ensured, they are classified into: Strictly Z-Path Free (SZPF), Z-Path Free (ZPF) and Z-Cycle Free (ZCF). Generally, the method adopted for the performance evaluation of checkpointing algorithms involves simulation. However, few works have been designed to validate their correctness. In this paper, we propose an efficient validation approach based on a graph transformation oriented towards the automatic detection of the previously mentioned properties. To achieve this, we took the vector clocks resulting from an algorithm execution, and we modeled them into the happened-before graph and the immediate dependency graph (which is the minimal causal graph). Then, we designed a set of transformation rules to verify if in these graphs, the algorithm is exempt from non-desirable patterns, such as Z-paths or Z-cycles, according to the case.
Theoretical and experimental evaluation of communication-induced checkpointing protocols in F<inf>E</inf> and F<inf>Lazy-E</inf> families
2011, Performance Evaluation
Citation Excerpt :
A consistent global checkpoint (also called a recovery line) is where the whole system can rollback to, in case of a failure. “the HMNR1 protocol must force at least one checkpoint between any two consecutive forced checkpoints taken by protocol CP in a process” [25]. HMNR2 protocol must force at least one checkpoint between any two consecutive forced checkpoints taken by the FINE protocol in a process.
Communication-induced checkpointing (CIC) protocols help in bounding rollback propagation by ensuring that each checkpoint taken is useful, while at the same time allowing each process to take checkpoints independently. In this paper, we focus on the evaluation of CIC protocols belonging to two families, namely, the $F_{E}$ family and $F_{Lazy - E}$ family. We present both theoretical and experimental evaluation of the protocols belonging to these two families. The results of our experimental evaluation not only confirm the theoretical comparison but also reveals the fine differences between these protocols.
FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems
2009, Journal of Parallel and Distributed Computing
Communication-Induced Checkpointing (CIC) protocols are classified into two categories in the literature: Index-based and Model-based. In this paper, we discuss two data structures being used in these two kinds of CIC protocols, and their different roles in helping the checkpointing algorithms to enforce Z-cycle Free (ZCF) property. Then, we present our Fully Informed aNd Efficient (FINE) communication-induced checkpointing algorithm, which not only has less checkpointing overhead than the well-known Fully Informed (FI) CIC protocol proposed by Helary et al. but also has less message overhead. Performance evaluation indicates that our protocol performs better than many of the other existing CIC protocols.
Quantifying rollback propagation in distributed checkpointing
2004, Journal of Parallel and Distributed Computing
This paper proposes a new classification of executions with checkpoints based on the amount of rollback during recovery. Specifically, an execution is k-rollback, if k indicates the maximal number of checkpoints that have to be rolled back. It is shown that coordinated checkpointing, SZPF, and ZPF are 1-rollback, while ZCF is (n−1)-rollback, where n is the number of participants in an execution.
A new class of executions, called d-bounded cycles (in short, d-BC), is introduced, and is shown to be ((n−1)·d)-rollback (ZCF is a special case of d-BC for d=1).
Finally, a protocol is presented whose executions are d-bounded cycles. A nice property of this protocol is that it does not impose any control information overhead on application messages, yet sends only a few control messages of its own. Moreover, the protocol maintains information that enables very efficient discovery of a recent recovery line that existed shortly before the failure.
Interval consistency of asynchronous distributed computations
2002, Journal of Computer and System Sciences
An interval of a sequential process is a sequence of consecutive events of this process. The set of intervals defined on a distributed computation defines an abstraction of this distributed computation, and the traditional causality relation on events induces a relation on the set of intervals that we call I-precedence. An important question is then, “Is the interval-based abstraction associated with a distributed computation consistent?” To answer this question, this paper introduces a consistency criterion named interval consistency (IC). Intuitively, this criterion states that an interval-based abstraction of a distributed computation is consistent if its I-precedence relation does not contradict the sequentiality of each process. More formally, IC is defined as a property of a precedence graph. Interestingly, the IC criterion can be operationally characterized in terms of timestamps (whose values belong to a lattice). The paper uses this characterization to design a versatile protocol that, given intervals defined by a daemon whose behavior is unpredictable, breaks them (in a nontrivial manner) in order to produce an abstraction satisfying the IC criterion. Applications to communication-induced checkpointing are suggested.
Impossibility of scalar clock-based communication-induced checkpointing protocols ensuring the RDT property
2001, Information Processing Letters
Communication-induced checkpointing protocols constitute an interesting approach to the on-line determination of checkpoint and communication patterns enjoying desirable properties such as domino-effect freedom. They do not add control messages to the computation, but instead may attach control information to computation messages. Among these protocols, scalar clock-based protocols are particularly attractive as they use a single integer as control information.
An interesting property of checkpoint and communication patterns is Rollback-Dependency Trackability, which ensures that all local checkpoint dependencies are on-the-fly trackable. So, it would be nice to design scalar clock-based communication-induced checkpointing protocols providing the RDT property, a previously open question. This paper shows that the design of such protocols is impossible.

View all citing articles on Scopus

¹: Tsai and Kuo's work is supported by the National Science Council, Taiwan, ROC, under Grant NSC 87-2213-E-259-007.

View full text

Evaluations of domino-free communication-induced checkpointing protocols

Abstract

Inform. Process. Lett.

A distributed domino effect free recovery algorithm

Distributed snapshots: determining global states of distributed systems

ACM Trans. Comput. Syst.

A survey of rollback-recovery protocols in message-passing systems

Communication-based prevention of useless checkpoints in distributed computations