ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (2478 K)

Article Toolbox
  E-mail Article   
  Add to my Quick Links   
Bookmark and share in 2collab (opens in new window)
Request permission to reuse this article
  Cited By in Scopus (0)
 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/j.peva.2007.09.001    
How to Cite or Link Using DOI (Opens New Window)

Copyright © 2007 Elsevier Ltd All rights reserved.

Model-based performance evaluation of distributed checkpointing protocolsstar, open

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Adnan Agbariaa, Corresponding Author Contact Information, E-mail The Corresponding Author and Roy Friedmanb, E-mail The Corresponding Author

aIBM Haifa Research Lab, Mount Carmel, Haifa 31905, Israel

bComputer Science Department, Technion - Israel Institute of Technology, Haifa 32000, Israel


Received 20 June 2006; 
revised 13 March 2007; 
accepted 16 September 2007. 
Available online 26 September 2007.

Abstract

A large number of distributed checkpointing protocols have appeared in the literature. However, to make informed decisions about which protocol performs best for a given environment, one must use an objective measure for comparing them. Obviously, a distributed checkpointing protocol could be the best in a specific environment, but not in another environment. This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular, we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. Using the objective measure as an evaluation technique, the paper also analyses several known protocols and compares their overhead ratios.

Keywords: Distributed checkpoint/restart; Rollback propagation; Performance analysis; Markov models

Article Outline

1. Introduction
2. Preliminaries
2.1. System model
2.2. Definitions and notations
2.3. Assumptions and limitations
3. The overhead ratio
3.1. Computing Γk using Markov chains
3.1.1. Computing Γ0
3.1.2. Computing Γ1
3.1.3. Computing Γ2
3.1.4. Computing Γk
3.2. Computing the overhead ratio
3.3. Numerical results
4. Performance analysis of checkpointing protocols
4.1. Distributed checkpointing protocols
4.1.1. Sync-and-Stop (SaS) [25]
4.1.2. Chandy–Lamport (C–L) [11]
4.1.3. Baldoni, Helary, Mostefaoai and Raynal (BHMR) [7]
4.1.4. Fixed-Dependency-Interval (FDI) [40]
4.1.5. Briatico, Ciuffoletti and Simoncini (BCS) [10]
4.1.6. Baldoni, Quaglia and Ciciani (BQC) [8]
4.1.7. Helary, Mostefaoui and Raynal (HMR) [14]
4.1.8. Manivannan–Singhal (M–S) [17]
4.1.9. d-Bounded Cycles (d-BC) [1]
4.2. Comparing distributed checkpointing protocols
5. Related work
6. Conclusions
Acknowledgements
References
Vitae

















star, openThis research is supported by the Bar-Nir Bergreen Software Technology Centre of Excellence.


Corresponding Author Contact InformationCorresponding author.

 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.