ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
Performance Evaluation
Volume 56, Issues 1-4, March 2004, Pages 213-248
Dependable Systems and Networks - Performance and Dependability Symposium (DSN-PDS) 2002: Selected Papers
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (505 K)

Article Toolbox
 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/j.peva.2003.07.007    
How to Cite or Link Using DOI (Opens New Window)

Copyright © 2003 Elsevier B.V. All rights reserved.

Improving availability with recursive microreboots: a soft-state system case study

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

George CandeaCorresponding Author Contact Information, E-mail The Corresponding Author, James CutlerE-mail The Corresponding Author and Armando FoxE-mail The Corresponding Author

Stanford University, Stanford, CA, USA


Available online 2 October 2003.

Abstract

Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover.

All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is recovered first; if that does not work, progressively larger subsets are recovered. Our domain of interest is Internet services; these systems experience primarily transient or intermittent failures, that can typically be resolved by rebooting. Conceding that failure-free software will continue eluding us for years to come, we undertake a systematic investigation of fine grain component-level restarts, microreboots, as high availability medicine. Building and maintaining an accurate model of large Internet systems is nearly impossible, due to their scale and constantly evolving nature, so we take an application-generic approach, that relies on empirical observations to manage recovery.

We apply recursive microreboots to Mercury, a commercial off-the-shelf (COTS)-based satellite ground station that is based on an Internet service platform. Mercury has been in successful operation for over 3 years. From our experience with Mercury, we draw design guidelines and lessons for the application of recursive microreboots to other software systems. We also present a set of guidelines for building systems amenable to recursive reboots, known as “crash-only software systems.”

Author Keywords: Microreboots; High availability; Recovery-oriented computing

Article Outline

1. Introduction
1.1. The true cost of performance
1.2. Recovery-oriented computing (ROC)
2. Recursive recovery and microreboots
2.1. No a priori models
2.2. RR systems
2.2.1. Monitoring system health
2.2.2. Fault propagation and recovery maps
2.2.3. The recovery process
3. Case study: the Mercury ground station
3.1. Overview
3.2. Ground station architecture
3.3. Adding failure monitoring to mercury
3.4. The reboot tree
3.5. Evolving Mercury’s reboot tree
3.5.1. Simple depth augmentation
3.5.2. Subtree depth augmentation
3.5.3. Consolidating dependent nodes
3.5.4. Promoting high-MTTR nodes
3.6. Lessons
3.6.1. Moving boundaries
3.6.2. Not all downtime is the same
4. Crash-only software
4.1. Why crash-only design?
4.1.1. Crash-only and fault model enforcement
4.2. Properties of crash-only software
4.2.1. Intra-component properties
4.2.2. Extra-component properties
4.3. A restart/retry architecture
4.4. Discussion
5. Related work
5.1. Detection
5.2. Containment
5.3. Recovery
6. Conclusion
Acknowledgements
References
Vitae









Corresponding Author Contact InformationCorresponding author.


Performance Evaluation
Volume 56, Issues 1-4, March 2004, Pages 213-248
Dependable Systems and Networks - Performance and Dependability Symposium (DSN-PDS) 2002: Selected Papers
 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.