Copyright © 2003 Elsevier B.V. All rights reserved.
Improving availability with recursive microreboots: a soft-state system case study
Available online 2 October 2003.
Abstract
Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover.
All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is recovered first; if that does not work, progressively larger subsets are recovered. Our domain of interest is Internet services; these systems experience primarily transient or intermittent failures, that can typically be resolved by rebooting. Conceding that failure-free software will continue eluding us for years to come, we undertake a systematic investigation of fine grain component-level restarts, microreboots, as high availability medicine. Building and maintaining an accurate model of large Internet systems is nearly impossible, due to their scale and constantly evolving nature, so we take an application-generic approach, that relies on empirical observations to manage recovery.
We apply recursive microreboots to Mercury, a commercial off-the-shelf (COTS)-based satellite ground station that is based on an Internet service platform. Mercury has been in successful operation for over 3 years. From our experience with Mercury, we draw design guidelines and lessons for the application of recursive microreboots to other software systems. We also present a set of guidelines for building systems amenable to recursive reboots, known as “crash-only software systems.”
Author Keywords: Microreboots; High availability; Recovery-oriented computing
Article Outline
- 1. Introduction
- 2. Recursive recovery and microreboots
- 3. Case study: the Mercury ground station
- 3.1. Overview
- 3.2. Ground station architecture
- 3.3. Adding failure monitoring to mercury
- 3.4. The reboot tree
- 3.5. Evolving Mercury’s reboot tree
- 3.5.1. Simple depth augmentation
- 3.5.2. Subtree depth augmentation
- 3.5.3. Consolidating dependent nodes
- 3.5.4. Promoting high-MTTR nodes
- 3.6. Lessons
- 4. Crash-only software
- 4.1. Why crash-only design?
- 4.2. Properties of crash-only software
- 4.3. A restart/retry architecture
- 4.4. Discussion
- 5. Related work
- 6. Conclusion
- Acknowledgements
- References
- Vitae






E-mail Article
Add to my Quick Links

Cited By in Scopus (20)







