Algorithm-dependent fault tolerance for distributed computing
Abstract
Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.
- Authors:
- Publication Date:
- Research Org.:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
- Sponsoring Org.:
- US Department of Energy (US)
- OSTI Identifier:
- 754901
- Report Number(s):
- SAND2000-8219
TRN: AH200021%%336
- DOE Contract Number:
- AC04-94AL85000
- Resource Type:
- Technical Report
- Resource Relation:
- Other Information: PBD: 1 Feb 2000
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; DISTRIBUTED DATA PROCESSING; FAULT TOLERANT COMPUTERS; ALGORITHMS; COMPUTER ARCHITECTURE
Citation Formats
Hough, P D, Goldsby, M e, and Walsh, E J. Algorithm-dependent fault tolerance for distributed computing. United States: N. p., 2000.
Web. doi:10.2172/754901.
Hough, P D, Goldsby, M e, & Walsh, E J. Algorithm-dependent fault tolerance for distributed computing. United States. https://doi.org/10.2172/754901
Hough, P D, Goldsby, M e, and Walsh, E J. 2000.
"Algorithm-dependent fault tolerance for distributed computing". United States. https://doi.org/10.2172/754901. https://www.osti.gov/servlets/purl/754901.
@article{osti_754901,
title = {Algorithm-dependent fault tolerance for distributed computing},
author = {Hough, P D and Goldsby, M e and Walsh, E J},
abstractNote = {Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.},
doi = {10.2172/754901},
url = {https://www.osti.gov/biblio/754901},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Feb 01 00:00:00 EST 2000},
month = {Tue Feb 01 00:00:00 EST 2000}
}