skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Algorithm-dependent fault tolerance for distributed computing

Abstract

Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.

Authors:
; ;
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Org.:
US Department of Energy (US)
OSTI Identifier:
754901
Report Number(s):
SAND2000-8219
TRN: AH200021%%336
DOE Contract Number:  
AC04-94AL85000
Resource Type:
Technical Report
Resource Relation:
Other Information: PBD: 1 Feb 2000
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; DISTRIBUTED DATA PROCESSING; FAULT TOLERANT COMPUTERS; ALGORITHMS; COMPUTER ARCHITECTURE

Citation Formats

Hough, P D, Goldsby, M e, and Walsh, E J. Algorithm-dependent fault tolerance for distributed computing. United States: N. p., 2000. Web. doi:10.2172/754901.
Hough, P D, Goldsby, M e, & Walsh, E J. Algorithm-dependent fault tolerance for distributed computing. United States. https://doi.org/10.2172/754901
Hough, P D, Goldsby, M e, and Walsh, E J. 2000. "Algorithm-dependent fault tolerance for distributed computing". United States. https://doi.org/10.2172/754901. https://www.osti.gov/servlets/purl/754901.
@article{osti_754901,
title = {Algorithm-dependent fault tolerance for distributed computing},
author = {Hough, P D and Goldsby, M e and Walsh, E J},
abstractNote = {Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.},
doi = {10.2172/754901},
url = {https://www.osti.gov/biblio/754901}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Feb 01 00:00:00 EST 2000},
month = {Tue Feb 01 00:00:00 EST 2000}
}