skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An Exploration in Implementing Fault Tolerance in Scientific Simulation Application Software

Technical Report ·
DOI:https://doi.org/10.2172/811162· OSTI ID:811162

The ability for scientific simulation software to detect and recover from errors and failures of supporting hardware and software layers is becoming more important due to the pressure to shift from large, specialized multi-million dollar ASCI computing platforms to smaller, less expensive interconnected machines consisting of off-the-shelf hardware. As evidenced by the CPlant{trademark} experiences, fault tolerance can be necessary even on such a homogeneous system and may also prove useful in the next generation of ASCI platforms. This report describes a research effort intended to study, implement, and test the feasibility of various fault tolerance mechanisms controlled at the simulation code level. Errors and failures would be detected by underlying software layers, communicated to the application through a convenient interface, and then handled by the simulation code itself. Targeted faults included corrupt communication messages, processor node dropouts, and unacceptable slowdown of service from processing nodes. Recovery techniques such as re-sending communication messages and dynamic reallocation of failing processor nodes were considered. However, most fault tolerance mechanisms rely on underlying software layers which were discovered to be lacking to such a degree that mechanisms at the application level could not be implemented. This research effort has been postponed and shifted to these supporting layers.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
US Department of Energy (US)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
811162
Report Number(s):
SAND2003-1651; TRN: US200311%%211
Resource Relation:
Other Information: PBD: 1 May 2003
Country of Publication:
United States
Language:
English