research-article

Free Access

DRAM errors in the wild: a large-scale field study

Authors:
Bianca Schroeder

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Eduardo Pinheiro

Google Inc., Mountain View, CA

Google Inc., Mountain View, CA
View Profile

,
Wolf-Dietrich Weber

Google Inc., Mountain View, CA

Google Inc., Mountain View, CA
View Profile

Authors Info & Claims

Communications of the ACM Volume 54 Issue 2February 2011pp 100–107https://doi.org/10.1145/1897816.1897844

Published:01 February 2011Publication History

Communications of the ACM

Abstract

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of dual in-line memory module (DIMM) days.

The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology, and DIMM age?

We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000--70,000 errors per billion device hours per Mb and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we do not observe any indication that newer generations of DIMMs have worse error behavior.

References

Mosys adds soft-error protection, correction. Semiconductor Business News (28 Jan. 2002).Google Scholar
Al-Ars, Z., van de Goor, A.J., Braun, J., Richter, D. Simulation based analysis of temperature effect on the faulty behavior of embedded DRAMs. In ITC'01: Proceedings of the 2001 IEEE International Test Conference (2001). Google ScholarDigital Library
Baumann, R. Soft errors in advanced computer systems. IEEE Design Test Comput. (2005), 258--266. Google ScholarDigital Library
Borucki, L., Schindlbeck, G., Slayman, C. Comparison of accelerated DRAM soft error rates measred at component and system level. In Proceedings of 46th Annual International Reliability Physics Symposium (2008).Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E. Bigtable: A distributed storage system for structured data. In Proceedings of OSDI'06 (2006). Google ScholarDigital Library
Chen, C., Hsiao, M. Error-correcting codes for semiconductor memory applications: A state-of-the-art review. IBM J. Res. Dev. 28, 2 (1984), 124--134. Google ScholarDigital Library
Dell, T.J. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM Microelectronics (1997).Google Scholar
Hamamoto, T, Sugiura, S., Sawada, S. On the retention time distribution of dynamic random access memory (DRAM). IEEE Trans. Electron Dev. 45, 6 (1998), 1300--1309.Google ScholarCross Ref
Johnston, A.H. Scaling and technology issues for soft error rates. In Proceedings of the 4th Annual Conference on Reliability (2000).Google Scholar
Li, X., Shen, K., Huang, M., Chu, L. A memory soft error measurement on production systems. In Proceedings of USENIX Annual Technical Conference (2007). Google ScholarDigital Library
May, T.C., Woods, M.H. Alpha-particle-induced soft errors in dynamic memories. IEEE Trans. Electron Dev. 26, 1 (1979).Google ScholarCross Ref
Messer, A., Bernadat, P., Fu, G., Chen, D., Dimitrijevic, Lie, D., Mannaru, D.D., Riska, R., Milojicic, D. Susceptibility of commodity systems and software to memory soft errors. IEEE Trans. Comput. 53, 12 (2004). Google ScholarDigital Library
Milojicic, D., Messer, A., Shau, J., Fu, G., Munoz, A. Increasing relevance of memory hardware errors: A case for recoverable programming models. In Proceedings of the 9th ACM SIGOPS European workshop (2000). Google ScholarDigital Library
Mukherjee, S.S., Emer, J., Fossum, T., Reinhardt, S.K. Cache scrubbing in microprocessors: Myth or necessity? In PRDC '04: Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (2004). Google ScholarDigital Library
Mukherjee, S.S., Emer, J., Reinhardt, S.K. The soft error problem: An architectural perspective. In HPCA '05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture (2005). Google ScholarDigital Library
Normand, E. Single event upset at ground level. IEEE Trans. Nucl. Sci. 6, 43 (1996), 2742--2750.Google Scholar
O'Gorman, T.J., Ross, J.M., Taber, A.H., Ziegler, J.F., Muhlfeld, H.P., Montrose, C.J., Curtis, H.W., Walsh, J.L. Field testing for cosmic ray soft errors in semiconductor memories. IBM J. Res. Dev. 40, 1 (1996). Google ScholarDigital Library
Schroeder, B., Gibson, G.A. A large scale study of failures in high-performance-computing systems. In DSN 2006: Proceedings of the International Conference on Dependable Systems and Networks (2006). Google ScholarDigital Library
Schroeder, B., Gibson, G.A. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In 5th USENIX FAST Conference (2007). Google ScholarDigital Library
Schroeder, B., Pinheiro, E., Weber, W.-D. DRAM errors in the wild: A large-scale field study. In Proceedings of ACM Sigmetrics (2009). Google ScholarDigital Library
Takeuchi, K., Shimohigashi, K., Kozuka, H., Toyabe, T., Itoh, K., Kurosawa, H. Origin and characteristics of alpha-particle-induced permanent junction leakage. IEEE Trans. Electron Dev. (Mar. 1999).Google Scholar
Ziegler, J.F., Lanford, W.A. Effect of cosmic rays on computer memories. Science 206 (1979), 776--788.Google ScholarCross Ref

Index Terms

Recommendations

DRAM errors in the wild: a large-scale field study
SIGMETRICS '09

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in ...
Read More
DRAM errors in the wild: a large-scale field study
SIGMETRICS '09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in ...
Read More
Exploiting Refresh Effect of DRAM Read Operations: A Practical Approach to Low-Power Refresh

Dynamic random access memory (DRAM) requires periodic refresh operations to retain its data. In practice, DRAM retention times are normally distributed from 64 ms to several seconds. However, the conventional refresh method uses 64 ms as the refresh ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Communications of the ACM Volume 54, Issue 2
February 2011
115 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1897816
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 79
  Total Citations
  View Citations
- 5,509
  Total Downloads
- Downloads (Last 12 months)264
- Downloads (Last 6 weeks)56
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

DRAM errors in the wild: a large-scale field study

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

DRAM errors in the wild: a large-scale field study

DRAM errors in the wild: a large-scale field study

Exploiting Refresh Effect of DRAM Read Operations: A Practical Approach to Low-Power Refresh

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

DRAM errors in the wild: a large-scale field study

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

DRAM errors in the wild: a large-scale field study

DRAM errors in the wild: a large-scale field study

Exploiting Refresh Effect of DRAM Read Operations: A Practical Approach to Low-Power Refresh

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media