Skip to main content
Log in

Asteroid: Scalable Online Memory Diagnostics for Multi-core, Multi-socket Servers

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Memory diagnostics are important to improving the resilience of DRAM main memory. As bit cell size reaches physical limits, DRAM memory will be more likely to suffer both transient and permanent errors. Memory diagnostics that operate online can be a component of a comprehensive strategy to allay errors. This paper presents a novel approach, Asteroid, to integrate online memory diagnostics during workload execution. The approach supports diagnostics that adapt at runtime to workload behavior and resource availability to maximize test quality while reducing performance overhead. We describe Asteroid’s design and how it can be efficiently integrated with a hierarchical memory allocator in modern operating systems. We also present how the framework enables control policies to dynamically configure a diagnostic. Using an adaptive policy, in a 16-core server, Asteroid has modest overhead of 1–4 % for workloads with low to high memory demand. For these workloads, Asteroid’s adaptive policy has good error coverage and can thoroughly test memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. We do not show a sensitivity study of the parameters because the parameters should be tuned to a given target system with offline profiling. This tuning is orthogonal to our contribution and relatively uninteresting.

References

  1. Borkar, S.: Microarchitecture and design challenges for gigascale integration. In: International Symposium on Microarchitecture (2004)

  2. Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter variations and impact on circuits and microarchitecture. In: Design Automation Conference (2003)

  3. Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23(4), 14–19 (2003)

    Article  Google Scholar 

  4. Dell, T.J.: A white paper on the benefits of chipkill. IBM Microelectron. Div. (1997)

  5. Du, Y., Zhou, M., Childers, B., Mosse, D., Melhem, R.: Supporting superpages in non-contiguous physical memory. In: IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 223–234 (2015)

  6. Elm, C., Klein, M., Tavangarian, D.: Automatic on-line memory testsin workstations. In: Workshop in Memory Technology, Design and Testing (1994)

  7. Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. In: Confernce on Arch. Support for Programming Language and Operating System (2012)

  8. Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware errors and software system susceptibility. In: USENIX Annual Technical Conference (2010)

  9. Memtest86+: Advanced Memory Diagnostic Tool. www.memtest.org

  10. Nair, P.J., Kim, D.-H., Qureshi, M.K.: Archshield: architectural framework for assisting DRAM scaling by tolerating high error rates. In: International Symposium on Computer Architecture (2013)

  11. Nightingale, E.B., Douceur, J.R., Orgovan, V.: Cycles, cells and platters: an empirical analysis of hardware failures on a million consumer PCs. In: European Conference on Computer Systems (2011)

  12. Rahman, M., Childers, B.R.: Asteroid: scalable online memory diagnostics. In: ACM International Conference on Computing Frontiers (2015)

  13. Rahman, M., Childers, B.R., Cho, S.: COMeT: continuous online memory test. In: Pacific Rim Dependability Conference (2011)

  14. Rahman, M., Childers, B.R., Cho, S.: COMeT+: continuous online memory testing with multi-threading extension. IEEE Trans. Comput. 63(7), 1668–1681 (2014)

    Article  MathSciNet  Google Scholar 

  15. Schirmeier, H., Neuhalfen, J., Korb, I., Spinczyk, O., Engel, M.: Rampage: graceful degradation management for memory errors in commodity Linux servers. In: Pacific Rim Dependability Conference (2011)

  16. Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: a large-scale field study. In: International Conference on Measurement and Modeling of Computer System (2009)

  17. Singh, A., Bose, D., Darisala, S.: Software based in-system memorytest for highly available systems. In: Workshop Memory Technology, Design and Testing (2005)

  18. Tang, D., Carruthers, P., Totari, Z., Shapiro, M.W.: Assessment of the effect of memory page retirement on system RAS against hardware faults. In: International Conference on Dependable Systems and Networks (2006)

  19. van de Goor, A., Tlili, I.: March tests for word-oriented memories. In: Design Automation and Test in Europe Conference (1998)

  20. Wu, C.-F., Huang, C.-T., Cheng, K.-L., Wu, C.-W.: Simulation-based test algorithm generation for random access memories. In: IEEE VLSI Test Symposium, pp. 291–296 (2000)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bruce R. Childers.

Additional information

This material is based upon work supported by the National Science Foundation under Grant Numbers CCF-1422331 and CNS-1012070.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rahman, M., Childers, B.R. Asteroid: Scalable Online Memory Diagnostics for Multi-core, Multi-socket Servers. Int J Parallel Prog 44, 949–974 (2016). https://doi.org/10.1007/s10766-016-0400-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0400-2

Keywords

Navigation