skip to main content
research-article

ZSim: fast and accurate microarchitectural simulation of thousand-core systems

Published:23 June 2013Publication History
Skip Abstract Section

Abstract

Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial.

We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores.

We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.

References

  1. Computer architecture simulation and modeling. IEEE Micro Special Issue, 26(4), 2006.Google ScholarGoogle Scholar
  2. A. Alameldeen and D. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26(4), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Bienia, S. Kumar, J. P. Singh, et al. The PARSEC benchmark suite: Characterization and architectural implications. In PACT-17, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Binkert, B. Beckmann, G. Black, et al. The gem5 simulator. SIGARCH Comp. Arch. News, 39(2), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Binkert, R. Dreslinski, L. Hsu, et al. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisiting the RISC vs CISC Debate on Contemporary ARM and x86 Architectures. In HPCA-19, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Boyd-Wickizer, H. Chen, R. Chen, et al. Corey: An operating system for many cores. In OSDI-8, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Supercomputing, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chandrasekaran and M. D. Hill. Optimistic simulation of parallel architectures using program executables. In PADS, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Chen, L. K. Dabbiru, D. Wong, et al. Adaptive and speculative slack simulations of CMPs on CMPs. In MICRO-43, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Chidester and A. George. Parallel simulation of chip-multiprocessor architectures. TOMACS, 12(3), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Chiou, D. Sunwoo, J. Kim, et al. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators. In MICRO-40, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, http://www.agner.org/optimize/.Google ScholarGoogle Scholar
  14. R. Fujimoto. Parallel discrete event simulation. CACM, 33--10, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. R. Hower, P. Montesinos, L. Ceze, et al. Two hardware-based approaches for deterministic multiprocessor replay. CACM, 52--6, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Huang, J. Moss, K. McKinley, et al. Dynamic simplescalar: Simulating java virtual machines. Technical report, UT Austin, 2003.Google ScholarGoogle Scholar
  17. Intel. Intel Xeon E3-1200 Family. Datasheet, 2011.Google ScholarGoogle Scholar
  18. A. Jaleel, R. Cohn, C. Luk, and B. Jacob. CMPSim: A Pin-based on-the-fly multi-core cache simulator. In MoBS-4, 2008.Google ScholarGoogle Scholar
  19. A. Khan, M. Vijayaraghavan, S. Boyd-Wickizer, and Arvind. Fast cycle-accurate modeling of a multicore processor. In ISPASS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Kurian, J. Miller, J. Psota, et al. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In PACT-19, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Liu, K. Klues, S. Bird, et al. Tessellation: Spacetime partitioning in a manycore client os. In HotPar, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C.-K. Luk, R. Cohn, R. Muth, et al. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. T. Malladi, B. C. Lee, F. A. Nothaft, et al. Towards energy-proportional datacenter memory with mobile DRAM. In ISCA-39, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, et al. Rethinking DRAM power modes for energy proportionality. In MICRO-45, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Martin, D. Sorin, B. Beckmann, et al. Multi-facet's general execution driven multiprocessor simulator (gems) toolset. Comp. Arch. News, 33--4, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. J. Mauer, M. D. Hill, and D. A. Wood. Full-system timing-first simulation. In SIGMETRICS conf., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Miller, H. Kasture, G. Kurian, et al. Graphite: A distributed parallel simulator for multicores. In HPCA-16, 2010.Google ScholarGoogle Scholar
  28. H. Pan, B. Hindman, and K. Asanovic. Lithe: Enabling efficient composition of parallel libraries. HotPar, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: A full system simulator for multicore x86 CPUs. In DAC-48, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Patel, F. Afram, K. Ghose, et al. MARSS: Micro Architectural Systems Simulator. In ISCA tutorial 6, 2012.Google ScholarGoogle Scholar
  31. M. Pellauer, M. Adler, M. Kinsy, et al. HAsim: FPGA-based high detail multicore simulation using time-division multiplexing. In HPCA-17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Pesterev, J. Strauss, N. Zeldovich, and R. Morris. Improving network connection locality on multicore systems. In EuroSys-7, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. K. Reinhardt, M. D. Hill, J. R. Larus, et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers. In SIGMETRICS conf., 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Ren, M. Lis, M. Cho, et al. HORNET: A Cycle-Level Multicore Simulator. IEEE TCAD, 31(6), 2012.Google ScholarGoogle Scholar
  35. P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAM-Sim2: A Cycle Accurate Memory System Simulator. CAL, 10(1), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In MICRO-43, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In ISCA-38, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Sanchez and C. Kozyrakis. Scalable and Efficient Fine-Grained Cache Partitioning with Vantage. IEEE Micro's Top Picks, 32(3), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. Sanchez and C. Kozyrakis. SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding. In HPCA-18, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Sanchez, D. Lo, R. Yoo, et al. Dynamic Fine-Grain Scheduling of Pipeline Parallelism. In PACT-20, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. Sanchez, G. Michelogiannakis, and C. Kozyrakis. An Analysis of Interconnection Networks for Large Scale Chip-Multiprocessors. TACO, 7(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. E. Schnarr and J. R. Larus. Fast out-of-order processor simulation using memoization. In ASPLOS-8, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. E. C. Schnarr, M. D. Hill, and J. R. Larus. Facile: A language and compiler for high-performance processor simulators. In PLDI, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS-10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Shin, K. Tam, D. Huang, et al. A 40nm 16-core 128-thread CMT SPARC SoC processor. In ISSCC, 2010.Google ScholarGoogle Scholar
  46. S. Srinivasan, L. Zhao, B. Ganesh, et al. CMP Memory Modeling: How Much Does Accuracy Matter? In MoBS-5, 2009.Google ScholarGoogle Scholar
  47. Z. Tan, A. Waterman, R. Avizienis, et al. RAMP Gold: An FPGA-based architecture simulator for multiprocessors. In DAC-47, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tilera. TILE-Gx 3000 Series Overview. Technical report, 2011.Google ScholarGoogle Scholar
  49. T. von Eicken, A. Basu, V. Buch, et al. U-net: a user-level network interface for parallel and distributed computing. In SOSP-15, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Wawrzynek, D. Patterson, M. Oskin, et al. RAMP: Research accelerator for multiple processors. IEEE Micro, 27(2), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. T. Wenisch, R. Wunderlich, M. Ferdman, et al. Simflex: statistical sampling of computer system simulation. IEEE Micro, 26(4), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In SIGMETRICS Perf. Eval. Review, volume 24, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In ISCA-30, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 41, Issue 3
    ICSA '13
    June 2013
    666 pages
    ISSN:0163-5964
    DOI:10.1145/2508148
    Issue’s Table of Contents
    • cover image ACM Other conferences
      ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
      June 2013
      686 pages
      ISBN:9781450320795
      DOI:10.1145/2485922

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 23 June 2013

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader