research-article

ZSim: fast and accurate microarchitectural simulation of thousand-core systems

Authors:
Daniel Sanchez

Massachusetts Institute of Technology

Massachusetts Institute of Technology
View Profile

,
Christos Kozyrakis

Stanford University

Stanford University
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 41 Issue 3June 2013pp 475–486https://doi.org/10.1145/2508148.2485963

Published:23 June 2013Publication History

ACM SIGARCH Computer Architecture News

Abstract

Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial.

We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores.

We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.

References

Computer architecture simulation and modeling. IEEE Micro Special Issue, 26(4), 2006.Google Scholar
A. Alameldeen and D. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26(4), 2006. Google ScholarDigital Library
C. Bienia, S. Kumar, J. P. Singh, et al. The PARSEC benchmark suite: Characterization and architectural implications. In PACT-17, 2008. Google ScholarDigital Library
N. Binkert, B. Beckmann, G. Black, et al. The gem5 simulator. SIGARCH Comp. Arch. News, 39(2), 2011. Google ScholarDigital Library
N. Binkert, R. Dreslinski, L. Hsu, et al. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006. Google ScholarDigital Library
E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisiting the RISC vs CISC Debate on Contemporary ARM and x86 Architectures. In HPCA-19, 2013. Google ScholarDigital Library
S. Boyd-Wickizer, H. Chen, R. Chen, et al. Corey: An operating system for many cores. In OSDI-8, 2008. Google ScholarDigital Library
T. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Supercomputing, 2011. Google ScholarDigital Library
S. Chandrasekaran and M. D. Hill. Optimistic simulation of parallel architectures using program executables. In PADS, 1996. Google ScholarDigital Library
J. Chen, L. K. Dabbiru, D. Wong, et al. Adaptive and speculative slack simulations of CMPs on CMPs. In MICRO-43, 2010. Google ScholarDigital Library
M. Chidester and A. George. Parallel simulation of chip-multiprocessor architectures. TOMACS, 12(3), 2002. Google ScholarDigital Library
D. Chiou, D. Sunwoo, J. Kim, et al. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators. In MICRO-40, 2007. Google ScholarDigital Library
A. Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, http://www.agner.org/optimize/.Google Scholar
R. Fujimoto. Parallel discrete event simulation. CACM, 33--10, 1990. Google ScholarDigital Library
D. R. Hower, P. Montesinos, L. Ceze, et al. Two hardware-based approaches for deterministic multiprocessor replay. CACM, 52--6, 2009. Google ScholarDigital Library
X. Huang, J. Moss, K. McKinley, et al. Dynamic simplescalar: Simulating java virtual machines. Technical report, UT Austin, 2003.Google Scholar
Intel. Intel Xeon E3-1200 Family. Datasheet, 2011.Google Scholar
A. Jaleel, R. Cohn, C. Luk, and B. Jacob. CMPSim: A Pin-based on-the-fly multi-core cache simulator. In MoBS-4, 2008.Google Scholar
A. Khan, M. Vijayaraghavan, S. Boyd-Wickizer, and Arvind. Fast cycle-accurate modeling of a multicore processor. In ISPASS, 2012. Google ScholarDigital Library
G. Kurian, J. Miller, J. Psota, et al. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In PACT-19, 2010. Google ScholarDigital Library
R. Liu, K. Klues, S. Bird, et al. Tessellation: Spacetime partitioning in a manycore client os. In HotPar, 2009. Google ScholarDigital Library
C.-K. Luk, R. Cohn, R. Muth, et al. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005. Google ScholarDigital Library
K. T. Malladi, B. C. Lee, F. A. Nothaft, et al. Towards energy-proportional datacenter memory with mobile DRAM. In ISCA-39, 2012. Google ScholarDigital Library
K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, et al. Rethinking DRAM power modes for energy proportionality. In MICRO-45, 2012. Google ScholarDigital Library
M. Martin, D. Sorin, B. Beckmann, et al. Multi-facet's general execution driven multiprocessor simulator (gems) toolset. Comp. Arch. News, 33--4, 2005. Google ScholarDigital Library
C. J. Mauer, M. D. Hill, and D. A. Wood. Full-system timing-first simulation. In SIGMETRICS conf., 2002. Google ScholarDigital Library
J. Miller, H. Kasture, G. Kurian, et al. Graphite: A distributed parallel simulator for multicores. In HPCA-16, 2010.Google Scholar
H. Pan, B. Hindman, and K. Asanovic. Lithe: Enabling efficient composition of parallel libraries. HotPar, 2009. Google ScholarDigital Library
A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: A full system simulator for multicore x86 CPUs. In DAC-48, 2011. Google ScholarDigital Library
A. Patel, F. Afram, K. Ghose, et al. MARSS: Micro Architectural Systems Simulator. In ISCA tutorial 6, 2012.Google Scholar
M. Pellauer, M. Adler, M. Kinsy, et al. HAsim: FPGA-based high detail multicore simulation using time-division multiplexing. In HPCA-17, 2011. Google ScholarDigital Library
A. Pesterev, J. Strauss, N. Zeldovich, and R. Morris. Improving network connection locality on multicore systems. In EuroSys-7, 2012. Google ScholarDigital Library
S. K. Reinhardt, M. D. Hill, J. R. Larus, et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers. In SIGMETRICS conf., 1993. Google ScholarDigital Library
P. Ren, M. Lis, M. Cho, et al. HORNET: A Cycle-Level Multicore Simulator. IEEE TCAD, 31(6), 2012.Google Scholar
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAM-Sim2: A Cycle Accurate Memory System Simulator. CAL, 10(1), 2011. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In MICRO-43, 2010. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In ISCA-38, 2011. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. Scalable and Efficient Fine-Grained Cache Partitioning with Vantage. IEEE Micro's Top Picks, 32(3), 2012. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding. In HPCA-18, 2012. Google ScholarDigital Library
D. Sanchez, D. Lo, R. Yoo, et al. Dynamic Fine-Grain Scheduling of Pipeline Parallelism. In PACT-20, 2011. Google ScholarDigital Library
D. Sanchez, G. Michelogiannakis, and C. Kozyrakis. An Analysis of Interconnection Networks for Large Scale Chip-Multiprocessors. TACO, 7(1), 2010. Google ScholarDigital Library
E. Schnarr and J. R. Larus. Fast out-of-order processor simulation using memoization. In ASPLOS-8, 1998. Google ScholarDigital Library
E. C. Schnarr, M. D. Hill, and J. R. Larus. Facile: A language and compiler for high-performance processor simulators. In PLDI, 2001. Google ScholarDigital Library
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS-10, 2002. Google ScholarDigital Library
J. Shin, K. Tam, D. Huang, et al. A 40nm 16-core 128-thread CMT SPARC SoC processor. In ISSCC, 2010.Google Scholar
S. Srinivasan, L. Zhao, B. Ganesh, et al. CMP Memory Modeling: How Much Does Accuracy Matter? In MoBS-5, 2009.Google Scholar
Z. Tan, A. Waterman, R. Avizienis, et al. RAMP Gold: An FPGA-based architecture simulator for multiprocessors. In DAC-47, 2010. Google ScholarDigital Library
Tilera. TILE-Gx 3000 Series Overview. Technical report, 2011.Google Scholar
T. von Eicken, A. Basu, V. Buch, et al. U-net: a user-level network interface for parallel and distributed computing. In SOSP-15, 1995. Google ScholarDigital Library
J. Wawrzynek, D. Patterson, M. Oskin, et al. RAMP: Research accelerator for multiple processors. IEEE Micro, 27(2), 2007. Google ScholarDigital Library
T. Wenisch, R. Wunderlich, M. Ferdman, et al. Simflex: statistical sampling of computer system simulation. IEEE Micro, 26(4), 2006. Google ScholarDigital Library
E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In SIGMETRICS Perf. Eval. Review, volume 24, 1996. Google ScholarDigital Library
R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In ISCA-30, 2003. Google ScholarDigital Library

Recommendations

ZSim: fast and accurate microarchitectural simulation of thousand-core systems
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by ...
Read More
Table-based modeling of delta-sigma modulators using ZSIM

ZSIM, a nonlinear Z -domain simulator for sampled-data systems, is presented and verified. ZSIM integrates analytic tools, a difference equation simulator, a table-based nonlinear Z -domain simulator, and digital signal processing into a workstation ...
Read More
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
June 2013
686 pages
ISBN:9781450320795
DOI:10.1145/2485922
General Chair:
Avi Mendelson
Technion
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2013
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 419
  Total Citations
  View Citations
- 2,175
  Total Downloads
- Downloads (Last 12 months)244
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ZSim: fast and accurate microarchitectural simulation of thousand-core systems

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Recommendations

ZSim: fast and accurate microarchitectural simulation of thousand-core systems

Table-based modeling of delta-sigma modulators using ZSIM

Evaluation of Rodinia Codes on Intel Xeon Phi