Abstract
Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial.
We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores.
We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.
- Computer architecture simulation and modeling. IEEE Micro Special Issue, 26(4), 2006.Google Scholar
- A. Alameldeen and D. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26(4), 2006. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, et al. The PARSEC benchmark suite: Characterization and architectural implications. In PACT-17, 2008. Google ScholarDigital Library
- N. Binkert, B. Beckmann, G. Black, et al. The gem5 simulator. SIGARCH Comp. Arch. News, 39(2), 2011. Google ScholarDigital Library
- N. Binkert, R. Dreslinski, L. Hsu, et al. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006. Google ScholarDigital Library
- E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisiting the RISC vs CISC Debate on Contemporary ARM and x86 Architectures. In HPCA-19, 2013. Google ScholarDigital Library
- S. Boyd-Wickizer, H. Chen, R. Chen, et al. Corey: An operating system for many cores. In OSDI-8, 2008. Google ScholarDigital Library
- T. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Supercomputing, 2011. Google ScholarDigital Library
- S. Chandrasekaran and M. D. Hill. Optimistic simulation of parallel architectures using program executables. In PADS, 1996. Google ScholarDigital Library
- J. Chen, L. K. Dabbiru, D. Wong, et al. Adaptive and speculative slack simulations of CMPs on CMPs. In MICRO-43, 2010. Google ScholarDigital Library
- M. Chidester and A. George. Parallel simulation of chip-multiprocessor architectures. TOMACS, 12(3), 2002. Google ScholarDigital Library
- D. Chiou, D. Sunwoo, J. Kim, et al. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators. In MICRO-40, 2007. Google ScholarDigital Library
- A. Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, http://www.agner.org/optimize/.Google Scholar
- R. Fujimoto. Parallel discrete event simulation. CACM, 33--10, 1990. Google ScholarDigital Library
- D. R. Hower, P. Montesinos, L. Ceze, et al. Two hardware-based approaches for deterministic multiprocessor replay. CACM, 52--6, 2009. Google ScholarDigital Library
- X. Huang, J. Moss, K. McKinley, et al. Dynamic simplescalar: Simulating java virtual machines. Technical report, UT Austin, 2003.Google Scholar
- Intel. Intel Xeon E3-1200 Family. Datasheet, 2011.Google Scholar
- A. Jaleel, R. Cohn, C. Luk, and B. Jacob. CMPSim: A Pin-based on-the-fly multi-core cache simulator. In MoBS-4, 2008.Google Scholar
- A. Khan, M. Vijayaraghavan, S. Boyd-Wickizer, and Arvind. Fast cycle-accurate modeling of a multicore processor. In ISPASS, 2012. Google ScholarDigital Library
- G. Kurian, J. Miller, J. Psota, et al. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In PACT-19, 2010. Google ScholarDigital Library
- R. Liu, K. Klues, S. Bird, et al. Tessellation: Spacetime partitioning in a manycore client os. In HotPar, 2009. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, et al. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005. Google ScholarDigital Library
- K. T. Malladi, B. C. Lee, F. A. Nothaft, et al. Towards energy-proportional datacenter memory with mobile DRAM. In ISCA-39, 2012. Google ScholarDigital Library
- K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, et al. Rethinking DRAM power modes for energy proportionality. In MICRO-45, 2012. Google ScholarDigital Library
- M. Martin, D. Sorin, B. Beckmann, et al. Multi-facet's general execution driven multiprocessor simulator (gems) toolset. Comp. Arch. News, 33--4, 2005. Google ScholarDigital Library
- C. J. Mauer, M. D. Hill, and D. A. Wood. Full-system timing-first simulation. In SIGMETRICS conf., 2002. Google ScholarDigital Library
- J. Miller, H. Kasture, G. Kurian, et al. Graphite: A distributed parallel simulator for multicores. In HPCA-16, 2010.Google Scholar
- H. Pan, B. Hindman, and K. Asanovic. Lithe: Enabling efficient composition of parallel libraries. HotPar, 2009. Google ScholarDigital Library
- A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: A full system simulator for multicore x86 CPUs. In DAC-48, 2011. Google ScholarDigital Library
- A. Patel, F. Afram, K. Ghose, et al. MARSS: Micro Architectural Systems Simulator. In ISCA tutorial 6, 2012.Google Scholar
- M. Pellauer, M. Adler, M. Kinsy, et al. HAsim: FPGA-based high detail multicore simulation using time-division multiplexing. In HPCA-17, 2011. Google ScholarDigital Library
- A. Pesterev, J. Strauss, N. Zeldovich, and R. Morris. Improving network connection locality on multicore systems. In EuroSys-7, 2012. Google ScholarDigital Library
- S. K. Reinhardt, M. D. Hill, J. R. Larus, et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers. In SIGMETRICS conf., 1993. Google ScholarDigital Library
- P. Ren, M. Lis, M. Cho, et al. HORNET: A Cycle-Level Multicore Simulator. IEEE TCAD, 31(6), 2012.Google Scholar
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAM-Sim2: A Cycle Accurate Memory System Simulator. CAL, 10(1), 2011. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In MICRO-43, 2010. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In ISCA-38, 2011. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. Scalable and Efficient Fine-Grained Cache Partitioning with Vantage. IEEE Micro's Top Picks, 32(3), 2012. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding. In HPCA-18, 2012. Google ScholarDigital Library
- D. Sanchez, D. Lo, R. Yoo, et al. Dynamic Fine-Grain Scheduling of Pipeline Parallelism. In PACT-20, 2011. Google ScholarDigital Library
- D. Sanchez, G. Michelogiannakis, and C. Kozyrakis. An Analysis of Interconnection Networks for Large Scale Chip-Multiprocessors. TACO, 7(1), 2010. Google ScholarDigital Library
- E. Schnarr and J. R. Larus. Fast out-of-order processor simulation using memoization. In ASPLOS-8, 1998. Google ScholarDigital Library
- E. C. Schnarr, M. D. Hill, and J. R. Larus. Facile: A language and compiler for high-performance processor simulators. In PLDI, 2001. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS-10, 2002. Google ScholarDigital Library
- J. Shin, K. Tam, D. Huang, et al. A 40nm 16-core 128-thread CMT SPARC SoC processor. In ISSCC, 2010.Google Scholar
- S. Srinivasan, L. Zhao, B. Ganesh, et al. CMP Memory Modeling: How Much Does Accuracy Matter? In MoBS-5, 2009.Google Scholar
- Z. Tan, A. Waterman, R. Avizienis, et al. RAMP Gold: An FPGA-based architecture simulator for multiprocessors. In DAC-47, 2010. Google ScholarDigital Library
- Tilera. TILE-Gx 3000 Series Overview. Technical report, 2011.Google Scholar
- T. von Eicken, A. Basu, V. Buch, et al. U-net: a user-level network interface for parallel and distributed computing. In SOSP-15, 1995. Google ScholarDigital Library
- J. Wawrzynek, D. Patterson, M. Oskin, et al. RAMP: Research accelerator for multiple processors. IEEE Micro, 27(2), 2007. Google ScholarDigital Library
- T. Wenisch, R. Wunderlich, M. Ferdman, et al. Simflex: statistical sampling of computer system simulation. IEEE Micro, 26(4), 2006. Google ScholarDigital Library
- E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In SIGMETRICS Perf. Eval. Review, volume 24, 1996. Google ScholarDigital Library
- R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In ISCA-30, 2003. Google ScholarDigital Library
Recommendations
ZSim: fast and accurate microarchitectural simulation of thousand-core systems
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureArchitectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by ...
Table-based modeling of delta-sigma modulators using ZSIM
ZSIM, a nonlinear Z -domain simulator for sampled-data systems, is presented and verified. ZSIM integrates analytic tools, a difference equation simulator, a table-based nonlinear Z -domain simulator, and digital signal processing into a workstation ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Comments