ABSTRACT
While architecture simulation is often treated as a methodology issue, it is at the core of most processor architecture research works, and simulation speed is often the bottleneck of the typical trial-and-error research process. To speedup simulation during this research process and get trends faster, researchers usually reduce the trace size. More sophisticated techniques like trace sampling or distributed simulation are scarcely used because they are considered unreliable and complex due to their impact on accuracy and the associated warm-up issues.In this article, we present DiST, a practical distributed simulation scheme where, unlike in other simulation techniques that trade accuracy for speed, the user is relieved from most accuracy issues thanks to an automatic and dynamic mechanism for adjusting the warm-up interval size. Moreover, the mechanism is designed so as to always privilege accuracy over speedup. The speedup scales with the amount of available computing resources, bringing an average 7.35 speedup on 10 machines with an average IPC error of 1.81% and a maximum IPC error of 5.06%.Besides proposing a solution to the warm-up issues in distributed simulation, we experimentally show that our technique is significantly more accurate than trace size reduction or trace sampling for identical speedups. We also show that not only the error always remains small for IPC and other metrics, but that a researcher can reliably base research decisions on DiST simulation results. Finally, we explain how the DiST tool is designed to be easily pluggable into existing architecture simulators with very few modifications.
- J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, D. Sites, M. Vandevoorde, C. Waldspurger, and W. Weihl. Continuous profiling: Where have all the cycles gone, July 1997. Google ScholarDigital Library
- P. Bose and T. M. Conte. Performance analysis and its impact on design. IEEE Computer, pages 41--49, May 1998. Google ScholarDigital Library
- D. Burger and T. Austin. The simplescalar tool set, version 2.0. Technical Report CS-TR-97-1342, Department of Computer Sciences, University of Wisconsin, June 1997.Google ScholarDigital Library
- S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In Sixth International Symposium on High-Performance Computer Architecture, pages 195--205, Toulouse, France, 2000.Google Scholar
- T. Conte, M. Hirsch, and K. Menezes. Reducing state loss for effective trace sampling of superscalar processors. In International Conference on Computer Design, pages 468--477, 1996. Google ScholarDigital Library
- J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Z. Chrysos. ProfileMe : Hardware support for instruction-level profiling on out-of-order processors. In International Symposium on Microarchitecture, pages 292--302, Research Triangle Park, North Carolina, 1997. Google ScholarDigital Library
- R. Desikan, D. Burger, and S. W. Keckler. Measuring experimental error in microprocessor simulation. In The 28th Annual Intl. Symposium on Computer Architecture, pages 266--277, June 2001. Google ScholarDigital Library
- L. Eeckhout, K. DeBousschere, and H. Neefs. Performance analysis through synthetic trace generation. In Int. Symp. on Performance Analysis of Systems and Software, Liege, Belgium, April 2000. Google ScholarDigital Library
- J. Haskins and K. Skadron. Minimal subset evaluation: Rapid warm-up for simulated hardware state. In Proc. of the 2001 International Conference on Computer Design, Austin, Texas, September 2001.Google ScholarCross Ref
- V. S. Iyengar and L. H. Trevillyan. Evaluation and generation of reduced traces for benchmarks. Technical Report RC20610, IBM T. J. Watson, Oct 1996.Google Scholar
- A. KleinOsowski, J. Flynn, N. Meares, and D. Lilja. Adapting the SPEC 2000 benchmark suite for simulation-based computer architecture research. In Proceedings of the Third IEEE Annual Workshop on Workload Characterization, International Conference on Computer Design (ICCD),, pages 73--82, September 2000.Google Scholar
- T. Lafage, A. Seznec, E. Rohou, and F. Bodin. Code cloning tracing: A "pay per trace" approach. In EuroPar'99 Parallel Processing, Toulouse, France, August 1999. Google ScholarDigital Library
- M. J. Litzkow, M. Livny, and M. W. Mutka. Condor - a hunter of idle workstations. In Proc. of the 8th Intl. Conf. on Distributed Computing Systems, pages 104--111, San Jose, Calif., June 1988.Google ScholarCross Ref
- M. Martonosi, A. Gupta, and T. Anderson. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pages 248--259. ACM Press, 1993. Google ScholarDigital Library
- A. Nguyen, M. Michael, A. Nanda, K. Ekanadham, and P. Bose. Accuracy and speed-up of parallel trace-driven architectural simulation. In Proc. Int'l Parallel Processing Symp., IEEE Computer Soc. Press,, pages 39--44, Geneva, Switzerland, April 1997. Google ScholarDigital Library
- D. B. Noonburg and J. P. Shen. A framework for statistical modeling of superscalar processor performance. In Proc. Thrird In. Symp. On High Perf. Computer Architecture, San Antonio, Texas, February 1997. Google ScholarDigital Library
- S. Nussbaum and J. Smith. Modeling superscalar processors via statistical simulation. In PACT '01, International Conference on Parallel Architectures and Compilation Techniques, Barcelona, September 2001. Google ScholarDigital Library
- D. Parello, O. Temam, and J.-M. Verdun. On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance - matrix-multiply revisited. In Supercomputing 2002, Baltimore, November 2002. Google ScholarDigital Library
- V. Rajesh and R. Moona. Processor modeling for hardware software codesign. In International Conference on VLSI Design, Goa, India, January 1999. Google ScholarDigital Library
- T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic behavior and simulation points in applications. In International Conference on Parallel Architecture and Compilation Techniques, Barcelona, Spain, September 2001. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proc. of Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, Calif., October 2002. Google ScholarDigital Library
- Synopsys. SystemC. http://www.systemc.org, 2000-2002.Google Scholar
- X. Vera, M. Hogskola, and J. Xue. Let's study whole-program cache behaviour analytically. In Proceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA'02), Boston, Massachusettes, February 2002. Google ScholarDigital Library
- Z. Wang, K. Pierce, and S. McFarling. BMAT --- a binary matching tool for stale profile propagation. Journal of Instruction-Level Parallelism, 2(1--6), 2000.Google Scholar
Index Terms
- DiST: a simple, reliable and scalable method to significantly reduce processor architecture simulation time
Recommendations
DiST: a simple, reliable and scalable method to significantly reduce processor architecture simulation time
While architecture simulation is often treated as a methodology issue, it is at the core of most processor architecture research works, and simulation speed is often the bottleneck of the typical trial-and-error research process. To speedup simulation ...
Using the HLA for Distributed Continuous Simulations
EUROSIM '13: Proceedings of the 2013 8th EUROSIM Congress on Modelling and SimulationDistributed computing offers many advantages for all types of computational applications. Realizing heterogeneous simulation platforms may benefit from many facilities of distributed computing. However, distributing simulation components over a network ...
SMT Layout Overhead and Scalability
Simultaneous Multi-Threading (SMT) is a hardware technique that increases processor throughput by issuing instructions simultaneously from multiple threads. However, while SMT can be added to an existing microarchitecture with relatively low overhead, ...
Comments