ABSTRACT
This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and differentiable profile. Our toolkit operates on application binaries and succeeds in modeling key application characteristics that determine program performance. We use these characterizations to explore the interactions between an application and a target architecture. We apply our toolkit to SPARC binaries to develop architecture-neutral models of computation and memory access patterns of the ASCI Sweep3D and the NAS SP, BT and LU benchmarks. From our models, we predict the L1, L2 and TLB cache miss counts as well as the overall execution time of these applications on an Origin 2000 system. We evaluate our predictions by comparing them against measurements collected using hardware performance counters.
- The ASCI Sweep3D Benchmark Code. DOE Accelerated Strategic Computing Initiative. http://www.llnl.gov/asci_benchmarks/asci/limited/sweep3d/asci_sweep3d.html.Google Scholar
- D. Bailey, T. Harris, W. Saphir, R. van der Wijngaart, A. Woo, and M. Yarrow. The NAS parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, Dec. 1995.Google Scholar
- T. Ball and J. R. Larus. Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems, 16(4):1319--1360, July 1994. Google ScholarDigital Library
- B. Bennett and V. Kruskal. Lru stack processing. IBM Journal of Research and Development, 19(4):353--357, July 1975.Google ScholarDigital Library
- K. Beyls and E. D'Hollander. Reuse distance as a metric for cache behavior. In IASTED conference on Parallel and Distributed Computing and Systems 2001 (PDCS01), pages 617--662, 2001.Google Scholar
- T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990. Google ScholarDigital Library
- C. Ding and Y. Zhong. Reuse distance analysis. Technical Report TR741, Dept. of Computer Science, University of Rochester, 2001. Google ScholarDigital Library
- C. Hristea, D. Lenoski, and J. Keen. Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks. In Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pages 1--12. ACM Press, 1997. Google ScholarDigital Library
- D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings. Predictive Performance and Scalability Modeling of a Large-Scale Application. In Supercomputing 2001, Denver, CO, Nov. 2001. Google ScholarDigital Library
- D. E. Knuth and F. R. Stevenson. Optimal measurement points for program frequency counts. BIT, 13(3):313--322, 1973.Google ScholarCross Ref
- J. Larus and E. Schnarr. EEL: Machine-Independent Executable Editing. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 291--300, June 1995. Google ScholarDigital Library
- G. Marin. Semi-Automatic Synthesis of Parameterized Performance Models for Scientific Programs. Master's thesis, Dept. of Computer Science, Rice University, Houston, TX, Apr. 2003.Google Scholar
- MathWorks. Optimization Toolbox: Function quadprog. http://www.mathworks.com/access/helpdesk/help/toolbox/optim/quadprog.shtml.Google Scholar
- R. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78--117, 1970.Google ScholarDigital Library
- U. Prestor. Evaluating the memory performance of a ccNUMA system. Master's thesis, Dept. of Computer Science, University of Utah, Salt Lake City, UT, Aug. 2001.Google Scholar
- A. Snavely, L. Carrington, and N. Wolter. Modeling application performance by convolving machine signatures with application profiles. In Proc. IEEE 4th Annual Workshop on Workload Characterization, 2001. Google ScholarDigital Library
- D. Sundaram-Stukel and M. K. Vernon. Predictive Analysis of a Wavefront Application Using LogGP. In Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '99), Atlanta, May 1999. Google ScholarDigital Library
- R. E. Tarjan. Testing flow graph reducibility. Journal of Computer and System Sciences, 9:355--365, 1974.Google ScholarDigital Library
- Y. Zhong, S. G. Dropsho, and C. Ding. Miss Rate Prediction across All Program Inputs. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, New Orleans, Louisiana, Sept. 2003. Google ScholarDigital Library
Index Terms
- Cross-architecture performance predictions for scientific applications using parameterized models
Recommendations
Cross-architecture performance predictions for scientific applications using parameterized models
This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and ...
An experimental evaluation of the HP V-class and SGI origin 2000 multiprocessors using microbenchmarks and scientific applications
As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, ...
Using dead blocks as a virtual victim cache
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesCaches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is ...
Comments