ABSTRACT
Suppose one is considering purchase of a computer equipped with accelerators. Or suppose one has access to such a computer and is considering porting code to take advantage of the accelerators. Is there a reason to suppose the purchase cost or programmer effort will be worth it? It would be nice to able to estimate the expected improvements in advance of paying money or time. We exhibit an analytical framework and tool-set for providing such estimates: the tools first look for user-defined idioms that are patterns of computation and data access identified in advance as possibly being able to benefit from accelerator hardware. A performance model is then applied to estimate how much faster these idioms would be if they were ported and run on the accelerators, and a recommendation is made as to whether or not each idiom is worth the porting effort to put them on the accelerator and an estimate is provided of what the overall application speedup would be if this were done.
As a proof-of-concept we focus our investigations on Gather/Scatter (G/S) operations and means to accelerate these available on the Convey HC-1 which has a special-purpose "personality" for accelerating G/S. We test the methodology on two large-scale HPC applications. The idiom recognizer tool saves weeks of programmer effort compared to having the programmer examine the code visually looking for idioms; performance models save yet more time by rank-ordering the best candidates for porting; and the performance models are accurate, predicting G/S runtime speedup resulting from porting to within 10% of speedup actually achieved. The G/S hardware on the Convey sped up these operations 20x, and the overall impact on total application runtime was to improve it by as much as 21%.
- B. Miller, et al., "The Paradyn Parallel Performance Measurement Tool," Computer, vol. 28, pp. 37--46, 2002. Google ScholarDigital Library
- S. Shende and A. Maloney, "The TAU Parallel Performance System," International Journal of High Performance Computing Applications, vol. 20, 2006. Google ScholarDigital Library
- V. Adve, et al., "An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs," Proceedings of the IEEE/ACM SC95 Conference, 1995. Google ScholarDigital Library
- V. Freeh, et al., "Analyzing the Energy-time Trade-off in High-Performance Computing Applications," IEEE Transactions on Parallel and Distributed Systems, vol. 18, pp. 835--848, 2007. Google ScholarDigital Library
- J. Shin, et al., "Autotuning and Specialization: Speeding up Nek5000 with Compiler Technology," presented at the International Conference on Supercomputing, 2010. Google ScholarDigital Library
- J. Kepner, "HPC Productivity: An Overarching View," International Journal of High Performance Computing Applications, vol. 18, 2004. Google ScholarDigital Library
- L. Hochstein, et al., "Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers," Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Google ScholarDigital Library
- C. Olschanowsky, et al., "PIR: A Static Idiom Recognizer," in First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010), San Diego, CA, 2010.Google Scholar
- J. Nieplocha, et al., "Global Arrays: A Non-uniform Memory Access Programming Model for High-Performance Computers," Journal of Supercomputing, vol. 10, pp. 169--189, 1996. Google ScholarDigital Library
- J. Lewis and H. Simon, "The Impact of Hardware Gather/Scatter On Sparse Gaussian Elimination," SIAM J. Sci. Stat. Comput., vol. 9, pp. 304--311, 1988. Google ScholarDigital Library
- S. Mukherjee, et al., "Efficient Support for Irregular Applications on Distributed-memory Machines," ACM SIGPLAN Notices, vol. 30, pp. 68--79, 1995. Google ScholarDigital Library
- SGBench see, http://www.sdsc.edu/pmac/SGBench.Google Scholar
- J. Dongarra and P. Luszczek, "Introduction to the HPC Challenge Benchmark Suite," ICL-UT-05-01, 2005.Google Scholar
- G. Fox, et al., "Solving Problems on Concurrent Processors: Volume 1, Chapter 22," P. Hall, Ed., ed Englewood Cluffs, NJ, 1988. Google ScholarDigital Library
- C. HC-1, "http://www.conveycomputer.com/ConveyArchitectureWhiteP.pdf," ed.Google Scholar
- M. Tikir, et al., "The PMaC Binary Instrumentation Library for PowerPC," Workshop on Binary Instrumentation and Applications, San Jose, 2006.Google Scholar
- C. Olschanowsky, et al., "PSnAP: Accurate Synthetic Address Streams Through Memory Profiles," The 22nd International Workshop on Languages and Compilers for Parallel Computing, Oct. 8--10 2009. Google ScholarDigital Library
- A. Snavely, et al., "A Framework for Application Performance Modeling and Prediction," ACM/IEEE Conference on High Performance Networking and Computing, 2002. Google ScholarDigital Library
- L. Carrington, et al., "How well can simple metrics represent the performance of HPC applications?," Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005. Google ScholarDigital Library
- M. Tikir, et al., "Genetic Algorithm Approach to Modeling the Performance of Memory-bound Codes," The Proceeding of the ACM/IEEE Conference on High Performance Networking and Computing, 2007. Google ScholarDigital Library
- M. Tikir, et al., "PSINS: An Open Source Event Tracer and Execution Simulator for model prediction," presented at the HPCMP User Group Conference, San Diego, CA, 2009. Google ScholarDigital Library
- "ORNL Jaguar see http://www.nccs.gov/computing-resources/jaguar/."Google Scholar
- B. He, et al., "Efficient Gather and Scatter Operations on Graphics Processors," SC07, 2007. Google ScholarDigital Library
- J. D. Owens, et al., "A Survey of general purpose compuation on graphics hardware," Computer Graphics Forum, vol. 26, 2007.Google Scholar
- M. Zagha and G. E. Blelloch, "Radix sort for vector multiprocessors.," in Supercomputing 1991, 1991. Google ScholarDigital Library
- J. Bolz, et al., "Sparse matrix solvers on the GPU: conjugate gradients and multigrid," ACM Transactions on Graphics, pp. 917--924, 2003. Google ScholarDigital Library
- V. Adve and R. Sakellariou, "Application representations for multiparadigm performance modeling of large-scale parallel scientific codes," The International Journal of High Performance Computing Applications, vol. 14, 2000. Google ScholarDigital Library
- S. Alam and J. Vetter, "A Framework to Develop Symbolic Performance Models of Parallel Applications," presented at the 5th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems, 2006. Google ScholarDigital Library
- G. Almasi, et al., "Demonstrating the scalability of a molecular dynamics application on a Petaflop computer," presented at the Proceedings of the 15th international conference on Supercomputing, Sorrento, Italy, 2001. Google ScholarDigital Library
- B. Armstrong and R. Eigenmann, "Performance forecasting: Towards a methodology for characterizing large computationals applications," in Internationals Conference on Parallel Processing, 1998. Google ScholarDigital Library
- D. Bailey and A. Snavely, "Performance Modeling: Understanding the Present and Predicting the Future," EuroPar, 2005. Google ScholarDigital Library
- J. Bourgeois and F. Spies, "Performance prediction of an NAS benchmark program with chronosmix enviroment," presented at the 6th International Euro-Par Conference, 2000. Google ScholarDigital Library
- M. Clement and M. Quinn, "Automated performance prediction for scalable parallel computing," Parallel Computing, vol. 23, 1997. Google ScholarDigital Library
- M. J. Clement and M. J. Quinn, "Analytical performance prediction on multicomputers," Supercomputing, pp. 886--894, 1993. Google ScholarDigital Library
- D. Culler, et al., "LogP: Towards a realistic modle of parallel computation," in 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993. Google ScholarDigital Library
- M. Faerman, et al., "Adaptive performance prediction for distributed data-intensive applications," presented at the Supercomputing, 1999. Google ScholarDigital Library
- T. Fahringer and M. Zima, "A static parameter based performance prediction tool for parallel programs," presented at the The International Conference on Supercomputing, 1993. Google ScholarDigital Library
- D. J. Kerbyson, et al., "Predictive Performance and Scalability Modeling of Large-Scale Application," Supercomputing, 2001. Google ScholarDigital Library
- C. Lim, et al., "Implementation lessons of performance prediction tool for parallel conservative simulation," presented at the 6th International Euro-Par Conference, 2000. Google ScholarDigital Library
- G. Marin and J. Mellor-Crummey, "Cross Architecture Performance Predictions for Scientific Applications Using Parameterized Models," In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, June 2004. Google ScholarDigital Library
- B. Mohr and F. Wolf, "KOJAK - A Tool Set for Automatic Performance Analysis of Parallel Applications," presented at the European Converence on Parallel Computing (EuroPar), 2003.Google Scholar
- J. Simon and J.-M. Wierum, "Accurate Performance Prediction for Massively Parallel Systems and its Applications," Euro-Par'96 Parallel Processing, vol. 1124, pp. 675--688, 1996. Google ScholarDigital Library
- A. van Gemund, "Symbolic performance modeling of parallel systems," IEEE Transactions on Parallel and Distributed Systems, vol. 14, 2003. Google ScholarDigital Library
- A. Wagner, et al., "Performance models for the processor farm paradigm," IEEE Transactions on Parallel and Distributed Systems, vol. 8, 1997. Google ScholarDigital Library
- L. Yang, et al., "Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution," presented at the Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Google ScholarDigital Library
- X. Zhang and Z. Xu, "Multiprocessor Scalability Predictions Through Detailed Program Execution Analysis," International Conference on Supercomputing, pp. 97--106, 1995. Google ScholarDigital Library
- S. Alam, et al., "An Exploration of Performance Attributes for Symbolic Modeling of Emerging Processing Devices," presented at the HPCC, 2007. Google ScholarDigital Library
- NAS Parallel Benchmarks (NPB) see, http://www.nas.nasa.gov/Resources/Software/npb.html.Google Scholar
- S. Hong and H. Kim, "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness," presented at the ISCA'09, Austin, Texas, USA, 2009. Google ScholarDigital Library
- N. Govindaraju, et al., "A Memory Model for Scientific Algorithms on Graphics Processors," presented at the Supercomputing, Tampa, Florida USA, 2006. Google ScholarDigital Library
Index Terms
- An idiom-finding tool for increasing productivity of accelerators
Recommendations
Modeling and predicting performance of high performance computing applications on hardware accelerators
Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD ForumComputers with hardware accelerators, also referred to as hybrid-core systems, speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use ...
Petascale computing with accelerators
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingA trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an ...
Comments