skip to main content
10.1145/1995896.1995928acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

An idiom-finding tool for increasing productivity of accelerators

Published:31 May 2011Publication History

ABSTRACT

Suppose one is considering purchase of a computer equipped with accelerators. Or suppose one has access to such a computer and is considering porting code to take advantage of the accelerators. Is there a reason to suppose the purchase cost or programmer effort will be worth it? It would be nice to able to estimate the expected improvements in advance of paying money or time. We exhibit an analytical framework and tool-set for providing such estimates: the tools first look for user-defined idioms that are patterns of computation and data access identified in advance as possibly being able to benefit from accelerator hardware. A performance model is then applied to estimate how much faster these idioms would be if they were ported and run on the accelerators, and a recommendation is made as to whether or not each idiom is worth the porting effort to put them on the accelerator and an estimate is provided of what the overall application speedup would be if this were done.

As a proof-of-concept we focus our investigations on Gather/Scatter (G/S) operations and means to accelerate these available on the Convey HC-1 which has a special-purpose "personality" for accelerating G/S. We test the methodology on two large-scale HPC applications. The idiom recognizer tool saves weeks of programmer effort compared to having the programmer examine the code visually looking for idioms; performance models save yet more time by rank-ordering the best candidates for porting; and the performance models are accurate, predicting G/S runtime speedup resulting from porting to within 10% of speedup actually achieved. The G/S hardware on the Convey sped up these operations 20x, and the overall impact on total application runtime was to improve it by as much as 21%.

References

  1. B. Miller, et al., "The Paradyn Parallel Performance Measurement Tool," Computer, vol. 28, pp. 37--46, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Shende and A. Maloney, "The TAU Parallel Performance System," International Journal of High Performance Computing Applications, vol. 20, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. V. Adve, et al., "An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs," Proceedings of the IEEE/ACM SC95 Conference, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Freeh, et al., "Analyzing the Energy-time Trade-off in High-Performance Computing Applications," IEEE Transactions on Parallel and Distributed Systems, vol. 18, pp. 835--848, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Shin, et al., "Autotuning and Specialization: Speeding up Nek5000 with Compiler Technology," presented at the International Conference on Supercomputing, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Kepner, "HPC Productivity: An Overarching View," International Journal of High Performance Computing Applications, vol. 18, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Hochstein, et al., "Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers," Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Olschanowsky, et al., "PIR: A Static Idiom Recognizer," in First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010), San Diego, CA, 2010.Google ScholarGoogle Scholar
  9. J. Nieplocha, et al., "Global Arrays: A Non-uniform Memory Access Programming Model for High-Performance Computers," Journal of Supercomputing, vol. 10, pp. 169--189, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Lewis and H. Simon, "The Impact of Hardware Gather/Scatter On Sparse Gaussian Elimination," SIAM J. Sci. Stat. Comput., vol. 9, pp. 304--311, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Mukherjee, et al., "Efficient Support for Irregular Applications on Distributed-memory Machines," ACM SIGPLAN Notices, vol. 30, pp. 68--79, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. SGBench see, http://www.sdsc.edu/pmac/SGBench.Google ScholarGoogle Scholar
  13. J. Dongarra and P. Luszczek, "Introduction to the HPC Challenge Benchmark Suite," ICL-UT-05-01, 2005.Google ScholarGoogle Scholar
  14. G. Fox, et al., "Solving Problems on Concurrent Processors: Volume 1, Chapter 22," P. Hall, Ed., ed Englewood Cluffs, NJ, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. HC-1, "http://www.conveycomputer.com/ConveyArchitectureWhiteP.pdf," ed.Google ScholarGoogle Scholar
  16. M. Tikir, et al., "The PMaC Binary Instrumentation Library for PowerPC," Workshop on Binary Instrumentation and Applications, San Jose, 2006.Google ScholarGoogle Scholar
  17. C. Olschanowsky, et al., "PSnAP: Accurate Synthetic Address Streams Through Memory Profiles," The 22nd International Workshop on Languages and Compilers for Parallel Computing, Oct. 8--10 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Snavely, et al., "A Framework for Application Performance Modeling and Prediction," ACM/IEEE Conference on High Performance Networking and Computing, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Carrington, et al., "How well can simple metrics represent the performance of HPC applications?," Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Tikir, et al., "Genetic Algorithm Approach to Modeling the Performance of Memory-bound Codes," The Proceeding of the ACM/IEEE Conference on High Performance Networking and Computing, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Tikir, et al., "PSINS: An Open Source Event Tracer and Execution Simulator for model prediction," presented at the HPCMP User Group Conference, San Diego, CA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. "ORNL Jaguar see http://www.nccs.gov/computing-resources/jaguar/."Google ScholarGoogle Scholar
  23. B. He, et al., "Efficient Gather and Scatter Operations on Graphics Processors," SC07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. D. Owens, et al., "A Survey of general purpose compuation on graphics hardware," Computer Graphics Forum, vol. 26, 2007.Google ScholarGoogle Scholar
  25. M. Zagha and G. E. Blelloch, "Radix sort for vector multiprocessors.," in Supercomputing 1991, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Bolz, et al., "Sparse matrix solvers on the GPU: conjugate gradients and multigrid," ACM Transactions on Graphics, pp. 917--924, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Adve and R. Sakellariou, "Application representations for multiparadigm performance modeling of large-scale parallel scientific codes," The International Journal of High Performance Computing Applications, vol. 14, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Alam and J. Vetter, "A Framework to Develop Symbolic Performance Models of Parallel Applications," presented at the 5th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Almasi, et al., "Demonstrating the scalability of a molecular dynamics application on a Petaflop computer," presented at the Proceedings of the 15th international conference on Supercomputing, Sorrento, Italy, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Armstrong and R. Eigenmann, "Performance forecasting: Towards a methodology for characterizing large computationals applications," in Internationals Conference on Parallel Processing, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Bailey and A. Snavely, "Performance Modeling: Understanding the Present and Predicting the Future," EuroPar, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Bourgeois and F. Spies, "Performance prediction of an NAS benchmark program with chronosmix enviroment," presented at the 6th International Euro-Par Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Clement and M. Quinn, "Automated performance prediction for scalable parallel computing," Parallel Computing, vol. 23, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. J. Clement and M. J. Quinn, "Analytical performance prediction on multicomputers," Supercomputing, pp. 886--894, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Culler, et al., "LogP: Towards a realistic modle of parallel computation," in 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Faerman, et al., "Adaptive performance prediction for distributed data-intensive applications," presented at the Supercomputing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. Fahringer and M. Zima, "A static parameter based performance prediction tool for parallel programs," presented at the The International Conference on Supercomputing, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. J. Kerbyson, et al., "Predictive Performance and Scalability Modeling of Large-Scale Application," Supercomputing, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. C. Lim, et al., "Implementation lessons of performance prediction tool for parallel conservative simulation," presented at the 6th International Euro-Par Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. G. Marin and J. Mellor-Crummey, "Cross Architecture Performance Predictions for Scientific Applications Using Parameterized Models," In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. B. Mohr and F. Wolf, "KOJAK - A Tool Set for Automatic Performance Analysis of Parallel Applications," presented at the European Converence on Parallel Computing (EuroPar), 2003.Google ScholarGoogle Scholar
  42. J. Simon and J.-M. Wierum, "Accurate Performance Prediction for Massively Parallel Systems and its Applications," Euro-Par'96 Parallel Processing, vol. 1124, pp. 675--688, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A. van Gemund, "Symbolic performance modeling of parallel systems," IEEE Transactions on Parallel and Distributed Systems, vol. 14, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. Wagner, et al., "Performance models for the processor farm paradigm," IEEE Transactions on Parallel and Distributed Systems, vol. 8, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. L. Yang, et al., "Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution," presented at the Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. X. Zhang and Z. Xu, "Multiprocessor Scalability Predictions Through Detailed Program Execution Analysis," International Conference on Supercomputing, pp. 97--106, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. Alam, et al., "An Exploration of Performance Attributes for Symbolic Modeling of Emerging Processing Devices," presented at the HPCC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. NAS Parallel Benchmarks (NPB) see, http://www.nas.nasa.gov/Resources/Software/npb.html.Google ScholarGoogle Scholar
  49. S. Hong and H. Kim, "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness," presented at the ISCA'09, Austin, Texas, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. N. Govindaraju, et al., "A Memory Model for Scientific Algorithms on Graphics Processors," presented at the Supercomputing, Tampa, Florida USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An idiom-finding tool for increasing productivity of accelerators

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '11: Proceedings of the international conference on Supercomputing
      May 2011
      398 pages
      ISBN:9781450301022
      DOI:10.1145/1995896

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 31 May 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate584of2,055submissions,28%

      Upcoming Conference

      ICS '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader