ABSTRACT
Hardware trends indicate that supercomputers will see fast growing intra-node parallelism. Future programming models will need to carefully manage the interaction between inter- and intra-node parallelism to cope with this evolution. There exist many programming models which expose both levels of parallelism. However, they do not scale well as per-node thread counts rise and there is limited interoperability between threading and communication, leading to unnecessary software overheads and an increased amount of unnecessary communication. To address this, it is necessary to understand the limitations of current models and develop new approaches.
We propose a new runtime system design, PPL, which abstracts important high-level concepts of a typical parallel system for distributed-memory machines. By modularizing these elements, layers can be tested to better understand the needs of future programming models. We present details of the design and development implementation of PPL in C++11 and evaluate the performance of several different module implementations through micro-benchmarks and three applications: Barnes-Hut, Monte Carlo particle tracking, and a sparse-triangular solver.
- Mellanox technologies. http://www.mellanox.com.Google Scholar
- B. Acun, A. Gupta, N. Jain, et al. Parallel programming with migratable objects: Charm++ in practice. In Supercomputing 2014, pages 647--658. IEEE, 2014. Google ScholarDigital Library
- A. Amer, H. Lu, Y. Wei, et al. MPI+Threads: Runtime contention and remedies. In PPoPP 2015, pages 239--248, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur. Toward efficient support for multithreaded mpi communication. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 120--129. Springer, 2008. Google ScholarDigital Library
- P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur. Fine-grained multithreading support for hybrid threaded mpi programming. International Journal of High Performance Computing Applications, 24(1):49--57, 2010. Google ScholarDigital Library
- M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In Supercomputing 2012, page 66. IEEE Computer Society Press, 2012. Google ScholarDigital Library
- K. Bergman, S. Borkar, D. Campbell, et al. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 15, 2008.Google Scholar
- D. Bonachea. GASNet specification, v1.8. http://gasnet.lbl.gov/#spec, November 2008.Google Scholar
- P. Charles, C. Grothoff, V. Saraswat, et al. X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPlan Notices, 40(10):519--538, 2005. Google ScholarDigital Library
- C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, et al. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In PPoPP 2005, pages 36--47. ACM, 2005. Google ScholarDigital Library
- K. Devine, E. Boman, R. Heaphy, et al. Zoltan data management services for parallel dynamic applications. Computing in Science & Engineering, 4(2):90--96, 2002. Google ScholarDigital Library
- G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Recent Advances in the Message Passing Interface, pages 11--20. Springer, 2010. Google ScholarDigital Library
- K. G. Felker, A. R. Siegel, K. S. Smith, P. K. Romano, and B. Forget. The Energy Band Memory Server Algorithm for Parallel Monte Carlo Transport Calculations. In Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo, 2013.Google Scholar
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In ACM Sigplan Notices, volume 33, pages 212--223. ACM, 1998. Google ScholarDigital Library
- W. Gropp and R. Thakur. Thread-safety in an MPI implementation: Requirements and analysis. Parallel Computing, 33(9):595--604, 2007. Google ScholarDigital Library
- P. Husbands, C. Iancu, and K. Yelick. A performance analysis of the berkeley upc compiler. In ICS '03, pages 63--73. ACM, 2003. Google ScholarDigital Library
- H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey. HPX: A task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, page 6. ACM, 2014. Google ScholarDigital Library
- L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In OOPSLA '93, pages 91--108, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
- T. Mattson, R. Cledat, Z. Budimlic, et al. OCR: The open community runtime interface version 1.1.0. http://xstack.exascale-tech.com/git/public?p=xstack.git;a=blob;f=ocr/spec/ocr-1.0.0.pdf, June 2015.Google Scholar
- MPI Forum. MPI: A message-passing interface standard version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, Sept. 2012.Google Scholar
- R. W. Numrich and J. Reid. Co-Array Fortran for parallel programming. In ACM Sigplan Fortran Forum, volume 17, pages 1--31. ACM, 1998. Google ScholarDigital Library
- OpenMP Architedcture Review Board. OpenMP application program interface version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, July 213.Google Scholar
- J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc., 2007. Google ScholarDigital Library
- A. Righi. umalloc.c file reference. http://minirighi.sourceforge.net/html/umalloc_8c.html. Accessed: 2014-10-17.Google Scholar
- J. K. Salmon. Parallel hierarchical N-body methods. PhD thesis, California Institute of Technology, 1991. Google ScholarDigital Library
- V. Sarkar, W. Harrod, and A. E. Snavely. Software challenges in extreme scale systems. In Journal of Physics: Conference Series, volume 180, page 012045. IOP Publishing, 2009.Google Scholar
- S. Seo, A. Amer, P. Balaji, P. Beckman, C. Bordage, G. Bosilca, A. Brooks, A. CastellÃş, D. Genet, T. Herault, P. Jindal, L. V. Kale, S. Krishnamoorthy, J. Lifflander, H. Lu, E. Meneses, M. Snir, and Y. Sun. Argobots: A lightweight low-level threading/tasking framework. http://collab.mcs.anl.gov/display/ARGOBOTS/, 2015.Google Scholar
- M. Si, A. J. Peña, P. Balaji, et al. MT-MPI: Multithreaded MPI for many-core environments. In ICS '14, pages 125--134. ACM, 2014. Google ScholarDigital Library
- H. Tang and T. Yang. Optimizing threaded MPI execution on SMP clusters. In ICS '01, pages 381--392. ACM, 2001. Google ScholarDigital Library
- Texas Advanced Computing Center. Stampede. portal.tacc.utexas.edu/user-guides/stampede. Accessed: 2015-01-13.Google Scholar
- E. Totoni, M. T. Heath, and L. V. Kale. Structure-adaptive parallel solution of sparse triangular linear systems. Parallel Computing, 40(9):454--470, 2014.Google ScholarDigital Library
- UPC Consortium. UPC language specifications v1.3. http://upc.lbl.gov/publications/upc-spec-1.3.pdf, 2013.Google Scholar
- V. M. Weaver. Linux perf_event features and overhead. In The 2nd International Workshop on Performance Analysis of Workload Optimized Systems, FastPath, page 80, 2013.Google Scholar
- K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In IPDPS 2008, pages 1--8. IEEE, 2008.Google ScholarCross Ref
- J. Zhang. MPI-3 EBMS. http://github.com/ANL-CESAR/EBMS, 2015.Google Scholar
- J. Zhang, B. Behzad, and M. Snir. Design of a multithreaded Barnes-Hut algorithm for multicore clusters. IEEE Transactions on Parallel and Distributed Systems, 26(7):31--36, 2015.Google ScholarCross Ref
Index Terms
- PPL: an abstract runtime system for hybrid parallel programming
Recommendations
Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computationPartitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Asynchronous PGAS runtime for Myrinet networks
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming ModelPGAS languages aim to enhance productivity for large scale systems. The IBM Asynchronous PGAS runtime (APGAS) supports various high productivity programming languages including UPC, X10 and CAF. The runtime has been designed for scalability and ...
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
PGAS '14: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming ModelsA Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly ...
Comments