ABSTRACT
Today's processors become fatter, not faster. However, the exploitation of these massively parallel compute resources remains a challenge for many traditional HPC applications regarding scalability, portability and programmability. To tackle this challenge, several parallel programming approaches such as loop parallelism and task parallelism are researched in form of languages, libraries and frameworks. Task parallelism as provided by OpenMP, HPX, StarPU, Charm++ and Kokkos is the most promising approach to overcome the challenges of ever increasing parallelism. The aforementioned parallel programming technologies enable scalability for a broad range of algorithms with coarse-grained tasks, e. g. in linear algebra and classical N-body simulation. However, they do not fully address the performance bottlenecks of algorithms with fine-grained tasks and the resultant large task graphs. Additionally, we experienced the description of large task graphs to be cumbersome with the common approach of providing in-, out- and inout-dependencies. We introduce event-based task parallelism to solve the performance and programmability issues for algorithms that exhibit fine-grained task parallelism and contain repetitive task patterns. With user-defined event lists, the approach provides a more convenient and compact way to describe large task graphs. Furthermore, we show how these event lists are processed by a task engine that reuses user-defined, algorithmic data structures. As use case, we describe the implementation of a fast multipole method for molecular dynamics with event-based task parallelism. The performance analysis reveals that the event-based implementation is 52 % faster than a classical loop-parallel implementation with OpenMP.
- M. J. Abraham, T. Murtola, R. Schulz, S. Päll, J. C. Smith, B. Hess, and E. Lindahl, "GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers," SoftwareX, vol. 1--2, pp. 19--25, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2352711015000059Google Scholar
- A. Arnold, F. Fahrenberger, C. Holm, O. Lenz, M. Bolten, H. Dachsel, R. Halver, I. Kabadshow, F. Gähler, F. Heber, J. Iseringhausen, M. Hofmann, M. Pippig, D. Potts, and G. Sutmann, "Comparison of scalable fast methods for long-range interactions," Phys. Rev. E, vol. 88, p. 063308, Dec 2013. [Online]. Available: http://link.aps.org/doi/10.1103/PhysRevE.88.063308Google ScholarCross Ref
- I. Kabadshow, H. Dachsel, C. Kutzner, and T. Ullmann, "GROMEX - Unified Long-range Electrostatics and Flexible Ionization," http://www.mpibpc.mpg.de/15304826/inSiDE_autumn2013.pdf, 2013 (accessed April 27, 2017).Google Scholar
- OpenMP Architecture Review Board, "OpenMP application programming interface version 5.0," 2018. [Online]. Available: https://www.openmp.org/wp-content/uploads/OpenMP-API- Specification-5.0.pdfGoogle Scholar
- P. Atkinson and S. McIntosh-Smith, "On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Applicatio," in Scaling OpenMP for Exascale Performance and Portability, B. R. de Supinski, S. L. Olivier, C. Terboven, B. M. Chapman, and M. S. Müller, Eds. Cham: Springer International Publishing, 2017, pp. 92--106.Google Scholar
- K. Taura, J. Nakashima, R. Yokota, and N. Maruyama, "A Task Parallel Implementation of Fast Multipole Methods," in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Nov 2012, pp. 617--625.Google Scholar
- E. Agullo, O. Aumage, B. Bramas, O. Coulaud, and S. Pitoiset, "Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 10, pp. 2794--2807, Oct 2017.Google ScholarDigital Library
- B. Chamberlain, D. Callahan, and H. Zima, "Parallel Programmability and the Chapel Language," Int. J. High Perform. Comput. Appl., vol. 21, no. 3, pp. 291--312, Aug. 2007. [Online]. Available: http://dx.doi.org/10.1177/1094342007078442Google ScholarDigital Library
- L. V. Kale and S. Krishnan, "CHARM+ +: A Portable Concurrent Object Oriented System Based on C++," in Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, ser. OOPSLA '93. New York, NY, USA: ACM, 1993, pp. 91--108. [Online]. Available: http://doi.acm.org/10.1145/165854.165874Google ScholarDigital Library
- C. Pheatt, "Intel Threading Building Blocks," J. Comput. Sci. Coll., vol. 23, no. 4, pp. 298--298, Apr. 2008. [Online]. Available: http://dl.acm.org/citation.cfm?id=1352079.1352134Google ScholarDigital Library
- H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey, "HPX: A Task Based Programming Model in a Global Address Space," in Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, ser. PGAS '14. New York, NY, USA: ACM, 2014, pp. 6:1--6:11. [Online]. Available: http://doi.acm.org/10.1145/2676870.2676883Google ScholarDigital Library
- H. C. Edwards, C. R. Trott, and D. Sunderland, "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns," Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202--3216, 2014, domain-Specific Languages and High-Level Frameworks for High-Performance Computing. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731514001257Google ScholarDigital Library
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, "StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures," CCPE - Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009, vol. 23, pp. 187--198, Feb. 2011. [Online]. Available: http://hal.inria.fr/inria-00550877Google Scholar
- R. Bagrodia, R. Meyer, M. Takai, Yu-An Chen, Xiang Zeng, J. Martin, and Ha Yoon Song, "Parsec: a parallel simulation environment for complex systems," Computer, vol. 31, no. 10, pp. 77--85, Oct 1998.Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, "X10: An Object-oriented Approach to Non-uniform Cluster Computing," in Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, ser. OOPSLA'05. New York, NY, USA: ACM, 2005, pp. 519--538. [Online]. Available: http://doi.acm.org/10.1145/1094811.1094852Google ScholarDigital Library
- P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K. Hasanov, P. Gschwandtner, P. Lemarinier, S. Markidis, H. Jordan, T. Fahringer, K. Katrinis, E. Laure, and D. S. Nikolopoulos, "A taxonomy of task-based parallel programming technologies for high-performance computing," The Journal of Supercomputing, vol. 74, no. 4, pp. 1422--1434, Apr 2018. [Online]. Available: https://doi.org/10.1007/s11227-018-2238-4Google ScholarDigital Library
- B. Zhang, "Asynchronous Task Scheduling of the Fast Multipole Method Using Various Runtime Systems," in Proceedings of the 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing, ser. DFM '14. Washington, DC, USA: IEEE Computer Society, 2014, pp. 9--16. [Online]. Available: https://doi.org/10.1109/DFM.2014.14Google Scholar
- Z. Khatami, H. Kaiser, P. Grubel, A. Serio, and J. Ramanujam, "A Massively Parallel Distributed N-body Application Implemented with HPX," in Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ser. ScalA '16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 57--64. [Online]. Available: https://doi.org/10.1109/ScalA.2016.12Google Scholar
- H. C. Edwards and C. R. Trott, "Kokkos: Enabling Performance Portability Across Manycore Architectures," in 2013 Extreme Scaling Workshop (xsw 2013), Aug 2013, pp. 18--24.Google Scholar
- R. Yokota and L. A. Barba, "A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems," The International Journal of High Performance Computing Applications, vol. 26, no. 4, pp. 337--346, 2012. [Online]. Available: https://doi.org/10.1177/1094342011429952Google ScholarDigital Library
- M. Abduljabbar, M. Al Farhan, R. Yokota, and D. Keyes, "Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture," in Euro-Par 2017: Parallel Processing, F. F. Rivera, T. F. Pena, and J. C. Cabaleiro, Eds. Cham: Springer International Publishing, 2017, pp. 553--564.Google Scholar
- H. Ltaief and R. Yokota, "Data-Driven Execution of Fast Multipole Methods," CoRR, vol. abs/1203.0889, 2012. [Online]. Available: http://arxiv.org/abs/1203.0889Google Scholar
- A. YarKhan, J. Kurzak, and J. Dongarra, "QUARK Users' Guide: QUeueing And Runtime for Kernels," 2011.Google Scholar
- E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner, and T. Takahashi, "Task-Based FMM for Multicore Architectures," SIAM Journal on Scientific Computing, vol. 36, no. 1, pp. C66-C93, 2014. [Online]. Available: https://doi.org/10.1137/130915662Google ScholarCross Ref
- E. Agullo, B. Bramas, O. Coulaud, M. Khannouz, and L. Stanisic, "Task-based fast multipole method for clusters of multicore processors," Inria Bordeaux Sud-Ouest, Research Report RR-8970, Mar. 2017. [Online]. Available: https://hal.inria.fr/hal-01387482Google Scholar
- L. Hyafil and R. Rivest, Graph Partitioning and Constructing Optimal Decision Trees are Polynomial Complete Problems, ser. Laboratoire de Recherche: Rapport de recherche. IRIA, 1973.Google Scholar
- N. Goodspeed," A proposal to add coroutines to the C++ standard library (Revision 1)" 2014.Google Scholar
- (2017) Intel TBB Data Flow and Dependence Graphs. [Online]. Available: https://software.intel.com/en-us/node/517340Google Scholar
- M. D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte, "Min-max heaps and generalized priority queues," Communications of the ACM, vol. 29, no. 10, pp. 996--1000, 1986.Google ScholarDigital Library
- N. M. Josuttis, The C++ standard library: a tutorial and reference. Addison-Wesley, 2012.Google ScholarDigital Library
- (2011) Constant Expression. [Online]. Available: http://en.cppreference.com/w/cpp/language/constant_expressionGoogle Scholar
- L. Greengard and V. Rokhlin, "A fast algorithm for particle simulations," Journal of computational physics, vol. 73, no. 2, pp. 325--348, 1987.Google ScholarDigital Library
- I. Kabadshow, Periodic boundary conditions and the error-controlled fast multipole method. Forschungszentrum Jülich, 2012, vol. 11.Google Scholar
- J. Evans, "jemalloc memory allocator," http://jemalloc.net/, 2017 (accessed April 11, 2017).Google Scholar
- T. Rauber and G. Rünger, Parallel Programming for Multicore and Cluster Systems. Springer Berlin Heidelberg, 2010.Google Scholar
- L. Morgenstern, "A NUMA-Aware Task-Based Load-Balancing Scheme for the Fast Multipole Method," Master Thesis, TU Chemnitz, 2017.Google Scholar
- M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. Morgan Kaufmann, 2011.Google ScholarDigital Library
Index Terms
- Eventify: Event-Based Task Parallelism for Strong Scaling
Recommendations
Hybrid PGAS runtime support for multicore nodes
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming ModelWith multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be ...
Communicating Data-Parallel Tasks: An MPI Library for HPF
HIPC '96: Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data-parallel computing. However, HPF does not support task parallelism or heterogeneous computing adequately. This paper presents a summary of our work on a library-based ...
A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms
PMAM'17: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and ManycoresTraditional widely used parallel programming models and methods focus on data distribution and are suitable for implementing data parallelism. They lack the abstraction of task parallelism and make it inconvenient to separate the applications' high ...
Comments