skip to main content
10.1145/3394277.3401858acmconferencesArticle/Chapter ViewAbstractPublication PagespascConference Proceedingsconference-collections
research-article

Eventify: Event-Based Task Parallelism for Strong Scaling

Published:29 June 2020Publication History

ABSTRACT

Today's processors become fatter, not faster. However, the exploitation of these massively parallel compute resources remains a challenge for many traditional HPC applications regarding scalability, portability and programmability. To tackle this challenge, several parallel programming approaches such as loop parallelism and task parallelism are researched in form of languages, libraries and frameworks. Task parallelism as provided by OpenMP, HPX, StarPU, Charm++ and Kokkos is the most promising approach to overcome the challenges of ever increasing parallelism. The aforementioned parallel programming technologies enable scalability for a broad range of algorithms with coarse-grained tasks, e. g. in linear algebra and classical N-body simulation. However, they do not fully address the performance bottlenecks of algorithms with fine-grained tasks and the resultant large task graphs. Additionally, we experienced the description of large task graphs to be cumbersome with the common approach of providing in-, out- and inout-dependencies. We introduce event-based task parallelism to solve the performance and programmability issues for algorithms that exhibit fine-grained task parallelism and contain repetitive task patterns. With user-defined event lists, the approach provides a more convenient and compact way to describe large task graphs. Furthermore, we show how these event lists are processed by a task engine that reuses user-defined, algorithmic data structures. As use case, we describe the implementation of a fast multipole method for molecular dynamics with event-based task parallelism. The performance analysis reveals that the event-based implementation is 52 % faster than a classical loop-parallel implementation with OpenMP.

References

  1. M. J. Abraham, T. Murtola, R. Schulz, S. Päll, J. C. Smith, B. Hess, and E. Lindahl, "GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers," SoftwareX, vol. 1--2, pp. 19--25, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2352711015000059Google ScholarGoogle Scholar
  2. A. Arnold, F. Fahrenberger, C. Holm, O. Lenz, M. Bolten, H. Dachsel, R. Halver, I. Kabadshow, F. Gähler, F. Heber, J. Iseringhausen, M. Hofmann, M. Pippig, D. Potts, and G. Sutmann, "Comparison of scalable fast methods for long-range interactions," Phys. Rev. E, vol. 88, p. 063308, Dec 2013. [Online]. Available: http://link.aps.org/doi/10.1103/PhysRevE.88.063308Google ScholarGoogle ScholarCross RefCross Ref
  3. I. Kabadshow, H. Dachsel, C. Kutzner, and T. Ullmann, "GROMEX - Unified Long-range Electrostatics and Flexible Ionization," http://www.mpibpc.mpg.de/15304826/inSiDE_autumn2013.pdf, 2013 (accessed April 27, 2017).Google ScholarGoogle Scholar
  4. OpenMP Architecture Review Board, "OpenMP application programming interface version 5.0," 2018. [Online]. Available: https://www.openmp.org/wp-content/uploads/OpenMP-API- Specification-5.0.pdfGoogle ScholarGoogle Scholar
  5. P. Atkinson and S. McIntosh-Smith, "On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Applicatio," in Scaling OpenMP for Exascale Performance and Portability, B. R. de Supinski, S. L. Olivier, C. Terboven, B. M. Chapman, and M. S. Müller, Eds. Cham: Springer International Publishing, 2017, pp. 92--106.Google ScholarGoogle Scholar
  6. K. Taura, J. Nakashima, R. Yokota, and N. Maruyama, "A Task Parallel Implementation of Fast Multipole Methods," in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Nov 2012, pp. 617--625.Google ScholarGoogle Scholar
  7. E. Agullo, O. Aumage, B. Bramas, O. Coulaud, and S. Pitoiset, "Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 10, pp. 2794--2807, Oct 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Chamberlain, D. Callahan, and H. Zima, "Parallel Programmability and the Chapel Language," Int. J. High Perform. Comput. Appl., vol. 21, no. 3, pp. 291--312, Aug. 2007. [Online]. Available: http://dx.doi.org/10.1177/1094342007078442Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. V. Kale and S. Krishnan, "CHARM+ +: A Portable Concurrent Object Oriented System Based on C++," in Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, ser. OOPSLA '93. New York, NY, USA: ACM, 1993, pp. 91--108. [Online]. Available: http://doi.acm.org/10.1145/165854.165874Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Pheatt, "Intel Threading Building Blocks," J. Comput. Sci. Coll., vol. 23, no. 4, pp. 298--298, Apr. 2008. [Online]. Available: http://dl.acm.org/citation.cfm?id=1352079.1352134Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey, "HPX: A Task Based Programming Model in a Global Address Space," in Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, ser. PGAS '14. New York, NY, USA: ACM, 2014, pp. 6:1--6:11. [Online]. Available: http://doi.acm.org/10.1145/2676870.2676883Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. C. Edwards, C. R. Trott, and D. Sunderland, "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns," Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202--3216, 2014, domain-Specific Languages and High-Level Frameworks for High-Performance Computing. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731514001257Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, "StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures," CCPE - Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009, vol. 23, pp. 187--198, Feb. 2011. [Online]. Available: http://hal.inria.fr/inria-00550877Google ScholarGoogle Scholar
  14. R. Bagrodia, R. Meyer, M. Takai, Yu-An Chen, Xiang Zeng, J. Martin, and Ha Yoon Song, "Parsec: a parallel simulation environment for complex systems," Computer, vol. 31, no. 10, pp. 77--85, Oct 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, "X10: An Object-oriented Approach to Non-uniform Cluster Computing," in Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, ser. OOPSLA'05. New York, NY, USA: ACM, 2005, pp. 519--538. [Online]. Available: http://doi.acm.org/10.1145/1094811.1094852Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K. Hasanov, P. Gschwandtner, P. Lemarinier, S. Markidis, H. Jordan, T. Fahringer, K. Katrinis, E. Laure, and D. S. Nikolopoulos, "A taxonomy of task-based parallel programming technologies for high-performance computing," The Journal of Supercomputing, vol. 74, no. 4, pp. 1422--1434, Apr 2018. [Online]. Available: https://doi.org/10.1007/s11227-018-2238-4Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Zhang, "Asynchronous Task Scheduling of the Fast Multipole Method Using Various Runtime Systems," in Proceedings of the 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing, ser. DFM '14. Washington, DC, USA: IEEE Computer Society, 2014, pp. 9--16. [Online]. Available: https://doi.org/10.1109/DFM.2014.14Google ScholarGoogle Scholar
  18. Z. Khatami, H. Kaiser, P. Grubel, A. Serio, and J. Ramanujam, "A Massively Parallel Distributed N-body Application Implemented with HPX," in Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ser. ScalA '16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 57--64. [Online]. Available: https://doi.org/10.1109/ScalA.2016.12Google ScholarGoogle Scholar
  19. H. C. Edwards and C. R. Trott, "Kokkos: Enabling Performance Portability Across Manycore Architectures," in 2013 Extreme Scaling Workshop (xsw 2013), Aug 2013, pp. 18--24.Google ScholarGoogle Scholar
  20. R. Yokota and L. A. Barba, "A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems," The International Journal of High Performance Computing Applications, vol. 26, no. 4, pp. 337--346, 2012. [Online]. Available: https://doi.org/10.1177/1094342011429952Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Abduljabbar, M. Al Farhan, R. Yokota, and D. Keyes, "Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture," in Euro-Par 2017: Parallel Processing, F. F. Rivera, T. F. Pena, and J. C. Cabaleiro, Eds. Cham: Springer International Publishing, 2017, pp. 553--564.Google ScholarGoogle Scholar
  22. H. Ltaief and R. Yokota, "Data-Driven Execution of Fast Multipole Methods," CoRR, vol. abs/1203.0889, 2012. [Online]. Available: http://arxiv.org/abs/1203.0889Google ScholarGoogle Scholar
  23. A. YarKhan, J. Kurzak, and J. Dongarra, "QUARK Users' Guide: QUeueing And Runtime for Kernels," 2011.Google ScholarGoogle Scholar
  24. E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner, and T. Takahashi, "Task-Based FMM for Multicore Architectures," SIAM Journal on Scientific Computing, vol. 36, no. 1, pp. C66-C93, 2014. [Online]. Available: https://doi.org/10.1137/130915662Google ScholarGoogle ScholarCross RefCross Ref
  25. E. Agullo, B. Bramas, O. Coulaud, M. Khannouz, and L. Stanisic, "Task-based fast multipole method for clusters of multicore processors," Inria Bordeaux Sud-Ouest, Research Report RR-8970, Mar. 2017. [Online]. Available: https://hal.inria.fr/hal-01387482Google ScholarGoogle Scholar
  26. L. Hyafil and R. Rivest, Graph Partitioning and Constructing Optimal Decision Trees are Polynomial Complete Problems, ser. Laboratoire de Recherche: Rapport de recherche. IRIA, 1973.Google ScholarGoogle Scholar
  27. N. Goodspeed," A proposal to add coroutines to the C++ standard library (Revision 1)" 2014.Google ScholarGoogle Scholar
  28. (2017) Intel TBB Data Flow and Dependence Graphs. [Online]. Available: https://software.intel.com/en-us/node/517340Google ScholarGoogle Scholar
  29. M. D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte, "Min-max heaps and generalized priority queues," Communications of the ACM, vol. 29, no. 10, pp. 996--1000, 1986.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. M. Josuttis, The C++ standard library: a tutorial and reference. Addison-Wesley, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. (2011) Constant Expression. [Online]. Available: http://en.cppreference.com/w/cpp/language/constant_expressionGoogle ScholarGoogle Scholar
  32. L. Greengard and V. Rokhlin, "A fast algorithm for particle simulations," Journal of computational physics, vol. 73, no. 2, pp. 325--348, 1987.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. I. Kabadshow, Periodic boundary conditions and the error-controlled fast multipole method. Forschungszentrum Jülich, 2012, vol. 11.Google ScholarGoogle Scholar
  34. J. Evans, "jemalloc memory allocator," http://jemalloc.net/, 2017 (accessed April 11, 2017).Google ScholarGoogle Scholar
  35. T. Rauber and G. Rünger, Parallel Programming for Multicore and Cluster Systems. Springer Berlin Heidelberg, 2010.Google ScholarGoogle Scholar
  36. L. Morgenstern, "A NUMA-Aware Task-Based Load-Balancing Scheme for the Fast Multipole Method," Master Thesis, TU Chemnitz, 2017.Google ScholarGoogle Scholar
  37. M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. Morgan Kaufmann, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Eventify: Event-Based Task Parallelism for Strong Scaling

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference
      June 2020
      169 pages
      ISBN:9781450379939
      DOI:10.1145/3394277

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 June 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      PASC '20 Paper Acceptance Rate16of36submissions,44%Overall Acceptance Rate109of221submissions,49%

      Upcoming Conference

      PASC '24
      Platform for Advanced Scientific Computing Conference
      June 3 - 5, 2024
      Zurich , Switzerland

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader