research-article

Eventify: Event-Based Task Parallelism for Strong Scaling

Authors:
David Haensel

Jülich Supercomputing Centre, Jülich, Germany

Jülich Supercomputing Centre, Jülich, Germany
View Profile

,
Laura Morgenstern

Chemnitz University of Technology, Chemnitz, Germany

Chemnitz University of Technology, Chemnitz, Germany
View Profile

,
Andreas Beckmann

Jülich Supercomputing Centre, Jülich, Germany

Jülich Supercomputing Centre, Jülich, Germany
View Profile

,
Ivo Kabadshow

Jülich Supercomputing Centre, Jülich, Germany

Jülich Supercomputing Centre, Jülich, Germany
View Profile

,
Holger Dachsel

Jülich Supercomputing Centre, Jülich, Germany

Jülich Supercomputing Centre, Jülich, Germany
View Profile

PASC '20: Proceedings of the Platform for Advanced Scientific Computing ConferenceJune 2020Article No.: 14Pages 1–10https://doi.org/10.1145/3394277.3401858

Published:29 June 2020Publication History

PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference

Pages 1–10

ABSTRACT

Today's processors become fatter, not faster. However, the exploitation of these massively parallel compute resources remains a challenge for many traditional HPC applications regarding scalability, portability and programmability. To tackle this challenge, several parallel programming approaches such as loop parallelism and task parallelism are researched in form of languages, libraries and frameworks. Task parallelism as provided by OpenMP, HPX, StarPU, Charm++ and Kokkos is the most promising approach to overcome the challenges of ever increasing parallelism. The aforementioned parallel programming technologies enable scalability for a broad range of algorithms with coarse-grained tasks, e. g. in linear algebra and classical N-body simulation. However, they do not fully address the performance bottlenecks of algorithms with fine-grained tasks and the resultant large task graphs. Additionally, we experienced the description of large task graphs to be cumbersome with the common approach of providing in-, out- and inout-dependencies. We introduce event-based task parallelism to solve the performance and programmability issues for algorithms that exhibit fine-grained task parallelism and contain repetitive task patterns. With user-defined event lists, the approach provides a more convenient and compact way to describe large task graphs. Furthermore, we show how these event lists are processed by a task engine that reuses user-defined, algorithmic data structures. As use case, we describe the implementation of a fast multipole method for molecular dynamics with event-based task parallelism. The performance analysis reveals that the event-based implementation is 52 % faster than a classical loop-parallel implementation with OpenMP.

References

M. J. Abraham, T. Murtola, R. Schulz, S. Päll, J. C. Smith, B. Hess, and E. Lindahl, "GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers," SoftwareX, vol. 1--2, pp. 19--25, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2352711015000059Google Scholar
A. Arnold, F. Fahrenberger, C. Holm, O. Lenz, M. Bolten, H. Dachsel, R. Halver, I. Kabadshow, F. Gähler, F. Heber, J. Iseringhausen, M. Hofmann, M. Pippig, D. Potts, and G. Sutmann, "Comparison of scalable fast methods for long-range interactions," Phys. Rev. E, vol. 88, p. 063308, Dec 2013. [Online]. Available: http://link.aps.org/doi/10.1103/PhysRevE.88.063308Google ScholarCross Ref
I. Kabadshow, H. Dachsel, C. Kutzner, and T. Ullmann, "GROMEX - Unified Long-range Electrostatics and Flexible Ionization," http://www.mpibpc.mpg.de/15304826/inSiDE_autumn2013.pdf, 2013 (accessed April 27, 2017).Google Scholar
OpenMP Architecture Review Board, "OpenMP application programming interface version 5.0," 2018. [Online]. Available: https://www.openmp.org/wp-content/uploads/OpenMP-API- Specification-5.0.pdfGoogle Scholar
P. Atkinson and S. McIntosh-Smith, "On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Applicatio," in Scaling OpenMP for Exascale Performance and Portability, B. R. de Supinski, S. L. Olivier, C. Terboven, B. M. Chapman, and M. S. Müller, Eds. Cham: Springer International Publishing, 2017, pp. 92--106.Google Scholar
K. Taura, J. Nakashima, R. Yokota, and N. Maruyama, "A Task Parallel Implementation of Fast Multipole Methods," in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Nov 2012, pp. 617--625.Google Scholar
E. Agullo, O. Aumage, B. Bramas, O. Coulaud, and S. Pitoiset, "Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 10, pp. 2794--2807, Oct 2017.Google ScholarDigital Library
B. Chamberlain, D. Callahan, and H. Zima, "Parallel Programmability and the Chapel Language," Int. J. High Perform. Comput. Appl., vol. 21, no. 3, pp. 291--312, Aug. 2007. [Online]. Available: http://dx.doi.org/10.1177/1094342007078442Google ScholarDigital Library
L. V. Kale and S. Krishnan, "CHARM+ +: A Portable Concurrent Object Oriented System Based on C++," in Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, ser. OOPSLA '93. New York, NY, USA: ACM, 1993, pp. 91--108. [Online]. Available: http://doi.acm.org/10.1145/165854.165874Google ScholarDigital Library
C. Pheatt, "Intel Threading Building Blocks," J. Comput. Sci. Coll., vol. 23, no. 4, pp. 298--298, Apr. 2008. [Online]. Available: http://dl.acm.org/citation.cfm?id=1352079.1352134Google ScholarDigital Library
H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey, "HPX: A Task Based Programming Model in a Global Address Space," in Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, ser. PGAS '14. New York, NY, USA: ACM, 2014, pp. 6:1--6:11. [Online]. Available: http://doi.acm.org/10.1145/2676870.2676883Google ScholarDigital Library
H. C. Edwards, C. R. Trott, and D. Sunderland, "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns," Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202--3216, 2014, domain-Specific Languages and High-Level Frameworks for High-Performance Computing. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731514001257Google ScholarDigital Library
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, "StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures," CCPE - Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009, vol. 23, pp. 187--198, Feb. 2011. [Online]. Available: http://hal.inria.fr/inria-00550877Google Scholar
R. Bagrodia, R. Meyer, M. Takai, Yu-An Chen, Xiang Zeng, J. Martin, and Ha Yoon Song, "Parsec: a parallel simulation environment for complex systems," Computer, vol. 31, no. 10, pp. 77--85, Oct 1998.Google ScholarDigital Library
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, "X10: An Object-oriented Approach to Non-uniform Cluster Computing," in Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, ser. OOPSLA'05. New York, NY, USA: ACM, 2005, pp. 519--538. [Online]. Available: http://doi.acm.org/10.1145/1094811.1094852Google ScholarDigital Library
P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K. Hasanov, P. Gschwandtner, P. Lemarinier, S. Markidis, H. Jordan, T. Fahringer, K. Katrinis, E. Laure, and D. S. Nikolopoulos, "A taxonomy of task-based parallel programming technologies for high-performance computing," The Journal of Supercomputing, vol. 74, no. 4, pp. 1422--1434, Apr 2018. [Online]. Available: https://doi.org/10.1007/s11227-018-2238-4Google ScholarDigital Library
B. Zhang, "Asynchronous Task Scheduling of the Fast Multipole Method Using Various Runtime Systems," in Proceedings of the 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing, ser. DFM '14. Washington, DC, USA: IEEE Computer Society, 2014, pp. 9--16. [Online]. Available: https://doi.org/10.1109/DFM.2014.14Google Scholar
Z. Khatami, H. Kaiser, P. Grubel, A. Serio, and J. Ramanujam, "A Massively Parallel Distributed N-body Application Implemented with HPX," in Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ser. ScalA '16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 57--64. [Online]. Available: https://doi.org/10.1109/ScalA.2016.12Google Scholar
H. C. Edwards and C. R. Trott, "Kokkos: Enabling Performance Portability Across Manycore Architectures," in 2013 Extreme Scaling Workshop (xsw 2013), Aug 2013, pp. 18--24.Google Scholar
R. Yokota and L. A. Barba, "A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems," The International Journal of High Performance Computing Applications, vol. 26, no. 4, pp. 337--346, 2012. [Online]. Available: https://doi.org/10.1177/1094342011429952Google ScholarDigital Library
M. Abduljabbar, M. Al Farhan, R. Yokota, and D. Keyes, "Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture," in Euro-Par 2017: Parallel Processing, F. F. Rivera, T. F. Pena, and J. C. Cabaleiro, Eds. Cham: Springer International Publishing, 2017, pp. 553--564.Google Scholar
H. Ltaief and R. Yokota, "Data-Driven Execution of Fast Multipole Methods," CoRR, vol. abs/1203.0889, 2012. [Online]. Available: http://arxiv.org/abs/1203.0889Google Scholar
A. YarKhan, J. Kurzak, and J. Dongarra, "QUARK Users' Guide: QUeueing And Runtime for Kernels," 2011.Google Scholar
E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner, and T. Takahashi, "Task-Based FMM for Multicore Architectures," SIAM Journal on Scientific Computing, vol. 36, no. 1, pp. C66-C93, 2014. [Online]. Available: https://doi.org/10.1137/130915662Google ScholarCross Ref
E. Agullo, B. Bramas, O. Coulaud, M. Khannouz, and L. Stanisic, "Task-based fast multipole method for clusters of multicore processors," Inria Bordeaux Sud-Ouest, Research Report RR-8970, Mar. 2017. [Online]. Available: https://hal.inria.fr/hal-01387482Google Scholar
L. Hyafil and R. Rivest, Graph Partitioning and Constructing Optimal Decision Trees are Polynomial Complete Problems, ser. Laboratoire de Recherche: Rapport de recherche. IRIA, 1973.Google Scholar
N. Goodspeed," A proposal to add coroutines to the C++ standard library (Revision 1)" 2014.Google Scholar
(2017) Intel TBB Data Flow and Dependence Graphs. [Online]. Available: https://software.intel.com/en-us/node/517340Google Scholar
M. D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte, "Min-max heaps and generalized priority queues," Communications of the ACM, vol. 29, no. 10, pp. 996--1000, 1986.Google ScholarDigital Library
N. M. Josuttis, The C++ standard library: a tutorial and reference. Addison-Wesley, 2012.Google ScholarDigital Library
(2011) Constant Expression. [Online]. Available: http://en.cppreference.com/w/cpp/language/constant_expressionGoogle Scholar
L. Greengard and V. Rokhlin, "A fast algorithm for particle simulations," Journal of computational physics, vol. 73, no. 2, pp. 325--348, 1987.Google ScholarDigital Library
I. Kabadshow, Periodic boundary conditions and the error-controlled fast multipole method. Forschungszentrum Jülich, 2012, vol. 11.Google Scholar
J. Evans, "jemalloc memory allocator," http://jemalloc.net/, 2017 (accessed April 11, 2017).Google Scholar
T. Rauber and G. Rünger, Parallel Programming for Multicore and Cluster Systems. Springer Berlin Heidelberg, 2010.Google Scholar
L. Morgenstern, "A NUMA-Aware Task-Based Load-Balancing Scheme for the Fast Multipole Method," Master Thesis, TU Chemnitz, 2017.Google Scholar
M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. Morgan Kaufmann, 2011.Google ScholarDigital Library

Index Terms

Eventify: Event-Based Task Parallelism for Strong Scaling
1. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Hybrid PGAS runtime support for multicore nodes
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be ...
Read More
Communicating Data-Parallel Tasks: An MPI Library for HPF
HIPC '96: Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)

High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data-parallel computing. However, HPF does not support task parallelism or heterogeneous computing adequately. This paper presents a summary of our work on a library-based ...
Read More
A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms
PMAM'17: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Traditional widely used parallel programming models and methods focus on data distribution and are suitable for implementing data parallelism. They lack the abstraction of task parallelism and make it inconvenient to separate the applications' high ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference
June 2020
169 pages
ISBN:9781450379939
DOI:10.1145/3394277

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multi-core
shared memory
strong scaling
task parallelism
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
PASC '20 Paper Acceptance Rate16of36submissions,44%Overall Acceptance Rate109of221submissions,49%
More
Upcoming Conference
PASC '24

Sponsor:

sighpc

Platform for Advanced Scientific Computing Conference

June 3 - 5, 2024

Zurich , Switzerland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 202
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Eventify: Event-Based Task Parallelism for Strong Scaling

PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hybrid PGAS runtime support for multicore nodes

Communicating Data-Parallel Tasks: An MPI Library for HPF

A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Eventify: Event-Based Task Parallelism for Strong Scaling

PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hybrid PGAS runtime support for multicore nodes

Communicating Data-Parallel Tasks: An MPI Library for HPF

A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media