Using Dynamic Broadcasts to Improve Task-Based Runtime Performances

Denis, Alexandre; Jeannot, Emmanuel; Swartvagher, Philippe; Thibault, Samuel

doi:10.1007/978-3-030-57675-2_28

Alexandre Denis^10,11,
Emmanuel Jeannot^10,11,
Philippe Swartvagher^10,11 &
…
Samuel Thibault^10,11

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Included in the following conference series:

European Conference on Parallel Processing

1344 Accesses
8 Citations

Abstract

Task-based runtimes have emerged in the HPC world to take benefit from the computation power of heterogeneous supercomputers and to achieve scalability. One of the main bottlenecks for scalability is the communication layer. Some task-based algorithms need to send the same data to multiple nodes. To optimize this communication pattern, libraries propose dedicated routines, such as MPI_Bcast. However, MPI_Bcast requirements do not fit well with the constraints of task-based runtime systems: it must be performed simultaneously by all involved nodes, and these must know each other, which is not possible when each node runs a task scheduler not synchronized with others. In this paper, we propose a new approach, called dynamic broadcasts to overcome these constraints. The broadcast communication pattern required by the task-based algorithm is detected automatically, then the broadcasting algorithm relies on active messages and source routing, so that participating nodes do not need to know each other and do not need to synchronize. Receiver receives data the same way as it receives point-to-point communication, without having to know it arrives through a broadcast. We have implemented the algorithm in the StarPU runtime system using the NewMadeleine communication library. We performed benchmarks using the Cholesky factorization that is known to use broadcasts and observed up to 30% improvement of its total execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It is important to note that the improvement is measured on the total performance and not on the communication part only.

References

Acun, B., et al.: Parallel programming with migratable objects: Charm++ in practice. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’14, pp. 647–658. IEEE (2014)
Google Scholar
Agullo, E., et al.: Faster, cheaper, better - a hybridization methodology to develop linear algebra software for GPUs. In: Hwu, W.W. (ed.) GPU Computing Gems, vol. 2. Morgan Kaufmann (September 2010). https://hal.inria.fr/inria-00547847
Agullo, E., et al.: Achieving high performance on supercomputers with a sequential task-based programming model. IEEE Trans. Parallel Distrib. Syst. (2017). https://hal.inria.fr/hal-01618526
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 863–874. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03869-3_80
Chapter Google Scholar
Aumage, O., Brunet, E., Furmento, N., Namyst, R.: NewMadeleine: a fast communication scheduling engine for high performance networks. In: Workshop on Communication Architecture for Clusters, CAC 2007, Long Beach, California, United States (March 2007). https://hal.inria.fr/inria-00127356
Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 1–11 (November 2012). https://doi.org/10.1109/SC.2012.71
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Luszczek, P., Dongarra, J.: Dense linear algebra on distributed heterogeneous hardware with a symbolic dag approach. Scalable Comput. Commun. Theor. Pract., 699–735 (2013)
Google Scholar
Denis, A.: Scalability of the NewMadeleine communication library for large numbers of MPI point-to-point requests. In: 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing, CCGrid 2019, Larnaca, Cyprus (May 2019) https://hal.inria.fr/hal-02103700
Dongarra, J.: Architecture-Aware Algorithms for Scalable Performance and Resilience on Heterogeneous Architectures (2013). https://doi.org/10.2172/1096392
Jeannot, E.: Automatic multithreaded parallel program generation for message passing multiprocessors using parameterized task graphs. In: International Conference on Parallel Computing 2001, ParCo2001, Naples, Italy (September 2001)
Google Scholar
Kaiser, H., Brodowicz, M., Sterling, T.: Parallex an advanced parallel execution model for scaling-impaired applications. In: 2009 International Conference on Parallel Processing Workshops, pp. 394–401 (September 2009). https://doi.org/10.1109/ICPPW.2009.14
Pješivac-Grbović, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Performance analysis of MPI collective operations. Clust. Comput. 10(2), 127–143 (2007)
Article Google Scholar
Sanders, P., Speck, J., Träff, J.L.: Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput. 35(12), 581–594 (2009). Selected papers from the 14th European PVM/MPI Users Group Meeting. https://doi.org/10.1016/j.parco.2009.09.001
Tejedor, E., Farreras, M., Grove, D., Badia, R.M., Almasi, G., Labarta, J.: A high-productivity task-based programming model for clusters. Concurrency Comput. Pract. Exp. 24(18), 2421–2448 (2012). https://doi.org/10.1002/cpe.2831
Article Google Scholar
Träff, J.L., Ripke, A.: Optimal broadcast for fully connected processor-node networks. J. Parallel Distrib. Comput. 68(7), 887–901 (2008). https://doi.org/10.1016/j.jpdc.2007.12.001
Article MATH Google Scholar
Wickramasinghe, U., Lumsdaine, A.: A survey of methods for collective communication optimization and tuning. CoRR abs/1611.06334 (2016). http://arxiv.org/abs/1611.06334

Download references

Acknowledgements

This work is supported by the Agence Nationale de la Recherche, under grant ANR-19-CE46-0009.

This work is supported by the Région Nouvelle-Aquitaine, under grant 2018-1R50119 HPC scalable ecosystem.

Experiments presented in this paper were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d’Aquitaine (see https://www.plafrim.fr/).

This work was granted access to the HPC resources of CINES under the allocation 2019- A0060601567 attributed by GENCI (Grand Equipement National de Calcul Intensif).

The authors furthermore thank Olivier Aumage and Nathalie Furmento for their help and advice regarding to this work.

Author information

Authors and Affiliations

Inria Bordeaux – Sud-Ouest, 33405, Talence, France
Alexandre Denis, Emmanuel Jeannot, Philippe Swartvagher & Samuel Thibault
LaBRI, University of Bordeaux, 33405, Talence, France
Alexandre Denis, Emmanuel Jeannot, Philippe Swartvagher & Samuel Thibault

Authors

Alexandre Denis
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Jeannot
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Swartvagher
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Thibault
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philippe Swartvagher .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Malawski
University of Warsaw, Warsaw, Poland
Krzysztof Rzadca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Denis, A., Jeannot, E., Swartvagher, P., Thibault, S. (2020). Using Dynamic Broadcasts to Improve Task-Based Runtime Performances. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-57675-2_28
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics