A Comparison of the Scalability of OpenMP Implementations

Jammer, Tim; Iwainsky, Christian; Bischof, Christian

doi:10.1007/978-3-030-57675-2_6

Tim Jammer^10,11,
Christian Iwainsky¹⁰ &
Christian Bischof¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Included in the following conference series:

European Conference on Parallel Processing

1582 Accesses
3 Citations

Abstract

OpenMP implementations must exploit current and upcoming hardware for performance. Overhead must be controlled and kept to a minimum to avoid low performance at scale. Previous work has shown that overheads do not scale favourably in commonly used OpenMP implementations. Focusing on synchronization overhead, this work analyses the overhead of core OpenMP runtime library components for GNU and LLVM compilers, reflecting on the implementation’s source code and algorithms. In addition, this work investigates the implementation’s capability to handle current CPU-internal NUMA structure observed in recent Intel CPUs. Using a custom benchmark designed to expose synchronization overhead of OpenMP regardless of user code, substantial differences between both implementations are observed. In summary, the LLVM implementation can be considered more scalable than the GNU implementation, but the GNU implementation yields lower overhead for lower threadcounts in some occasions. Neither implementation reacts to the system architecture, although the effects of the internal NUMA structure on the overhead can be observed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

How Many Threads will be too Many? On the Scalability of OpenMP Implementations

Improving Performance Portability in OpenCL Programs

Comparing OpenMP Implementations with Applications Across A64FX Platforms

Notes

1.
For example each node has to respond to the SLURM controller from time to time.
2.
Such as which function this task should call or the pointers to the shared variables.

References

Al-Khalissi, H., Shah, S.A.A., Berekovic, M.: An efficient barrier implementation for OpenMP-like parallelism on the Intel SCC. In: 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 76–83. IEEE (2014). https://doi.org/10.1109/pdp.2014.25
Bari, M.A.S., et al.: Arcs: adaptive runtime configuration selection for power-constrained OpenMP applications. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 461–470. IEEE (2016). https://doi.org/10.1109/cluster.2016.39
Brightwell, R.: A comparison of three MPI implementations for red storm. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 425–432. Springer, Heidelberg (2005). https://doi.org/10.1007/11557265_54
Chapter Google Scholar
Bull, J.M.: Measuring synchronisation and scheduling overheads in OpenMP. In: Proceedings of First European Workshop on OpenMP. vol. 8, p. 49 (1999)
Google Scholar
Bull, J.M., O’Neill, D.: A microbenchmark suite for OpenMP 2.0. ACM SIGARCH Comput. Arch. News 29, 41–48 (2001). https://doi.org/10.1145/563647.563656
Article Google Scholar
Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of OpenMP task scheduling algorithms for large NUMA architectures. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 596–607. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09873-9_50
Chapter Google Scholar
Diaz, J.M., et al.: Analysis of OpenMP 4.5 offloading in implementations: correctness and overhead. Parallel Comput. 89, 102546 (2019). https://doi.org/10.1016/j.parco.2019.102546
Article Google Scholar
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19
Chapter Google Scholar
Gupta, R., Hill, C.R.: A scalable implementation of barrier synchronization using an adaptive combining tree. Int. J. Parallel Program. 18(3), 161–180 (1989). https://doi.org/10.1007/bf01407897
Article Google Scholar
Hoefler, T., Schneider, T., Lumsdaine, A.: Accurately measuring collective operations at massive scale. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8. IEEE (2008). https://doi.org/10.1109/ipdps.2008.4536494
Iwainsky, C., et al.: How many threads will be too many? on the scalability of OpenMP implementations. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 451–463. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48096-0_35
Chapter Google Scholar
Jammer, T., Iwainsky, C., Bischof, C.: Artifact and instructions to generate experimental results for EuroPar 2020 paper: A Comparison of the Scalability of OpenMP Implementations (Jul 2020). https://doi.org/10.6084/m9.figshare.12555263, https://springernature.figshare.com/articles/datasetArtifact_and_instructions_to_generate_experimental_results_for_EuroPar_2020_paper_A_Comparison_of_the_Scalability_of_OpenMP_Implementations_/12555263/1
Kang, S.J., Lee, S.Y., Lee, K.M.: Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Adv. Multi. 2015, (2015). https://doi.org/10.1155/2015/575687
Krawezik, G.: Performance comparison of MPI and three OpenMP programming styles on shared memory multiprocessors. In: Proceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 118–127 (2003). https://doi.org/10.1145/777412.777433
Krawezik, G., Cappello, F.: Performance comparison of MPI and OpenMP on shared memory multiprocessors. Concurrency Comput. Prac. Experience 18(1), 29–61 (2006). https://doi.org/10.1002/cpe.905
Article Google Scholar
Kuhn, B., Petersen, P., O’Toole, E.: OpenMP versus threading in C/C++. Concurrency Prac. Experience 12(12), 1165–1176 (2000). https://doi.org/10.1002/1096-9128(200010)12:12<1165::aid-cpe529>3.0.co;2-l
Article MATH Google Scholar
Libgomp: GNU offloading and multi processing runtime library: The GNU OpenMP and OpenACC implementation. Tech. rep., GNU libgomp (2018). https://gcc.gnu.org/onlinedocs/gcc-8.3.0/libgomp.pdf
Liu, J., et al.: Performance comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 58 (2003). https://doi.org/10.1145/1048935.1050208
LLVM: LLVM OpenMP runtime library. Tech. rep., the LLVM Project (2015). http://openmp.llvm.org/Reference.pdf
Mills, D.L.: Internet time synchronization: the network time protocol. IEEE Trans. Communi. 39(10), 1482–1493 (1991). https://doi.org/10.1109/26.103043
Article Google Scholar
Muddukrishna, A., et al.: Locality-aware task scheduling and data distribution on NUMA systems. In: Rendell, AlP, Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 156–170. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40698-0_12
Chapter Google Scholar
Nanjegowda, R., et al.: Scalability evaluation of barrier algorithms for OpenMP. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 42–52. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_4
Chapter Google Scholar
Nethercote, N.: Cachegrind: a cache profiler. Tech. rep., Valgrind Developers (2019). https://valgrind.org/docs/manual/cg-manual.html
Rodchenko, A., et al.: Effective barrier synchronization on Intel Xeon Phi coprocessor. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 588–600. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48096-0_45
Chapter Google Scholar
Terboven, C., et al.: Assessing OpenMP tasking implementations on NUMA architectures. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 182–195. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_14
Chapter Google Scholar

Download references

Acknowledgments and Data Availability Statement

Measurement for this work were conducted on the Lichtenberg high performance computer of the TU Darmstadt. This work was supported by the Hessian Ministry for Higher Education, Research and the Arts through the Hessian Competence Center for High-Performance Computing and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 265191195 – SFB 1194.

The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12555263 [12].

Author information

Authors and Affiliations

Hessian Competence Center for High Performance Computing (HKHLR), Technical University Darmstadt, Darmstadt, Germany
Tim Jammer & Christian Iwainsky
Department of Scientific Computing, Technical University Darmstadt, 64283, Darmstadt, Germany
Tim Jammer & Christian Bischof

Authors

Tim Jammer
View author publications
You can also search for this author in PubMed Google Scholar
Christian Iwainsky
View author publications
You can also search for this author in PubMed Google Scholar
Christian Bischof
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tim Jammer .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Malawski
University of Warsaw, Warsaw, Poland
Krzysztof Rzadca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jammer, T., Iwainsky, C., Bischof, C. (2020). A Comparison of the Scalability of OpenMP Implementations. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-57675-2_6
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Comparison of the Scalability of OpenMP Implementations

Abstract

Access this chapter

Similar content being viewed by others

How Many Threads will be too Many? On the Scalability of OpenMP Implementations

Improving Performance Portability in OpenCL Programs

Comparing OpenMP Implementations with Applications Across A64FX Platforms

Notes

References

Acknowledgments and Data Availability Statement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Comparison of the Scalability of OpenMP Implementations

Abstract

Access this chapter

Similar content being viewed by others

How Many Threads will be too Many? On the Scalability of OpenMP Implementations

Improving Performance Portability in OpenCL Programs

Comparing OpenMP Implementations with Applications Across A64FX Platforms

Notes

References

Acknowledgments and Data Availability Statement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation