Abstract
OpenMP implementations must exploit current and upcoming hardware for performance. Overhead must be controlled and kept to a minimum to avoid low performance at scale. Previous work has shown that overheads do not scale favourably in commonly used OpenMP implementations. Focusing on synchronization overhead, this work analyses the overhead of core OpenMP runtime library components for GNU and LLVM compilers, reflecting on the implementation’s source code and algorithms. In addition, this work investigates the implementation’s capability to handle current CPU-internal NUMA structure observed in recent Intel CPUs. Using a custom benchmark designed to expose synchronization overhead of OpenMP regardless of user code, substantial differences between both implementations are observed. In summary, the LLVM implementation can be considered more scalable than the GNU implementation, but the GNU implementation yields lower overhead for lower threadcounts in some occasions. Neither implementation reacts to the system architecture, although the effects of the internal NUMA structure on the overhead can be observed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For example each node has to respond to the SLURM controller from time to time.
- 2.
Such as which function this task should call or the pointers to the shared variables.
References
Al-Khalissi, H., Shah, S.A.A., Berekovic, M.: An efficient barrier implementation for OpenMP-like parallelism on the Intel SCC. In: 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 76–83. IEEE (2014). https://doi.org/10.1109/pdp.2014.25
Bari, M.A.S., et al.: Arcs: adaptive runtime configuration selection for power-constrained OpenMP applications. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 461–470. IEEE (2016). https://doi.org/10.1109/cluster.2016.39
Brightwell, R.: A comparison of three MPI implementations for red storm. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 425–432. Springer, Heidelberg (2005). https://doi.org/10.1007/11557265_54
Bull, J.M.: Measuring synchronisation and scheduling overheads in OpenMP. In: Proceedings of First European Workshop on OpenMP. vol. 8, p. 49 (1999)
Bull, J.M., O’Neill, D.: A microbenchmark suite for OpenMP 2.0. ACM SIGARCH Comput. Arch. News 29, 41–48 (2001). https://doi.org/10.1145/563647.563656
Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of OpenMP task scheduling algorithms for large NUMA architectures. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 596–607. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09873-9_50
Diaz, J.M., et al.: Analysis of OpenMP 4.5 offloading in implementations: correctness and overhead. Parallel Comput. 89, 102546 (2019). https://doi.org/10.1016/j.parco.2019.102546
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19
Gupta, R., Hill, C.R.: A scalable implementation of barrier synchronization using an adaptive combining tree. Int. J. Parallel Program. 18(3), 161–180 (1989). https://doi.org/10.1007/bf01407897
Hoefler, T., Schneider, T., Lumsdaine, A.: Accurately measuring collective operations at massive scale. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8. IEEE (2008). https://doi.org/10.1109/ipdps.2008.4536494
Iwainsky, C., et al.: How many threads will be too many? on the scalability of OpenMP implementations. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 451–463. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48096-0_35
Jammer, T., Iwainsky, C., Bischof, C.: Artifact and instructions to generate experimental results for EuroPar 2020 paper: A Comparison of the Scalability of OpenMP Implementations (Jul 2020). https://doi.org/10.6084/m9.figshare.12555263, https://springernature.figshare.com/articles/datasetArtifact_and_instructions_to_generate_experimental_results_for_EuroPar_2020_paper_A_Comparison_of_the_Scalability_of_OpenMP_Implementations_/12555263/1
Kang, S.J., Lee, S.Y., Lee, K.M.: Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Adv. Multi. 2015, (2015). https://doi.org/10.1155/2015/575687
Krawezik, G.: Performance comparison of MPI and three OpenMP programming styles on shared memory multiprocessors. In: Proceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 118–127 (2003). https://doi.org/10.1145/777412.777433
Krawezik, G., Cappello, F.: Performance comparison of MPI and OpenMP on shared memory multiprocessors. Concurrency Comput. Prac. Experience 18(1), 29–61 (2006). https://doi.org/10.1002/cpe.905
Kuhn, B., Petersen, P., O’Toole, E.: OpenMP versus threading in C/C++. Concurrency Prac. Experience 12(12), 1165–1176 (2000). https://doi.org/10.1002/1096-9128(200010)12:12<1165::aid-cpe529>3.0.co;2-l
Libgomp: GNU offloading and multi processing runtime library: The GNU OpenMP and OpenACC implementation. Tech. rep., GNU libgomp (2018). https://gcc.gnu.org/onlinedocs/gcc-8.3.0/libgomp.pdf
Liu, J., et al.: Performance comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 58 (2003). https://doi.org/10.1145/1048935.1050208
LLVM: LLVM OpenMP runtime library. Tech. rep., the LLVM Project (2015). http://openmp.llvm.org/Reference.pdf
Mills, D.L.: Internet time synchronization: the network time protocol. IEEE Trans. Communi. 39(10), 1482–1493 (1991). https://doi.org/10.1109/26.103043
Muddukrishna, A., et al.: Locality-aware task scheduling and data distribution on NUMA systems. In: Rendell, AlP, Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 156–170. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40698-0_12
Nanjegowda, R., et al.: Scalability evaluation of barrier algorithms for OpenMP. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 42–52. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_4
Nethercote, N.: Cachegrind: a cache profiler. Tech. rep., Valgrind Developers (2019). https://valgrind.org/docs/manual/cg-manual.html
Rodchenko, A., et al.: Effective barrier synchronization on Intel Xeon Phi coprocessor. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 588–600. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48096-0_45
Terboven, C., et al.: Assessing OpenMP tasking implementations on NUMA architectures. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 182–195. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_14
Acknowledgments and Data Availability Statement
Measurement for this work were conducted on the Lichtenberg high performance computer of the TU Darmstadt. This work was supported by the Hessian Ministry for Higher Education, Research and the Arts through the Hessian Competence Center for High-Performance Computing and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 265191195 – SFB 1194.
The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12555263 [12].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jammer, T., Iwainsky, C., Bischof, C. (2020). A Comparison of the Scalability of OpenMP Implementations. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-57675-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)