Abstract
SMP/NUMA systems are powerful HPC platforms which could be applied for a wide range of real-life applications. These systems provide large capacity of shared memory, and allow using the shared-variable programming model to take advantages of shared memory for inter-process communications and synchronizations. However, as data can be physically dispersed over many nodes, the access to various data items may require significantly different times. In this paper, we face the challenge of harnessing the heterogeneous nature of SMP/NUMA communications for a complex scientific application which implements the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), consisting of a set of heterogeneous stencil computations.
When using our method of MPDATA workload distribution, which was successfully applied for small-scale shared memory systems with several CPUs and/or accelerators, significant performance losses are noticeable for larger SMP/NUMA systems, such as SGI UV 2000 server used in this work. To overcome this shortcoming, we propose a new islands-of-cores approach. It exposes a correlation between computation and communication for heterogeneous stencils, and enables an efficient management of trade-off between computation and communication costs in accordance with the features of SMP/NUMA systems. In consequence, when using the maximum configuration with 112 cores of 14 Intel Xeon E5-4627v2 3.3 GHz processors, the proposed approach accelerates the previous method more then 10 times, achieving about 390 Gflop/s, or approximately 30% of the theoretical peak performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cao, X., et al.: Accelerating data shuffling in MapReduce framework with a scale-up NUMA computing architecture. In: Proceedings of the 24th High Performance Computing Symposium, HPC 2016. International Society for Computer Simulation (2016)
Castro, M., Francesquini, E., Nguélé, T.M., Méhaut, J.F.: Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application. In: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms. ACM (2013)
Ciznicki, M., Kulczewski, M., Kopta, P., Kurowski, K.: Methods to load balance a GCR pressure solver using a stencil framework on multi-and many-core architectures. Sci. Program. (2015)
Culler, D., Pal Singh, J., Gupta, A.: Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Czarnul, P.: Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors. Int. J. Parallel Program. 1–17 (2017)
Guo, J., Bikshandi, G., Fraguela, B.B., Padua, D.: Writing productive stencil codes with overlapped tiling. Concurr. Comput. Pract. Exp. 21(1), 25–39 (2009)
Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power properties of modern multi-core chips via simple machine models. Concurr. Comput. Pract. Exp. 28(22), 189–210 (2016)
National Supercomputing Center IT4Innovations (2017). http://www.it4i.cz
Kumar, S., Bhattacharyya, R., Joshi, B., Smolarkiewicz, P.: On the role of repetitive magnetic reconnections in evolution of magnetic flux ropes in solar corona. Astrophys. J. 830(2), 80 (2016)
Lastovetsky, A., Szustak, L., Wyrzykowski, R.: Model-based optimization of EULAG kernel on Intel Xeon Phi through load imbalancing. IEEE Trans. Parallel Distrib. Syst. 28(3), 787–797 (2017)
SGI Products: Servers SGI UV (2015). https://www.sgi.com/products/servers/uv/
SGI UV 2000 System User Guide. Document Number 007–5832-002 (2013)
Smolarkiewicz, P.: Multidimensional positive definite advection transport algorithm: an overview. Int. J. Numer. Methods Fluids 50(10), 1123–1144 (2006)
Smolarkiewicz, P., Margolin, L.: MPDATA: a finite-difference solver for geophysical flows. J. Comput. Phys. 140(2), 459–480 (1998)
Smolarkiewicz, P.K., Charbonneau, P.: EULAG, a computational model for multiscale flows: an MHD extension. J. Comput. Phys. 236, 608–623 (2013)
Smolarkiewicz, P.K., Szmelter, J., Xiao, F.: Simulation of all-scale atmospheric dynamics on unstructured meshes. J. Comput. Phys. 322(C), 267–287 (2016)
Strugarek, A., Beaudoin, P., Brun, A., Charbonneau, P., Mathis, S., Smolarkiewicz, P.: Modeling turbulent stellar convection zones: sub-grid scales effects. Adv. Space Res. 58(8), 1538–1553 (2016)
Szustak, L., Rojek, K., Gepner, P.: Using Intel Xeon Phi coprocessor to accelerate computations in MPDATA algorithm. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 582–592. Springer, Heidelberg (2014). doi:10.1007/978-3-642-55224-3_54
Szustak, L., Rojek, K., Olas, T., Kuczynski, L., Halbiniak, K., Gepner, P.: Adaptation of MPDATA heterogeneous stencil computation to Intel Xeon Phi coprocessor. Sci. Program. (2015). doi:10.1155/2015/642705
Szustak, L., Rojek, K., Wyrzykowski, R., Gepner, P.: Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture. In: Proceedings of the 1st International Workshop on High-Performance Stencil Computations, HiStencils 2014, pp. 51–56 (2014)
Treibig, J., Hager, G., Wellein, G.: LIKWID: a lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of the First International Workshop on Parallel Software Tools and Tool Infrastructures, PSTI 2010, San Diego, CA (2010)
Unat, D., et al.: Programming abstractions for data locality. (2014). http://web.eecs.umich.edu/akamil/papers/padal14report.pdf
Utrera, G., Gil, M., Martorell, X.: In search of the best MPI-OpenMP distribution for optimum Intel-MIC cluster performance. In: 2015 International Conference on High Performance Computing and Simulation (HPCS), pp. 429–435. IEEE (2015)
Xue, W., et al.: Ultra-scalable CPU-MIC acceleration of mesoscale atmospheric modeling on Tianhe-2. IEEE Trans. Comput. 64(8), 2382–2393 (2015)
Yasui, Y., Fujisawa, K., Goh, E.L., Baron, J., Sugiura, A., Uchiyama, T.: NUMA-aware scalable graph traversal on SGI UV systems. In: Proceedings of the ACM Workshop on High Performance Graph Processing, pp. 19–26. ACM (2016)
Zhou, X., Giacalone, J.P., Garzarán, M.J., Kuhn, R.H., Ni, Y., Padua, D.: Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 207–218. ACM (2012)
Acknowledgments
This work was supported by the National Science Centre (Poland) under grant UMO-2015/17/D/ST6/04059, as well as partially supported by the Ministry of Education, Youth and Sports of Czech Republic from the project “IT4Innovations National Supercomputing Center LM2015070”, and by EU under the COST Program Action IC1305 “Network for Sustainable Ultrascale Computing (NESUS)” and its Czech supporting project LD15105 “Ultrascale Computing in Geosciences”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Szustak, L., Wyrzykowski, R., Jakl, O. (2017). Islands-of-Cores Approach for Harnessing SMP/NUMA Architectures in Heterogeneous Stencil Computations. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2017. Lecture Notes in Computer Science(), vol 10421. Springer, Cham. https://doi.org/10.1007/978-3-319-62932-2_34
Download citation
DOI: https://doi.org/10.1007/978-3-319-62932-2_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62931-5
Online ISBN: 978-3-319-62932-2
eBook Packages: Computer ScienceComputer Science (R0)