On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

McIntosh-Smith, Simon; Boulton, Michael; Curran, Dan; Price, James

doi:10.1007/978-3-319-07518-1_4

On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

Simon McIntosh-Smith¹⁸,
Michael Boulton¹⁸,
Dan Curran¹⁸ &
…
James Price¹⁸

Conference paper

2842 Accesses
25 Citations
2 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8488))

Abstract

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel’s Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area — structured grid codes — and investigated techniques for ensuring performance portability across a diverse range of different, high-end many-core architectures. We chose three codes to investigate: a 3D lattice Boltzmann code (D3Q19 BGK), the CloverLeaf hydrodynamics mini application from Sandia’s Mantevo benchmark suite, and ROTORSIM, a production-quality structured grid, multiblock, compressible finite-volume CFD code. We have developed OpenCL versions of these codes in order to provide cross-platform functional portability, and compared the performance of the OpenCL versions of these structured grid codes to optimized versions on each platform, including hybrid OpenMP/MPI/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Our results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for structured grid applications, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Moore, G.: Cramming more components onto integrated circuits. Electronics Magazine, 114–117 (April 1965)
Google Scholar
Demmel, J., Dongarra, J., Parlett, B., Kahan, W., Gu, M., Bindel, D., Hida, Y., Li, X., Marques, O., Riedy, E.J., et al.: Prospectus for a dense linear algebra software library (April 2006)
Google Scholar
Munshi, A. (ed.): The Khronous OpenCL Working Group: The OpenCL specification (2008)
Google Scholar
Case, D., Darden, T., Cheatham III, T., Simmerling, C., Wang, J., Duke, R., Luo, R., Walker, R., Zhang, W., Merz, K., et al.: AMBER 2012. University of California, San Francisco (2012)
Google Scholar
Götz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized Born. Journal of Chemical Theory and Computation 8(5), 1542–1555 (2012)
Google Scholar
Salomon-Ferrer, R., Götz, A.W., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. Journal of Chemical Theory and Computation 9(9), 3878–3888 (2013)
Article Google Scholar
Grand, S.L., Götz, A.W., Walker, R.C.: SPFP: Speed without compromise—a mixed precision model for GPU accelerated molecular dynamics simulations. Computer Physics Communications 184(2), 374–380 (2013)
Article Google Scholar
Davidson, A., Owens, J.: Toward techniques for auto-tuning gpu algorithms. In: Jónasson, K. (ed.) PARA 2010, Part II. LNCS, vol. 7134, pp. 110–119. Springer, Heidelberg (2012)
Chapter Google Scholar
Zhang, Y., Sinclair II, M., Chien, A.A.: Improving performance portability in OpenCL programs. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 136–150. Springer, Heidelberg (2013)
Chapter Google Scholar
McIntosh-Smith, S., Price, J., Sessions, R.B., Ibarra, A.A.: High performance in silico virtual drug screening on many-core processors. International Journal of High Performance Computing Applications (IJHPCA) (April 2014)
Google Scholar
McIntosh-Smith, S., Sessions, R.B.: An accelerated, computer assisted molecular modeling method for drug design. In: International Supercomputing (June 2008)
Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)
Google Scholar
Colella, P.: Defining software requirements for scientific computing (2004)
Google Scholar
Boltzmann, L.: Weitere studien über das Wärmegleichgewicht unter gasmolekülen (further studies on the heat equilibrium of gas molecules). Wiener Berichte 66, 275–370 (1872)
MATH Google Scholar
Qian, Y.H., D’Humières, D., Lallemand, P.: Lattice BGK models for Navier-Stokes equation. EPL (Europhysics Letters) 17(6), 479 (1992)
Article Google Scholar
Succi, S.: The Lattice Boltzmann Equation: For Fluid Dynamics and Beyond. Numerical Mathematics and Scientific Computation. Clarendon Press (2001)
Google Scholar
Habich, J., Zeiser, T., Hager, G., Wellein, G.: Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA. Advances in Engineering Software 42(5), 266–272 (2011)
Article Google Scholar
Mawson, M., Revell, A.: Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs. arXiv preprint arXiv:1309.1983 (2013)
Google Scholar
Januszewski, M., Kostur, M.: Sailfish: a flexible multi-GPU implementation of the lattice Boltzmann method. ArXiv e-prints (November 2013)
Google Scholar
Allen, C.B.: An unsteady multiblock multigrid scheme for lifting forward flight rotor simulation. International Journal for Numerical Methods in Fluids 45(9), 973–984 (2004)
Article Google Scholar
Allen, C.B.: Parallel universal approach to mesh motion and application to rotors in forward flight. International Journal for Numerical Methods in Engineering 69(10), 2126–2149 (2007)
Article Google Scholar
Allen, C.B.: Parallel simulation of unsteady hovering rotor wakes. International Journal for Numerical Methods in Engineering 68(6), 632–649 (2006)
Article Google Scholar
Rendall, T.C.S., Allen, C.B.: Unified fluid–structure interpolation and mesh motion using radial basis functions. International Journal for Numerical Methods in Engineering 74(10), 1519–1559 (2008)
Article MathSciNet Google Scholar
Allen, C.B., Rendall, T.C.: CFD-based optimization of hovering rotors using radial basis functions for shape parameterization and mesh deformation. Optimization and Engineering 14(1), 97–118 (2013)
Article Google Scholar
Herdman, J., Gaudin, W., McIntosh-Smith, S., Boulton, M., Beckingsale, D., Mallinson, A., Jarvis, S.: Accelerating hydrocodes with OpenACC, OpenCL and CUDA. In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp. 465–471 (November 2012)
Google Scholar
Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Sandia National Laboratories. Tech. Rep. (2009)
Google Scholar
Sandia National Laboratory: The Mantevo project home page (February 2014), http://mantevo.org
Mallinson, A.C., Beckingsale, D.A., Gaudin, W.P., Herdman, J.A., Jarvis, S.A.: Towards portable performance for explicit hydrodynamics codes. In: Proceedings of the 1st International Workshop on OpenCL (IWOCL 2013). ACM (May 2013)
Google Scholar
Saad, Y.: Iterative methods for sparse linear systems. SIAM (2003)
Google Scholar
Servat, H., Teruel, X., Llort, G., Duran, A., Giménez, J., Martorell, X., Ayguadé, E., Labarta, J.: On the instrumentation of OpenMP and OmpSs tasking constructs. In: Caragiannis, I., et al. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 414–428. Springer, Heidelberg (2013)
Chapter Google Scholar
Komatsu, K., Sato, K., Arai, Y., Koyama, K., Takizawa, H., Kobayashi, H.: Evaluating performance and portability of OpenCL programs. In: The Fifth International Workshop on Automatic Performance Tuning (2010)
Google Scholar
Rul, S., Vandierendonck, H., D’Haene, J., De Bosschere, K.: An experimental study on performance portability of OpenCL kernels. In: 2010 Symposium on Application Accelerators in High Performance Computing (2010) (papers)
Google Scholar
Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE International Symposium on Workload Characterization (IISWC), pp. 137–148. IEEE (2011)
Google Scholar
Pennycook, S., Hammond, S., Wright, S., Herdman, J., Miller, I., Jarvis, S.: An investigation of the performance portability of OpenCL. Journal of Parallel and Distributed Computing 73(11), 1439–1450 (2013)
Article Google Scholar
Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing 38(8), 391–407 (2012)
Article Google Scholar
Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.: clMAGMA: High performance dense linear algebra with OpenCL. Technical report (lawn 275), ut-cs-13-706, University of Tennessee Computer Science (March 2013)
Google Scholar
Habich, J., Feichtinger, C., Kostler, H., Hager, G., Wellein, G.: Performance engineering for the lattice Boltzmann method on GPGPUs: Architectural requirements and performance results. ArXiv e-prints (December 2011)
Google Scholar
Gray, A., Stratford, K.: Ludwig: multiple GPUs for a complex fluid lattice Boltzmann application. In: Couturier, R. (ed.) Designing Scientific Applications on GPUs. Chapman & Hall/CRC Numerical Analysis and Scientific Computing Series, Taylor & Francis (2013)
Google Scholar
Gray, A., Hart, A., Henrich, O., Stratford, K.: Scaling soft matter physics to thousands of GPUs in parallel (2013)
Google Scholar
Xiong, Q., Li, B., Xu, J., Fang, X., Wang, X., Wang, L., He, X., Ge, W.: Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units. Chinese Science Bulletin 57(7), 707–715 (2012)
Article Google Scholar
Geveler, M., Ribbrock, D., Mallach, S., Goddeke, D.: A simulation suite for Lattice-Boltzmann based real-time CFD applications exploiting multi-level parallelism on modern multi- and many-core architectures. Journal of Computational Science 2(2), 113–123 (2011)
Article Google Scholar
Brandvik, T., Pullan, G.: Acceleration of a 3D Euler solver using commodity graphics hardware. In: 46th AIAA Aerospace Sciences Meeting and Exhibit, January 2008, pp. 607–661 (2008)
Google Scholar
Elsen, E., LeGresley, P., Darve, E.: Large calculation of the flow over a hypersonic vehicle using a GPU. Journal of Computational Physics 227(24), 10148–10161 (2008)
Article Google Scholar
Cohen, J., Molemaker, M.J.: A fast double precision CFD code using CUDA. In: Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, pp. 414–429 (2009)
Google Scholar
Göddeke, D., Buijssen, S., Wobker, H., Turek, S.: GPU acceleration of an unmodified parallel finite element Navier-Stokes solver. In: International Conference on High Performance Computing Simulation, HPCS 2009, pp. 12–21 (June 2009)
Google Scholar
Phillips, E.H., Zhang, Y., Davis, R.L., Owens, J.D.: Rapid aerodynamic performance prediction on a cluster of graphics processing units. In: Proceedings of the 47th AIAA Aerospace Sciences Meeting, pp. 1–11 (2009)
Google Scholar
Barnette, D.W., Barrett, R.F., Hammond, S.D., Jayaraj, J., Laros III, J.H.: Using miniapplications in a Mantevo framework for optimizing Sandia’s SPARC CFD code on multi-core, many-core, and GPU-accelerated compute platforms. Technical report, Sandia National Laboratories (2012)
Google Scholar
Mallinson, A., Beckingsale, D., Gaudin, W., Herdman, J., Levesque, J., Jarvis, S.: CloverLeaf: Preparing hydrodynamics codes for Exascale. Cray User Group (CUG), Napa Valley (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Bristol, Woodland Road, Clifton, Bristol, BS8 1UB, UK
Simon McIntosh-Smith, Michael Boulton, Dan Curran & James Price

Authors

Simon McIntosh-Smith
View author publications
You can also search for this author in PubMed Google Scholar
Michael Boulton
View author publications
You can also search for this author in PubMed Google Scholar
Dan Curran
View author publications
You can also search for this author in PubMed Google Scholar
James Price
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

MIN Faculty, Department of Informatics Scientific Computing, University of Hamburg, Bundestraße 45a, 20146, Hamburg, Germany
Julian Martin Kunkel
Deutsches Klimarechenzentrum, Bundesstraße 45a, 20146, Hamburg, Germany
Thomas Ludwig
Germany and Prometeus GmbH, University of Mannheim, Fliederstraße 2, 74915, Waibstadt, Germany
Hans Werner Meuer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McIntosh-Smith, S., Boulton, M., Curran, D., Price, J. (2014). On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2014. Lecture Notes in Computer Science, vol 8488. Springer, Cham. https://doi.org/10.1007/978-3-319-07518-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-07518-1_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07517-4
Online ISBN: 978-3-319-07518-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics