Skip to main content

On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8488))

Abstract

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel’s Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area — structured grid codes — and investigated techniques for ensuring performance portability across a diverse range of different, high-end many-core architectures. We chose three codes to investigate: a 3D lattice Boltzmann code (D3Q19 BGK), the CloverLeaf hydrodynamics mini application from Sandia’s Mantevo benchmark suite, and ROTORSIM, a production-quality structured grid, multiblock, compressible finite-volume CFD code. We have developed OpenCL versions of these codes in order to provide cross-platform functional portability, and compared the performance of the OpenCL versions of these structured grid codes to optimized versions on each platform, including hybrid OpenMP/MPI/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Our results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for structured grid applications, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Moore, G.: Cramming more components onto integrated circuits. Electronics Magazine, 114–117 (April 1965)

    Google Scholar 

  2. Demmel, J., Dongarra, J., Parlett, B., Kahan, W., Gu, M., Bindel, D., Hida, Y., Li, X., Marques, O., Riedy, E.J., et al.: Prospectus for a dense linear algebra software library (April 2006)

    Google Scholar 

  3. Munshi, A. (ed.): The Khronous OpenCL Working Group: The OpenCL specification (2008)

    Google Scholar 

  4. Case, D., Darden, T., Cheatham III, T., Simmerling, C., Wang, J., Duke, R., Luo, R., Walker, R., Zhang, W., Merz, K., et al.: AMBER 2012. University of California, San Francisco (2012)

    Google Scholar 

  5. Götz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized Born. Journal of Chemical Theory and Computation 8(5), 1542–1555 (2012)

    Google Scholar 

  6. Salomon-Ferrer, R., Götz, A.W., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. Journal of Chemical Theory and Computation 9(9), 3878–3888 (2013)

    Article  Google Scholar 

  7. Grand, S.L., Götz, A.W., Walker, R.C.: SPFP: Speed without compromise—a mixed precision model for GPU accelerated molecular dynamics simulations. Computer Physics Communications 184(2), 374–380 (2013)

    Article  Google Scholar 

  8. Davidson, A., Owens, J.: Toward techniques for auto-tuning gpu algorithms. In: Jónasson, K. (ed.) PARA 2010, Part II. LNCS, vol. 7134, pp. 110–119. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  9. Zhang, Y., Sinclair II, M., Chien, A.A.: Improving performance portability in OpenCL programs. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 136–150. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  10. McIntosh-Smith, S., Price, J., Sessions, R.B., Ibarra, A.A.: High performance in silico virtual drug screening on many-core processors. International Journal of High Performance Computing Applications (IJHPCA) (April 2014)

    Google Scholar 

  11. McIntosh-Smith, S., Sessions, R.B.: An accelerated, computer assisted molecular modeling method for drug design. In: International Supercomputing (June 2008)

    Google Scholar 

  12. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)

    Google Scholar 

  13. Colella, P.: Defining software requirements for scientific computing (2004)

    Google Scholar 

  14. Boltzmann, L.: Weitere studien über das Wärmegleichgewicht unter gasmolekülen (further studies on the heat equilibrium of gas molecules). Wiener Berichte 66, 275–370 (1872)

    MATH  Google Scholar 

  15. Qian, Y.H., D’Humières, D., Lallemand, P.: Lattice BGK models for Navier-Stokes equation. EPL (Europhysics Letters) 17(6), 479 (1992)

    Article  Google Scholar 

  16. Succi, S.: The Lattice Boltzmann Equation: For Fluid Dynamics and Beyond. Numerical Mathematics and Scientific Computation. Clarendon Press (2001)

    Google Scholar 

  17. Habich, J., Zeiser, T., Hager, G., Wellein, G.: Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA. Advances in Engineering Software 42(5), 266–272 (2011)

    Article  Google Scholar 

  18. Mawson, M., Revell, A.: Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs. arXiv preprint arXiv:1309.1983 (2013)

    Google Scholar 

  19. Januszewski, M., Kostur, M.: Sailfish: a flexible multi-GPU implementation of the lattice Boltzmann method. ArXiv e-prints (November 2013)

    Google Scholar 

  20. Allen, C.B.: An unsteady multiblock multigrid scheme for lifting forward flight rotor simulation. International Journal for Numerical Methods in Fluids 45(9), 973–984 (2004)

    Article  Google Scholar 

  21. Allen, C.B.: Parallel universal approach to mesh motion and application to rotors in forward flight. International Journal for Numerical Methods in Engineering 69(10), 2126–2149 (2007)

    Article  Google Scholar 

  22. Allen, C.B.: Parallel simulation of unsteady hovering rotor wakes. International Journal for Numerical Methods in Engineering 68(6), 632–649 (2006)

    Article  Google Scholar 

  23. Rendall, T.C.S., Allen, C.B.: Unified fluid–structure interpolation and mesh motion using radial basis functions. International Journal for Numerical Methods in Engineering 74(10), 1519–1559 (2008)

    Article  MathSciNet  Google Scholar 

  24. Allen, C.B., Rendall, T.C.: CFD-based optimization of hovering rotors using radial basis functions for shape parameterization and mesh deformation. Optimization and Engineering 14(1), 97–118 (2013)

    Article  Google Scholar 

  25. Herdman, J., Gaudin, W., McIntosh-Smith, S., Boulton, M., Beckingsale, D., Mallinson, A., Jarvis, S.: Accelerating hydrocodes with OpenACC, OpenCL and CUDA. In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp. 465–471 (November 2012)

    Google Scholar 

  26. Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Sandia National Laboratories. Tech. Rep. (2009)

    Google Scholar 

  27. Sandia National Laboratory: The Mantevo project home page (February 2014), http://mantevo.org

  28. Mallinson, A.C., Beckingsale, D.A., Gaudin, W.P., Herdman, J.A., Jarvis, S.A.: Towards portable performance for explicit hydrodynamics codes. In: Proceedings of the 1st International Workshop on OpenCL (IWOCL 2013). ACM (May 2013)

    Google Scholar 

  29. Saad, Y.: Iterative methods for sparse linear systems. SIAM (2003)

    Google Scholar 

  30. Servat, H., Teruel, X., Llort, G., Duran, A., Giménez, J., Martorell, X., Ayguadé, E., Labarta, J.: On the instrumentation of OpenMP and OmpSs tasking constructs. In: Caragiannis, I., et al. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 414–428. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  31. Komatsu, K., Sato, K., Arai, Y., Koyama, K., Takizawa, H., Kobayashi, H.: Evaluating performance and portability of OpenCL programs. In: The Fifth International Workshop on Automatic Performance Tuning (2010)

    Google Scholar 

  32. Rul, S., Vandierendonck, H., D’Haene, J., De Bosschere, K.: An experimental study on performance portability of OpenCL kernels. In: 2010 Symposium on Application Accelerators in High Performance Computing (2010) (papers)

    Google Scholar 

  33. Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE International Symposium on Workload Characterization (IISWC), pp. 137–148. IEEE (2011)

    Google Scholar 

  34. Pennycook, S., Hammond, S., Wright, S., Herdman, J., Miller, I., Jarvis, S.: An investigation of the performance portability of OpenCL. Journal of Parallel and Distributed Computing 73(11), 1439–1450 (2013)

    Article  Google Scholar 

  35. Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing 38(8), 391–407 (2012)

    Article  Google Scholar 

  36. Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.: clMAGMA: High performance dense linear algebra with OpenCL. Technical report (lawn 275), ut-cs-13-706, University of Tennessee Computer Science (March 2013)

    Google Scholar 

  37. Habich, J., Feichtinger, C., Kostler, H., Hager, G., Wellein, G.: Performance engineering for the lattice Boltzmann method on GPGPUs: Architectural requirements and performance results. ArXiv e-prints (December 2011)

    Google Scholar 

  38. Gray, A., Stratford, K.: Ludwig: multiple GPUs for a complex fluid lattice Boltzmann application. In: Couturier, R. (ed.) Designing Scientific Applications on GPUs. Chapman & Hall/CRC Numerical Analysis and Scientific Computing Series, Taylor & Francis (2013)

    Google Scholar 

  39. Gray, A., Hart, A., Henrich, O., Stratford, K.: Scaling soft matter physics to thousands of GPUs in parallel (2013)

    Google Scholar 

  40. Xiong, Q., Li, B., Xu, J., Fang, X., Wang, X., Wang, L., He, X., Ge, W.: Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units. Chinese Science Bulletin 57(7), 707–715 (2012)

    Article  Google Scholar 

  41. Geveler, M., Ribbrock, D., Mallach, S., Goddeke, D.: A simulation suite for Lattice-Boltzmann based real-time CFD applications exploiting multi-level parallelism on modern multi- and many-core architectures. Journal of Computational Science 2(2), 113–123 (2011)

    Article  Google Scholar 

  42. Brandvik, T., Pullan, G.: Acceleration of a 3D Euler solver using commodity graphics hardware. In: 46th AIAA Aerospace Sciences Meeting and Exhibit, January 2008, pp. 607–661 (2008)

    Google Scholar 

  43. Elsen, E., LeGresley, P., Darve, E.: Large calculation of the flow over a hypersonic vehicle using a GPU. Journal of Computational Physics 227(24), 10148–10161 (2008)

    Article  Google Scholar 

  44. Cohen, J., Molemaker, M.J.: A fast double precision CFD code using CUDA. In: Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, pp. 414–429 (2009)

    Google Scholar 

  45. Göddeke, D., Buijssen, S., Wobker, H., Turek, S.: GPU acceleration of an unmodified parallel finite element Navier-Stokes solver. In: International Conference on High Performance Computing Simulation, HPCS 2009, pp. 12–21 (June 2009)

    Google Scholar 

  46. Phillips, E.H., Zhang, Y., Davis, R.L., Owens, J.D.: Rapid aerodynamic performance prediction on a cluster of graphics processing units. In: Proceedings of the 47th AIAA Aerospace Sciences Meeting, pp. 1–11 (2009)

    Google Scholar 

  47. Barnette, D.W., Barrett, R.F., Hammond, S.D., Jayaraj, J., Laros III, J.H.: Using miniapplications in a Mantevo framework for optimizing Sandia’s SPARC CFD code on multi-core, many-core, and GPU-accelerated compute platforms. Technical report, Sandia National Laboratories (2012)

    Google Scholar 

  48. Mallinson, A., Beckingsale, D., Gaudin, W., Herdman, J., Levesque, J., Jarvis, S.: CloverLeaf: Preparing hydrodynamics codes for Exascale. Cray User Group (CUG), Napa Valley (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

McIntosh-Smith, S., Boulton, M., Curran, D., Price, J. (2014). On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2014. Lecture Notes in Computer Science, vol 8488. Springer, Cham. https://doi.org/10.1007/978-3-319-07518-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07518-1_4

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07517-4

  • Online ISBN: 978-3-319-07518-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics