Skip to main content
Log in

Techniques, Tricks, and Algorithms for Efficient GPU-Based Processing of Higher Order Hyperbolic PDEs

  • Original Paper
  • Published:
Communications on Applied Mathematics and Computation Aims and scope Submit manuscript

Abstract

GPU computing is expected to play an integral part in all modern Exascale supercomputers. It is also expected that higher order Godunov schemes will make up about a significant fraction of the application mix on such supercomputers. It is, therefore, very important to prepare the community of users of higher order schemes for hyperbolic PDEs for this emerging opportunity.


Not every algorithm that is used in the space-time update of the solution of hyperbolic PDEs will take well to GPUs. However, we identify a small core of algorithms that take exceptionally well to GPU computing. Based on an analysis of available options, we have been able to identify weighted essentially non-oscillatory (WENO) algorithms for spatial reconstruction along with arbitrary derivative (ADER) algorithms for time extension followed by a corrector step as the winning three-part algorithmic combination. Even when a winning subset of algorithms has been identified, it is not clear that they will port seamlessly to GPUs. The low data throughput between CPU and GPU, as well as the very small cache sizes on modern GPUs, implies that we have to think through all aspects of the task of porting an application to GPUs. For that reason, this paper identifies the techniques and tricks needed for making a successful port of this very useful class of higher order algorithms to GPUs.

Application codes face a further challenge—the GPU results need to be practically indistinguishable from the CPU results—in order for the legacy knowledge bases embedded in these applications codes to be preserved during the port of GPUs. This requirement often makes a complete code rewrite impossible. For that reason, it is safest to use an approach based on OpenACC directives, so that most of the code remains intact (as long as it was originally well-written). This paper is intended to be a one-stop shop for anyone seeking to make an OpenACC-based port of a higher order Godunov scheme to GPUs.

We focus on three broad and high-impact areas where higher order Godunov schemes are used. The first area is computational fluid dynamics (CFD). The second is computational magnetohydrodynamics (MHD) which has an involution constraint that has to be mimetically preserved. The third is computational electrodynamics (CED) which has involution constraints and also extremely stiff source terms. Together, these three diverse uses of higher order Godunov methodology, cover many of the most important applications areas. In all three cases, we show that the optimal use of algorithms, techniques, and tricks, along with the use of OpenACC, yields superlative speedups on GPUs. As a bonus, we find a most remarkable and desirable result: some higher order schemes, with their larger operations count per zone, show better speedup than lower order schemes on GPUs. In other words, the GPU is an optimal stratagem for overcoming the higher computational complexities of higher order schemes. Several avenues for future improvement have also been identified. A scalability study is presented for a real-world application using GPUs and comparable numbers of high-end multicore CPUs. It is found that GPUs offer a substantial performance benefit over comparable number of CPUs, especially when all the methods designed in this paper are used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

All data generated or analysed during this study are included in this published article.

References

  1. Balsara, D.S.: A two-dimensional HLLC Riemann solver with applications to Euler and MHD flows. J. Comp. Phys. 231, 7476–7503 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  2. Balsara, D.S.: Divergence-free adaptive mesh refinement for magnetohydrodynamics. J. Comput. Phys. 174, 614–648 (2001)

    Article  MATH  Google Scholar 

  3. Balsara, D.S.: Divergence-free reconstruction of magnetic fields and WENO schemes for magnetohydrodynamics. J. Comput. Phys. 228, 5040–5056 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. Balsara, D.S.: Higher order accurate space-time schemes for computational astrophysics – Part I: finite volume methods. Liv. Rev. Computat. Astrophy. 3, 2 (2017). https://doi.org/10.1007/s41115-017-0002-8

    Article  Google Scholar 

  5. Balsara, D.S.: Multidimensional extension of the HLLE Riemann solver; application to Euler and magnetohydrodynamical flows. J. Comput. Phys. 229, 1970–1993 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  6. Balsara, D.S.: Multidimensional Riemann problem with self-similar internal structure – Part I - application to hyperbolic conservation laws on structured meshes. J. Comput. Phys. 277, 163–200 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  7. Balsara, D.S.: Second-order-accurate schemes for magnetohydrodynamics with divergence-free reconstruction. Astrophys. J. Suppl. 151, 149–184 (2004)

    Article  Google Scholar 

  8. Balsara, D.S., Amano, T., Garain, S., Kim, J.: High order accuracy divergence-free scheme for the electrodynamics of relativistic plasmas with multidimensional Riemann solvers. J. Comput. Phys. 318, 169–200 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  9. Balsara, D.S., Garain, S., Shu, C.-W.: An efficient class of WENO schemes with adaptive order. J. Comput. Phys. 326, 780–804 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  10. Balsara, D.S., Meyer, C., Dumbser, M., Du, H., Xu, Z.: Efficient implementation of ADER schemes for Euler and magnetohydrodynamical flows on structured meshes – comparison with Runge-Kutta methods. J. Comput. Phys. 235, 934–969 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  11. Balsara, D.S., Nkonga, B.: Formulating multidimensional Riemann solvers in similarity variables – Part III – a multidimensional analogue of the HLLI Riemann solver for conservative hyperbolic systems. J. Comput. Phys. 346, 25–48 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  12. Balsara, D.S., Rumpf, T., Dumbser, M., Munz, C.-D., Efficient, high accuracy ADER-WENO schemes for hydrodynamics and divergence-free magnetohydrodynamics. J. Comput. Phys. 228, 2480–2516 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  13. Balsara, D.S., Shu, C.-W.: Monotonicity preserving weighted non-oscillatory schemes with increasingly high order of accuracy. J. Comput. Phys. 160, 405–452 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  14. Balsara, D.S., Spicer, D.S.: A staggered mesh algorithm using high order Godunov fluxes to ensure solenoidal magnetic fields in magnetohydrodynamic simulations. J. Comput. Phys. 149, 270–292 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  15. Balsara, D.S., Taflove, A., Garain, S., Montecinos, G.: Computational electrodynamics in material media with constraint-preservation, multidimensional Riemann solvers and sub-cell resolution – part I, second-order FVTD schemes. J. Comput. Phys. 349, 604–635 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  16. Balsara, D.S., Taflove, A., Garain, S., Montecinos, G.: Computational electrodynamics in material media with constraint-preservation, multidimensional Riemann solvers and sub-cell resolution – part II, higher-order FVTD schemes. J. Comput. Phys. 354, 613–645 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  17. Chandrasekaran, S., Juckeland, G.: OpenACC for Programmers: Concepts and Strategies. Addison-Wesley, Boston (2018)

    Google Scholar 

  18. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press, Cambridge, MA (2008)

    Google Scholar 

  19. Colella, P.: Multidimensional upwind methods for hyperbolic conservation laws. J. Comput. Phys. 87, 171 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  20. Dai, W., Woodward, P.R.: On the divergence-free condition and conservation laws in numerical simulations for supersonic magnetohydrodynamic flows. Astrophys. J. 494, 317–335 (1998)

    Article  Google Scholar 

  21. Dumbser, M., Balsara, D.S.: A new, efficient formulation of the HLLEM riemann solver for general conservative and non-conservative hyperbolic systems. J. Comput. Phys. 304, 275–319 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  22. Dumbser, M., Balsara, D.S., Toro, E.F., Munz, C.-D.: A unified framework for the construction of one-step finite volume and discontinuous Galerkin schemes on unstructured meshes. J. Comput. Phys. 227, 8209–8253 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  23. Dumbser, M., Zanotti, O., Hidalgo, A., Balsara, D.S.: ADER-WENO Finite volume schemes with space-time adaptive mesh refinement. J. Comput. Phys. 248, 257–286 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  24. Einfeldt, B., Munz, C.-D., Roe, P.L., Sjogreen, B.: On Godunov-type methods near low densities. J. Comput. Phys. 92, 273–295 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  25. Garain, S., Balsara, D.S., Reid, J.: Comparing Coarray Fortran (CAF) with MPI for several structured mesh PDE applications. J. Comput. Phys. 297, 237–253 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  26. Godunov, S.K.: Finite difference methods for the computation of discontinuous solutions of the Equations of Fluid Dynamics. Mathematics of the USSR, Sbornik. 47, 271–306 (1959)

    Google Scholar 

  27. Harten, A., Engquist, B., Osher, S., Chakravarthy, S.: Uniformly high order essentially non-oscillatory schemes III. J. Comput. Phys. 71, 231–303 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  28. Harten, A., Lax, P.D., van Leer, B.: On upstream differencing and Godunov-type schemes for hyperbolic conservation laws. SIAM Rev. 25, 289–315 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  29. Jiang, G.-S., Shu, C.-W.: Efficient implementation of weighted ENO schemes. J. Comput. Phys. 126, 202–228 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  30. Roe, P.L.: Approximate Riemann solver, parameter vectors and difference schemes. J. Comput. Phys. 43, 357–372 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  31. Rusanov, V.V.: Calculation of interaction of non-steady shock waves with obstacles. J. Comput. Math. Phys. USSR 1, 267 (1961)

    Google Scholar 

  32. Ryu, D., Miniati, F., Jones, T.W., Frank, A.: A divergence-free upwind code for multidimensional magnetohydrodynamic flows. Astrophys. J. 509, 244–255 (1998)

    Article  Google Scholar 

  33. Shu, C.-W.: Total variation-diminishing time discretizations. SIAM J Sci. Stat. Comput. 9, 1073–1084 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  34. Shu, C.-W., Osher, S.J.: Efficient implementation of essentially non-oscillatory shock capturing schemes. J. Comput. Phys. 77, 439–471 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  35. Shu, C.-W., Osher, S.J.: Efficient implementation of essentially non-oscillatory shock capturing schemes II. J. Comput. Phys. 83, 32–78 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  36. Subramanian, S., Balsara, D.S., Gagne, M., A.: ud-Doula, Modeling magnetic massive stars in 3D i: isothermal simulations of a magnetic O star. Month. Note. Royal Astronom. Soc. 515(1), 237–255 (2022)

    Article  Google Scholar 

  37. Titarev, V.A., Toro, E.F.: ADER: arbitrary high order Godunov approach. J. Sci. Comput. 17(1/2/3/4), 609–618 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  38. Titarev, V.A., Toro, E.F.: ADER schemes for three-dimensional nonlinear hyperbolic systems. J. Comput. Phys. 204, 715–736 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  39. Toro, E.F., Spruce, M., Speares, W.: Restoration of contact surface in the HLL Riemann solver. Shock Waves 4, 25–34 (1994)

    Article  MATH  Google Scholar 

  40. Toro, E.F., Titarev, V.A.: Solution of the generalized Riemann problem for advection reaction equations. Proc. R. Soc. Lond. Ser. A 458, 271–281 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  41. Van Leer, B.: Toward the ultimate conservative difference scheme. V. A second-order sequel to Godunov’s method. J. Comput. Phys. 32, 101–136 (1979)

    Article  MATH  Google Scholar 

  42. Woodward, P., Colella, P.: The numerical simulation of two-dimensional fluid flow with strong shocks. J. Comput. Phys. 54, 115–173 (1984)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

DSB acknowledges support via the NSF grants NSF-19-04774, NSF-AST-2009776, NASA-2020-1241 and the NASA grant 80NSSC22K0628. DSB and HK acknowledge support from a Vajra award.

Funding

The funding has been acknowledged. DSB acknowledges support via the NSF grants NSF-19–04774, NSF-AST-2009776, NASA-2020–1241 and the NASA grant 80NSSC22K0628.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dinshaw S. Balsara.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethical approval

This manuscript complies with all ethical standards for scientific publishing.

Informed Consent

Not applicable.

Appendix A

Appendix A

The OpenACC implementation of a HOG scheme in the C/C++ language is described here. The overall structure is identical to that of the Fortran code structure described in Sect. 2.2. The reader who is familiar with C/C++ should be able to pick up all the OpenACC insights from the figures shown in this Appendix. In this Appendix, we also highlight the differences between Fortran and C/C++ syntax and use of the OpenACC directives.

Figure A1 shows the header file containing the preprocessor definitions of the domain sizes “nx, ny, nz” and the function declarations. The data type and array sizes are specified in the function interface as required by the C/C++ syntax, which is different from Fortran. It can be noted that the variable “dt_next” in the function “Update()” is passed by reference as a pointer since this is a scalar output variable. Such special care is not needed in a Fortran subroutine, since all the variables are passed by reference by default. Similar to the Fortran code, the function “U_skinny_to_U()” moves the minimal data from CPU to GPU at the beginning of every timestep.

Fig. A1
figure 10

Header file “hydro.h”, containing the definitions of the domain sizes “nx, ny, and nz” in the x-, y-, and z-directions, respectively; and also the function declarations, which are essential for a hydrodynamic simulation. This header file is common to all the C/C++ functions and the main function

Figure A2 shows the overall structure of a hydro code in C/C++ , which is analogous to the Fortran structure shown in Fig. 1. The OpenACC directives have a different structure in C/C++ , they start with “#pragma acc” and they do not require an “!$acc end” statement. So, the analogous form for “!$acc end data” in Fig. 1 is not present in Fig. 2.

Fig. A2
figure 11

Structure of a C/C++ version of the hydro code (in pseudocode format) at second order with OpenACC extensions to allow it to perform well on a GPU. The overall structure is identical to that of the Fortran version. The exceptions are, the arrays dimensions starts from “0” in C and the zone indexing is reversed from “(i, j, k)” of Fortran to “[k][j][i]” of C, to efficiently operate on the row-major array storage format in C. The OpenACC pragmas are identical to the Fortran version, except the C/C +  + version does not need the ending “#pragma acc end” pragma

Figure A3 shows the C/C++ code for the “U_skinny_to_U()” function. This is equivalent to the Fortran code shown in Fig. 2. The OpenACC directives that are outside the triply nested loop are identical in C/C++ and Fortran. The Fortran has the “:” operator for the whole array operation, whereas the C/C++ does not have any special operator for this. Therefore, we need another “for” loop inside the triply nested loop. This inner for-loop will be executed serially. It has been marked with the OpenACC directive “#pragma acc loop seq”.

Fig. A3
figure 12

Structure of the “U_Skinny_to_U” function. The array “U_skinny” contains the minimum data that should be carried from the host to the device. This subroutine just copies the contents of “U_skinny” to “U (:, :, :, :, 1)”. The OpenACC pragmas are identical to that of the Fortran version

Figure A4 shows the overall structure of a “Limiter()” function in C/C++ . This is analogous to the Fortran subroutine shown in Fig. 3. The OpenACC directives are identical, except for the “!$acc end” directive, which is not needed in C/C++ . Unlike Fortran, the C/C++ does not allow functions to be defined inside another function. So, the “MC_limiter()” function is defined as an inline function, outside the “Limiter()”. The “MC_limiter()” is called inside a triply nested loop serially. So, it is marked with “#pragma acc routine seq”.

Fig. A4
figure 13

C/C++ version of the application of a simple monotone-centered limiter in the x-direction to all the fluid variables. The inline function “MC_Limiter” is made to accommodate a compression factor which can indeed be changed based on the flow field being limited. The OpenACC directives are identical to that of the Fortran code, except the C version does not need the ending “#pragma acc end” pragma. Also, the inline function must be identified as “acc routine seq” pragma, since the C does not differentiate between a function and a subroutine like Fortran does

Figure A5 shows the overall structure of a “Predictor()” function in C/C++ . This is equivalent to the Fortran subroutine, shown in Fig. 4. The overall structure and the OpenACC directives are identical to the Fortran code. As mentioned before, the C/C++ lacks the “:” operator for array operations. Due to this, “for” loops are used for the array operations, inside the triply nested loop. These are marked with “#pragma acc loop seq” to indicate their serial execution.

Fig. A5
figure 14

C/C++ version of the predictor step for building the fifth mode, i.e., the mode that contains the time rate of change. The OpenACC directives are identical to that of the Fortran code, except in the C version the inner for-loops need to be marked with the “acc loop seq” pragma indicating their sequential execution. This is not needed in the Fortran version, where the “:” operator is used for such operations

Figure A6 shows the C/C++ code for the “Make_Flux()” function. This is equivalent to the Fortran code shown in Fig. 5. The OpenACC directives are identical to the Fortran code. An additional “#pragma acc loop seq” is used for the inner loop, which performs the operations on each fluid component.

Fig. A6
figure 15

C/C +  + version of the pseudocode for the subroutine to build the flux along the x-faces. The “U_L” variable stores the value of “U” obtained from the zone which is at the left to the face and the “U_R” variable stores the value of “U” obtained from the zone which is at the right to the face. Thus, the facial values of the conserved variable “U” are stored in “U_L, U_R”. This is then fed into the Riemann solver function to build the flux for the corresponding face. The OpenACC directives are identical to that of the Fortran code, except in the C version the inner for-loops need to be marked with the “acc loop seq” pragma indicating their sequential execution. This is not needed in the Fortran version, where the “:” operator is used for such operations. In addition, it can be noted that the index of small arrays such as “U_L, U_R, flux_x_ptwise” starts with “0” instead of “1”

Figure A7 shows the equivalent C/C++ code for the “Make_dU_dt()” function. This is analogous to the Fortran shown in Fig. 6. The overall structure and the OpenACC directives are identical between the C/C++ and Fortran.

Fig. A7
figure 16

C/C++ version of the function which builds the “dU_dt” update from the fluxes obtained from the “Make_Flux_X, Make_Flux_Y and Make_Flux_Z” functions. The “:” operator of the Fortran is replaced with a sequential inner for-loop

Figure A8 shows the C/C++ code for the “Update_U_Timestep()” function. Here we perform the time update for the conserved variable “U” and estimate the next timestep, “dt_next”. This is analogous to the Fortran code shown in Fig. 7. In the C/C +  + version, the scalar variable “dt_next” is passed by reference as a pointer. In such a case, the function and the OpenACC directives are unaware of the data structure of the pointer variable. Therefore, to keep it simple, an additional temporary variable “dt1” is used for the reduction operation. The variable “dt1” is copied to GPU at the beginning, using the directive “#pragma acc data copyin(dt1)”. At the end of the triply nested loop, the final data stored in “dt1” is copied back to the CPU, using the OpenACC directive “#pragma acc update host(dt1)”. Then, it is copied to the “dt_next” variable.

Fig. A8
figure 17

C/C++ version of the function which updates the “U” from the “dU_dt”. In addition to this, here the “U_skinny” is updated from the new zone-centered values of “U”. The OpenACC directives are identical to that of the Fortran code, except in the C version the inner for-loop need to be marked with the “acc loop seq” pragma indicating its sequential execution. Also in this function, the next timestep “dt_next” is estimated, using the CFL number, zone size, and U. The smallest among all the zones is found using the “fmin()” function. To correctly parallelize this reduction operation, the “reduction(min: dt1)” pragma is used

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Subramanian, S., Balsara, D.S., Bhoriya, D. et al. Techniques, Tricks, and Algorithms for Efficient GPU-Based Processing of Higher Order Hyperbolic PDEs. Commun. Appl. Math. Comput. (2023). https://doi.org/10.1007/s42967-022-00235-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42967-022-00235-9

Mathematics Subject Classification

Keywords

Navigation