Skip to main content
Log in

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Allen, G., Dramlitsch, T., Foster, I., Karonis, N.T., Ripeanu, M., Seidel, E., Toonen, B.: Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus. In: SC’01, pp. 52–52 (2001)

  2. Alpert, M.: Not just fun and games. April (1999)

  3. Bromley M., Heller S., McNerney T., Steele G.L. Jr: Fortran at ten gigaflops: the connection machine convolution compiler. PLDI ’91 26(6), 145–156 (1991)

    Article  Google Scholar 

  4. Chatterjee, S., Gilbert, J.R., Schreiber, R.: Mobile and replicated alignment of arrays in data-parallel programs. In: SC’93, pp. 420–429 November (1993)

  5. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Skadron, K.: A performance study of general purpose applications on graphics processors using CUDA, June (2008)

  6. NVIDIA Corporation. Geforce gtx 280 specifications. (2008)

  7. NVIDIA Corporation. NVIDIA CUDA visual profiler. June (2008)

  8. Dagum, L.: OpenMP: a proposed industry standard API for shared memory programming, October (1997)

  9. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC ’08. 1–12 (2008)

  10. Deitz, S.J., Chamberlain, B.L., Snyder, L.: Eliminating redundancies in sum-of-product array computations. In: ICS ’01, pp. 65–77 (2001)

  11. Evans, L.C.: Partial differential equations. Am. Math. Soc. (1998)

  12. Chen, L., Zhang, Z.-Q., Feng, X.-B.: Redundant computation partition on distributed-memory systems. In: ICA3PP ’02, pp. 252 (2002)

  13. Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: ICS’05, pp. 361–366 (2005)

  14. Goodnight, N.: CUDA/OpenGL fluid simulation. April (2007)

  15. Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: CF’06 (2006)

  16. Huang C.-H., Sadayappan P.: Communication-free hyperplane partitioning of nested loops. J. Parallel Distrib. Comput. 19(2), 90–102 (1993)

    Article  MATH  Google Scholar 

  17. Huang, W., Stan, M.R., Skadron, K., Ghosh, S., Sankaranarayanan, K., Velusamy, S.: Compact thermal modeling for temperature-aware design. In: DAC’04. (2004)

  18. Electronic Educational Devices Inc. Watts up? electricity meter operator’s manual. (2002)

  19. Jalby, W., Meier, U.: Optimizing matrix operations on a parallel multiprocessor with a hierarchical memory system, pp. 429–432 (1986)

  20. Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: MSP’05, pp. 36–43 (2005)

  21. Kowarschik M., Weiß C., Karl W., Rüde U.: Cache-aware multigrid methods for solving poisson’s equation in two dimensions. Computing 64(4), 381–399 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  22. Krishnamoorthy S., Baskaran M., Bondhugula U., Ramanujam J., Rountev A., Sadayappan P.: Effective automatic parallelization of stencil computations. PLDI ’07 42(6), 235–244 (2007)

    Article  Google Scholar 

  23. Lee P.: Techniques for compiling programs on distributed memory multicomputers. Parallel Comput. 21, 1895–1923 (1995)

    Article  MathSciNet  Google Scholar 

  24. Li Z., Song Y.: Automatic tiling of iterative stencil loops. ACM Trans. Program. Lang. Syst. 26(6), 975–1028 (2004)

    Article  Google Scholar 

  25. Manjikian N., Abdelrahman T.S.: Fusion of loops for parallelism and locality. Parallel Distrib. Syst. 8, 19–28 (1997)

    Google Scholar 

  26. Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In: ICS ’09, pp. 256–265 (2009)

  27. Nickolls J., Buck I., Garland M., Skadron K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)

    Article  Google Scholar 

  28. Premnath K.N., Abraham J.: Three-dimensional multi-relaxation time (mrt) lattice-Boltzmann models for multiphase flow. J. Comput. Phys. 224(2), 539–559 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  29. Ramanujam, J.: Tiling of iteration spaces for multicomputers. In: Proceedings International Conference Parallel Processing, pp. 179–186. (1990)

  30. Renganarayana, L., Harthikote-Matha, M., Dewri, R., Rajopadhye, S.: Towards optimal multi-level tiling for stencil computations. IPDPS’07, pp. 1–10, March (2007)

  31. Renganarayana, L., Rajopadhye, S.: Positivity, posynomials and tile size selection. In: SC ’08, pp. 1–12 (2008)

  32. Ripeanu, M., Iamnitchi, A., Foster, I.: Cactus application: Performance predictions in a grid environment. In: EuroPar’01. (2001)

  33. Rivera G., Tseng, C.-W.: Tiling optimizations for 3D scientific computations. In: SC ’00, p. 32 (2000)

  34. Ueng, S.-Z., Baghsorkhi, S., Lathara, M., Hwu, W.m.: CUDA-lite: Reducing GPU programming complexity. In: LCPC’08. (2008)

  35. Wonnacott, D.: Time skewing for parallel computers. In: WLCPC’99, pp. 477–480 (1999)

  36. Wonnacott D.: Achieving scalable locality with time skewing. Int. J. Parallel Program. 30(3), 181–221 (2002)

    Article  MATH  Google Scholar 

  37. Yang Z., Zhu Y., Pu Y.: Parallel image processing based on CUDA. Int. Conf. Comput. Sci. Software Eng. 3, 198–201 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiayuan Meng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meng, J., Skadron, K. A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations. Int J Parallel Prog 39, 115–142 (2011). https://doi.org/10.1007/s10766-010-0142-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-010-0142-5

Keywords

Navigation