A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Meng, Jiayuan; Skadron, Kevin

doi:10.1007/s10766-010-0142-5

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Published: 30 June 2010

Volume 39, pages 115–142, (2011)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Jiayuan Meng¹ &
Kevin Skadron¹

320 Accesses
32 Citations
Explore all metrics

Abstract

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Allen, G., Dramlitsch, T., Foster, I., Karonis, N.T., Ripeanu, M., Seidel, E., Toonen, B.: Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus. In: SC’01, pp. 52–52 (2001)
Alpert, M.: Not just fun and games. April (1999)
Bromley M., Heller S., McNerney T., Steele G.L. Jr: Fortran at ten gigaflops: the connection machine convolution compiler. PLDI ’91 26(6), 145–156 (1991)
Article Google Scholar
Chatterjee, S., Gilbert, J.R., Schreiber, R.: Mobile and replicated alignment of arrays in data-parallel programs. In: SC’93, pp. 420–429 November (1993)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Skadron, K.: A performance study of general purpose applications on graphics processors using CUDA, June (2008)
NVIDIA Corporation. Geforce gtx 280 specifications. (2008)
NVIDIA Corporation. NVIDIA CUDA visual profiler. June (2008)
Dagum, L.: OpenMP: a proposed industry standard API for shared memory programming, October (1997)
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC ’08. 1–12 (2008)
Deitz, S.J., Chamberlain, B.L., Snyder, L.: Eliminating redundancies in sum-of-product array computations. In: ICS ’01, pp. 65–77 (2001)
Evans, L.C.: Partial differential equations. Am. Math. Soc. (1998)
Chen, L., Zhang, Z.-Q., Feng, X.-B.: Redundant computation partition on distributed-memory systems. In: ICA3PP ’02, pp. 252 (2002)
Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: ICS’05, pp. 361–366 (2005)
Goodnight, N.: CUDA/OpenGL fluid simulation. April (2007)
Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: CF’06 (2006)
Huang C.-H., Sadayappan P.: Communication-free hyperplane partitioning of nested loops. J. Parallel Distrib. Comput. 19(2), 90–102 (1993)
Article MATH Google Scholar
Huang, W., Stan, M.R., Skadron, K., Ghosh, S., Sankaranarayanan, K., Velusamy, S.: Compact thermal modeling for temperature-aware design. In: DAC’04. (2004)
Electronic Educational Devices Inc. Watts up? electricity meter operator’s manual. (2002)
Jalby, W., Meier, U.: Optimizing matrix operations on a parallel multiprocessor with a hierarchical memory system, pp. 429–432 (1986)
Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: MSP’05, pp. 36–43 (2005)
Kowarschik M., Weiß C., Karl W., Rüde U.: Cache-aware multigrid methods for solving poisson’s equation in two dimensions. Computing 64(4), 381–399 (2000)
Article MATH MathSciNet Google Scholar
Krishnamoorthy S., Baskaran M., Bondhugula U., Ramanujam J., Rountev A., Sadayappan P.: Effective automatic parallelization of stencil computations. PLDI ’07 42(6), 235–244 (2007)
Article Google Scholar
Lee P.: Techniques for compiling programs on distributed memory multicomputers. Parallel Comput. 21, 1895–1923 (1995)
Article MathSciNet Google Scholar
Li Z., Song Y.: Automatic tiling of iterative stencil loops. ACM Trans. Program. Lang. Syst. 26(6), 975–1028 (2004)
Article Google Scholar
Manjikian N., Abdelrahman T.S.: Fusion of loops for parallelism and locality. Parallel Distrib. Syst. 8, 19–28 (1997)
Google Scholar
Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In: ICS ’09, pp. 256–265 (2009)
Nickolls J., Buck I., Garland M., Skadron K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)
Article Google Scholar
Premnath K.N., Abraham J.: Three-dimensional multi-relaxation time (mrt) lattice-Boltzmann models for multiphase flow. J. Comput. Phys. 224(2), 539–559 (2007)
Article MATH MathSciNet Google Scholar
Ramanujam, J.: Tiling of iteration spaces for multicomputers. In: Proceedings International Conference Parallel Processing, pp. 179–186. (1990)
Renganarayana, L., Harthikote-Matha, M., Dewri, R., Rajopadhye, S.: Towards optimal multi-level tiling for stencil computations. IPDPS’07, pp. 1–10, March (2007)
Renganarayana, L., Rajopadhye, S.: Positivity, posynomials and tile size selection. In: SC ’08, pp. 1–12 (2008)
Ripeanu, M., Iamnitchi, A., Foster, I.: Cactus application: Performance predictions in a grid environment. In: EuroPar’01. (2001)
Rivera G., Tseng, C.-W.: Tiling optimizations for 3D scientific computations. In: SC ’00, p. 32 (2000)
Ueng, S.-Z., Baghsorkhi, S., Lathara, M., Hwu, W.m.: CUDA-lite: Reducing GPU programming complexity. In: LCPC’08. (2008)
Wonnacott, D.: Time skewing for parallel computers. In: WLCPC’99, pp. 477–480 (1999)
Wonnacott D.: Achieving scalable locality with time skewing. Int. J. Parallel Program. 30(3), 181–221 (2002)
Article MATH Google Scholar
Yang Z., Zhu Y., Pu Y.: Parallel image processing based on CUDA. Int. Conf. Comput. Sci. Software Eng. 3, 198–201 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Virginia, Charlottesville, VA, USA
Jiayuan Meng & Kevin Skadron

Authors

Jiayuan Meng
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Skadron
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiayuan Meng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meng, J., Skadron, K. A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations. Int J Parallel Prog 39, 115–142 (2011). https://doi.org/10.1007/s10766-010-0142-5

Download citation

Received: 30 January 2010
Accepted: 26 May 2010
Published: 30 June 2010
Issue Date: February 2011
DOI: https://doi.org/10.1007/s10766-010-0142-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

GPU Architecture

Performance improvement of the triangular matrix product in commodity clusters

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

GPU Architecture

Performance improvement of the triangular matrix product in commodity clusters

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation