research-article

Open Access

Effective Host-GPU Memory Management Through Code Generation

Authors:
Hans-Nikolai Vießmann

Heriot-Watt University, Edinburgh, UK and Radboud University, Nijmegen, Netherlands

Heriot-Watt University, Edinburgh, UK and Radboud University, Nijmegen, Netherlands
View Profile

,
Sven-Bodo Scholz

Heriot-Watt University, Edinburgh, UK and Radboud University, Nijmegen, Netherlands

Heriot-Watt University, Edinburgh, UK and Radboud University, Nijmegen, Netherlands
View Profile

IFL '20: Proceedings of the 32nd Symposium on Implementation and Application of Functional LanguagesSeptember 2020Pages 138–149https://doi.org/10.1145/3462172.3462199

Published:23 July 2021Publication History

IFL '20: Proceedings of the 32nd Symposium on Implementation and Application of Functional Languages

Pages 138–149

ABSTRACT

NVIDIA’s CUDA provides several options to orchestrate the management of host and device memory as well as the communication between them. In this paper we look at these choices, identify the program changes required when switching between them, and we observe their effect on application performance.

We present code generation schemes that translate resource-agnostic program specifications, i.e., programs without any explicit notion of memory or GPU kernels, into five CUDA versions that differ in the use of the memory and communication API of CUDA only. An implementation of these code generators within the compiler of the functional programming language Single-Assignment C (SaC) shows performance differences between the variants by up to a factor of 3.

Performance analyses reveal that the preferred choices depend on a combination of several factors, including the actual hardware being used, and several aspects of the application itself. A clear choice, therefore, cannot be made a priori. Instead, it seems essential that different variants can be generated from a single source for achieving performance portability across GPU devices.

References

D.C. Cann. 1989. Compilation Techniques for High Performance Applicative Computation. Technical Report CS-89-108. Lawrence Livermore National Laboratory, LLNL, Livermore California.Google Scholar
Sanjay Chatterjee, Max Grossman, Alina Sbîrlea, and Vivek Sarkar. 2013. Dynamic Task Parallelism with a GPU Work-Stealing Runtime System. In Languages and Compilers for Parallel Computing, Sanjay Rajopadhyeand Michelle Mills Strout (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 203–217. https://doi.org/10.1007/978-3-642-36036-7_14Google Scholar
S. Chien, I. Peng, and S. Markidis. 2019. Performance Evaluation of Advanced Features in CUDA Unified Memory. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). 50–57. https://doi.org/10.1109/MCHPC49590.2019.00014Google Scholar
Jake Choi, Hojun You, Chongam Kim, Heon Young Yeom, and Yoonhee Kim. 2020. Comparing unified, pinned, and host/device memory allocations for memory-intensive workloads on Tegra SoC. Concurrency and Computation: Practice and Experience (Sept. 2020). https://doi.org/10.1002/cpe.6018Google Scholar
Steven M. Fitzgerald and Rodney R. Oldehoeft. 1996. Update-in-place Analysis for True Multidimensional Arrays. Sci. Program. 5, 2 (July 1996), 147–160. https://doi.org/10.1155/1996/493673Google Scholar
Clemens Grelck. 2005. Shared memory multiprocessor support for functional array processing in SAC. Journal of Functional Programming 15, 3 (2005), 353–401. https://doi.org/10.1017/S0956796805005538Google ScholarDigital Library
Clemens Grelck and Sven-Bodo Scholz. 2003. Axis Control in Sac. In Implementation of Functional Languages, 14th International Workshop (IFL’02), Madrid, Spain, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 2670), Ricardo Peña and Thomas Arts (Eds.). Springer, 182–198.Google Scholar
Clemens Grelck and Sven-Bodo Scholz. 2006. SAC: A Functional Array Language for Efficient Multithreaded Execution. International Journal of Parallel Programming 34, 4(2006), 383–427. https://doi.org/10.1007/s10766-006-0018-xGoogle ScholarDigital Library
Clemens Grelck, Sven-Bodo Scholz, and Kai Trojahner. 2004. With-loop Scalarization: Merging Nested Array Operations. In Implementation of Functional Languages, 15th International Workshop (IFL’03), Edinburgh, Scotland, UK, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 3145), Phil Trinder and Greg Michaelson (Eds.). Springer. https://doi.org/10.1007/978-3-540-27861-0_8Google ScholarDigital Library
Jing Guo, Robert Bernecky, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2014. Polyhedral Methods for Improving Parallel Update-in-Place. In Proceedings of the 4th International Workshop on Polyhedral Compilation Techniques, Sanjay Rajopadhye and Sven Verdoolaege (Eds.). Vienna, Austria.Google Scholar
Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2009. Towards Compiling SAC to CUDA. In Proceedings of the Tenth Symposium on Trends in Functional Programming, TFP 2009, Komárno, Slovakia, June 2-4, 2009(Trends in Functional Programming, Vol. 10), Zoltán Horváth, Viktória Zsók, Peter Achten, and Pieter W. M. Koopman(Eds.). Intellect, 33–48.Google Scholar
Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2011. Breaking the Gpu Programming Barrier with the Auto-parallelising Sac Compiler. In 6th Workshop on Declarative Aspects of Multicore Programming (DAMP’11), Austin, USA. ACM Press, 15–24. https://doi.org/10.1145/1926354.1926359Google ScholarDigital Library
Juan Gómez-Luna, José María González-Linares, José Ignacio Benavides, and Nicolás Guil. 2012. Performance models for asynchronous data transfers on consumer Graphics Processing Units. J. Parallel and Distrib. Comput. 72, 9 (2012), 1117–1126. https://doi.org/10.1016/j.jpdc.2011.07.011 Accelerators for High-Performance Computing.Google ScholarDigital Library
Tianyi David Han and Tarek S. Abdelrahman. 2009. HiCUDA: A High-Level Directive-Based Language for GPU Programming. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (Washington, D.C., USA) (GPGPU-2). ACM, New York, NY, USA, 52–61. https://doi.org/10.1145/1513895.1513902Google ScholarDigital Library
Christoph Hartmann and Ulrich Margull. 2019. GPUart - An application-based limited preemptive GPU real-time scheduler for embedded systems. Journal of Systems Architecture 97 (2019), 304—319. https://doi.org/10.1016/j.sysarc.2018.10.005Google ScholarDigital Library
Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (Barcelona, Spain) (PLDI 2017). ACM, New York, NY, USA, 556–571. https://doi.org/10.1145/3062341.3062354Google ScholarDigital Library
Hanwoong Jung, Youngmin Yi, and Soonhoi Ha. 2012. Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU Architectures. In Parallel Processing and Applied Mathematics, Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Waśniewski (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 579–588. https://doi.org/10.1007/978-3-642-31464-3_59Google ScholarDigital Library
Yooseong Kim and Aviral Shrivastava. 2013. Memory Performance Estimation of CUDA Programs. ACM Trans. Embed. Comput. Syst. 13, 2, Article 21 (Sept. 2013), 22 pages. https://doi.org/10.1145/2514641.2514648Google ScholarDigital Library
T. Macht and C. Grelck. 2019. SAC Goes Cluster: Fully Implicit Distributed Computing. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 996–1006.Google Scholar
Nikolay Sakharnykh. 2017. Maximizing Unified Memory Performance in CUDA. https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/. [Online; 29-May-2019].Google Scholar
Nikolay Sakharnykh. 2018. Everything You Need To Know About Unified Memory. http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf. [Online; 03-Nov-2019].Google Scholar
NVIDIA Corporation. 2019. CUDA Toolkit Documentation v10.1.168. https://web.archive.org/web/20190523173815/https://docs.nvidia.com/cuda/archive/10.1/. [WayBack Machine; 02-Nov-2019].Google Scholar
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI ’13). ACM, New York, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176Google ScholarDigital Library
Sven-Bodo Scholz. 1998. With-loop-folding in Sac — Condensing Consecutive Array Operations. In Implementation of Functional Languages, 9th International Workshop (IFL’97), St. Andrews, UK, Selected Papers(Lecture Notes in Computer Science, Vol. 1467), Chris Clack, Tony Davie, and Kevin Hammond(Eds.). Springer, 72–92. https://doi.org/10.1007/BFb0055425Google Scholar
Sven-Bodo Scholz. 2003. Single Assignment C — Efficient Support for High-Level Array Operations in a Functional Setting. Journal of Functional Programming 13, 6 (2003), 1005–1059. https://doi.org/10.1017/S0956796802004458Google ScholarDigital Library
M. Steuwer, T. Remmelg, and C. Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 74–85. https://doi.org/10.1109/CGO.2017.7863730Google ScholarCross Ref
Hans-Nikolai Vießmann, Artjoms Šinkarovs, and Sven-Bodo Scholz. 2018. Extended Memory Reuse: An Optimisation for Reducing Memory Allocations. In Proceedings of the 30th Symposium on Implementation and Application of Functional Languages(Lowell, MA, USA) (IFL 2018). ACM, New York, NY, USA, 107–118. https://doi.org/10.1145/3310232.3310242Google ScholarDigital Library

Recommendations

Performance portable GPU code generation for matrix multiplication
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to ...
Read More
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More
GPU-to-GPU and Host-to-Host Multipattern String Matching on a GPU

We develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU (input to the algorithms is initially in GPU memory and the output is left in GPU memory) and host-to-host (input and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IFL '20: Proceedings of the 32nd Symposium on Implementation and Application of Functional Languages
September 2020
161 pages
ISBN:9781450389631
DOI:10.1145/3462172

Copyright © 2020 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2021
Check for updates
Author Tags
CUDA
GPU
SaC
code generation
communication models
memory management
transfer bandwidth
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate19of36submissions,53%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 521
  Total Downloads
- Downloads (Last 12 months)156
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Effective Host-GPU Memory Management Through Code Generation

IFL '20: Proceedings of the 32nd Symposium on Implementation and Application of Functional Languages

ABSTRACT

References

Cited By

Recommendations

Performance portable GPU code generation for matrix multiplication

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

GPU-to-GPU and Host-to-Host Multipattern String Matching on a GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Effective Host-GPU Memory Management Through Code Generation

IFL '20: Proceedings of the 32nd Symposium on Implementation and Application of Functional Languages

ABSTRACT

References

Cited By

Recommendations

Performance portable GPU code generation for matrix multiplication

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

GPU-to-GPU and Host-to-Host Multipattern String Matching on a GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media