ABSTRACT
NVIDIA’s CUDA provides several options to orchestrate the management of host and device memory as well as the communication between them. In this paper we look at these choices, identify the program changes required when switching between them, and we observe their effect on application performance.
We present code generation schemes that translate resource-agnostic program specifications, i.e., programs without any explicit notion of memory or GPU kernels, into five CUDA versions that differ in the use of the memory and communication API of CUDA only. An implementation of these code generators within the compiler of the functional programming language Single-Assignment C (SaC) shows performance differences between the variants by up to a factor of 3.
Performance analyses reveal that the preferred choices depend on a combination of several factors, including the actual hardware being used, and several aspects of the application itself. A clear choice, therefore, cannot be made a priori. Instead, it seems essential that different variants can be generated from a single source for achieving performance portability across GPU devices.
- D.C. Cann. 1989. Compilation Techniques for High Performance Applicative Computation. Technical Report CS-89-108. Lawrence Livermore National Laboratory, LLNL, Livermore California.Google Scholar
- Sanjay Chatterjee, Max Grossman, Alina Sbîrlea, and Vivek Sarkar. 2013. Dynamic Task Parallelism with a GPU Work-Stealing Runtime System. In Languages and Compilers for Parallel Computing, Sanjay Rajopadhyeand Michelle Mills Strout (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 203–217. https://doi.org/10.1007/978-3-642-36036-7_14Google Scholar
- S. Chien, I. Peng, and S. Markidis. 2019. Performance Evaluation of Advanced Features in CUDA Unified Memory. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). 50–57. https://doi.org/10.1109/MCHPC49590.2019.00014Google Scholar
- Jake Choi, Hojun You, Chongam Kim, Heon Young Yeom, and Yoonhee Kim. 2020. Comparing unified, pinned, and host/device memory allocations for memory-intensive workloads on Tegra SoC. Concurrency and Computation: Practice and Experience (Sept. 2020). https://doi.org/10.1002/cpe.6018Google Scholar
- Steven M. Fitzgerald and Rodney R. Oldehoeft. 1996. Update-in-place Analysis for True Multidimensional Arrays. Sci. Program. 5, 2 (July 1996), 147–160. https://doi.org/10.1155/1996/493673Google Scholar
- Clemens Grelck. 2005. Shared memory multiprocessor support for functional array processing in SAC. Journal of Functional Programming 15, 3 (2005), 353–401. https://doi.org/10.1017/S0956796805005538Google ScholarDigital Library
- Clemens Grelck and Sven-Bodo Scholz. 2003. Axis Control in Sac. In Implementation of Functional Languages, 14th International Workshop (IFL’02), Madrid, Spain, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 2670), Ricardo Peña and Thomas Arts (Eds.). Springer, 182–198.Google Scholar
- Clemens Grelck and Sven-Bodo Scholz. 2006. SAC: A Functional Array Language for Efficient Multithreaded Execution. International Journal of Parallel Programming 34, 4(2006), 383–427. https://doi.org/10.1007/s10766-006-0018-xGoogle ScholarDigital Library
- Clemens Grelck, Sven-Bodo Scholz, and Kai Trojahner. 2004. With-loop Scalarization: Merging Nested Array Operations. In Implementation of Functional Languages, 15th International Workshop (IFL’03), Edinburgh, Scotland, UK, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 3145), Phil Trinder and Greg Michaelson (Eds.). Springer. https://doi.org/10.1007/978-3-540-27861-0_8Google ScholarDigital Library
- Jing Guo, Robert Bernecky, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2014. Polyhedral Methods for Improving Parallel Update-in-Place. In Proceedings of the 4th International Workshop on Polyhedral Compilation Techniques, Sanjay Rajopadhye and Sven Verdoolaege (Eds.). Vienna, Austria.Google Scholar
- Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2009. Towards Compiling SAC to CUDA. In Proceedings of the Tenth Symposium on Trends in Functional Programming, TFP 2009, Komárno, Slovakia, June 2-4, 2009(Trends in Functional Programming, Vol. 10), Zoltán Horváth, Viktória Zsók, Peter Achten, and Pieter W. M. Koopman(Eds.). Intellect, 33–48.Google Scholar
- Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2011. Breaking the Gpu Programming Barrier with the Auto-parallelising Sac Compiler. In 6th Workshop on Declarative Aspects of Multicore Programming (DAMP’11), Austin, USA. ACM Press, 15–24. https://doi.org/10.1145/1926354.1926359Google ScholarDigital Library
- Juan Gómez-Luna, José María González-Linares, José Ignacio Benavides, and Nicolás Guil. 2012. Performance models for asynchronous data transfers on consumer Graphics Processing Units. J. Parallel and Distrib. Comput. 72, 9 (2012), 1117–1126. https://doi.org/10.1016/j.jpdc.2011.07.011 Accelerators for High-Performance Computing.Google ScholarDigital Library
- Tianyi David Han and Tarek S. Abdelrahman. 2009. HiCUDA: A High-Level Directive-Based Language for GPU Programming. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (Washington, D.C., USA) (GPGPU-2). ACM, New York, NY, USA, 52–61. https://doi.org/10.1145/1513895.1513902Google ScholarDigital Library
- Christoph Hartmann and Ulrich Margull. 2019. GPUart - An application-based limited preemptive GPU real-time scheduler for embedded systems. Journal of Systems Architecture 97 (2019), 304—319. https://doi.org/10.1016/j.sysarc.2018.10.005Google ScholarDigital Library
- Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (Barcelona, Spain) (PLDI 2017). ACM, New York, NY, USA, 556–571. https://doi.org/10.1145/3062341.3062354Google ScholarDigital Library
- Hanwoong Jung, Youngmin Yi, and Soonhoi Ha. 2012. Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU Architectures. In Parallel Processing and Applied Mathematics, Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Waśniewski (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 579–588. https://doi.org/10.1007/978-3-642-31464-3_59Google ScholarDigital Library
- Yooseong Kim and Aviral Shrivastava. 2013. Memory Performance Estimation of CUDA Programs. ACM Trans. Embed. Comput. Syst. 13, 2, Article 21 (Sept. 2013), 22 pages. https://doi.org/10.1145/2514641.2514648Google ScholarDigital Library
- T. Macht and C. Grelck. 2019. SAC Goes Cluster: Fully Implicit Distributed Computing. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 996–1006.Google Scholar
- Nikolay Sakharnykh. 2017. Maximizing Unified Memory Performance in CUDA. https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/. [Online; 29-May-2019].Google Scholar
- Nikolay Sakharnykh. 2018. Everything You Need To Know About Unified Memory. http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf. [Online; 03-Nov-2019].Google Scholar
- NVIDIA Corporation. 2019. CUDA Toolkit Documentation v10.1.168. https://web.archive.org/web/20190523173815/https://docs.nvidia.com/cuda/archive/10.1/. [WayBack Machine; 02-Nov-2019].Google Scholar
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI ’13). ACM, New York, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176Google ScholarDigital Library
- Sven-Bodo Scholz. 1998. With-loop-folding in Sac — Condensing Consecutive Array Operations. In Implementation of Functional Languages, 9th International Workshop (IFL’97), St. Andrews, UK, Selected Papers(Lecture Notes in Computer Science, Vol. 1467), Chris Clack, Tony Davie, and Kevin Hammond(Eds.). Springer, 72–92. https://doi.org/10.1007/BFb0055425Google Scholar
- Sven-Bodo Scholz. 2003. Single Assignment C — Efficient Support for High-Level Array Operations in a Functional Setting. Journal of Functional Programming 13, 6 (2003), 1005–1059. https://doi.org/10.1017/S0956796802004458Google ScholarDigital Library
- M. Steuwer, T. Remmelg, and C. Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 74–85. https://doi.org/10.1109/CGO.2017.7863730Google ScholarCross Ref
- Hans-Nikolai Vießmann, Artjoms Šinkarovs, and Sven-Bodo Scholz. 2018. Extended Memory Reuse: An Optimisation for Reducing Memory Allocations. In Proceedings of the 30th Symposium on Implementation and Application of Functional Languages(Lowell, MA, USA) (IFL 2018). ACM, New York, NY, USA, 107–118. https://doi.org/10.1145/3310232.3310242Google ScholarDigital Library
Recommendations
Performance portable GPU code generation for matrix multiplication
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing UnitParallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
GPU-to-GPU and Host-to-Host Multipattern String Matching on a GPU
We develop GPU adaptations of the Aho-Corasick and multipattern Boyer-Moore string matching algorithms for the two cases GPU-to-GPU (input to the algorithms is initially in GPU memory and the output is left in GPU memory) and host-to-host (input and ...
Comments