skip to main content
10.1145/3462172.3462199acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiflConference Proceedingsconference-collections
research-article
Open Access

Effective Host-GPU Memory Management Through Code Generation

Authors Info & Claims
Published:23 July 2021Publication History

ABSTRACT

NVIDIA’s CUDA provides several options to orchestrate the management of host and device memory as well as the communication between them. In this paper we look at these choices, identify the program changes required when switching between them, and we observe their effect on application performance.

We present code generation schemes that translate resource-agnostic program specifications, i.e., programs without any explicit notion of memory or GPU kernels, into five CUDA versions that differ in the use of the memory and communication API of CUDA only. An implementation of these code generators within the compiler of the functional programming language Single-Assignment C (SaC) shows performance differences between the variants by up to a factor of 3.

Performance analyses reveal that the preferred choices depend on a combination of several factors, including the actual hardware being used, and several aspects of the application itself. A clear choice, therefore, cannot be made a priori. Instead, it seems essential that different variants can be generated from a single source for achieving performance portability across GPU devices.

References

  1. D.C. Cann. 1989. Compilation Techniques for High Performance Applicative Computation. Technical Report CS-89-108. Lawrence Livermore National Laboratory, LLNL, Livermore California.Google ScholarGoogle Scholar
  2. Sanjay Chatterjee, Max Grossman, Alina Sbîrlea, and Vivek Sarkar. 2013. Dynamic Task Parallelism with a GPU Work-Stealing Runtime System. In Languages and Compilers for Parallel Computing, Sanjay Rajopadhyeand Michelle Mills Strout (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 203–217. https://doi.org/10.1007/978-3-642-36036-7_14Google ScholarGoogle Scholar
  3. S. Chien, I. Peng, and S. Markidis. 2019. Performance Evaluation of Advanced Features in CUDA Unified Memory. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). 50–57. https://doi.org/10.1109/MCHPC49590.2019.00014Google ScholarGoogle Scholar
  4. Jake Choi, Hojun You, Chongam Kim, Heon Young Yeom, and Yoonhee Kim. 2020. Comparing unified, pinned, and host/device memory allocations for memory-intensive workloads on Tegra SoC. Concurrency and Computation: Practice and Experience (Sept. 2020). https://doi.org/10.1002/cpe.6018Google ScholarGoogle Scholar
  5. Steven M. Fitzgerald and Rodney R. Oldehoeft. 1996. Update-in-place Analysis for True Multidimensional Arrays. Sci. Program. 5, 2 (July 1996), 147–160. https://doi.org/10.1155/1996/493673Google ScholarGoogle Scholar
  6. Clemens Grelck. 2005. Shared memory multiprocessor support for functional array processing in SAC. Journal of Functional Programming 15, 3 (2005), 353–401. https://doi.org/10.1017/S0956796805005538Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Clemens Grelck and Sven-Bodo Scholz. 2003. Axis Control in Sac. In Implementation of Functional Languages, 14th International Workshop (IFL’02), Madrid, Spain, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 2670), Ricardo Peña and Thomas Arts (Eds.). Springer, 182–198.Google ScholarGoogle Scholar
  8. Clemens Grelck and Sven-Bodo Scholz. 2006. SAC: A Functional Array Language for Efficient Multithreaded Execution. International Journal of Parallel Programming 34, 4(2006), 383–427. https://doi.org/10.1007/s10766-006-0018-xGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  9. Clemens Grelck, Sven-Bodo Scholz, and Kai Trojahner. 2004. With-loop Scalarization: Merging Nested Array Operations. In Implementation of Functional Languages, 15th International Workshop (IFL’03), Edinburgh, Scotland, UK, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 3145), Phil Trinder and Greg Michaelson (Eds.). Springer. https://doi.org/10.1007/978-3-540-27861-0_8Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jing Guo, Robert Bernecky, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2014. Polyhedral Methods for Improving Parallel Update-in-Place. In Proceedings of the 4th International Workshop on Polyhedral Compilation Techniques, Sanjay Rajopadhye and Sven Verdoolaege (Eds.). Vienna, Austria.Google ScholarGoogle Scholar
  11. Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2009. Towards Compiling SAC to CUDA. In Proceedings of the Tenth Symposium on Trends in Functional Programming, TFP 2009, Komárno, Slovakia, June 2-4, 2009(Trends in Functional Programming, Vol. 10), Zoltán Horváth, Viktória Zsók, Peter Achten, and Pieter W. M. Koopman(Eds.). Intellect, 33–48.Google ScholarGoogle Scholar
  12. Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2011. Breaking the Gpu Programming Barrier with the Auto-parallelising Sac Compiler. In 6th Workshop on Declarative Aspects of Multicore Programming (DAMP’11), Austin, USA. ACM Press, 15–24. https://doi.org/10.1145/1926354.1926359Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Juan Gómez-Luna, José María González-Linares, José Ignacio Benavides, and Nicolás Guil. 2012. Performance models for asynchronous data transfers on consumer Graphics Processing Units. J. Parallel and Distrib. Comput. 72, 9 (2012), 1117–1126. https://doi.org/10.1016/j.jpdc.2011.07.011 Accelerators for High-Performance Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Tianyi David Han and Tarek S. Abdelrahman. 2009. HiCUDA: A High-Level Directive-Based Language for GPU Programming. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (Washington, D.C., USA) (GPGPU-2). ACM, New York, NY, USA, 52–61. https://doi.org/10.1145/1513895.1513902Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Christoph Hartmann and Ulrich Margull. 2019. GPUart - An application-based limited preemptive GPU real-time scheduler for embedded systems. Journal of Systems Architecture 97 (2019), 304—319. https://doi.org/10.1016/j.sysarc.2018.10.005Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (Barcelona, Spain) (PLDI 2017). ACM, New York, NY, USA, 556–571. https://doi.org/10.1145/3062341.3062354Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hanwoong Jung, Youngmin Yi, and Soonhoi Ha. 2012. Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU Architectures. In Parallel Processing and Applied Mathematics, Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Waśniewski (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 579–588. https://doi.org/10.1007/978-3-642-31464-3_59Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yooseong Kim and Aviral Shrivastava. 2013. Memory Performance Estimation of CUDA Programs. ACM Trans. Embed. Comput. Syst. 13, 2, Article 21 (Sept. 2013), 22 pages. https://doi.org/10.1145/2514641.2514648Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Macht and C. Grelck. 2019. SAC Goes Cluster: Fully Implicit Distributed Computing. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 996–1006.Google ScholarGoogle Scholar
  20. Nikolay Sakharnykh. 2017. Maximizing Unified Memory Performance in CUDA. https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/. [Online; 29-May-2019].Google ScholarGoogle Scholar
  21. Nikolay Sakharnykh. 2018. Everything You Need To Know About Unified Memory. http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf. [Online; 03-Nov-2019].Google ScholarGoogle Scholar
  22. NVIDIA Corporation. 2019. CUDA Toolkit Documentation v10.1.168. https://web.archive.org/web/20190523173815/https://docs.nvidia.com/cuda/archive/10.1/. [WayBack Machine; 02-Nov-2019].Google ScholarGoogle Scholar
  23. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI ’13). ACM, New York, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sven-Bodo Scholz. 1998. With-loop-folding in Sac — Condensing Consecutive Array Operations. In Implementation of Functional Languages, 9th International Workshop (IFL’97), St. Andrews, UK, Selected Papers(Lecture Notes in Computer Science, Vol. 1467), Chris Clack, Tony Davie, and Kevin Hammond(Eds.). Springer, 72–92. https://doi.org/10.1007/BFb0055425Google ScholarGoogle Scholar
  25. Sven-Bodo Scholz. 2003. Single Assignment C — Efficient Support for High-Level Array Operations in a Functional Setting. Journal of Functional Programming 13, 6 (2003), 1005–1059. https://doi.org/10.1017/S0956796802004458Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Steuwer, T. Remmelg, and C. Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 74–85. https://doi.org/10.1109/CGO.2017.7863730Google ScholarGoogle ScholarCross RefCross Ref
  27. Hans-Nikolai Vießmann, Artjoms Šinkarovs, and Sven-Bodo Scholz. 2018. Extended Memory Reuse: An Optimisation for Reducing Memory Allocations. In Proceedings of the 30th Symposium on Implementation and Application of Functional Languages(Lowell, MA, USA) (IFL 2018). ACM, New York, NY, USA, 107–118. https://doi.org/10.1145/3310232.3310242Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    IFL '20: Proceedings of the 32nd Symposium on Implementation and Application of Functional Languages
    September 2020
    161 pages
    ISBN:9781450389631
    DOI:10.1145/3462172

    Copyright © 2020 Owner/Author

    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 23 July 2021

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate19of36submissions,53%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader