ABSTRACT
Continuing advances in heterogeneous and parallel computing enable massive performance gains in domains such as AI and HPC. Such gains often involve using hardware accelerators, such as FPGAs and GPUs, to speed up specific workloads. However, to make effective use of emerging heterogeneous architectures, optimisation is typically done manually by highly-skilled developers with in-depth understanding of the target hardware. The process is tedious, error-prone, and must be repeated for each new application. This paper introduces Design-Flow Patterns, which capture modular, recurring application-agnostic elements involved in mapping and optimising application descriptions onto efficient CPU and GPU targets. Our approach is the first to codify and programmatically coordinate these elements into fully automated, customisable, and reusable end-to-end design-flows. We implement key design-flow patterns using the meta-programming tool Artisan, and evaluate automated design-flows applied to three sequential C++ applications. Compared to single-threaded implementations, our approach generates multi-threaded OpenMP CPU designs achieving up to 18 times speedup on a CPU platform with 32-threads, as well as HIP GPU designs achieving up to 1184 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU.
- AMD. 2022. HIP Programming Guide v4.5. Webpage. Retrieved January, 2022 from https://rocmdocs.amd.com/en/ latest/Programming_Guides/HIP-GUIDE.htmlGoogle Scholar
- Christopher Brown, Marco Danelutto, Peter Kilpatrick, Kevin Hammond, and Sam Elliott. 2014. Cost-Directed Refactoring for Parallel Erlang Programs. Int’l Journal of Parallel Programming 42.Google Scholar
- Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell Array Codes with Multicore GPUs. In Proc. of the 6th Workshop on Declarative Aspects of Multicore Programming (DAMP ’11).Google ScholarDigital Library
- Andre Dehon, Joshua Adams, Michael Delorimier, Nachiket Kapre, Yuki Matsuda, Helia Naeimi, Michael Vanier, and Michael Wrighton. 2004. Design patterns for reconfigurable computing. In 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.Google ScholarDigital Library
- Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1993. Design Patterns: Abstraction and Reuse of Object-Oriented Design. In ECOOP.Google Scholar
- HeCBench. 2022. Bezier Surface. Webpage. Retrieved March, 2022 from https://github.com/zjin-lcf/HeCBenchGoogle Scholar
- HeCBench. 2022. Rush Larsen. Webpage. Retrieved March, 2022 from https://github.com/zjin-lcf/HeCBenchGoogle Scholar
- Eric Holk, William Byrd, Nilesh Mahajan, Jeremiah Willcock, Arun Chauhan, and Andrew Lumsdaine. 2012. Declarative Parallel Programming for GPUs. In Advances in Parallel Computing, Vol. 22.Google Scholar
- Intel. 2022. Intel oneAPI DPC++ FPGA Optimization Guide. Webpage. Retrieved March, 2022 from https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top.htmlGoogle Scholar
- ISLPY. 2021. islpy 2020.2.2 Documentation. Webpage. Retrieved February, 2021 from https://documen.tician.de/islpy/index.htmlGoogle Scholar
- LLVM Developer Group. 2022. Clang: a C language family frontend for LLVM. Webpage. Retrieved March, 2022 from https://clang.llvm.org/Google Scholar
- Deepak Majeti, Kuldeep S. Meel, Rajkishore Barik, and Vivek Sarkar. 2016. Automatic Data Layout Generation and Kernel Mapping for CPU+GPU Architectures. In Proc. of the 25th Int’l Conf. on Compiler Construction (CC 2016).Google ScholarDigital Library
- Matthew Marangoni and Thomas Wischgoll. 2016. Paper: Togpu: Automatic Source Transformation from C++ to CUDA using Clang/LLVM. In Visualization and Data Analysis.Google Scholar
- Berna Massingill, Tim Mattson, and Beverly Sanders. 2000. A Pattern Language for Parallel Application Programs. In Euro-Par 2000.Google ScholarCross Ref
- Timothy Mattson, Beverly Sanders, and Berna Massingill. 2004. Patterns for Parallel Programming(1st ed.). Addison-Wesley Professional.Google Scholar
- Maxeler App Gallery. 2015. N-Body Particle Simulation. Webpage. Retrieved January, 2020 from https://github.com/maxeler/NBodyGoogle Scholar
- Michael McCool, James Reinders, and Arch Robison. 2012. Structured Parallel Programming: Patterns for Efficient Computation (1st ed.). Morgan Kaufmann Publishers Inc.Google ScholarDigital Library
- NVIDIA Developer Zone. 2022. CUDA C++ Best Practices Guide. Webpage. Retrieved January, 2022 from https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.htmlGoogle Scholar
- Michel Steuwer, Philipp Kegel, and Sergei Gorlatch. 2011. SkelCL - A Portable Skeleton Library for High-Level GPU Programming. In 2011 IEEE Int’l Symp. on Parallel and Distributed Processing Workshops and Phd Forum.Google Scholar
- Joel Svensson, Mary Sheeran, and Koen Claessen. 2008. Obsidian: A Domain Specific Embedded Language for Parallel Programming of Graphics Processors. In Proc. of the 20th Int’l Conf. on Implementation and Application of Functional Languages (IFL’08).Google Scholar
- Jessica Vandebon, Jose G. F. Coutinho, Wayne Luk, and Eriko Nurvitadhi. 2021. Enhancing High-Level Synthesis Using a Meta-Programming Approach. IEEE Trans. Comput. 70, 12 (2021).Google Scholar
- Michael Vollmer, Bo Joel Svensson, Eric Holk, and Ryan Newton. 2015. Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code. In Proc. of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing.Google ScholarDigital Library
- Mohamed Wahib and Naoya Maruyama. 2015. Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC ’15). Association for Computing Machinery.Google ScholarDigital Library
- Zheng Wang, Dominik Grewe, and Michael F. P. O’boyle. 2015. Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems. In ACM TACO ’15, Vol. 11.Google Scholar
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
PIPSEA: A Practical IPsec Gateway on Embedded APUs
CCS '16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications SecurityAccelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between ...
Comments