research-article

Meta-Programming Design-Flow Patterns for Automating Reusable Optimisations

Authors:
Jessica Vandebon

Imperial College London, United Kingdom

Imperial College London, United Kingdom
View Profile

,
Jose Coutinho

Imperial College London, United Kingdom

Imperial College London, United Kingdom
View Profile

,
Wayne Luk

Imperial College London, United Kingdom

Imperial College London, United Kingdom
View Profile

HEART '22: Proceedings of the 12th International Symposium on Highly-Efficient Accelerators and Reconfigurable TechnologiesJune 2022Pages 42–50https://doi.org/10.1145/3535044.3535050

Published:09 June 2022Publication History

HEART '22: Proceedings of the 12th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

Pages 42–50

ABSTRACT

Continuing advances in heterogeneous and parallel computing enable massive performance gains in domains such as AI and HPC. Such gains often involve using hardware accelerators, such as FPGAs and GPUs, to speed up specific workloads. However, to make effective use of emerging heterogeneous architectures, optimisation is typically done manually by highly-skilled developers with in-depth understanding of the target hardware. The process is tedious, error-prone, and must be repeated for each new application. This paper introduces Design-Flow Patterns, which capture modular, recurring application-agnostic elements involved in mapping and optimising application descriptions onto efficient CPU and GPU targets. Our approach is the first to codify and programmatically coordinate these elements into fully automated, customisable, and reusable end-to-end design-flows. We implement key design-flow patterns using the meta-programming tool Artisan, and evaluate automated design-flows applied to three sequential C++ applications. Compared to single-threaded implementations, our approach generates multi-threaded OpenMP CPU designs achieving up to 18 times speedup on a CPU platform with 32-threads, as well as HIP GPU designs achieving up to 1184 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU.

References

AMD. 2022. HIP Programming Guide v4.5. Webpage. Retrieved January, 2022 from https://rocmdocs.amd.com/en/ latest/Programming_Guides/HIP-GUIDE.htmlGoogle Scholar
Christopher Brown, Marco Danelutto, Peter Kilpatrick, Kevin Hammond, and Sam Elliott. 2014. Cost-Directed Refactoring for Parallel Erlang Programs. Int’l Journal of Parallel Programming 42.Google Scholar
Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell Array Codes with Multicore GPUs. In Proc. of the 6th Workshop on Declarative Aspects of Multicore Programming (DAMP ’11).Google ScholarDigital Library
Andre Dehon, Joshua Adams, Michael Delorimier, Nachiket Kapre, Yuki Matsuda, Helia Naeimi, Michael Vanier, and Michael Wrighton. 2004. Design patterns for reconfigurable computing. In 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.Google ScholarDigital Library
Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1993. Design Patterns: Abstraction and Reuse of Object-Oriented Design. In ECOOP.Google Scholar
HeCBench. 2022. Bezier Surface. Webpage. Retrieved March, 2022 from https://github.com/zjin-lcf/HeCBenchGoogle Scholar
HeCBench. 2022. Rush Larsen. Webpage. Retrieved March, 2022 from https://github.com/zjin-lcf/HeCBenchGoogle Scholar
Eric Holk, William Byrd, Nilesh Mahajan, Jeremiah Willcock, Arun Chauhan, and Andrew Lumsdaine. 2012. Declarative Parallel Programming for GPUs. In Advances in Parallel Computing, Vol. 22.Google Scholar
Intel. 2022. Intel oneAPI DPC++ FPGA Optimization Guide. Webpage. Retrieved March, 2022 from https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top.htmlGoogle Scholar
ISLPY. 2021. islpy 2020.2.2 Documentation. Webpage. Retrieved February, 2021 from https://documen.tician.de/islpy/index.htmlGoogle Scholar
LLVM Developer Group. 2022. Clang: a C language family frontend for LLVM. Webpage. Retrieved March, 2022 from https://clang.llvm.org/Google Scholar
Deepak Majeti, Kuldeep S. Meel, Rajkishore Barik, and Vivek Sarkar. 2016. Automatic Data Layout Generation and Kernel Mapping for CPU+GPU Architectures. In Proc. of the 25th Int’l Conf. on Compiler Construction (CC 2016).Google ScholarDigital Library
Matthew Marangoni and Thomas Wischgoll. 2016. Paper: Togpu: Automatic Source Transformation from C++ to CUDA using Clang/LLVM. In Visualization and Data Analysis.Google Scholar
Berna Massingill, Tim Mattson, and Beverly Sanders. 2000. A Pattern Language for Parallel Application Programs. In Euro-Par 2000.Google ScholarCross Ref
Timothy Mattson, Beverly Sanders, and Berna Massingill. 2004. Patterns for Parallel Programming(1st ed.). Addison-Wesley Professional.Google Scholar
Maxeler App Gallery. 2015. N-Body Particle Simulation. Webpage. Retrieved January, 2020 from https://github.com/maxeler/NBodyGoogle Scholar
Michael McCool, James Reinders, and Arch Robison. 2012. Structured Parallel Programming: Patterns for Efficient Computation (1st ed.). Morgan Kaufmann Publishers Inc.Google ScholarDigital Library
NVIDIA Developer Zone. 2022. CUDA C++ Best Practices Guide. Webpage. Retrieved January, 2022 from https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.htmlGoogle Scholar
Michel Steuwer, Philipp Kegel, and Sergei Gorlatch. 2011. SkelCL - A Portable Skeleton Library for High-Level GPU Programming. In 2011 IEEE Int’l Symp. on Parallel and Distributed Processing Workshops and Phd Forum.Google Scholar
Joel Svensson, Mary Sheeran, and Koen Claessen. 2008. Obsidian: A Domain Specific Embedded Language for Parallel Programming of Graphics Processors. In Proc. of the 20th Int’l Conf. on Implementation and Application of Functional Languages (IFL’08).Google Scholar
Jessica Vandebon, Jose G. F. Coutinho, Wayne Luk, and Eriko Nurvitadhi. 2021. Enhancing High-Level Synthesis Using a Meta-Programming Approach. IEEE Trans. Comput. 70, 12 (2021).Google Scholar
Michael Vollmer, Bo Joel Svensson, Eric Holk, and Ryan Newton. 2015. Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code. In Proc. of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing.Google ScholarDigital Library
Mohamed Wahib and Naoya Maruyama. 2015. Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC ’15). Association for Computing Machinery.Google ScholarDigital Library
Zheng Wang, Dominik Grewe, and Michael F. P. O’boyle. 2015. Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems. In ACM TACO ’15, Vol. 11.Google Scholar

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
PIPSEA: A Practical IPsec Gateway on Embedded APUs
CCS '16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security

Accelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between ...
Read More
Kokkos

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HEART '22: Proceedings of the 12th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies
June 2022
114 pages
ISBN:9781450396608
DOI:10.1145/3535044

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
Heterogeneous Computing
Meta-Programming
Multi-Core
Parallel Computing
Patterns
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
HEART '22 Paper Acceptance Rate10of21submissions,48%Overall Acceptance Rate22of50submissions,44%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 45
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Meta-Programming Design-Flow Patterns for Automating Reusable Optimisations

HEART '22: Proceedings of the 12th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

ABSTRACT

References

Cited By

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

PIPSEA: A Practical IPsec Gateway on Embedded APUs

Kokkos

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Meta-Programming Design-Flow Patterns for Automating Reusable Optimisations

HEART '22: Proceedings of the 12th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

ABSTRACT

References

Cited By

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

PIPSEA: A Practical IPsec Gateway on Embedded APUs

Kokkos

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media