research-article

MLIR-based code generation for GPU tensor cores

Authors:
Navdeep Katel

Indian Institute of Science, India / PolyMage Labs, India

Indian Institute of Science, India / PolyMage Labs, India
View Profile

,
Vivek Khandelwal

Indian Institute of Science, India

Indian Institute of Science, India
View Profile

,
Uday Bondhugula

Indian Institute of Science, India / PolyMage Labs, India

Indian Institute of Science, India / PolyMage Labs, India

0000-0002-8297-6159
View Profile

CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler ConstructionMarch 2022Pages 117–128https://doi.org/10.1145/3497776.3517770

Published:18 March 2022Publication History

CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

Pages 117–128

ABSTRACT

The state-of-the-art in high-performance deep learning today is primarily driven by manually developed libraries optimized and highly tuned by expert programmers using low-level abstractions with significant effort. This effort is often repeated for similar hardware and future ones. In this work, we pursue and evaluate the more modular and reusable approach of using compiler IR infrastructure to generate libraries by encoding all the required optimizations as a sequence of transformations and customized passes on an IR. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), it had been hard to represent and transform computation at various levels of abstraction within a single IR.

Using the MLIR infrastructure, we build a transformation and lowering pipeline to automatically generate near-peak performance code for matrix-matrix multiplication (matmul) as well as matmul fused with simple pointwise operators targeting tensor cores on NVIDIA GPUs. On a set of problem sizes ranging from 256 to 16384, our performance evaluation shows that we can obtain performance that is 0.95× to 1.19× and 0.80× to 1.60× of cuBLAS for FP32 and FP16 accumulate respectively on NVIDIA’s Ampere based Geforce 3090 RTX. Furthermore, by allowing the fusion of common pointwise operations with matrix-matrix multiplication, we obtain performance ranging from 0.95× to 1.67× of a cuBLAS-based implementation. Additionally, we present matmul-like examples such as 3-d contraction and batched matmul, which the pipeline can efficiently handle while providing competitive performance. We believe that these results motivate further research and engineering on automatic domain-specific library generation using compiler IR infrastructure for similar specialized accelerators.

References

Jeremy Appleyard and Scott Yokim.. 2017. Programming Tensor Cores in CUDA 9. https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/Google Scholar
Somashekaracharya G. Bhaskaracharya, Julien Demouth, and Vinod Grover. 2020. Automatic Kernel Generation for Volta Tensor Cores. CoRR, abs/2006.12645 (2020), arxiv:2006.12645Google Scholar
Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. CoRR, https://arxiv.org/abs/2003.00532Google Scholar
C. Lattner and M. Amini and U. Bondhugula and A. Cohen and A. Davis and J. Pienaar and R. Riddle and T. Shpeisman and N. Vasilache and O. Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).Google Scholar
LLVM Community. 2022. The LLVM Project. https://llvm.org/Google Scholar
LLVM community. 2022. User Guide for NVPTX Back-end. https://llvm.org/docs/NVPTXUsage.htmlGoogle Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, arxiv:1810.04805Google Scholar
Thomas Faingnaert, Tim Besard, and Bjorn De Sutter. 2020. Flexible Performant GEMM Kernels on GPUs. CoRR, abs/2009.12263 (2020), arxiv:2009.12263Google Scholar
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw., 34, 3 (2008), issn:0098-3500 https://doi.org/10.1145/1356052.1356053 Google ScholarDigital Library
Scott Gray. 2017. A full walk through of the SGEMM implementation. https://github.com/NervanaSystems/maxas/wiki/SGEMMGoogle Scholar
Stephan Herhut. 2020. MLIR on GPUs. MLIR Open Design Meeting, Apr 16, 2020Google Scholar
Stephan Herhut and Oleksandr Zinenko. 2019. GPUs in MLIR. MLIR Open Design Meeting, Dec 12, 2019Google Scholar
Junjie Lai and Andre Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO ’13). IEEE Computer Society. https://doi.org/10.1109/CGO.2013.6494986 Google ScholarDigital Library
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. San Jose, CA, USA. 75–88.Google Scholar
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. CoRR, abs/2002.11054 (2020), http://arxiv.org/abs/2002.11054Google Scholar
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. CoRR, abs/1901.08746 (2019), arxiv:1901.08746Google Scholar
Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders. CoRR, arxiv:1908.08345Google Scholar
LLVM/MLIR. 2020. MLIR: A multi-level intermediate representation. https://mlir.llvm.orgGoogle Scholar
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw., 43, 2 (2016), Article 12, Aug..Google ScholarDigital Library
Justin Luitjens. 2013. Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/Google Scholar
MLIR. 2020. MLIR dialects. https://mlir.llvm.org/docs/Dialects/Google Scholar
MLIR. 2020. MLIR language reference. https://mlir.llvm.org/docs/LangRef/Google Scholar
Pandu Nayak. 2019. Understanding searches better than ever before. https://www.blog.google/products/search/search-language-understanding-bert/Google Scholar
NVIDIA. 2018. NVIDIA Turing GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdfGoogle Scholar
NVIDIA. 2020. NVIDIA Ampere GA102 GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdfGoogle Scholar
NVIDIA. 2021. cuBLAS. https://docs.nvidia.com/cuda/cublas/Google Scholar
NVIDIA. 2021. CUTLASS. https://github.com/NVIDIA/cutlassGoogle Scholar
NVIDIA. 2021. Mixed precision training in deep learning. https://docs.nvidia.com/deeplearning/performance/mixed- precision-training/index.htmlGoogle Scholar
NVIDIA. 2021. Nsight Systems. https://docs.nvidia.com/nsight-systems/UserGuide/index.htmlGoogle Scholar
NVIDIA. 2021. PTX programming guide. https://docs.nvidia.com/cuda/parallel-thread-execution/index.htmlGoogle Scholar
NVIDIA. 2022. CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle Scholar
NVIDIA. 2022. cuTensor. https://docs.nvidia.com/cuda/cutensor/index.htmlGoogle Scholar
PolyMage Labs. 2020. MLIRX. https://github.com/polymage-labs/mlirxGoogle Scholar
Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2063384.2063431 Google ScholarDigital Library
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. Association for Computing Machinery. https://doi.org/10.1145/3315508.3329973 Google ScholarDigital Library
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Softw., 41, 3 (2015), Article 14, June.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR.Google Scholar
Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Mathematical Software – ICMS 2010, Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama (Eds.).Google ScholarDigital Library
Da Yan, Wei Wang, and Xiaowen Chu. 2020. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarCross Ref
Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. SIGPLAN Not., https://doi.org/10.1145/3155284.3018755 Google ScholarDigital Library

Index Terms

MLIR-based code generation for GPU tensor cores
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More
GPU Code Generation of Cardiac Electrophysiology Simulation with MLIR
Euro-Par 2023: Parallel Processing
Abstract
We show the benefits of the novel MLIR compiler technology to the generation of code from a DSL, namely EasyML used in openCARP, a widely used simulator in the cardiac electrophysiology community. Building on an existing work that deeply modified ...
Read More
Fast implementation of DGEMM on Fermi GPU
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction
March 2022
253 pages
ISBN:9781450391832
DOI:10.1145/3497776
General Chairs:
Bernhard Egger
Seoul National University, South Korea
,
Aaron Smith
Microsoft, USA
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 March 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
MLIR
matrix-matrix multiplication
tensor cores
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 1,110
  Total Downloads
- Downloads (Last 12 months)294
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MLIR-based code generation for GPU tensor cores

CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

GPU Code Generation of Cardiac Electrophysiology Simulation with MLIR

Fast implementation of DGEMM on Fermi GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

MLIR-based code generation for GPU tensor cores

CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

GPU Code Generation of Cardiac Electrophysiology Simulation with MLIR

Fast implementation of DGEMM on Fermi GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media