ABSTRACT
The state-of-the-art in high-performance deep learning today is primarily driven by manually developed libraries optimized and highly tuned by expert programmers using low-level abstractions with significant effort. This effort is often repeated for similar hardware and future ones. In this work, we pursue and evaluate the more modular and reusable approach of using compiler IR infrastructure to generate libraries by encoding all the required optimizations as a sequence of transformations and customized passes on an IR. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), it had been hard to represent and transform computation at various levels of abstraction within a single IR.
Using the MLIR infrastructure, we build a transformation and lowering pipeline to automatically generate near-peak performance code for matrix-matrix multiplication (matmul) as well as matmul fused with simple pointwise operators targeting tensor cores on NVIDIA GPUs. On a set of problem sizes ranging from 256 to 16384, our performance evaluation shows that we can obtain performance that is 0.95× to 1.19× and 0.80× to 1.60× of cuBLAS for FP32 and FP16 accumulate respectively on NVIDIA’s Ampere based Geforce 3090 RTX. Furthermore, by allowing the fusion of common pointwise operations with matrix-matrix multiplication, we obtain performance ranging from 0.95× to 1.67× of a cuBLAS-based implementation. Additionally, we present matmul-like examples such as 3-d contraction and batched matmul, which the pipeline can efficiently handle while providing competitive performance. We believe that these results motivate further research and engineering on automatic domain-specific library generation using compiler IR infrastructure for similar specialized accelerators.
- Jeremy Appleyard and Scott Yokim.. 2017. Programming Tensor Cores in CUDA 9. https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/Google Scholar
- Somashekaracharya G. Bhaskaracharya, Julien Demouth, and Vinod Grover. 2020. Automatic Kernel Generation for Volta Tensor Cores. CoRR, abs/2006.12645 (2020), arxiv:2006.12645Google Scholar
- Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. CoRR, https://arxiv.org/abs/2003.00532Google Scholar
- C. Lattner and M. Amini and U. Bondhugula and A. Cohen and A. Davis and J. Pienaar and R. Riddle and T. Shpeisman and N. Vasilache and O. Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).Google Scholar
- LLVM Community. 2022. The LLVM Project. https://llvm.org/Google Scholar
- LLVM community. 2022. User Guide for NVPTX Back-end. https://llvm.org/docs/NVPTXUsage.htmlGoogle Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, arxiv:1810.04805Google Scholar
- Thomas Faingnaert, Tim Besard, and Bjorn De Sutter. 2020. Flexible Performant GEMM Kernels on GPUs. CoRR, abs/2009.12263 (2020), arxiv:2009.12263Google Scholar
- Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw., 34, 3 (2008), issn:0098-3500 https://doi.org/10.1145/1356052.1356053 Google ScholarDigital Library
- Scott Gray. 2017. A full walk through of the SGEMM implementation. https://github.com/NervanaSystems/maxas/wiki/SGEMMGoogle Scholar
- Stephan Herhut. 2020. MLIR on GPUs. MLIR Open Design Meeting, Apr 16, 2020Google Scholar
- Stephan Herhut and Oleksandr Zinenko. 2019. GPUs in MLIR. MLIR Open Design Meeting, Dec 12, 2019Google Scholar
- Junjie Lai and Andre Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO ’13). IEEE Computer Society. https://doi.org/10.1109/CGO.2013.6494986 Google ScholarDigital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. San Jose, CA, USA. 75–88.Google Scholar
- Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. CoRR, abs/2002.11054 (2020), http://arxiv.org/abs/2002.11054Google Scholar
- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. CoRR, abs/1901.08746 (2019), arxiv:1901.08746Google Scholar
- Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders. CoRR, arxiv:1908.08345Google Scholar
- LLVM/MLIR. 2020. MLIR: A multi-level intermediate representation. https://mlir.llvm.orgGoogle Scholar
- Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw., 43, 2 (2016), Article 12, Aug..Google ScholarDigital Library
- Justin Luitjens. 2013. Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/Google Scholar
- MLIR. 2020. MLIR dialects. https://mlir.llvm.org/docs/Dialects/Google Scholar
- MLIR. 2020. MLIR language reference. https://mlir.llvm.org/docs/LangRef/Google Scholar
- Pandu Nayak. 2019. Understanding searches better than ever before. https://www.blog.google/products/search/search-language-understanding-bert/Google Scholar
- NVIDIA. 2018. NVIDIA Turing GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdfGoogle Scholar
- NVIDIA. 2020. NVIDIA Ampere GA102 GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdfGoogle Scholar
- NVIDIA. 2021. cuBLAS. https://docs.nvidia.com/cuda/cublas/Google Scholar
- NVIDIA. 2021. CUTLASS. https://github.com/NVIDIA/cutlassGoogle Scholar
- NVIDIA. 2021. Mixed precision training in deep learning. https://docs.nvidia.com/deeplearning/performance/mixed- precision-training/index.htmlGoogle Scholar
- NVIDIA. 2021. Nsight Systems. https://docs.nvidia.com/nsight-systems/UserGuide/index.htmlGoogle Scholar
- NVIDIA. 2021. PTX programming guide. https://docs.nvidia.com/cuda/parallel-thread-execution/index.htmlGoogle Scholar
- NVIDIA. 2022. CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle Scholar
- NVIDIA. 2022. cuTensor. https://docs.nvidia.com/cuda/cutensor/index.htmlGoogle Scholar
- PolyMage Labs. 2020. MLIRX. https://github.com/polymage-labs/mlirxGoogle Scholar
- Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2063384.2063431 Google ScholarDigital Library
- Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. Association for Computing Machinery. https://doi.org/10.1145/3315508.3329973 Google ScholarDigital Library
- Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Softw., 41, 3 (2015), Article 14, June.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR.Google Scholar
- Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Mathematical Software – ICMS 2010, Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama (Eds.).Google ScholarDigital Library
- Da Yan, Wei Wang, and Xiaowen Chu. 2020. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarCross Ref
- Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. SIGPLAN Not., https://doi.org/10.1145/3155284.3018755 Google ScholarDigital Library
Index Terms
- MLIR-based code generation for GPU tensor cores
Recommendations
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
GPU Code Generation of Cardiac Electrophysiology Simulation with MLIR
Euro-Par 2023: Parallel ProcessingAbstractWe show the benefits of the novel MLIR compiler technology to the generation of code from a DSL, namely EasyML used in openCARP, a widely used simulator in the cardiac electrophysiology community. Building on an existing work that deeply modified ...
Fast implementation of DGEMM on Fermi GPU
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and AnalysisIn this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of ...
Comments