skip to main content
10.1145/3497776.3517770acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article

MLIR-based code generation for GPU tensor cores

Published:18 March 2022Publication History

ABSTRACT

The state-of-the-art in high-performance deep learning today is primarily driven by manually developed libraries optimized and highly tuned by expert programmers using low-level abstractions with significant effort. This effort is often repeated for similar hardware and future ones. In this work, we pursue and evaluate the more modular and reusable approach of using compiler IR infrastructure to generate libraries by encoding all the required optimizations as a sequence of transformations and customized passes on an IR. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), it had been hard to represent and transform computation at various levels of abstraction within a single IR.

Using the MLIR infrastructure, we build a transformation and lowering pipeline to automatically generate near-peak performance code for matrix-matrix multiplication (matmul) as well as matmul fused with simple pointwise operators targeting tensor cores on NVIDIA GPUs. On a set of problem sizes ranging from 256 to 16384, our performance evaluation shows that we can obtain performance that is 0.95× to 1.19× and 0.80× to 1.60× of cuBLAS for FP32 and FP16 accumulate respectively on NVIDIA’s Ampere based Geforce 3090 RTX. Furthermore, by allowing the fusion of common pointwise operations with matrix-matrix multiplication, we obtain performance ranging from 0.95× to 1.67× of a cuBLAS-based implementation. Additionally, we present matmul-like examples such as 3-d contraction and batched matmul, which the pipeline can efficiently handle while providing competitive performance. We believe that these results motivate further research and engineering on automatic domain-specific library generation using compiler IR infrastructure for similar specialized accelerators.

References

  1. Jeremy Appleyard and Scott Yokim.. 2017. Programming Tensor Cores in CUDA 9. https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/Google ScholarGoogle Scholar
  2. Somashekaracharya G. Bhaskaracharya, Julien Demouth, and Vinod Grover. 2020. Automatic Kernel Generation for Volta Tensor Cores. CoRR, abs/2006.12645 (2020), arxiv:2006.12645Google ScholarGoogle Scholar
  3. Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. CoRR, https://arxiv.org/abs/2003.00532Google ScholarGoogle Scholar
  4. C. Lattner and M. Amini and U. Bondhugula and A. Cohen and A. Davis and J. Pienaar and R. Riddle and T. Shpeisman and N. Vasilache and O. Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).Google ScholarGoogle Scholar
  5. LLVM Community. 2022. The LLVM Project. https://llvm.org/Google ScholarGoogle Scholar
  6. LLVM community. 2022. User Guide for NVPTX Back-end. https://llvm.org/docs/NVPTXUsage.htmlGoogle ScholarGoogle Scholar
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, arxiv:1810.04805Google ScholarGoogle Scholar
  8. Thomas Faingnaert, Tim Besard, and Bjorn De Sutter. 2020. Flexible Performant GEMM Kernels on GPUs. CoRR, abs/2009.12263 (2020), arxiv:2009.12263Google ScholarGoogle Scholar
  9. Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw., 34, 3 (2008), issn:0098-3500 https://doi.org/10.1145/1356052.1356053 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Scott Gray. 2017. A full walk through of the SGEMM implementation. https://github.com/NervanaSystems/maxas/wiki/SGEMMGoogle ScholarGoogle Scholar
  11. Stephan Herhut. 2020. MLIR on GPUs. MLIR Open Design Meeting, Apr 16, 2020Google ScholarGoogle Scholar
  12. Stephan Herhut and Oleksandr Zinenko. 2019. GPUs in MLIR. MLIR Open Design Meeting, Dec 12, 2019Google ScholarGoogle Scholar
  13. Junjie Lai and Andre Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO ’13). IEEE Computer Society. https://doi.org/10.1109/CGO.2013.6494986 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. San Jose, CA, USA. 75–88.Google ScholarGoogle Scholar
  15. Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. CoRR, abs/2002.11054 (2020), http://arxiv.org/abs/2002.11054Google ScholarGoogle Scholar
  16. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. CoRR, abs/1901.08746 (2019), arxiv:1901.08746Google ScholarGoogle Scholar
  17. Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders. CoRR, arxiv:1908.08345Google ScholarGoogle Scholar
  18. LLVM/MLIR. 2020. MLIR: A multi-level intermediate representation. https://mlir.llvm.orgGoogle ScholarGoogle Scholar
  19. Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw., 43, 2 (2016), Article 12, Aug..Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Justin Luitjens. 2013. Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/Google ScholarGoogle Scholar
  21. MLIR. 2020. MLIR dialects. https://mlir.llvm.org/docs/Dialects/Google ScholarGoogle Scholar
  22. MLIR. 2020. MLIR language reference. https://mlir.llvm.org/docs/LangRef/Google ScholarGoogle Scholar
  23. Pandu Nayak. 2019. Understanding searches better than ever before. https://www.blog.google/products/search/search-language-understanding-bert/Google ScholarGoogle Scholar
  24. NVIDIA. 2018. NVIDIA Turing GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdfGoogle ScholarGoogle Scholar
  25. NVIDIA. 2020. NVIDIA Ampere GA102 GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdfGoogle ScholarGoogle Scholar
  26. NVIDIA. 2021. cuBLAS. https://docs.nvidia.com/cuda/cublas/Google ScholarGoogle Scholar
  27. NVIDIA. 2021. CUTLASS. https://github.com/NVIDIA/cutlassGoogle ScholarGoogle Scholar
  28. NVIDIA. 2021. Mixed precision training in deep learning. https://docs.nvidia.com/deeplearning/performance/mixed- precision-training/index.htmlGoogle ScholarGoogle Scholar
  29. NVIDIA. 2021. Nsight Systems. https://docs.nvidia.com/nsight-systems/UserGuide/index.htmlGoogle ScholarGoogle Scholar
  30. NVIDIA. 2021. PTX programming guide. https://docs.nvidia.com/cuda/parallel-thread-execution/index.htmlGoogle ScholarGoogle Scholar
  31. NVIDIA. 2022. CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle ScholarGoogle Scholar
  32. NVIDIA. 2022. cuTensor. https://docs.nvidia.com/cuda/cutensor/index.htmlGoogle ScholarGoogle Scholar
  33. PolyMage Labs. 2020. MLIRX. https://github.com/polymage-labs/mlirxGoogle ScholarGoogle Scholar
  34. Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2063384.2063431 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. Association for Computing Machinery. https://doi.org/10.1145/3315508.3329973 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Softw., 41, 3 (2015), Article 14, June.Google ScholarGoogle Scholar
  37. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR.Google ScholarGoogle Scholar
  38. Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Mathematical Software – ICMS 2010, Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama (Eds.).Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Da Yan, Wei Wang, and Xiaowen Chu. 2020. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarGoogle ScholarCross RefCross Ref
  40. Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. SIGPLAN Not., https://doi.org/10.1145/3155284.3018755 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MLIR-based code generation for GPU tensor cores

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction
      March 2022
      253 pages
      ISBN:9781450391832
      DOI:10.1145/3497776

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 March 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader