skip to main content
10.1145/3410463.3414654acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Open Access

SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference

Published:30 September 2020Publication History

ABSTRACT

In recent years, there has been a flurry of research in deep neural network pruning and compression. Early approaches prune weights individually. However, it is difficult to take advantage of the resulting unstructured sparsity patterns on modern hardware like GPUs. As a result, pruning strategies which impose sparsity structures in the weights have become more popular. However,these structured pruning approaches typically lead to higher losses in accuracy than unstructured pruning. In this paper, we present SparseRT, a code generator that leverage unstructured sparsity toaccelerate sparse linear algebra operations in deep learning inference on GPUs. For 1x1 convolutions and fully connected layers, we demonstrate geometric mean of speedups of 3.4x over the equivalent dense computation at 90% sparsity and 5.4x at 95% sparsity when evaluated on hundreds of test cases in deep learning. For sparse 3x3 convolutions, we show speedups of over 5x on use casesin ResNet-50.

References

  1. [n.d.]. Intel Inspector Executor Sparse BLAS. https://software.intel.com/enus/mkl-developer-reference-c-inspector-executor-sparse-blas-routines.Google ScholarGoogle Scholar
  2. [n.d.]. TensorRT. https://developer.nvidia.com/tensorrtGoogle ScholarGoogle Scholar
  3. Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 303--316.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Riyadh Baghdadi, Abdelkader Nadir Debbagh, Kamel Abdous, Benhamida Fatima Zohra, Alex Renda, Jonathan Elliott Frankle, Michael Carbin, and Saman Amarasinghe. [n.d.]. TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning. ([n.,d.]).Google ScholarGoogle Scholar
  5. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: end-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799 (2018), 1--15.Google ScholarGoogle Scholar
  6. Xuhao Chen. 2019. Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs. arXiv (2019).Google ScholarGoogle Scholar
  7. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google ScholarGoogle Scholar
  8. Elliot Crowley, Jack Turner, Amos Storkey, and O'Boyle Michael. 2019. A closer look at structured pruning for neural network compression. arXiv preprint arXiv:1810.04622v3 (2019).Google ScholarGoogle Scholar
  9. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  10. Xiao Dong, Lei Liu, Peng Zhao, Guangli Li, Jiansong Li, Xueying Wang, and Xiaobing Feng. 2019. Acorns: A framework for accelerating deep neural networks with input sparsity. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 178--191.Google ScholarGoogle ScholarCross RefCross Ref
  11. Erich Elsen, Marat Dukhan, Trevor Gale, and Karen Simonyan. 2019. Fast Sparse ConvNets. arXiv preprint arXiv:1911.09723 (2019).Google ScholarGoogle Scholar
  12. Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2019. Rigging the Lottery: Making All Tickets Winners. arXiv preprint arXiv:1911.11134 (2019).Google ScholarGoogle Scholar
  13. Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).Google ScholarGoogle Scholar
  14. Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019).Google ScholarGoogle Scholar
  15. Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU Kernels for Deep Learning. arXiv preprint arXiv:2006.10901 (2020).Google ScholarGoogle Scholar
  16. Scott Gray, Alec Radford, and Diederik P Kingma. 2017. Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224 (2017).Google ScholarGoogle Scholar
  17. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 243--254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google ScholarGoogle Scholar
  19. M Harris and K Perelygin. 2017. Cooperative groups: Flexible CUDA thread programming.Google ScholarGoogle Scholar
  20. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  21. Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1389--1397.Google ScholarGoogle ScholarCross RefCross Ref
  22. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google ScholarGoogle Scholar
  23. Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V cC atalyürek, Srinivasan Parthasarathy, and P Sadayappan. 2018. Efficient sparse-matrix multi-vector product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 66--79.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P Sadayappan. 2019. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, 300--314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google ScholarGoogle Scholar
  26. Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486 (2019).Google ScholarGoogle Scholar
  27. Peng Jiang, Changwan Hong, and Gagan Agrawal. 2020. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 376--388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013--4021.Google ScholarGoogle ScholarCross RefCross Ref
  29. Shaohui Lin, Rongrong Ji, Yuchao Li, Cheng Deng, and Xuelong Li. 2019. Toward Compact ConvNets via Structure-Sparsity Regularized Filter Pruning. IEEE transactions on neural networks and learning systems (2019).Google ScholarGoogle Scholar
  30. Christos Louizos, Max Welling, and Diederik P Kingma. 2017. Learning Sparse Neural Networks through L_0 Regularization. arXiv preprint arXiv:1712.01312 (2017).Google ScholarGoogle Scholar
  31. Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2498--2507.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sharan Narang. 2016. Deepbench. (2016).Google ScholarGoogle Scholar
  33. Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2016. Faster cnns with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016).Google ScholarGoogle Scholar
  34. Michelle Mills Strout, Larry Carter, Jeanne Ferrante, and Barbara Kreaseck. 2004. Sparse tiling for stationary iterative methods. The International Journal of High Performance Computing Applications, Vol. 18, 1 (2004), 95--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Michelle Mills Strout, Mary Hall, and Catherine Olschanowsky. 2018. The sparse polyhedral framework: Composing compiler-generated inspector-executor code. Proc. IEEE, Vol. 106, 11 (2018), 1921--1934.Google ScholarGoogle ScholarCross RefCross Ref
  36. Mingxing Tan and Quoc V Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:1905.11946 (2019).Google ScholarGoogle Scholar
  37. Philippe Tillet and David Cox. 2017. Input-aware auto-tuning of compute-bound HPC kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Philippe Tillet, HT Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. ACM, 10--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018).Google ScholarGoogle Scholar
  40. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google ScholarGoogle Scholar
  41. Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and data transformations for sparse matrix code. In ACM SIGPLAN Notices, Vol. 50. ACM, 521--532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2019. Structured Pruning of Large Language Models. arXiv preprint arXiv:1910.04732 (2019).Google ScholarGoogle Scholar
  43. Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, and Hai Li. 2017. Learning intrinsic sparse structures within long short-term memory. arXiv preprint arXiv:1709.05027 (2017).Google ScholarGoogle Scholar
  44. R Clint Whaley, Antoine Petitet, and Jack J Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel computing, Vol. 27, 1--2 (2001), 3--35.Google ScholarGoogle Scholar
  45. Carl Yang, Aydin Bulucc, and John D Owens. 2018. Design principles for sparse matrix multiplication on the GPU. In European Conference on Parallel Processing. Springer, 672--687.Google ScholarGoogle ScholarCross RefCross Ref
  46. Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang, and Lanshun Nie. 2019. Balanced sparsity for efficient dnn inference on gpu. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5676--5683.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 359--371.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
          September 2020
          505 pages
          ISBN:9781450380751
          DOI:10.1145/3410463
          • General Chair:
          • Vivek Sarkar,
          • Program Chair:
          • Hyesoon Kim

          Copyright © 2020 Owner/Author

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 September 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate121of471submissions,26%

          Upcoming Conference

          PACT '24
          International Conference on Parallel Architectures and Compilation Techniques
          October 14 - 16, 2024
          Southern California , CA , USA

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader