Abstract
Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save memory. Programmers are left to write kernels for every operation of interest, with different mixes of dense and sparse tensors in different formats. The combinations are infinite, which makes it impossible to manually implement and optimize them all. This paper introduces the first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors. The technique is implemented in a C++ library called taco. Its performance is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.
Supplemental Material
Available for Download
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Berkeley, CA, USA, 265–283. http://dl.acm.org/citation.cfm?id=3026877. 3026899Google ScholarDigital Library
- Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Tensor Decompositions for Learning Latent Variable Models. J. Mach. Learn. Res. 15, Article 1 (Jan. 2014), 60 pages.Google ScholarDigital Library
- E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarCross Ref
- Gilad Arnold. 2011. Data-Parallel Language for Correct and Efficient Sparse Matrix Codes. Ph.D. Dissertation. University of California, Berkeley.Google Scholar
- Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. 2006. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics 104, 2 (2006), 211–228. Google Scholar
- Brett W. Bader, Michael W. Berry, and Murray Browne. 2008. Discussion Tracking in Enron Email Using PARAFAC. Springer London, 147–163.Google Scholar
- Brett W Bader and Tamara G Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (2007), 205–231.Google ScholarDigital Library
- Satish Balay, William D Gropp, Lois Curfman McInnes, and Barry F Smith. 1997. Efficient management of parallelism in object-oriented numerical software libraries. In Modern software tools for scientific computing. Springer, Birkhäuser Boston, 163–202.Google Scholar
- Muthu Baskaran, Benoît Meister, Nicolas Vasilache, and Richard Lethin. 2012. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. IEEE, 1–6. Google ScholarCross Ref
- James Bennett, Stan Lanning, et al. 2007. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. ACM, New York, 35.Google Scholar
- Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman. 2012. Julia: A Fast Dynamic Language for Technical Computing. (2012).Google Scholar
- Aart JC Bik and Harry AG Wijshoff. 1993. Compilation techniques for sparse matrix computations. In Proceedings of the 7th international conference on Supercomputing. ACM, 416–424.Google ScholarDigital Library
- Aart JC Bik and Harry AG Wijshoff. 1994. On automatic data structure selection and code generation for sparse computations. In Languages and Compilers for Parallel Computing. Springer, 57–75.Google Scholar
- Aydin Buluc and John R. Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In IEEE International Symposium on Parallel and Distributed Processing, (IPDPS). 1–11. Google ScholarCross Ref
- Aydin Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel Sparse Matrixvector and Matrix-transpose-vector Multiplication Using Compressed Sparse Blocks. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures (SPAA ’09). ACM, New York, NY, USA, 233–244. Google ScholarDigital Library
- Jong-Ho Byun, Richard Lin, Katherine A Yelick, and James Demmel. 2012. Autotuning sparse matrix-vector multiplication for multicore. EECS, UC Berkeley, Tech. Rep (2012).Google Scholar
- Jonathon Cai, Muthu Baskaran, Benoît Meister, and Richard Lethin. 2015. Optimization of symmetric tensor computations. In High Performance Extreme Computing Conference (HPEC), 2015 IEEE. IEEE, 1–7. Google ScholarCross Ref
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011). Google ScholarDigital Library
- Albert. Einstein. 1916. The Foundation of the General Theory of Relativity. Annalen der Physik 354 (1916), 769–822.Google ScholarCross Ref
- Evgeny Epifanovsky, Michael Wormit, Tomasz Kuś, Arie Landau, Dmitry Zuev, Kirill Khistyaev, Prashant Manohar, Ilya Kaliman, Andreas Dreuw, and Anna I Krylov. 2013. New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations. Journal of computational chemistry 34, 26 (2013), 2293–2309. Google ScholarCross Ref
- Richard Feynman, Robert B. Leighton, and Matthew L. Sands. 1963. The Feynman Lectures on Physics. Vol. 3. Addison-Wesley.Google Scholar
- Google. 2017. TensorFlow Sparse Tensors. https://www.tensorflow.org/api_guides/python/sparse_ops . (2017).Google Scholar
- Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. http://eigen.tuxfamily.org . (2010).Google Scholar
- Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw. 4, 3 (1978). Google ScholarDigital Library
- Intel. 2012. Intel math kernel library reference manual. Technical Report. 630813-051US, 2012. http://software.intel.com/ sites/products/documentation/hpc/mkl/mklman/mklman.pdf .Google Scholar
- Kenneth E. Iverson. 1962. A Programming Language. Wiley. Google ScholarDigital Library
- Oguz Kaya and Bora Uçar. 2015. Scalable sparse tensor decompositions in distributed memory systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 77. Google ScholarDigital Library
- Fredrik Kjolstad, Shoaib Kamil, Jonathan Ragan-Kelley, David Levin, Shinjiro Sueda, Desai Chen, Etienne Vouga, Danny Kaufman, Gurtej Kanwar, Wojciech Matusik, and Saman Amarasinghe. 2016. Simit: A Language for Physical Simulation. ACM Trans. Graphics (2016).Google ScholarDigital Library
- Donald Ervin Knuth. 1973. The art of computer programming: sorting and searching. Vol. 3. Pearson Education.Google Scholar
- Joseph C Kolecki. 2002. An Introduction to Tensors for Students of Physics and Engineering. Unixenguaedu 7, September (2002), 29.Google Scholar
- Vladimir Kotlyar. 1999. Relational Algebraic Techniques for the Synthesis of Sparse Matrix Programs. Ph.D. Dissertation. Cornell.Google Scholar
- Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill. 1997. A relational approach to the compilation of sparse matrix programs. In Euro-Par’97 Parallel Processing. Springer, 318–327. Google ScholarCross Ref
- Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 76. Google ScholarDigital Library
- Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc. 2016. Optimizing sparse tensor times matrix on multi-core and many-core architectures. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 26–33. Google ScholarCross Ref
- MATLAB. 2014. version 8.3.0 (R2014a). The MathWorks Inc., Natick, Massachusetts.Google Scholar
- Devin Matthews. 2017. High-Performance Tensor Contraction without Transposition. Technical Report.Google Scholar
- Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 165–172. Google ScholarDigital Library
- Kathryn S McKinley, Steve Carr, and Chau-Wen Tseng. 1996. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems (TOPLAS) 18, 4 (1996), 424–453.Google ScholarDigital Library
- John Michael McNamee. 1971. Algorithm 408: a sparse matrix package (part I)[F4]. Commun. ACM 14, 4 (1971), 265–273. Google ScholarDigital Library
- Lenore Mullin and James Raynolds. 2014. Scalable, Portable, Verifiable Kronecker Products on Multi-scale Computers. Springer International Publishing, Cham, 111–129. Google ScholarCross Ref
- L. M. R. Mullin. 1988. A Mathematics of Arrays. Ph.D. Dissertation. Syracuse University.Google Scholar
- Thomas Nelson, Geoffrey Belter, Jeremy G. Siek, Elizabeth Jessup, and Boyana Norris. 2015. Reliable Generation of High-Performance Matrix Algebra. ACM Trans. Math. Softw. 41, 3, Article 18 (June 2015), 27 pages.Google ScholarDigital Library
- William Pugh and Tatiana Shpeisman. 1999. SIPR: A new framework for generating efficient code for sparse matrix computations. In Languages and Compilers for Parallel Computing. Springer, 213–229. Google ScholarCross Ref
- Yves Renard. 2017. Gmm++. (2017). http://download.gna.org/getfem/html/homepage/gmm/first-step.htmlGoogle Scholar
- Gregorio Ricci-Curbastro and Tullio Levi-Civita. 1901. Méthodes de calcul différentiel absolu et leurs applications. Math. Ann. 54 (1901).Google Scholar
- Hongbo Rong, Jongsoo Park, Lingxiang Xiang, Todd A. Anderson, and Mikhail Smelyanskiy. 2016. Sparso: Context-driven Optimizations of Sparse Linear Algebra. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 247–259. Google ScholarDigital Library
- Conrad Sanderson. 2010. Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments. Technical Report. NICTA.Google Scholar
- Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. (2017). http://frostt.io/Google Scholar
- Shaden Smith and George Karypis. 2015. Tensor-matrix products with a compressed sparse tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 5. Google ScholarDigital Library
- Shaden Smith, Niranjay Ravindran, Nicholas Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 61–70. Google ScholarDigital Library
- Edgar Solomonik and Torsten Hoefler. 2015. Sparse Tensor Algebra as a Parallel Programming Model. arXiv preprint arXiv:1512.00066 (2015).Google Scholar
- Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton, and James Demmel. 2014. A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel and Distrib. Comput. 74, 12 (2014), 3176–3190. Google ScholarDigital Library
- Daniele G Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 23.Google ScholarDigital Library
- Paul Springer and Paolo Bientinesi. 2016. Design of a high-performance GEMM-like Tensor-Tensor Multiplication. arXiv preprint arXiv:1607.00145 (2016).Google Scholar
- Paul Stodghill. 1997. A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. Ph.D. Dissertation. Cornell.Google Scholar
- Scott Thibault, Lenore Mullin, and Matt Insall. 1994. Generating Indexing Functions of Regularly Sparse Arrays for Array Compilers. (1994).Google Scholar
- William F Tinney and John W Walker. 1967. Direct solutions of sparse network equations by optimally ordered triangular factorization. Proc. IEEE 55, 11 (1967), 1801–1809. Google ScholarCross Ref
- Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13, 2 (2011), 22–30.Google Scholar
- Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2015). 521–532. Google ScholarDigital Library
- Anand Venkat, Mahdi Soltan Mohammadi, Hongbo Rong, Rajkishore Barik, Jongsoo Park, Michelle Mills Strout, and Mary Hall. 2016. Automating Wavefront Parallelization for Sparse Matrix Computations. In In Supercomputing (SC).Google Scholar
- Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P. Gummadi. 2009. On the Evolution of User Interaction in Facebook. In Proceedings of the 2nd ACM SIGCOMM Workshop on Social Networks (WOSN’09). Google ScholarDigital Library
- Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16, 1 (2005), 521+. Google ScholarCross Ref
- Joerg Walter and Mathias Koch. 2007. uBLAS. (2007). http://www.boost.org/libs/numeric/ublas/doc/index.htmGoogle Scholar
- R. Clint Whaley and Jack Dongarra. 1998. Automatically Tuned Linear Algebra Software. In SuperComputing 1998: High Performance Networking and Computing. Google ScholarCross Ref
- Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC ’07). New York, NY, USA, 38:1–38:12. Google ScholarDigital Library
- Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. SIGPLAN Not. 26, 6 (May 1991), 30–44. Google ScholarDigital Library
- Michael Joseph Wolfe. 1982. Optimizing Supercompilers for Supercomputers. Ph.D. Dissertation. University of Illinois at Urbana-Champaign, Champaign, IL, USA. AAI8303027.Google ScholarDigital Library
- Huasha Zhao. 2014. High Performance Machine Learning through Codesign and Rooflining. Ph.D. Dissertation. EECS Department, University of California, Berkeley.Google Scholar
Index Terms
- The tensor algebra compiler
Recommendations
Compiler Support for Sparse Tensor Computations in MLIR
Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Developing and maintaining sparse software by ...
taco: a tool to generate tensor algebra kernels
ASE '17: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software EngineeringTensor algebra is an important computational abstraction that is increasingly used in data analytics, machine learning, engineering, and the physical sciences. However, the number of tensor expressions is unbounded, which makes it hard to develop and ...
Compiling Structured Tensor Algebra
Tensor algebra is essential for data-intensive workloads in various computational domains. Computational scientists face a trade-off between the specialization degree provided by dense tensor algebra and the algorithmic efficiency that leverages the ...
Comments