Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs☆
Introduction
The emergence of multicore and heterogeneous architectures requires many linear algebra algorithms to be redesigned to take advantage of accelerators, such as GPUs. A particularly challenging class of problems, arising in numerous applications, involves the use of linear algebra operations on many small-sized matrices. Their number can be thousands, even millions. For example, billions of 8 × 8 and 32 × 32 eigenvalue problems need to be solved in magnetic resonance imaging. Also, thousands of matrix–matrix (GEMM) and matrix–vector products (GEMV) are computed in hydrodynamic simulations with finite element method [1]. Here the size of matrices increases with the order of the numerical methods, and can range from ten to a few hundred. GEMM is at the heart of deep neural network (DNN) computations, where rather than treating convolution as one large GEMM problem, it is much more efficient to view it as many small GEMMs [2]. Thus, Batched BLAS, and Batched GEMM in particular, are central part of performing deep learning, and therefore can be used to accelerate frameworks like PaddlePaddle [3], Theano [4], TensorFlow [5], and Torch [6]. In an astrophysics ODE solver [7], multiple zones are simulated, and each zone corresponds to a small linear system solve based on an LU factorization [7]. If the matrix is symmetric and definite, the problem is reduced to a batched Cholesky factorization [[8], [9]]. In [10], there are demands to compute one million 25 × 8 and n*20 SVD problems, where n ranges from 32 to 4096.
The acceleration of the aforementioned many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. To address the wide range of application needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched Basic Linear Algebra Subroutines (BLAS) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. To describe these developments, we start with other related work and summary of our contributions (Section 2), followed by algorithmic background (Section 3) and the Householder Bi-diagonalization algorithm (Section 4), performance analysis and a roofline model that guides our design and optimizations (Section 5), the main batched BLAS design, optimization techniques and implementation for GPUs (Section 6), our auto-tuning strategy (Section 7), performance on NVIDIA GPUs (Section 8) and AMD GPUs (Section 9). Finally, conclusions and future work directions are given in Section 10.
Section snippets
Related work and contributions
The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in application directly from highly-optimized, standard numerical libraries, like (batched) BLAS and LAPACK [11]. Indeed, vendors like NVIDIA and Intel started to provide certain
Background
The SVD problem is to find orthogonal matrices U and V, and a diagonal matrix Σ with nonnegative elements, such that A = UΣVT, where A is an m × n matrix. The diagonal elements of Σ are singular values of A, the columns of U are called left singular vectors of A, and the columns of V are called right singular vectors of A. Such problem is solved by a three-phase process:
- 1.
Reduction phase: orthogonal matrices Q and P are applied on both the left and the right side of A to reduce it to a
Householder bi-diagonalization
The bi-diagonalization factorizes A = U BVT, where U and V are orthogonal matrices, and B is bi-diagonal with non-zeros only on the diagonal and upper superdiagonal. This is done by the classic stable Golub–Kahan method that applies a sequence of Householder transformations [21]. Algorithmically, this corresponds to a sequence of in-place transformations, where A is overwritten by the entries of the bi-diagonal matrix B, as well as by the U and V holding the vectors defining the left and right
Performance bound analysis and roofline model
In order to evaluate the performance behavior of the reduction to bi-diagonal and to analyze if there are opportunities for improvements, we present a performance bound analysis and the associated with it roofline model. Similar to the one-sided factorizations (LU, Cholesky, QR), the two-sided factorizations (in particular, the bi-diagonal reduction) are split into a panel factorization and a trailing matrix update. Unlike the one-sided factorizations, the panel factorization requires computing
Batched BLAS design and implementation for GPUs
In a batched problem that is based on batched BLAS, many small dense matrices must be factorized simultaneously, meaning that all the matrices will be processed simultaneously by the same kernel.
Auto-tuning
The efforts of maximizing the performance of BLAS, especially GEMM, generally fall into two directions: writing assembly code and source level code tuning. The vendor libraries (e.g., Intel MKL, AMD ACML and hipBLAS, NVIDIA CUBLAS) supply their own routines on their hardware. To achieve performance, the GEMM routine is implemented in assembly code, like the CUBLAS GEMM on Kepler GPUs. The assembly code usually delivers high performance. A disadvantage is that it is highly architecture-specific.
Performance on NVIDIA GPU
We conducted our experiments on a multicore system with two 8-cores socket Intel Xeon E5-2670 (Sandy Bridge) processors with each running at 2.6 GHz. Each socket has a shared 20 MB L3 cache, and each core has a private 256 KB L2 and a 64 KB L1 cache. The system is equipped with 64 GB of memory and the theoretical peak in double precision is 20.8 Gflop/s per core, i.e., 332.8 Gflop/s in total for the two sockets. It is also equipped with an NVIDIA K40c GPU with 11.6 GB GDDR memory per card
Batched GEMV and GEMM performance on AMD GPU
Recently, AMD introduced the Radeon Open Compute Platform (ROCm) open-sourced platform for GPU-based high-performance computing [27]. The ROCm ecosystem includes Linux kernels, runtimes, the HCC compiler, and high level mathematical libraries. ROCm introduces a Heterogeneous–Compute Interface for Portability (HIP) layer allowing users to create portable applications that run on AMD and NVIDIA GPUs with similar source code. Compared to the OpenCL C language, HIP is a C/C++ runtime API and kernel
Conclusions and future work
The use of GPUs to accelerate scientific codes has been observed extensively on large dense and sparse linear algebra problems that feature abundant data parallelism. Solving small linear algebra problems on current high-end many-core systems is more challenging. Still, small problems can be implemented relatively easy for multicore CPUs by taking advantage of the large CPU caches for data reuse. On the other hand, the development of small problems on GPUs is not as straightforward. We
Acknowledgements
This work is partially supported by NSF Grant No. SI2:SSE 1740250, and the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nations exascale computing imperative.
References (29)
- et al.
A step towards energy efficient computing: redesigning a hydrodynamic application on CPU–GPU
IEEE 28th International Parallel Distributed Processing Symposium (IPDPS)
(2014) Accelerate Machine Learning with the cuDNN Deep Neural Network Library
(2015)- PArallel Distributed Deep LEarning (Paddle), Available at https://github.com/PaddlePaddle/Paddle,...
- Theano Development Team, Theano: A Python Framework for Fast Computation of Mathematical Expressions, arXiv e-prints...
- TensorFlow Development Team, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, Software Available at...
- et al.
Torch7: a Matlab-like environment for machine learning
BigLearn, NIPS Workshop
(2011) - et al.
Multicore and accelerator development for a leadership-class stellar astrophysics code
Proceedings of PARA 2012: State-of-the-Art in Scientific and Parallel Computing
(2012) - et al.
Poster: A Batched Cholesky Solver for Local RX Anomaly Detection on GPUs
(2013) - N. Corporation,...
- Batched SVD, Available at https://devtalk.nvidia.com/default/topic/851534/batched-svd-/,...
A proposed API for batched basic linear algebra subprograms
MIMS EPrint 2016.25
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parellel Comput. Syst. Appl.
Accelerating numerical dense linear algebra calculations with GPUs
Numer. Comput. GPUs
A framework for batched and GPU-resident factorization algorithms applied to block householder transformations
Cited by (13)
The art of computational science: Bridging gaps – forming alloys
2018, Journal of Computational ScienceOptimization Techniques for Hestenes-Jacobi SVD on FPGAs
2023, Proceedings - 2023 33rd International Conference on Field-Programmable Logic and Applications, FPL 2023VECTORIZATION OF A THREAD-PARALLEL JACOBI SINGULAR VALUE DECOMPOSITION METHOD
2023, SIAM Journal on Scientific ComputingA high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution
2022, Journal of SupercomputingW-Cycle SVD: A Multilevel Algorithm for Batched SVD on GPUs
2022, International Conference for High Performance Computing, Networking, Storage and Analysis, SC
- ☆
This is an extended version of our conference paper [18] that was invited to the JoCS special issue (https://doi.org/10.1016/j.procs.2017.05.237).