Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs

https://doi.org/10.1016/j.jocs.2018.01.007Get rights and content

Highlights

  • A batched BLAS design based on device function and big-tile setting was proposed on the GPU.

  • The batched BLAS approach is used to optimize solving the batched bi-diagonalization problem.

  • For the first time, batched BLAS in HIP code on AMD platform was presented and compared against NVIDIA CUDA platform in research.

Abstract

The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in applications, calling them directly from highly optimized, standard numerical libraries, like (batched) BLAS and LAPACK. While most of the developments have been for one-sided factorizations and solvers, many important applications – from big data analytics to information retrieval, low-rank approximations for solvers and preconditioners – require two-sided factorizations, and most notably the SVD factorization. To address these needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched Basic Linear Algebra Subroutines (BLAS) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. We propose a device functions-based methodology and big-tile setting techniques in our batched BLAS design. The different optimization techniques result in many software versions that must be tuned, for which we adopt an auto-tuning strategy to automatically derive the optimized instances of the routines. We illustrate our batched BLAS approach to optimize batched SVD bi-diagonalization progressively on GPUs. The progression is illustrated on an NVIDIA K40c GPU, and also, ported and presented on AMD Fiji Nano GPU, using AMD's Heterogeneous–Compute Interface for Portability (HIP) C++ runtime API. We demonstrate achieving 80% of the theoretically achievable peak performance for the overall algorithm, and significant acceleration of the Level-2 BLAS GEMV and Level-3 BLAS GEMM needed compared to vendor-optimized libraries on GPUs and multicore CPUs. The optimization techniques in this paper are applicable to the other two-sided factorizations as well.

Introduction

The emergence of multicore and heterogeneous architectures requires many linear algebra algorithms to be redesigned to take advantage of accelerators, such as GPUs. A particularly challenging class of problems, arising in numerous applications, involves the use of linear algebra operations on many small-sized matrices. Their number can be thousands, even millions. For example, billions of 8 × 8 and 32 × 32 eigenvalue problems need to be solved in magnetic resonance imaging. Also, thousands of matrix–matrix (GEMM) and matrix–vector products (GEMV) are computed in hydrodynamic simulations with finite element method [1]. Here the size of matrices increases with the order of the numerical methods, and can range from ten to a few hundred. GEMM is at the heart of deep neural network (DNN) computations, where rather than treating convolution as one large GEMM problem, it is much more efficient to view it as many small GEMMs [2]. Thus, Batched BLAS, and Batched GEMM in particular, are central part of performing deep learning, and therefore can be used to accelerate frameworks like PaddlePaddle [3], Theano [4], TensorFlow [5], and Torch [6]. In an astrophysics ODE solver [7], multiple zones are simulated, and each zone corresponds to a small linear system solve based on an LU factorization [7]. If the matrix is symmetric and definite, the problem is reduced to a batched Cholesky factorization [[8], [9]]. In [10], there are demands to compute one million 25 × 8 and n*20 SVD problems, where n ranges from 32 to 4096.

The acceleration of the aforementioned many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. To address the wide range of application needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched Basic Linear Algebra Subroutines (BLAS) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. To describe these developments, we start with other related work and summary of our contributions (Section 2), followed by algorithmic background (Section 3) and the Householder Bi-diagonalization algorithm (Section 4), performance analysis and a roofline model that guides our design and optimizations (Section 5), the main batched BLAS design, optimization techniques and implementation for GPUs (Section 6), our auto-tuning strategy (Section 7), performance on NVIDIA GPUs (Section 8) and AMD GPUs (Section 9). Finally, conclusions and future work directions are given in Section 10.

Section snippets

Related work and contributions

The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in application directly from highly-optimized, standard numerical libraries, like (batched) BLAS and LAPACK [11]. Indeed, vendors like NVIDIA and Intel started to provide certain

Background

The SVD problem is to find orthogonal matrices U and V, and a diagonal matrix Σ with nonnegative elements, such that A = UΣVT, where A is an m × n matrix. The diagonal elements of Σ are singular values of A, the columns of U are called left singular vectors of A, and the columns of V are called right singular vectors of A. Such problem is solved by a three-phase process:

  • 1.

    Reduction phase: orthogonal matrices Q and P are applied on both the left and the right side of A to reduce it to a

Householder bi-diagonalization

The bi-diagonalization factorizes A = U BVT, where U and V are orthogonal matrices, and B is bi-diagonal with non-zeros only on the diagonal and upper superdiagonal. This is done by the classic stable Golub–Kahan method that applies a sequence of Householder transformations [21]. Algorithmically, this corresponds to a sequence of in-place transformations, where A is overwritten by the entries of the bi-diagonal matrix B, as well as by the U and V holding the vectors defining the left and right

Performance bound analysis and roofline model

In order to evaluate the performance behavior of the reduction to bi-diagonal and to analyze if there are opportunities for improvements, we present a performance bound analysis and the associated with it roofline model. Similar to the one-sided factorizations (LU, Cholesky, QR), the two-sided factorizations (in particular, the bi-diagonal reduction) are split into a panel factorization and a trailing matrix update. Unlike the one-sided factorizations, the panel factorization requires computing

Batched BLAS design and implementation for GPUs

In a batched problem that is based on batched BLAS, many small dense matrices must be factorized simultaneously, meaning that all the matrices will be processed simultaneously by the same kernel.

Auto-tuning

The efforts of maximizing the performance of BLAS, especially GEMM, generally fall into two directions: writing assembly code and source level code tuning. The vendor libraries (e.g., Intel MKL, AMD ACML and hipBLAS, NVIDIA CUBLAS) supply their own routines on their hardware. To achieve performance, the GEMM routine is implemented in assembly code, like the CUBLAS GEMM on Kepler GPUs. The assembly code usually delivers high performance. A disadvantage is that it is highly architecture-specific.

Performance on NVIDIA GPU

We conducted our experiments on a multicore system with two 8-cores socket Intel Xeon E5-2670 (Sandy Bridge) processors with each running at 2.6 GHz. Each socket has a shared 20 MB L3 cache, and each core has a private 256 KB L2 and a 64 KB L1 cache. The system is equipped with 64 GB of memory and the theoretical peak in double precision is 20.8 Gflop/s per core, i.e., 332.8 Gflop/s in total for the two sockets. It is also equipped with an NVIDIA K40c GPU with 11.6 GB GDDR memory per card

Batched GEMV and GEMM performance on AMD GPU

Recently, AMD introduced the Radeon Open Compute Platform (ROCm) open-sourced platform for GPU-based high-performance computing [27]. The ROCm ecosystem includes Linux kernels, runtimes, the HCC compiler, and high level mathematical libraries. ROCm introduces a Heterogeneous–Compute Interface for Portability (HIP) layer allowing users to create portable applications that run on AMD and NVIDIA GPUs with similar source code. Compared to the OpenCL C language, HIP is a C/C++ runtime API and kernel

Conclusions and future work

The use of GPUs to accelerate scientific codes has been observed extensively on large dense and sparse linear algebra problems that feature abundant data parallelism. Solving small linear algebra problems on current high-end many-core systems is more challenging. Still, small problems can be implemented relatively easy for multicore CPUs by taking advantage of the large CPU caches for data reuse. On the other hand, the development of small problems on GPUs is not as straightforward. We

Acknowledgements

This work is partially supported by NSF Grant No. SI2:SSE 1740250, and the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nations exascale computing imperative.

References (29)

  • T. Dong et al.

    A step towards energy efficient computing: redesigning a hydrodynamic application on CPU–GPU

    IEEE 28th International Parallel Distributed Processing Symposium (IPDPS)

    (2014)
  • L. Brown

    Accelerate Machine Learning with the cuDNN Deep Neural Network Library

    (2015)
  • PArallel Distributed Deep LEarning (Paddle), Available at https://github.com/PaddlePaddle/Paddle,...
  • Theano Development Team, Theano: A Python Framework for Fast Computation of Mathematical Expressions, arXiv e-prints...
  • TensorFlow Development Team, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, Software Available at...
  • R. Collobert et al.

    Torch7: a Matlab-like environment for machine learning

    BigLearn, NIPS Workshop

    (2011)
  • O. Messer et al.

    Multicore and accelerator development for a leadership-class stellar astrophysics code

    Proceedings of PARA 2012: State-of-the-Art in Scientific and Parallel Computing

    (2012)
  • J. Molero et al.

    Poster: A Batched Cholesky Solver for Local RX Anomaly Detection on GPUs

    (2013)
  • N. Corporation,...
  • Batched SVD, Available at https://devtalk.nvidia.com/default/topic/851534/batched-svd-/,...
  • J. Dongarra et al.

    A proposed API for batched basic linear algebra subprograms

    MIMS EPrint 2016.25

    (2016)
  • S. Tomov et al.

    Towards dense linear algebra for hybrid GPU accelerated manycore systems

    Parellel Comput. Syst. Appl.

    (2010)
  • J. Dongarra et al.

    Accelerating numerical dense linear algebra calculations with GPUs

    Numer. Comput. GPUs

    (2014)
  • A. Haidar et al.

    A framework for batched and GPU-resident factorization algorithms applied to block householder transformations

  • Cited by (13)

    View all citing articles on Scopus

    This is an extended version of our conference paper [18] that was invited to the JoCS special issue (https://doi.org/10.1016/j.procs.2017.05.237).

    View full text