Elsevier

Parallel Computing

Volume 81, January 2019, Pages 1-21
Parallel Computing

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

https://doi.org/10.1016/j.parco.2018.10.003Get rights and content

Abstract

Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.

Introduction

The available parallelism to exploit in today’s computer architectures is pervasive—not only in systems from large supercomputers to laptops, but also in small portable devices like smartphones and watches. Along with parallelism, the level of heterogeneity in modern computing systems is also gradually increasing. Multi-core CPUs are combined with discrete high-performance GPUs, or even become integrated parts as a system-on-chip (SoC) like in the NVIDIA Tegra mobile family of devices. Heterogeneity makes the parallel programming for technical computing problems extremely challenging, especially in modern applications that require fast linear algebra on many independent problems that are of size 100 and smaller. According to a recent survey among the Scalable Linear Algebra PACKage (ScaLAPACK) and Matrix Algebra on GPU and Multicore Architectures (MAGMA) [1] users, 40% of the respondents needed this functionality for applications in machine learning, big data analytics, signal processing, batched operations for sparse preconditioners, algebraic multigrid, sparse direct multi-frontal solvers, QR types of factorizations on small problems, astrophysics, and high-order finite element methods (FEM). At some point in their execution, applications like these must perform a computation that is cumulatively very large and which often needs to run on large-scale distributed memory systems, but the individual parts of which are very small; when such operations are implemented naively using the typical approaches, they perform poorly. To address these challenges, there are efforts in the community to extend the basic linear algebra subprograms (BLAS) standard to include API for Hybrid Batched BLAS [2], as well as to develop innovative algorithms [3], data and task abstractions [4], and high-performance implementations based on the standard. Some of these efforts have been released as examples through the MAGMA library since version 2.0 [5], [6]. Fig. 1 illustrates how the need for batched operations and new data types arises in areas like linear algebra (Left) and machine learning (Right). The computational characteristics in these cases are common to many applications where the overall computation is very large but is made of operations of interest that are generally small. The small operations must be batched for efficiency and various transformations must be explored to cast them to regular, and therefore efficient, to implement operations, like GEMMs. This is the case in a number of applications that are currently of great interest, like data analytics and machine learning, where tensor data structures and APIs are used to represent higher-dimension multi-linear relations data; but still, for high performance and efficiency the computation is flattened to linear algebra on two-dimensional matrix data through Batched GEMMs [4].

There is a lack of sufficient optimizations on the Batched GEMMs that we target in this paper and that are needed in a number of applications. We show that this is the case through a theoretical performance analysis and a comparison between the results from the techniques introduced in this paper and vendor libraries like cuBLAS for NVIDIA GPUs, and MKL for Intel multi-core CPUs, as well as comparison to the open-source library called Eigen [7]. Related work on GEMM and its use for tensor contractions [4] target only GPUs and for very small sizes (16 and smaller). Batched GEMM for fixed and variable sizes in the range of 1000 and smaller were developed in [8]. The main target here is Batched GEMMs for multi-core CPUs, ARM, Intel Xeon Phi, and GPU architectures on matrices of sizes up to 32. In a preliminary study [9], we have investigated this problem and have laid out some of the ideas on algorithms and optimization techniques needed to accelerate them on modern architectures. In this paper, we extend these preliminary results by completing the algorithmic work, providing further details and extended functionalities, as well as generalizing the approach and the portability of the developments. We design a generic framework that incorporates all developments: the framework auto-generates kernels for every new architecture and autotunes them to find the best performing kernels. While we produce a single tool, the best kernels for different architectures and sizes are different, incorporating different optimization techniques, algorithms, and tuning parameters, which we highlight and analyze in this paper. Performance results are updated and include IBM Power8 processors and newer GPU architectures, e.g., the V100 GPU, We also add results and analysis on the performance differences and comparison to the Eigen library. A direct motivation for this work came from the need to accelerate large-scale, distributed-memory solvers in applications using high-order finite element discretizations, where tensor contraction computations are cast as Batched GEMMs [4].

Section snippets

Main contributions

The rapid advancements in semiconductor technologies today are dramatically transforming the balance of future computer systems, producing unprecedented changes at every level of the platform pyramid and the software stack. There are three challenges that stand out for numerical libraries and the myriad of applications that depend on them: (1) the need to exploit unprecedented amounts of parallelism; (2) the need to maximize the use of data locality and vectorized operations; and (3) the need

Performance model for batched GEMMs

To evaluate the efficiency of our algorithms, we derive theoretical bounds for the maximum achievable performance Pmax=F/Tmin, where F is the number of operations needed by the computation F=2n3, and Tmin is the fastest time to solution. For simplicity, consider C=αAB+βC on square matrices of size n. In an ideal case, where we assume that there is overlap between computation and communication, the Tmin becomes,Tmin=max(TRead(A,B,C)+TWrite(C),TCompute(C))Let β define the maximum achievable

Experimental hardware

All experiments are done on an Intel multi-core system with two 10-core Intel Xeon E5-2650 v3 (Haswell) CPUs, a 4-core Cortex A57 ARM CPU, two 10-core IBM Power8 CPUs, a 68-core Intel Knights Landing CPU 7250 and a Pascal Generation Tesla P100 GPU, and the newest Volta V100 GPU. Details about the hardware are illustrated in Fig. 2. We used GNU Compiler Collection (GCC) compiler 5.3 for our Xeon code (with options -std=c++14 -O3 -mavx2 -mfma), as well as the icc compiler from the Intel suite

The kernel auto-generation and autotuning process

The dgemm kernel is parameterized and implemented using C++ features, including templates and overloaded functions. The kernel design for small matrix sizes is illustrated in Fig. 3a. The matrix C is split into blocks Cij of size BLKM × BLKN that can be computed in parallel. The idea is that since C is where the computations are accumulated and the final result written, it is better to keep as large a part of C as possible in registers during the accumulation of the multiplication. Note that

Programming model, performance analysis, and optimization for CPUs

The overall design fits the general description given in Section 5. However, there are specifics for the CPU and GPU cases. Here we provide in more detail the specifics for our CPU considerations, design, and optimizations.

In order to design a framework that has better code re-usability and adaptability, our overall designs and software construction choices include the use of new features of C++. By using advanced template techniques we can create high-level interfaces [13] without adding any

Programming model, performance analysis, and optimization for GPUs

Concerning the development for GPUs, we set a goal to have a unified code base that can achieve high performance for very small matrices. The design strategy is different from the MAGMA batched GEMM kernel for medium and large sizes [8]. The latter uses a hierarchical blocking technique where different thread blocks (TBs) compute different blocks of the output matrix C. With such a design, a TB reads an entire block row of A and an entire block column of B to compute its share of C. Obviously,

Conclusions and future directions

This paper presented an extensive study of the design and optimization techniques for Batched GEMMs on small matrices. The work is motivated by a large number of applications ranging from machine learning to big data analytics, to high-order finite element methods and more, which all require fast linear algebra on many independent problems that are of size 32 and smaller. The use of standard Batched BLAS APIs in applications is essential for their performance portability. However, this

Acknowledgments

This material is based in part upon work supported by the US NSF under Grants no. OAC-1740250, NVIDIA, and under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

References (26)

  • S. Tomov et al.

    Towards dense linear algebra for hybrid GPU accelerated manycore systems

    Parallel Comput.

    (2010)
  • J. Dongarra et al.

    A Proposed API for Batched Basic Linear Algebra Subprograms

    MIMS EPrint 2016.25

    (2016)
  • A. Haidar et al.

    A framework for batched and GPU-resident factorization algorithms applied to block householder transformations

    High Performance Computing

    (2015)
  • A. Abdelfattah et al.

    High-performance tensor contractions for GPUs

    International Conference on Computational Science (ICCS’16)

    (2016)
  • T. Dong et al.

    LU factorization of small matrices: accelerating Batched DGETRF on the GPU

    Proceedings of 16th IEEE International Conference on High Performance and Communications

    (2014)
  • A. Haidar et al.

    Batched matrix computations on hardware accelerators based on gpus

    Int. J. High Perform. Comput. Appl.

    (2015)
  • G. Guennebaud, B. Jacob, et al., Eigen v3, 2010,...
  • A. Abdelfattah et al.

    Performance, design, and autotuning of batched GEMM for GPUs

    The International Supercomputing Conference (ISC High Performance 2016), Frankfurt, Germany

    (2016)
  • I. Masliah et al.

    High-performance matrix-matrix multiplications of very small matrices

    Euro-Par 2016: Parallel Processing – 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24–26, 2016, Proceedings

    (2016)
  • S.H. Fuller et al.

    The Future of Computing Performance: Game Over Or Next Level?

    (2011)
  • J.D. McCalpin

    Memory bandwidth and machine balance in current high performance computers

    IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter

    (1995)
  • Intel Math Kernel Library, 2016. Available at...
  • I. Masliah et al.

    Metaprogramming dense linear algebra solvers applications to multi and many-core architectures

    2015 iIEEE TrustCom/BigDataSE/ISPA, Helsinki, Finland, August,

    (2015)
  • Cited by (19)

    • Ultra-fast and efficient implementation schemes of complex matrix multiplication algorithm for VLIW architectures

      2022, Computers and Electrical Engineering
      Citation Excerpt :

      However, the second challenge consists of searching the suitable approach to use in order to profit from the available computing resources in an efficient way. In order to meet real-time and low power consumption requirements of many applications, the MM algorithm was implemented on several kind of processors like CPU, Graphic Processing Units (GPU) and FPGA [6–8]. Nevertheless, only few DSP (Digital Signal Processor) implementations can be found in the literature.

    • Asymptotic covariance estimation by Gaussian random perturbation

      2022, Computational Statistics and Data Analysis
    • Use of optimization techniques for energy use efficiency and environmental life cycle assessment modification in sugarcane production

      2019, Energy
      Citation Excerpt :

      OT are also termed mathematical programming methods, as a tributary of process research. Process research can be broadly classified as follows [49]: (1) Mathematical programming techniques, which are feasible in determining the minimum or maximum function of multiple parameters under prescribed constraints. ( 2) Accidental procedure techniques, which are suitable to analyze quandaries qualified by a specific set of variables in random. (

    • Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs

      2024, IEEE Transactions on Parallel and Distributed Systems
    View all citing articles on Scopus
    View full text