Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices
Introduction
The available parallelism to exploit in today’s computer architectures is pervasive—not only in systems from large supercomputers to laptops, but also in small portable devices like smartphones and watches. Along with parallelism, the level of heterogeneity in modern computing systems is also gradually increasing. Multi-core CPUs are combined with discrete high-performance GPUs, or even become integrated parts as a system-on-chip (SoC) like in the NVIDIA Tegra mobile family of devices. Heterogeneity makes the parallel programming for technical computing problems extremely challenging, especially in modern applications that require fast linear algebra on many independent problems that are of size 100 and smaller. According to a recent survey among the Scalable Linear Algebra PACKage (ScaLAPACK) and Matrix Algebra on GPU and Multicore Architectures (MAGMA) [1] users, 40% of the respondents needed this functionality for applications in machine learning, big data analytics, signal processing, batched operations for sparse preconditioners, algebraic multigrid, sparse direct multi-frontal solvers, QR types of factorizations on small problems, astrophysics, and high-order finite element methods (FEM). At some point in their execution, applications like these must perform a computation that is cumulatively very large and which often needs to run on large-scale distributed memory systems, but the individual parts of which are very small; when such operations are implemented naively using the typical approaches, they perform poorly. To address these challenges, there are efforts in the community to extend the basic linear algebra subprograms (BLAS) standard to include API for Hybrid Batched BLAS [2], as well as to develop innovative algorithms [3], data and task abstractions [4], and high-performance implementations based on the standard. Some of these efforts have been released as examples through the MAGMA library since version 2.0 [5], [6]. Fig. 1 illustrates how the need for batched operations and new data types arises in areas like linear algebra (Left) and machine learning (Right). The computational characteristics in these cases are common to many applications where the overall computation is very large but is made of operations of interest that are generally small. The small operations must be batched for efficiency and various transformations must be explored to cast them to regular, and therefore efficient, to implement operations, like GEMMs. This is the case in a number of applications that are currently of great interest, like data analytics and machine learning, where tensor data structures and APIs are used to represent higher-dimension multi-linear relations data; but still, for high performance and efficiency the computation is flattened to linear algebra on two-dimensional matrix data through Batched GEMMs [4].
There is a lack of sufficient optimizations on the Batched GEMMs that we target in this paper and that are needed in a number of applications. We show that this is the case through a theoretical performance analysis and a comparison between the results from the techniques introduced in this paper and vendor libraries like cuBLAS for NVIDIA GPUs, and MKL for Intel multi-core CPUs, as well as comparison to the open-source library called Eigen [7]. Related work on GEMM and its use for tensor contractions [4] target only GPUs and for very small sizes (16 and smaller). Batched GEMM for fixed and variable sizes in the range of 1000 and smaller were developed in [8]. The main target here is Batched GEMMs for multi-core CPUs, ARM, Intel Xeon Phi, and GPU architectures on matrices of sizes up to 32. In a preliminary study [9], we have investigated this problem and have laid out some of the ideas on algorithms and optimization techniques needed to accelerate them on modern architectures. In this paper, we extend these preliminary results by completing the algorithmic work, providing further details and extended functionalities, as well as generalizing the approach and the portability of the developments. We design a generic framework that incorporates all developments: the framework auto-generates kernels for every new architecture and autotunes them to find the best performing kernels. While we produce a single tool, the best kernels for different architectures and sizes are different, incorporating different optimization techniques, algorithms, and tuning parameters, which we highlight and analyze in this paper. Performance results are updated and include IBM Power8 processors and newer GPU architectures, e.g., the V100 GPU, We also add results and analysis on the performance differences and comparison to the Eigen library. A direct motivation for this work came from the need to accelerate large-scale, distributed-memory solvers in applications using high-order finite element discretizations, where tensor contraction computations are cast as Batched GEMMs [4].
Section snippets
Main contributions
The rapid advancements in semiconductor technologies today are dramatically transforming the balance of future computer systems, producing unprecedented changes at every level of the platform pyramid and the software stack. There are three challenges that stand out for numerical libraries and the myriad of applications that depend on them: (1) the need to exploit unprecedented amounts of parallelism; (2) the need to maximize the use of data locality and vectorized operations; and (3) the need
Performance model for batched GEMMs
To evaluate the efficiency of our algorithms, we derive theoretical bounds for the maximum achievable performance where F is the number of operations needed by the computation and Tmin is the fastest time to solution. For simplicity, consider on square matrices of size n. In an ideal case, where we assume that there is overlap between computation and communication, the Tmin becomes,Let β define the maximum achievable
Experimental hardware
All experiments are done on an Intel multi-core system with two 10-core Intel Xeon E5-2650 v3 (Haswell) CPUs, a 4-core Cortex A57 ARM CPU, two 10-core IBM Power8 CPUs, a 68-core Intel Knights Landing CPU 7250 and a Pascal Generation Tesla P100 GPU, and the newest Volta V100 GPU. Details about the hardware are illustrated in Fig. 2. We used GNU Compiler Collection (GCC) compiler 5.3 for our Xeon code (with options -std=c++14 -O3 -mavx2 -mfma), as well as the icc compiler from the Intel suite
The kernel auto-generation and autotuning process
The dgemm kernel is parameterized and implemented using C++ features, including templates and overloaded functions. The kernel design for small matrix sizes is illustrated in Fig. 3a. The matrix C is split into blocks Cij of size BLKM × BLKN that can be computed in parallel. The idea is that since C is where the computations are accumulated and the final result written, it is better to keep as large a part of C as possible in registers during the accumulation of the multiplication. Note that
Programming model, performance analysis, and optimization for CPUs
The overall design fits the general description given in Section 5. However, there are specifics for the CPU and GPU cases. Here we provide in more detail the specifics for our CPU considerations, design, and optimizations.
In order to design a framework that has better code re-usability and adaptability, our overall designs and software construction choices include the use of new features of C++. By using advanced template techniques we can create high-level interfaces [13] without adding any
Programming model, performance analysis, and optimization for GPUs
Concerning the development for GPUs, we set a goal to have a unified code base that can achieve high performance for very small matrices. The design strategy is different from the MAGMA batched GEMM kernel for medium and large sizes [8]. The latter uses a hierarchical blocking technique where different thread blocks (TBs) compute different blocks of the output matrix C. With such a design, a TB reads an entire block row of A and an entire block column of B to compute its share of C. Obviously,
Conclusions and future directions
This paper presented an extensive study of the design and optimization techniques for Batched GEMMs on small matrices. The work is motivated by a large number of applications ranging from machine learning to big data analytics, to high-order finite element methods and more, which all require fast linear algebra on many independent problems that are of size 32 and smaller. The use of standard Batched BLAS APIs in applications is essential for their performance portability. However, this
Acknowledgments
This material is based in part upon work supported by the US NSF under Grants no. OAC-1740250, NVIDIA, and under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
References (26)
- et al.
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Comput.
(2010) - et al.
A Proposed API for Batched Basic Linear Algebra Subprograms
MIMS EPrint 2016.25
(2016) - et al.
A framework for batched and GPU-resident factorization algorithms applied to block householder transformations
High Performance Computing
(2015) - et al.
High-performance tensor contractions for GPUs
International Conference on Computational Science (ICCS’16)
(2016) - et al.
LU factorization of small matrices: accelerating Batched DGETRF on the GPU
Proceedings of 16th IEEE International Conference on High Performance and Communications
(2014) - et al.
Batched matrix computations on hardware accelerators based on gpus
Int. J. High Perform. Comput. Appl.
(2015) - G. Guennebaud, B. Jacob, et al., Eigen v3, 2010,...
- et al.
Performance, design, and autotuning of batched GEMM for GPUs
The International Supercomputing Conference (ISC High Performance 2016), Frankfurt, Germany
(2016) - et al.
High-performance matrix-matrix multiplications of very small matrices
Euro-Par 2016: Parallel Processing – 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24–26, 2016, Proceedings
(2016) - et al.
The Future of Computing Performance: Game Over Or Next Level?
(2011)
Memory bandwidth and machine balance in current high performance computers
IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter
Metaprogramming dense linear algebra solvers applications to multi and many-core architectures
2015 iIEEE TrustCom/BigDataSE/ISPA, Helsinki, Finland, August,
Cited by (19)
Ultra-fast and efficient implementation schemes of complex matrix multiplication algorithm for VLIW architectures
2022, Computers and Electrical EngineeringCitation Excerpt :However, the second challenge consists of searching the suitable approach to use in order to profit from the available computing resources in an efficient way. In order to meet real-time and low power consumption requirements of many applications, the MM algorithm was implemented on several kind of processors like CPU, Graphic Processing Units (GPU) and FPGA [6–8]. Nevertheless, only few DSP (Digital Signal Processor) implementations can be found in the literature.
Asymptotic covariance estimation by Gaussian random perturbation
2022, Computational Statistics and Data AnalysisUse of optimization techniques for energy use efficiency and environmental life cycle assessment modification in sugarcane production
2019, EnergyCitation Excerpt :OT are also termed mathematical programming methods, as a tributary of process research. Process research can be broadly classified as follows [49]: (1) Mathematical programming techniques, which are feasible in determining the minimum or maximum function of multiple parameters under prescribed constraints. ( 2) Accidental procedure techniques, which are suitable to analyze quandaries qualified by a specific set of variables in random. (
Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs
2024, IEEE Transactions on Parallel and Distributed SystemsEffective Implementation of Matrix Inversion Based on Batched LU Decomposition on GPU
2023, Ruan Jian Xue Bao/Journal of Software