Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

doi:10.1016/j.parco.2018.10.003

Parallel Computing

Volume 81, January 2019, Pages 1-21

https://doi.org/10.1016/j.parco.2018.10.003 Get rights and content

Abstract

Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.

Introduction

The available parallelism to exploit in today’s computer architectures is pervasive—not only in systems from large supercomputers to laptops, but also in small portable devices like smartphones and watches. Along with parallelism, the level of heterogeneity in modern computing systems is also gradually increasing. Multi-core CPUs are combined with discrete high-performance GPUs, or even become integrated parts as a system-on-chip (SoC) like in the NVIDIA Tegra mobile family of devices. Heterogeneity makes the parallel programming for technical computing problems extremely challenging, especially in modern applications that require fast linear algebra on many independent problems that are of size 100 and smaller. According to a recent survey among the Scalable Linear Algebra PACKage (ScaLAPACK) and Matrix Algebra on GPU and Multicore Architectures (MAGMA) [1] users, 40% of the respondents needed this functionality for applications in machine learning, big data analytics, signal processing, batched operations for sparse preconditioners, algebraic multigrid, sparse direct multi-frontal solvers, QR types of factorizations on small problems, astrophysics, and high-order finite element methods (FEM). At some point in their execution, applications like these must perform a computation that is cumulatively very large and which often needs to run on large-scale distributed memory systems, but the individual parts of which are very small; when such operations are implemented naively using the typical approaches, they perform poorly. To address these challenges, there are efforts in the community to extend the basic linear algebra subprograms (BLAS) standard to include API for Hybrid Batched BLAS [2], as well as to develop innovative algorithms [3], data and task abstractions [4], and high-performance implementations based on the standard. Some of these efforts have been released as examples through the MAGMA library since version 2.0 [5], [6]. Fig. 1 illustrates how the need for batched operations and new data types arises in areas like linear algebra (Left) and machine learning (Right). The computational characteristics in these cases are common to many applications where the overall computation is very large but is made of operations of interest that are generally small. The small operations must be batched for efficiency and various transformations must be explored to cast them to regular, and therefore efficient, to implement operations, like GEMMs. This is the case in a number of applications that are currently of great interest, like data analytics and machine learning, where tensor data structures and APIs are used to represent higher-dimension multi-linear relations data; but still, for high performance and efficiency the computation is flattened to linear algebra on two-dimensional matrix data through Batched GEMMs [4].

There is a lack of sufficient optimizations on the Batched GEMMs that we target in this paper and that are needed in a number of applications. We show that this is the case through a theoretical performance analysis and a comparison between the results from the techniques introduced in this paper and vendor libraries like cuBLAS for NVIDIA GPUs, and MKL for Intel multi-core CPUs, as well as comparison to the open-source library called Eigen [7]. Related work on GEMM and its use for tensor contractions [4] target only GPUs and for very small sizes (16 and smaller). Batched GEMM for fixed and variable sizes in the range of 1000 and smaller were developed in [8]. The main target here is Batched GEMMs for multi-core CPUs, ARM, Intel Xeon Phi, and GPU architectures on matrices of sizes up to 32. In a preliminary study [9], we have investigated this problem and have laid out some of the ideas on algorithms and optimization techniques needed to accelerate them on modern architectures. In this paper, we extend these preliminary results by completing the algorithmic work, providing further details and extended functionalities, as well as generalizing the approach and the portability of the developments. We design a generic framework that incorporates all developments: the framework auto-generates kernels for every new architecture and autotunes them to find the best performing kernels. While we produce a single tool, the best kernels for different architectures and sizes are different, incorporating different optimization techniques, algorithms, and tuning parameters, which we highlight and analyze in this paper. Performance results are updated and include IBM Power8 processors and newer GPU architectures, e.g., the V100 GPU, We also add results and analysis on the performance differences and comparison to the Eigen library. A direct motivation for this work came from the need to accelerate large-scale, distributed-memory solvers in applications using high-order finite element discretizations, where tensor contraction computations are cast as Batched GEMMs [4].

Section snippets

Main contributions

The rapid advancements in semiconductor technologies today are dramatically transforming the balance of future computer systems, producing unprecedented changes at every level of the platform pyramid and the software stack. There are three challenges that stand out for numerical libraries and the myriad of applications that depend on them: (1) the need to exploit unprecedented amounts of parallelism; (2) the need to maximize the use of data locality and vectorized operations; and (3) the need

Performance model for batched GEMMs

To evaluate the efficiency of our algorithms, we derive theoretical bounds for the maximum achievable performance $P_{\max} = F / T_{\min},$ where F is the number of operations needed by the computation $F = 2 n^{3},$ and T_min is the fastest time to solution. For simplicity, consider $C = α A B + β C$ on square matrices of size n. In an ideal case, where we assume that there is overlap between computation and communication, the T_min becomes, $T_{m i n} = m a x (T_{R e a d (A, B, C)} + T_{W r i t e (C)}, T_{C o m p u t e (C)})$ Let β define the maximum achievable

Experimental hardware

All experiments are done on an Intel multi-core system with two 10-core Intel Xeon E5-2650 v3 (Haswell) CPUs, a 4-core Cortex A57 ARM CPU, two 10-core IBM Power8 CPUs, a 68-core Intel Knights Landing CPU 7250 and a Pascal Generation Tesla P100 GPU, and the newest Volta V100 GPU. Details about the hardware are illustrated in Fig. 2. We used GNU Compiler Collection (GCC) compiler 5.3 for our Xeon code (with options -std=c++14 -O3 -mavx2 -mfma), as well as the icc compiler from the Intel suite

The kernel auto-generation and autotuning process

The dgemm kernel is parameterized and implemented using C++ features, including templates and overloaded functions. The kernel design for small matrix sizes is illustrated in Fig. 3a. The matrix C is split into blocks C_ij of size BLK_M × BLK_N that can be computed in parallel. The idea is that since C is where the computations are accumulated and the final result written, it is better to keep as large a part of C as possible in registers during the accumulation of the multiplication. Note that

Programming model, performance analysis, and optimization for CPUs

The overall design fits the general description given in Section 5. However, there are specifics for the CPU and GPU cases. Here we provide in more detail the specifics for our CPU considerations, design, and optimizations.

In order to design a framework that has better code re-usability and adaptability, our overall designs and software construction choices include the use of new features of C++. By using advanced template techniques we can create high-level interfaces [13] without adding any

Programming model, performance analysis, and optimization for GPUs

Concerning the development for GPUs, we set a goal to have a unified code base that can achieve high performance for very small matrices. The design strategy is different from the MAGMA batched GEMM kernel for medium and large sizes [8]. The latter uses a hierarchical blocking technique where different thread blocks (TBs) compute different blocks of the output matrix C. With such a design, a TB reads an entire block row of A and an entire block column of B to compute its share of C. Obviously,

Conclusions and future directions

This paper presented an extensive study of the design and optimization techniques for Batched GEMMs on small matrices. The work is motivated by a large number of applications ranging from machine learning to big data analytics, to high-order finite element methods and more, which all require fast linear algebra on many independent problems that are of size 32 and smaller. The use of standard Batched BLAS APIs in applications is essential for their performance portability. However, this

Acknowledgments

This material is based in part upon work supported by the US NSF under Grants no. OAC-1740250, NVIDIA, and under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

References (26)

S. Tomov et al.
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Comput.
(2010)
J. Dongarra et al.
A Proposed API for Batched Basic Linear Algebra Subprograms
MIMS EPrint 2016.25
(2016)
A. Haidar et al.
A framework for batched and GPU-resident factorization algorithms applied to block householder transformations
High Performance Computing
(2015)
A. Abdelfattah et al.
High-performance tensor contractions for GPUs
International Conference on Computational Science (ICCS’16)
(2016)
T. Dong et al.
LU factorization of small matrices: accelerating Batched DGETRF on the GPU
Proceedings of 16th IEEE International Conference on High Performance and Communications
(2014)
A. Haidar et al.
Batched matrix computations on hardware accelerators based on gpus
Int. J. High Perform. Comput. Appl.
(2015)
G. Guennebaud, B. Jacob, et al., Eigen v3, 2010,...
A. Abdelfattah et al.
Performance, design, and autotuning of batched GEMM for GPUs
The International Supercomputing Conference (ISC High Performance 2016), Frankfurt, Germany
(2016)
I. Masliah et al.
High-performance matrix-matrix multiplications of very small matrices
Euro-Par 2016: Parallel Processing – 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24–26, 2016, Proceedings
(2016)
S.H. Fuller et al.
The Future of Computing Performance: Game Over Or Next Level?
(2011)

J.D. McCalpin

Memory bandwidth and machine balance in current high performance computers

IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter

(1995)

Intel Math Kernel Library, 2016. Available at...

I. Masliah et al.

Metaprogramming dense linear algebra solvers applications to multi and many-core architectures

2015 iIEEE TrustCom/BigDataSE/ISPA, Helsinki, Finland, August,

(2015)

Cited by (19)

Environmental life cycle assessment and energy-economic analysis in different cultivation of microalgae-based optimization method
2023, Results in Engineering
The novelty of this study lies in its comprehensive exploration of microalgae cultivation methods, using energy-economic analysis and life cycle assessment (LCA) techniques, along with data envelopment analysis (DEA) to optimize energy and environmental impacts. The study provides insights into the energy requirements, productivity, net return, and environmental impacts of two common cultivation methods, open space and greenhouse. Results show that open space and greenhouse methods require an average input energy of 15920.40 and 17691.60 MJ kg⁻¹, respectively, with energy ratios of 0.89 and 0.80 and productivity energy indexes of 0.06 and 0.02 kg MJ⁻¹, respectively. The net return of open space and greenhouse methods is 204376.59 and 269276.06 ($ kg⁻¹), productivity of 0.17 and 0.16 (kg $⁻¹), respectively. The ReCiPe 2016 method is used to calculate environmental effects, with resource impact categories ranging from 201.11 to 205.68 USD2013. In terms of damage assessment categories, water and machinery and equipment have the highest share of damage assessment in both cultivation methods. DEA shows that total optimal energy consumption in the open space method is 14476.93 (MJ kg⁻¹), resulting in a 7.55% decrease in energy consumption compared to the current cultivation. This is achieved by converting inefficient to efficient cultivation.
Ultra-fast and efficient implementation schemes of complex matrix multiplication algorithm for VLIW architectures
2022, Computers and Electrical Engineering
Citation Excerpt :
However, the second challenge consists of searching the suitable approach to use in order to profit from the available computing resources in an efficient way. In order to meet real-time and low power consumption requirements of many applications, the MM algorithm was implemented on several kind of processors like CPU, Graphic Processing Units (GPU) and FPGA [6–8]. Nevertheless, only few DSP (Digital Signal Processor) implementations can be found in the literature.
The Complex Matrix Multiplication (CMM) algorithm is known to require a high computing performance and presenting exceptional challenges in real-life applications. Recent advances in Very Long Instruction Word (VLIW) based Digital Signal Processors (DSP) demonstrated high computing capabilities with a very low power consumption. In this work, we propose three ultra-fast, parallel and efficient VLIW implementation approaches of the CMM algorithm which could be used to meet tighter real-time constraints of several signal and image processing applications like radars. A novel parallel kernel, task mapping strategy and low-level optimization techniques are suggested, to fit a set of modern VLIW architectures. Additionally, an original memory access management technique was adopted to accelerate the algorithm by avoiding cache misses and bank conflicts. The experimental results showed the effectiveness of the proposed approaches where a peak performance of 15.89 GFLOPS was achieved on one C66x DSP core with a core utilization of 99% and a speedup of about 1.61, 3 and 10 compared to the state-of-the-art, the most optimized vendor and the conventional approaches, respectively.
Asymptotic covariance estimation by Gaussian random perturbation
2022, Computational Statistics and Data Analysis
In most cases, the asymptotic covariance matrix of an M-estimator is in a sandwich form. This sandwich form involves calculations of the first and second order derivatives of the loss function, which is intractable if the loss function is complex. To alleviate this problem, we propose in this article a novel method called Gaussian random perturbation. This method can be used to estimate the asymptotic covariance matrix of a general M-estimator without derivative calculations. The idea can be summarized as follows. We first generate a small random perturbation around the M-estimator. Then, we re-evaluate the loss function at the randomly perturbed M-estimator and obtain the estimators of the first and second order derivatives of the loss function via Taylor series expansion. This leads to a novel estimator for the asymptotic covariance matrix. We then rigorously show that the resulting covariance estimator is statistically consistent with two elegant characteristics. First, it involves no computation of derivatives. This makes it easier to estimate the covariance matrix of an M-estimator with a complex loss function. Second, it is convenient for parallel computing and thus attractive for massive data analysis. The consistency of the proposed asymptotic covariance estimator is demonstrated under appropriate regularity conditions. The practical usefulness of the method is further demonstrated with both simulation studies and real data analysis.
Use of optimization techniques for energy use efficiency and environmental life cycle assessment modification in sugarcane production
2019, Energy
Citation Excerpt :
OT are also termed mathematical programming methods, as a tributary of process research. Process research can be broadly classified as follows [49]: (1) Mathematical programming techniques, which are feasible in determining the minimum or maximum function of multiple parameters under prescribed constraints. ( 2) Accidental procedure techniques, which are suitable to analyze quandaries qualified by a specific set of variables in random. (
The objective of this work is to apply optimization techniques (OT) including Multi-Objective Genetic Algorithm (MOGA) and Data Envelopment Analysis (DEA) for environmental impact category reduction and energy use optimization in planted and ratoon farms of sugarcane production at Imam Khomeini Sugarcane Agro-Industrial Company (IKSAIC) in southern Iran. Results demonstrate that energy savings by applying MOGA and DEA in planted farms are 20.90% and 8.52%, respectively whilst the corresponding values in ratoon farms are 2.61% and 13.90%, respectively. The increase of energy use efficiency is mainly attributed to electricity, diesel fuel, human labor and nitrogen fertilizer in sugarcane production (planted and ratoon). Furthermore, most environmental impacts under MOGA condition are considerably lower than those under DEA, which are in turn less than the present conditions for both farms (planted and ratoon). The largest variations between MOGA and DEA are on terrestrial ecotoxicity and photochemical oxidation in planted farms and ratoon farms, respectively. MOGA is a feasible OT to assign the best input combinations for planted and ratoon sugarcane productions, by reducing environmental impacts and simultaneously enhancing farms productivity and energy use efficiency. Results are useful to authorities in making decision regarding sustainable expansion of sugarcane production in Iran.
Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs
2024, IEEE Transactions on Parallel and Distributed Systems
Effective Implementation of Matrix Inversion Based on Batched LU Decomposition on GPU
2023, Ruan Jian Xue Bao/Journal of Software

View all citing articles on Scopus

View full text

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

Abstract

Introduction

Section snippets

Main contributions

Performance model for batched GEMMs

Experimental hardware

The kernel auto-generation and autotuning process

Programming model, performance analysis, and optimization for CPUs

Programming model, performance analysis, and optimization for GPUs

Conclusions and future directions

Acknowledgments

Parallel Comput.

A Proposed API for Batched Basic Linear Algebra Subprograms

MIMS EPrint 2016.25

A framework for batched and GPU-resident factorization algorithms applied to block householder transformations

High Performance Computing

High-performance tensor contractions for GPUs

International Conference on Computational Science (ICCS’16)

LU factorization of small matrices: accelerating Batched DGETRF on the GPU

Proceedings of 16th IEEE International Conference on High Performance and Communications

Batched matrix computations on hardware accelerators based on gpus

Int. J. High Perform. Comput. Appl.

Performance, design, and autotuning of batched GEMM for GPUs

The International Supercomputing Conference (ISC High Performance 2016), Frankfurt, Germany

High-performance matrix-matrix multiplications of very small matrices

Euro-Par 2016: Parallel Processing – 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24–26, 2016, Proceedings

The Future of Computing Performance: Game Over Or Next Level?

Memory bandwidth and machine balance in current high performance computers

IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter

Metaprogramming dense linear algebra solvers applications to multi and many-core architectures

2015 iIEEE TrustCom/BigDataSE/ISPA, Helsinki, Finland, August,

2015 iIEEE TrustCom/BigDataSE/ISPA, Helsinki, Finland, August,