Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures
Highlights
► A dynamically scheduled shared-memory implementation of a two-stage reduction to Hessenberg form is presented and evaluated. ► Experiments show that the intermediate block Hessenberg form needs to have a relatively small number of non-zero sub-diagonals due to a performance trade-off between the two stages.
Introduction
The Hessenberg decomposition A = QHQT of a square matrix , where is an upper Hessenberg matrix and is orthogonal, has many applications in numerical linear algebra [1]. Our focus is on medium- to large-scale applications that require blocked and parallel algorithms. For example, the Hessenberg QR algorithm, which computes a Schur form, takes as input a matrix in Hessenberg form. Efficient implementations of the Hessenberg QR algorithm are described in, e.g., [2], [3]. Other examples include solvers for dense Sylvester-type matrix equations, which typically simplify the coefficient matrices to Schur and/or Hessenberg form [4], [5]. A third application is the solution of linear systems (A − σI)x = b for many shifts σ. By first reducing A to Hessenberg form, each subsequent system can then be solved with only floating point operations (flops).
The aim of this paper is to describe and evaluate a new efficient Hessenberg reduction algorithm for shared-memory systems. Our implementation is primarily aimed at architectures with nearly uniform memory access. Most of the algorithmic techniques are certainly suitable for machines with non-uniform memory access as well, but the implementation will likely need some adjustments and is subject to future work. The proposed algorithm consists of two stages [6], [7], whereas the conventional approach goes directly to Hessenberg form [8]. Stage 1 of the two-stage algorithm reduces the dense matrix A to r-Hessenberg form, i.e., block Hessenberg form with lower bandwidth r. Formally, a matrix A is in r-Hessenberg form if i > j + r implies A(i, j) = 0. Stage 2 completes the Hessenberg reduction with a bulge-chasing procedure that annihilates the r − 1 unwanted nonzero sub-diagonals from the r-Hessenberg matrix [9], [6], [10]. Two-stage algorithms are also effective for symmetric band reduction [9], [6] and Hessenberg-triangular reduction [11], [12].
The rest of the paper is organized as follows. We recall the conventional unblocked and blocked Hessenberg reduction algorithms in Section 2. We discuss some pros and cons of the two-stage approach in Section 3. We describe a new implementation of Stage 1 in Section 4. We outline the authors’ recently published implementation of Stage 2 [10] in Section 5. We evaluate the performance of the two stages, both in isolation and combined, in Section 6. Finally, we summarize our findings and briefly review some related work in Section 7.
Section snippets
One-stage Hessenberg reduction
In both the blocked and the unblocked standard algorithms, the matrix is reduced from left to right using one Householder reflection per column. We recall the unblocked variant in Section 2.1 and the blocked variant in Section 2.2. They form the basis of several high-performance implementations of the Hessenberg reduction, e.g., see [13], [14].
Two-stage Hessenberg reduction
Recall that the two-stage approach first reduces the matrix to r-Hessenberg form and then to upper Hessenberg form. The parameter r can be chosen to optimize the performance. As is explained in the coming sections, the performance of Stage 1 (measured in flops/s) increases with r, whereas the performance of Stage 2 decreases with r. This trade-off leads to the conclusion that a modest value of r gives the best performance when the two stages are combined.
The one-stage approach requires
Stage 1: reduction from dense to r-Hessenberg form
It is straightforward to modify Algorithm 2 to produce an r-Hessenberg form using two levels of blocking. The inner block size is determined by the number of sub-diagonals, r, and thus affects the performance of Stage 2. As it turns out [10], Stage 2 benefits from having a small value of r and thus it is appropriate to add a second level of blocking to Stage 1. The outer block size b does not influence Stage 2 in any way and can therefore be tuned independently of Stage 2.
Stage 2: reduction from r-Hessenberg to Hessenberg form
The input to Stage 2 is an r-Hessenberg matrix . The aim is to complete the reduction to Hessenberg form. Algorithms for symmetric band reduction [20], [21], [22] can be adapted to non-symmetric matrices, e.g., matrices in r-Hessenberg form.
This section covers some aspects of our recently published parallel algorithm [10]. It has a dynamic adaptive load-balancing scheme that partitions the load into coarse tasks and uses time measurements as feedback to further improve the load balance.
Performance evaluation
Numerical experiments were carried out in double precision (64-bit) arithmetic on one node (eight cores) of the Akka system at HPC2N. Akka consists of 672 interconnected nodes with a total of 5376 processor cores. Each node consists of two Intel Xeon L5420 quad-core processors with 12 MB of L2 cache. The processors run at 2.5 GHz and share 16 GB of RAM. A node’s theoretical double precision peak performance is 8·2.5·2·2 = 80 Gflops/s (one Gflops/s equals 109 flops/s). The code was linked to the
Conclusions
The performance of the conventional blocked one-stage Hessenberg reduction algorithm is limited by large matrix–vector multiplications that account for 20% of the total number of flops.
We show that a two-stage approach to Hessenberg reduction can obtain high performance in both stages and thus overcome its 60% increase in the number of flops compared to the one-stage approach.
Experiments on a dual quad-core machine shows a peak performance of 22 Gflops/s, or 28% of the theoretical peak, for the
Acknowledgments
We thank the three referees for taking the time to carefully read the manuscript and give constructive comments. This research was conducted using the resources of the High Performance Computing Center North (HPC2N) and was supported by the Swedish Research Council (VR) under grant VR7062571 and by the Swedish Foundation for Strategic Research under grant A3 02:128. The work is also supported via eSSENCE, a strategic collaborative program in eScience funded by VR and the Swedish Government
References (26)
- et al.
Block reduction of matrices to condensed forms for eigenvalue computations
J. Comput. Appl. Math.
(1989) - et al.
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
Parallel Comput.
(2010) Using the Hessenberg decomposition in control theory
- et al.
The multishift QR algorithm. part I: Maintaining well-focused shifts and level 3 performance
SIAM J. Matrix Anal. Appl.
(2001) - et al.
A novel parallel QR algorithm for hybrid distributed memory HPC systems
SIAM J. Sci. Comput.
(2010) - et al.
Parallel solvers for Sylvester-type matrix equations with applications in condition estimation, part I: Theory and algorithms
ACM Trans. Math. Softw.
(2010) - et al.
Algorithm 904: The SCASY library – parallel solvers for Sylvester-type matrix equations with applications in condition estimation, Part II
ACM Trans. Math. Softw.
(2010) - S. Mohanty, I/O Efficient Algorithms for Matrix Computations, Ph.D. thesis, Indian Institute of Technology Guwahati,...
- et al.
A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-Hessenberg form
Parallel Comput.
(1995) - et al.
Matrix Computations
(1996)
Blocked algorithms and software for reduction of a regular matrix pair to generalized Schur form
ACM Trans. Math. Softw.
Cited by (21)
Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs
2018, Parallel ComputingPerformance analysis and optimisation of two-sided factorization algorithms for heterogeneous platform
2015, Procedia Computer ScienceAccelerated implementation of adaptive directional lifting-based discrete wavelet transform on GPU
2013, Signal Processing: Image CommunicationCitation Excerpt :Image processing algorithms with frequent memory access per pixel can exploit the high memory bandwidth to achieve significant speedup. Furthermore, in the CUDA framework much higher memory access (100× compared to GPU global memory) is processed using coherent memory access and on-chip shared memory to achieve substantially higher performance [21]. The number of per-pixel memory access instructions determines the frequency of memory access in a given algorithm.
Linear algebra software for large-scale accelerated multicore computing
2016, Acta Numerica