Elsevier

Parallel Computing

Volume 37, Issue 12, December 2011, Pages 771-782
Parallel Computing

Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures

https://doi.org/10.1016/j.parco.2011.05.001Get rights and content

Abstract

We consider parallel reduction of a real matrix to Hessenberg form using orthogonal transformations. Standard Hessenberg reduction algorithms reduce the columns of the matrix from left to right in either a blocked or unblocked fashion. However, the standard blocked variant performs 20% of the computations in terms of matrix–vector multiplications. We show that a two-stage approach consisting of an intermediate reduction to block Hessenberg form speeds up the reduction by avoiding matrix–vector multiplications. We describe and evaluate a new high-performance implementation of the two-stage approach that attains significant speedups over the one-stage approach. The key components are a dynamically scheduled implementation of Stage 1 and a blocked, adaptively load-balanced implementation of Stage 2.

Highlights

► A dynamically scheduled shared-memory implementation of a two-stage reduction to Hessenberg form is presented and evaluated. ► Experiments show that the intermediate block Hessenberg form needs to have a relatively small number of non-zero sub-diagonals due to a performance trade-off between the two stages.

Introduction

The Hessenberg decomposition A = QHQT of a square matrix ARn×n, where HRn×n is an upper Hessenberg matrix and QRn×n is orthogonal, has many applications in numerical linear algebra [1]. Our focus is on medium- to large-scale applications that require blocked and parallel algorithms. For example, the Hessenberg QR algorithm, which computes a Schur form, takes as input a matrix in Hessenberg form. Efficient implementations of the Hessenberg QR algorithm are described in, e.g., [2], [3]. Other examples include solvers for dense Sylvester-type matrix equations, which typically simplify the coefficient matrices to Schur and/or Hessenberg form [4], [5]. A third application is the solution of linear systems (A  σI)x = b for many shifts σ. By first reducing A to Hessenberg form, each subsequent system can then be solved with only O(n2) floating point operations (flops).

The aim of this paper is to describe and evaluate a new efficient Hessenberg reduction algorithm for shared-memory systems. Our implementation is primarily aimed at architectures with nearly uniform memory access. Most of the algorithmic techniques are certainly suitable for machines with non-uniform memory access as well, but the implementation will likely need some adjustments and is subject to future work. The proposed algorithm consists of two stages [6], [7], whereas the conventional approach goes directly to Hessenberg form [8]. Stage 1 of the two-stage algorithm reduces the dense matrix A to r-Hessenberg form, i.e., block Hessenberg form with lower bandwidth r. Formally, a matrix A is in r-Hessenberg form if i > j +  r implies A(i, j) = 0. Stage 2 completes the Hessenberg reduction with a bulge-chasing procedure that annihilates the r  1 unwanted nonzero sub-diagonals from the r-Hessenberg matrix [9], [6], [10]. Two-stage algorithms are also effective for symmetric band reduction [9], [6] and Hessenberg-triangular reduction [11], [12].

The rest of the paper is organized as follows. We recall the conventional unblocked and blocked Hessenberg reduction algorithms in Section 2. We discuss some pros and cons of the two-stage approach in Section 3. We describe a new implementation of Stage 1 in Section 4. We outline the authors’ recently published implementation of Stage 2 [10] in Section 5. We evaluate the performance of the two stages, both in isolation and combined, in Section 6. Finally, we summarize our findings and briefly review some related work in Section 7.

Section snippets

One-stage Hessenberg reduction

In both the blocked and the unblocked standard algorithms, the matrix is reduced from left to right using one Householder reflection per column. We recall the unblocked variant in Section 2.1 and the blocked variant in Section 2.2. They form the basis of several high-performance implementations of the Hessenberg reduction, e.g., see [13], [14].

Two-stage Hessenberg reduction

Recall that the two-stage approach first reduces the matrix to r-Hessenberg form and then to upper Hessenberg form. The parameter r can be chosen to optimize the performance. As is explained in the coming sections, the performance of Stage 1 (measured in flops/s) increases with r, whereas the performance of Stage 2 decreases with r. This trade-off leads to the conclusion that a modest value of r gives the best performance when the two stages are combined.

The one-stage approach requires 103n3

Stage 1: reduction from dense to r-Hessenberg form

It is straightforward to modify Algorithm 2 to produce an r-Hessenberg form using two levels of blocking. The inner block size is determined by the number of sub-diagonals, r, and thus affects the performance of Stage 2. As it turns out [10], Stage 2 benefits from having a small value of r and thus it is appropriate to add a second level of blocking to Stage 1. The outer block size b does not influence Stage 2 in any way and can therefore be tuned independently of Stage 2.

Stage 2: reduction from r-Hessenberg to Hessenberg form

The input to Stage 2 is an r-Hessenberg matrix ARn×n. The aim is to complete the reduction to Hessenberg form. Algorithms for symmetric band reduction [20], [21], [22] can be adapted to non-symmetric matrices, e.g., matrices in r-Hessenberg form.

This section covers some aspects of our recently published parallel algorithm [10]. It has a dynamic adaptive load-balancing scheme that partitions the load into coarse tasks and uses time measurements as feedback to further improve the load balance.

Performance evaluation

Numerical experiments were carried out in double precision (64-bit) arithmetic on one node (eight cores) of the Akka system at HPC2N. Akka consists of 672 interconnected nodes with a total of 5376 processor cores. Each node consists of two Intel Xeon L5420 quad-core processors with 12 MB of L2 cache. The processors run at 2.5 GHz and share 16 GB of RAM. A node’s theoretical double precision peak performance is 8·2.5·2·2 = 80 Gflops/s (one Gflops/s equals 109 flops/s). The code was linked to the

Conclusions

The performance of the conventional blocked one-stage Hessenberg reduction algorithm is limited by large matrix–vector multiplications that account for 20% of the total number of flops.

We show that a two-stage approach to Hessenberg reduction can obtain high performance in both stages and thus overcome its 60% increase in the number of flops compared to the one-stage approach.

Experiments on a dual quad-core machine shows a peak performance of 22 Gflops/s, or 28% of the theoretical peak, for the

Acknowledgments

We thank the three referees for taking the time to carefully read the manuscript and give constructive comments. This research was conducted using the resources of the High Performance Computing Center North (HPC2N) and was supported by the Swedish Research Council (VR) under grant VR7062571 and by the Swedish Foundation for Strategic Research under grant A3 02:128. The work is also supported via eSSENCE, a strategic collaborative program in eScience funded by VR and the Swedish Government

References (26)

  • J. Dongarra et al.

    Block reduction of matrices to condensed forms for eigenvalue computations

    J. Comput. Appl. Math.

    (1989)
  • S. Tomov et al.

    Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

    Parallel Comput.

    (2010)
  • C.F. Van Loan

    Using the Hessenberg decomposition in control theory

  • K. Braman et al.

    The multishift QR algorithm. part I: Maintaining well-focused shifts and level 3 performance

    SIAM J. Matrix Anal. Appl.

    (2001)
  • R. Granat et al.

    A novel parallel QR algorithm for hybrid distributed memory HPC systems

    SIAM J. Sci. Comput.

    (2010)
  • R. Granat et al.

    Parallel solvers for Sylvester-type matrix equations with applications in condition estimation, part I: Theory and algorithms

    ACM Trans. Math. Softw.

    (2010)
  • R. Granat et al.

    Algorithm 904: The SCASY library – parallel solvers for Sylvester-type matrix equations with applications in condition estimation, Part II

    ACM Trans. Math. Softw.

    (2010)
  • S. Mohanty, I/O Efficient Algorithms for Matrix Computations, Ph.D. thesis, Indian Institute of Technology Guwahati,...
  • M.W. Berry et al.

    A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-Hessenberg form

    Parallel Comput.

    (1995)
  • G.H. Golub et al.

    Matrix Computations

    (1996)
  • C. Bischof, B. Lang, X. Sun, Parallel tridiagonalization through two-step band reduction, in: Scalable High-Performance...
  • L. Karlsson, B. Kågström, Efficient Reduction from Block Hessenberg Form to Hessenberg Form using Shared Memory,...
  • K. Dackland et al.

    Blocked algorithms and software for reduction of a regular matrix pair to generalized Schur form

    ACM Trans. Math. Softw.

    (1999)
  • Cited by (21)

    • Accelerated implementation of adaptive directional lifting-based discrete wavelet transform on GPU

      2013, Signal Processing: Image Communication
      Citation Excerpt :

      Image processing algorithms with frequent memory access per pixel can exploit the high memory bandwidth to achieve significant speedup. Furthermore, in the CUDA framework much higher memory access (100× compared to GPU global memory) is processed using coherent memory access and on-chip shared memory to achieve substantially higher performance [21]. The number of per-pixel memory access instructions determines the frequency of memory access in a given algorithm.

    View all citing articles on Scopus
    View full text