A block-asynchronous relaxation method for graphics processing units

https://doi.org/10.1016/j.jpdc.2013.05.008Get rights and content

Highlights

  • Block-asynchronous relaxation on GPU-accelerated systems.

  • Method’s high iteration rate compensates a low convergence rate (competitive to CG).

  • GPU thread-block scheduling reduces non-deterministic behavior and stochastic noise.

  • Analysis of different communication strategies for multi-GPU usage.

  • Block-asynchronous relaxation inherently tolerant to hardware error.

Abstract

In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the “subdomain” handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing.

Introduction

The latest developments in hardware architectures show an enormous increase in the number of processing units (computing cores) that form one processor. The reason for this varies from various physical limitations to energy minimization considerations that are at odds with further scaling up of processor’ frequencies—the basic acceleration method used in the architecture designs for the last decades  [15]. Only by merging multiple processing units into one processor does further acceleration seem to be possible. One example where this core gathering is carried to extremes is the GPU. The current high-end products of the leading GPU providers consist of 2880 CUDA cores for the NVIDIA Tesla K20 and 3072 stream processors for the Northern Islands generation from ATI. While the original purpose of GPUs was graphics processing, their enormous computing power also suggests the usage as accelerators when performing parallel computations. Yet, the design and characteristics of these devices pose some challenges for their efficient use. In particular, since the synchronization between the individual processing units usually triggers considerable overhead and limits the performance to the slowest component, it is attractive to employ algorithms that have a high degree of parallelism and only very few synchronization points.

On the other hand, numerical algorithms usually require this synchronization. For example, when solving linear systems of equations with iterative methods like the Conjugate Gradient or GMRES, the parallelism is usually limited to the matrix–vector and the vector–vector operations (with synchronization required between them)  [12], [33], [34]. Also, methods that are based on component-wise updates like Jacobi or Gauss–Seidel have synchronization between the iteration steps  [26], [13]: no component is updated twice (or more) before all other components are updated. Still, it is possible to ignore these synchronization steps, which will result in a chaotic or asynchronous iteration process. Despite the fact that the numerical robustness and convergence properties severely suffer from this chaotic behavior, they may be interesting for specific applications, since the absence of synchronization points make them perfect candidates for highly parallel hardware platforms. The result is a trade-off: while the algorithm’s convergence may suffer from the asynchronism, the performance can benefit from the superior scalability. Additionally, the asynchronous nature implies a high tolerance to hardware failure, which is one of the key properties required for numerics suitable for future hardware.

In this paper, we want to analyze the potential of employing asynchronous iteration methods on GPUs by analyzing convergence behavior and time-to-solution when iteratively solving linear systems of equations. We split this paper into the following parts. First, we will shortly recall the mathematical idea of the Jacobi iteration method and derive the component-wise iteration algorithm. Then the idea of an asynchronous relaxation method is derived, and some basic characteristics concerning the convergence demands are summarized. The section about the experiment framework will first provide information about the linear systems of equations we target. The matrices affiliated with the systems are taken from the University of Florida matrix collection. Then we describe the asynchronous iteration method for GPUs and multi-GPUs that we designed. In the following section we analyze the experiment results with focus on the convergence behavior, iteration times and fault-tolerance for the different matrix systems. In Section  5 we summarize the results and provide an outlook about future work in this field.

Section snippets

Jacobi method

The Jacobi method is an iterative algorithm for finding the approximate solution for a linear system of equations Ax=b, where A is strictly or irreducibly diagonally dominant. One can rewrite the system as (L+D+U)x=b where D denotes the diagonal entries of A while L and U denote the lower and upper triangular part of A, respectively. Using the form Dx=b(L+U)x, the Jacobi method is derived as an iterative scheme xk+1=D1(b(L+U)xk). Denoting the error at iteration k+1 by ek+1xk+1x (x may

Linear systems of equations

In our experiments, we search for the approximate solutions of linear system of equations, where the respective matrices are taken from the University of Florida Matrix Collection.1 Due to the convergence properties of the iterative methods considered the experiment matrices must be properly chosen. While for the Jacobi method a sufficient condition for convergence is clearly ρ(B)=ρ(ID1A)<1 (i.e., the spectral radius of the iteration

Stochastic impact of chaotic behavior of asynchronous iteration methods

At this point it should be mentioned that only the synchronous Gauss–Seidel and Jacobi methods are deterministic. While synchronous relaxation algorithms always provide the same solution approximations for each solver run, the unique pattern concerning the distinct component updates in the asynchronous iteration method generates a sequence of iteration approximations that can usually only be reproduced by choosing exactly the same update order. This also implies that variations in the

Summary and future work

Synchronizations are becoming increasingly hazardous for high performance computing, especially with the ever-increasing parallelism in today’s computer architectures. In effect, the use of new algorithms and optimization techniques that reduce synchronizations are becoming essential for obtaining high efficiency, high performance, and scalability. We demonstrated and quantified this for relaxation methods on current GPUs (and multicore clusters with GPUs). In particular, we developed

Acknowledgments

The authors would like to thank the National Science Foundation, the Department of Energy, NVIDIA, and the MathWorks for supporting this research effort. We would also like to thank the Karlsruhe House of Young Scientists (KHYS) for supporting the research cooperation.

Hartwig Anzt received his Ph.D. in mathematics from the Karlsruhe Institute of Technology (KIT) in 2012 and holds a Diploma in a joint program including mathematics, physics, and computer science from the University of Karlsruhe. Anzt’s main topics of research are in scientific high performance computing and include simulation algorithms, hardware-optimized numerics for GPU-accelerated platforms, communication-avoiding and asynchronous methods, and power-aware computing. In June 2013, Anzt will

References (39)

  • D. Chazan et al.

    Chaotic relaxation

    Linear Algebra and its Applications

    (1969)
  • A. Frommer et al.

    On asynchronous iterations

    Journal of Computational and Applied Mathematics

    (2000)
  • J.C. Strikwerda

    A convergence theorem for chaotic asynchronous relaxation

    Linear Algebra and its Applications

    (1997)
  • H. Anzt, S. Tomov, J. Dongarra, V. Heuveline, A block-asynchronous relaxation method for graphics processing units, in:...
  • S. Ashby

    The opportunities and challenges of exascale computing

  • R. Bagnara

    A unified proof for the convergence of Jacobi and Gauss–Seidel methods

    SIAM Review

    (1995)
  • Z.-Z. Bai et al.

    Block and asynchronous two-stage methods for mildly nonlinear systems

    Numerische Mathematik

    (1999)
  • A.H. Baker, R.D. Falgout, T. Gamblin, T.V. Kolev, S. Martin, U. Meier Yang, Scaling algebraic multigrid solvers: on the...
  • A.H. Baker, R.D. Falgout, T.V. Kolev, U. Meier Yang, Multigrid smoothers for ultra-parallel computing,...
  • K. Bergman, et al. Exascale computing study: technology challenges in achieving exascale systems,...
  • D.P. Bertsekas, J. Eckstein, Distributed asynchronous relaxation methods for linear network flow problems, in:...
  • P.G. Bridges et al.

    Cooperative application/OS DRAM fault recovery

  • F. Cappello et al.

    Toward exascale resilience

    International Journal of High Performance Computing Applications

    (2009)
  • A.T. Chronopoulos, A.B. Kucherov, A parallel Krylov-type method for nonsymmetric linear systems, 2001, pp....
  • J.A.H. Courtecuisse, Parallel dense Gauss–Seidel algorithm on many-core...
  • I. Corporation, Intel C++ compiler options, document Number:...
  • J. Dongarra

    The international ExaScale software project roadmap

    International Journal of High Performance Computing Applications

    (2011)
  • P. Du, A. Bouteiller, G. Bosilca, T. Herault, J. Dongarra, Algorithm-based fault tolerance for dense matrix...
  • P. Du et al.

    Soft error resilient QR factorization for hybrid system with GPGPU

    Journal of Computational Science

    (2013)
  • Cited by (21)

    • Chaotic multigrid methods for the solution of elliptic equations

      2019, Computer Physics Communications
      Citation Excerpt :

      In order to remove this implicit synchronization from Jacobi-like solvers or smoothers, one can use the concept of ‘Chaotic Relaxations’ [14] — a method in which individual rows of the equation-system can be iterated in any chaotic order with guaranteed convergence. This theory has been used to create highly-asynchronous Jacobi-like solvers, such as in Anzt et al. [15]; but this theory can be leveraged further. Here, a completely non-deterministic, chaotic solver is developed in which independent threads are used to overlap computational and communicational work — completely desynchronizing parallel processes and removing a significant scalability bottleneck.

    • Energy-efficient multigrid smoothers and grid transfer operators on multi-core and GPU clusters

      2017, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      Asynchronous iteration has successfully been used in the context of high-performance computing, see e.g. [11] and references therein. For GPU-accelerated systems, works reported in [3,4] investigate numerical convergence and performance of block-asynchronous iteration, both as plain solver and in combination with mixed precision iterative refinement. An extension of this setup to the case of distributed memory machines with several host processes sharing one or two devices per node, and an extension to asynchronous CPU-based solvers which also benefit from relaxed synchronization requirements has been presented in [25], where also runtime performance and energy consumption has been assessed.

    • A Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory

      2022, 2022 IEEE High Performance Extreme Computing Conference, HPEC 2022
    View all citing articles on Scopus

    Hartwig Anzt received his Ph.D. in mathematics from the Karlsruhe Institute of Technology (KIT) in 2012 and holds a Diploma in a joint program including mathematics, physics, and computer science from the University of Karlsruhe. Anzt’s main topics of research are in scientific high performance computing and include simulation algorithms, hardware-optimized numerics for GPU-accelerated platforms, communication-avoiding and asynchronous methods, and power-aware computing. In June 2013, Anzt will join Jack Dongarra’s Innovative Computing Lab (ICL) at the University of Tennessee, where he had already spent several months during his Ph.D. research collaborating on a number of problems in HPC. Furthermore, he enjoys a fruitful research cooperation with the High Performance Computing & Architectures group at University of Jaume. During his masters Anzt also spent one year at the University of Ottawa, Canada.

    Stanimire Tomov is a Research Director in the Innovative Computing Laboratory (ICL) at the University of Tennessee. Tomov’s research interests include parallel algorithms, numerical analysis, and high-performance scientific computing (HPC). He has been involved in the development of numerical algorithms and software tools in a variety of fields ranging from scientific visualization and data mining to accurate and efficient numerical solution of PDEs. Currently, his work is concentrated on the development of numerical linear algebra libraries for emerging architectures for HPC, such as heterogeneous multicore processors, graphics processing units (GPUs), and Many Integrated Core (MIC) architectures.

    Jack Dongarra holds an appointment at the University of Tennessee, Oak Ridge National Laboratory, and the University of Manchester. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced-computer architectures, programming methodology, and tools for parallel computers. He was awarded the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches; in 2008 he was the recipient of the first IEEE Medal of Excellence in Scalable Computing; in 2010 he was the first recipient of the SIAM Special Interest Group on Supercomputing’s award for Career Achievement; and in 2011 he was the recipient of the IEEE IPDPS 2011 Charles Babbage Award. He is a fellow of the AAAS, ACM, IEEE, and SIAM and a member of the National Academy of Engineering.

    Vincent Heuveline is the Director of the Engineering Mathematics and Computing Lab (EMCL). He is full professor at the Karlsruhe Institute of Technology (KIT) where he is leading the Chair on Numerical Simulation, Optimization and High Performance Computing (NumHPC). His research interests include numerical simulation, high performance computing, software engineering with main application focuses on Energy Research, Meteorology and environment as well as Medical Engineering. He serves as a programme committee member of several international conferences on high performance computing. He is widely consulted for industry with respect to the deployment of numerical simulation and high performance computing in industrial environments.

    View full text