A block-asynchronous relaxation method for graphics processing units
Introduction
The latest developments in hardware architectures show an enormous increase in the number of processing units (computing cores) that form one processor. The reason for this varies from various physical limitations to energy minimization considerations that are at odds with further scaling up of processor’ frequencies—the basic acceleration method used in the architecture designs for the last decades [15]. Only by merging multiple processing units into one processor does further acceleration seem to be possible. One example where this core gathering is carried to extremes is the GPU. The current high-end products of the leading GPU providers consist of 2880 CUDA cores for the NVIDIA Tesla K20 and 3072 stream processors for the Northern Islands generation from ATI. While the original purpose of GPUs was graphics processing, their enormous computing power also suggests the usage as accelerators when performing parallel computations. Yet, the design and characteristics of these devices pose some challenges for their efficient use. In particular, since the synchronization between the individual processing units usually triggers considerable overhead and limits the performance to the slowest component, it is attractive to employ algorithms that have a high degree of parallelism and only very few synchronization points.
On the other hand, numerical algorithms usually require this synchronization. For example, when solving linear systems of equations with iterative methods like the Conjugate Gradient or GMRES, the parallelism is usually limited to the matrix–vector and the vector–vector operations (with synchronization required between them) [12], [33], [34]. Also, methods that are based on component-wise updates like Jacobi or Gauss–Seidel have synchronization between the iteration steps [26], [13]: no component is updated twice (or more) before all other components are updated. Still, it is possible to ignore these synchronization steps, which will result in a chaotic or asynchronous iteration process. Despite the fact that the numerical robustness and convergence properties severely suffer from this chaotic behavior, they may be interesting for specific applications, since the absence of synchronization points make them perfect candidates for highly parallel hardware platforms. The result is a trade-off: while the algorithm’s convergence may suffer from the asynchronism, the performance can benefit from the superior scalability. Additionally, the asynchronous nature implies a high tolerance to hardware failure, which is one of the key properties required for numerics suitable for future hardware.
In this paper, we want to analyze the potential of employing asynchronous iteration methods on GPUs by analyzing convergence behavior and time-to-solution when iteratively solving linear systems of equations. We split this paper into the following parts. First, we will shortly recall the mathematical idea of the Jacobi iteration method and derive the component-wise iteration algorithm. Then the idea of an asynchronous relaxation method is derived, and some basic characteristics concerning the convergence demands are summarized. The section about the experiment framework will first provide information about the linear systems of equations we target. The matrices affiliated with the systems are taken from the University of Florida matrix collection. Then we describe the asynchronous iteration method for GPUs and multi-GPUs that we designed. In the following section we analyze the experiment results with focus on the convergence behavior, iteration times and fault-tolerance for the different matrix systems. In Section 5 we summarize the results and provide an outlook about future work in this field.
Section snippets
Jacobi method
The Jacobi method is an iterative algorithm for finding the approximate solution for a linear system of equations where is strictly or irreducibly diagonally dominant. One can rewrite the system as where denotes the diagonal entries of while and denote the lower and upper triangular part of , respectively. Using the form , the Jacobi method is derived as an iterative scheme Denoting the error at iteration by ( may
Linear systems of equations
In our experiments, we search for the approximate solutions of linear system of equations, where the respective matrices are taken from the University of Florida Matrix Collection.1 Due to the convergence properties of the iterative methods considered the experiment matrices must be properly chosen. While for the Jacobi method a sufficient condition for convergence is clearly (i.e., the spectral radius of the iteration
Stochastic impact of chaotic behavior of asynchronous iteration methods
At this point it should be mentioned that only the synchronous Gauss–Seidel and Jacobi methods are deterministic. While synchronous relaxation algorithms always provide the same solution approximations for each solver run, the unique pattern concerning the distinct component updates in the asynchronous iteration method generates a sequence of iteration approximations that can usually only be reproduced by choosing exactly the same update order. This also implies that variations in the
Summary and future work
Synchronizations are becoming increasingly hazardous for high performance computing, especially with the ever-increasing parallelism in today’s computer architectures. In effect, the use of new algorithms and optimization techniques that reduce synchronizations are becoming essential for obtaining high efficiency, high performance, and scalability. We demonstrated and quantified this for relaxation methods on current GPUs (and multicore clusters with GPUs). In particular, we developed
Acknowledgments
The authors would like to thank the National Science Foundation, the Department of Energy, NVIDIA, and the MathWorks for supporting this research effort. We would also like to thank the Karlsruhe House of Young Scientists (KHYS) for supporting the research cooperation.
Hartwig Anzt received his Ph.D. in mathematics from the Karlsruhe Institute of Technology (KIT) in 2012 and holds a Diploma in a joint program including mathematics, physics, and computer science from the University of Karlsruhe. Anzt’s main topics of research are in scientific high performance computing and include simulation algorithms, hardware-optimized numerics for GPU-accelerated platforms, communication-avoiding and asynchronous methods, and power-aware computing. In June 2013, Anzt will
References (39)
- et al.
Chaotic relaxation
Linear Algebra and its Applications
(1969) - et al.
On asynchronous iterations
Journal of Computational and Applied Mathematics
(2000) A convergence theorem for chaotic asynchronous relaxation
Linear Algebra and its Applications
(1997)- H. Anzt, S. Tomov, J. Dongarra, V. Heuveline, A block-asynchronous relaxation method for graphics processing units, in:...
The opportunities and challenges of exascale computing
A unified proof for the convergence of Jacobi and Gauss–Seidel methods
SIAM Review
(1995)- et al.
Block and asynchronous two-stage methods for mildly nonlinear systems
Numerische Mathematik
(1999) - A.H. Baker, R.D. Falgout, T. Gamblin, T.V. Kolev, S. Martin, U. Meier Yang, Scaling algebraic multigrid solvers: on the...
- A.H. Baker, R.D. Falgout, T.V. Kolev, U. Meier Yang, Multigrid smoothers for ultra-parallel computing,...
- K. Bergman, et al. Exascale computing study: technology challenges in achieving exascale systems,...
Cooperative application/OS DRAM fault recovery
Toward exascale resilience
International Journal of High Performance Computing Applications
The international ExaScale software project roadmap
International Journal of High Performance Computing Applications
Soft error resilient QR factorization for hybrid system with GPGPU
Journal of Computational Science
Cited by (21)
Reducing the effect of global synchronization in delayed gradient methods for symmetric linear systems
2020, Advances in Engineering SoftwareChaotic multigrid methods for the solution of elliptic equations
2019, Computer Physics CommunicationsCitation Excerpt :In order to remove this implicit synchronization from Jacobi-like solvers or smoothers, one can use the concept of ‘Chaotic Relaxations’ [14] — a method in which individual rows of the equation-system can be iterated in any chaotic order with guaranteed convergence. This theory has been used to create highly-asynchronous Jacobi-like solvers, such as in Anzt et al. [15]; but this theory can be leveraged further. Here, a completely non-deterministic, chaotic solver is developed in which independent threads are used to overlap computational and communicational work — completely desynchronizing parallel processes and removing a significant scalability bottleneck.
Energy-efficient multigrid smoothers and grid transfer operators on multi-core and GPU clusters
2017, Journal of Parallel and Distributed ComputingCitation Excerpt :Asynchronous iteration has successfully been used in the context of high-performance computing, see e.g. [11] and references therein. For GPU-accelerated systems, works reported in [3,4] investigate numerical convergence and performance of block-asynchronous iteration, both as plain solver and in combination with mixed precision iterative refinement. An extension of this setup to the case of distributed memory machines with several host processes sharing one or two devices per node, and an extension to asynchronous CPU-based solvers which also benefit from relaxed synchronization requirements has been presented in [25], where also runtime performance and energy consumption has been assessed.
Linear algebra software for large-scale accelerated multicore computing
2016, Acta NumericaA Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory
2022, 2022 IEEE High Performance Extreme Computing Conference, HPEC 2022An asynchronous incomplete block lu preconditioner for computational fluid dynamics on unstructured grids
2021, SIAM Journal on Scientific Computing
Hartwig Anzt received his Ph.D. in mathematics from the Karlsruhe Institute of Technology (KIT) in 2012 and holds a Diploma in a joint program including mathematics, physics, and computer science from the University of Karlsruhe. Anzt’s main topics of research are in scientific high performance computing and include simulation algorithms, hardware-optimized numerics for GPU-accelerated platforms, communication-avoiding and asynchronous methods, and power-aware computing. In June 2013, Anzt will join Jack Dongarra’s Innovative Computing Lab (ICL) at the University of Tennessee, where he had already spent several months during his Ph.D. research collaborating on a number of problems in HPC. Furthermore, he enjoys a fruitful research cooperation with the High Performance Computing & Architectures group at University of Jaume. During his masters Anzt also spent one year at the University of Ottawa, Canada.
Stanimire Tomov is a Research Director in the Innovative Computing Laboratory (ICL) at the University of Tennessee. Tomov’s research interests include parallel algorithms, numerical analysis, and high-performance scientific computing (HPC). He has been involved in the development of numerical algorithms and software tools in a variety of fields ranging from scientific visualization and data mining to accurate and efficient numerical solution of PDEs. Currently, his work is concentrated on the development of numerical linear algebra libraries for emerging architectures for HPC, such as heterogeneous multicore processors, graphics processing units (GPUs), and Many Integrated Core (MIC) architectures.
Jack Dongarra holds an appointment at the University of Tennessee, Oak Ridge National Laboratory, and the University of Manchester. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced-computer architectures, programming methodology, and tools for parallel computers. He was awarded the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches; in 2008 he was the recipient of the first IEEE Medal of Excellence in Scalable Computing; in 2010 he was the first recipient of the SIAM Special Interest Group on Supercomputing’s award for Career Achievement; and in 2011 he was the recipient of the IEEE IPDPS 2011 Charles Babbage Award. He is a fellow of the AAAS, ACM, IEEE, and SIAM and a member of the National Academy of Engineering.
Vincent Heuveline is the Director of the Engineering Mathematics and Computing Lab (EMCL). He is full professor at the Karlsruhe Institute of Technology (KIT) where he is leading the Chair on Numerical Simulation, Optimization and High Performance Computing (NumHPC). His research interests include numerical simulation, high performance computing, software engineering with main application focuses on Energy Research, Meteorology and environment as well as Medical Engineering. He serves as a programme committee member of several international conferences on high performance computing. He is widely consulted for industry with respect to the deployment of numerical simulation and high performance computing in industrial environments.