A parallel block LU decomposition method for distributed finite element matrices
Highlights
► Parallel direct linear solver for finite element matrices. ► Parallel solution for dense Schur complements. ► Pre-computation of the block sparsity pattern. ► Evaluation of the algorithm by a challenging test suite.
Introduction
Finite element applications often require a fine mesh which results into several million or billion unknowns. To obtain a reasonable computing time to solve this equation, this can only be realized on a parallel machine. For this purpose several applications require parallel direct solvers, since they are more general and more robust than iterative solvers. For example, preconditioned CG-methods are restricted to symmetric positive definite systems, and also multigrid or domain decomposition preconditioner require an efficient parallel coarse problem solver. A direct solver often performs well in cases, where many iterative solvers fail, e.g., if the problem is indefinite, unsymmetric, or ill-conditioned.
Our concept of the parallel LU decomposition for matrices resulting from finite element problems is based on a nested dissection approach. On P = 2S processors, the algorithm uses S + 1 steps in which sets of processors are combined together consistently and problems between these processors are solved, beginning with one processor in the first step. The first step itself is comparable to the preprocessing step for non-overlapping domain decomposition methods [29], where the resulting Schur complement is then solved by a suitable preconditioned iteration. Here, a parallel LU decomposition for this Schur complement problem is introduced. Nevertheless, a simple nested dissection method fails for finite element applications since, in particular in 3D, the resulting Schur complement problems emerge into large dense problems, so that they have to be solved distributed and in parallel, too.
Block LU factorizations and its analysis have been discussed by various authors, see e.g., [14], [10], [18], [8]. A general purpose algorithm for sparse matrices is realized, e.g., in MUMPS [1] for distributed memory, and, e.g., in PARDISO [25], [26] on shared memory machines. Parallel solvers also can be used in hybrid method for solving subproblems, e.g., in SPIKE [23]. Sequential solvers for sparse matrices such as SuperLU [9] and UMFPACK [28] can be used to eliminate local degrees of freedom. A more general discussion of block LU decomposition and handling linear systems on parallel computers can be found in [7], [11], [13]. A block LU decomposition method with iterative methods for the Schur complement can be found, e.g., in [4].
Our new parallel solver explicitly uses the structure of the finite element matrix. Thus, it is not a “black box” solver, the knowledge of the structure of the finite element matrix and the decomposition of the domain to the processors is essential. Our new contribution is the efficient and transparent use of this structure for the parallel distribution of the elimination steps. In particular, we can identify a priori parts of the resulting LU decomposition which remains zero, so that they can be ignored during the algorithm.
The paper is organized as follows. In Section 2 a general setting for parallel finite elements is introduced which leads to a parallel block structure for the corresponding matrix. The mesh is first distributed on P processors and then refined locally l times, such that each processor handles a local part of the mesh. Then, a suitable block LU decomposition is defined and the parallel realization is discussed in Section 3. In Section 4 we introduce several finite element problems which are used in Section 5 for the evaluation of the parallel performance of our direct solver. In some cases the results are compared with the parallel direct solver MUMPS [1].
Section snippets
Parallel finite elements
Following [30], [31], we define a parallel additive representation of the stiffness matrix which directly corresponds to a parallel domain decomposition. This additive representation is the basis for the LU algorithm discussed in the next section.
A parallel block LU decomposition
Now we introduce a parallel block LU decomposition based on the blocks associated to the processor sets πk. We will see below that our algorithm is equivalent to a recursive Schur complement reduction method.
We start with a suitable numbering of the processorsThe block LU decomposition is based on a recursive definition of combined processor sets in each step s with a cluster number t = 1, … , 2S−s, such that . For s = 0 each processor set exists of exactly
Model problems
We define a series of different model problems for our numerical test. In our notation, u = u(x) is the solution, where x = (X, Y)T and x = (X, Y, Z)T are the coordinates in the 2D and 3D case, respectively.
Results
The numerical tests are realized on the Cluster hc3 of the Steinbuch Centre of Computing (SCC) in Karlsruhe with 332 eight-way compute nodes, where each node has two Intel Xeon Quad Core sockets with 2.53 GHz frequency. They are connected by an InfiniBand 4X QDR interconnect [15]. In the algorithm we use BLAS and LAPACK routines for dense matrices, given from the Intel Math Kernel Library [19] and MUMPS [1], [21] or SuperLU [9], [27] as a solver for sparse matrices, where MUMPS emerges as the
Acknowledgement
The authors acknowledge the financial support from BMBF Grant 01IH08014A within the joint research project ASIL (Advanced Solvers Integrated Library).
References (31)
- et al.
A parallel algorithm for multilevel graph partitioning and sparse matrix ordering
J. Parall. Distribut. Comput.
(1998) - et al.
SPIKE: a parallel environment for solving banded linear systems
Comput. Fluids
(2007) - et al.
PARDISO: A high-performance serial and parallel sparse linear solver in semiconductor device simulation
FGCS. Future Generat. Comput. Syst.
(2001) - et al.
A fully asynchronous multifrontal solver using distributed dynamic scheduling
SIAM J. Matrix Anal. Appl.
(2001) - et al.
Lapack: A Portable Linear Algebr. Library High-performance Comput.
(1990) - E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S....
- et al.
A parallel linear system solver for circuit simulation problems
Numer. Linear Algebr. Appl.
(2000) Finite Elements: Theory, Fast Solvers, and Applications in Solid Mechanics
(1997)- CST-computer simulation technology, darmstadt. Available from:...
- Krister Dackland, Erik Elmroth, Bo Kågström, Charles Van Loan. Design and Evaluation of Parallel Block Algorithms: LU...
Stability of block LU factorization
Numer. Linear Algebr. Appl.
A supernodal approach to sparse partial pivoting
SIAM J. Matrix Anal. Appl.
Stability of block algorithms with fast level-3 BLAS
ACM Trans. Math. Softw.
Finite Elements and Fast Iterative Solvers with Applications in Incompressible Fluid Dynamics
Cited by (18)
The parallel finite element system M++ with integrated multilevel preconditioning and multilevel Monte Carlo methods
2021, Computers and Mathematics with ApplicationsA SBS-BD based solver for domain decomposition in BE methods
2013, Engineering Analysis with Boundary ElementsCitation Excerpt :In these cases, iterative solvers are commonly employed since direct solvers (though robust in solving general dense systems [9,10]) generate a large amount of fill-ins and are computationally expensive. In fact, in most practical applications by using the FEM or the BEM, large sparse systems are generated, so that employing direct solvers additionally requires devising complex re-ordering (pivoting) strategies to reduce the fill-ins, and to improve the numerical stability and scalability (for parallel processing) of the algorithms [11–14]. Furthermore, high-precision direct solvers also take into account iterative refinement of the computed solution [15].
Research progress on coastal groundwater flow and solute transport processes based on MARUN
2023, Bulletin of Geological Science and TechnologySecure Distributed Outsourcing of Large-scale LU Decomposition
2023, Proceedings - 2023 IEEE SmartWorld, Ubiquitous Intelligence and Computing, Autonomous and Trusted Vehicles, Scalable Computing and Communications, Digital Twin, Privacy Computing and Data Security, Metaverse, SmartWorld/UIC/ATC/ScalCom/DigitalTwin/PCDS/Metaverse 2023Hybrid Discretizations in Solid Mechanics for Non-linear and Non-smooth Problems
2022, Lecture Notes in Applied and Computational MechanicsNumerical Matrix Decomposition
2021, arXiv