Elsevier

Parallel Computing

Volume 37, Issue 12, December 2011, Pages 742-758
Parallel Computing

A parallel block LU decomposition method for distributed finite element matrices

https://doi.org/10.1016/j.parco.2011.05.007Get rights and content

Abstract

In this work we present a new parallel direct linear solver for matrices resulting from finite element problems. The algorithm follows the nested dissection approach, where the resulting Schur complements are also distributed in parallel. The sparsity structure of the finite element matrices is used to pre-compute an efficient block structure for the LU factors. We demonstrate the performance and the parallel scaling behavior by several test examples.

Highlights

► Parallel direct linear solver for finite element matrices. ► Parallel solution for dense Schur complements. ► Pre-computation of the block sparsity pattern. ► Evaluation of the algorithm by a challenging test suite.

Introduction

Finite element applications often require a fine mesh which results into several million or billion unknowns. To obtain a reasonable computing time to solve this equation, this can only be realized on a parallel machine. For this purpose several applications require parallel direct solvers, since they are more general and more robust than iterative solvers. For example, preconditioned CG-methods are restricted to symmetric positive definite systems, and also multigrid or domain decomposition preconditioner require an efficient parallel coarse problem solver. A direct solver often performs well in cases, where many iterative solvers fail, e.g., if the problem is indefinite, unsymmetric, or ill-conditioned.

Our concept of the parallel LU decomposition for matrices resulting from finite element problems is based on a nested dissection approach. On P = 2S processors, the algorithm uses S + 1 steps in which sets of processors are combined together consistently and problems between these processors are solved, beginning with one processor in the first step. The first step itself is comparable to the preprocessing step for non-overlapping domain decomposition methods [29], where the resulting Schur complement is then solved by a suitable preconditioned iteration. Here, a parallel LU decomposition for this Schur complement problem is introduced. Nevertheless, a simple nested dissection method fails for finite element applications since, in particular in 3D, the resulting Schur complement problems emerge into large dense problems, so that they have to be solved distributed and in parallel, too.

Block LU factorizations and its analysis have been discussed by various authors, see e.g., [14], [10], [18], [8]. A general purpose algorithm for sparse matrices is realized, e.g., in MUMPS [1] for distributed memory, and, e.g., in PARDISO [25], [26] on shared memory machines. Parallel solvers also can be used in hybrid method for solving subproblems, e.g., in SPIKE [23]. Sequential solvers for sparse matrices such as SuperLU [9] and UMFPACK [28] can be used to eliminate local degrees of freedom. A more general discussion of block LU decomposition and handling linear systems on parallel computers can be found in [7], [11], [13]. A block LU decomposition method with iterative methods for the Schur complement can be found, e.g., in [4].

Our new parallel solver explicitly uses the structure of the finite element matrix. Thus, it is not a “black box” solver, the knowledge of the structure of the finite element matrix and the decomposition of the domain to the processors is essential. Our new contribution is the efficient and transparent use of this structure for the parallel distribution of the elimination steps. In particular, we can identify a priori parts of the resulting LU decomposition which remains zero, so that they can be ignored during the algorithm.

The paper is organized as follows. In Section 2 a general setting for parallel finite elements is introduced which leads to a parallel block structure for the corresponding matrix. The mesh is first distributed on P processors and then refined locally l times, such that each processor handles a local part of the mesh. Then, a suitable block LU decomposition is defined and the parallel realization is discussed in Section 3. In Section 4 we introduce several finite element problems which are used in Section 5 for the evaluation of the parallel performance of our direct solver. In some cases the results are compared with the parallel direct solver MUMPS [1].

Section snippets

Parallel finite elements

Following [30], [31], we define a parallel additive representation of the stiffness matrix which directly corresponds to a parallel domain decomposition. This additive representation is the basis for the LU algorithm discussed in the next section.

A parallel block LU decomposition

Now we introduce a parallel block LU decomposition based on the blocks associated to the processor sets πk. We will see below that our algorithm is equivalent to a recursive Schur complement reduction method.

We start with a suitable numbering of the processorsP={p1,p2,,p2S}The block LU decomposition is based on a recursive definition of combined processor sets Ps,t in each step s with a cluster number t = 1,  , 2Ss, such that t=1,,2S-sPs,t=P. For s = 0 each processor set P0,t exists of exactly

Model problems

We define a series of different model problems for our numerical test. In our notation, u = u(x) is the solution, where x = (X, Y)T and x = (X, Y, Z)T are the coordinates in the 2D and 3D case, respectively.

Results

The numerical tests are realized on the Cluster hc3 of the Steinbuch Centre of Computing (SCC) in Karlsruhe with 332 eight-way compute nodes, where each node has two Intel Xeon Quad Core sockets with 2.53 GHz frequency. They are connected by an InfiniBand 4X QDR interconnect [15]. In the algorithm we use BLAS and LAPACK routines for dense matrices, given from the Intel Math Kernel Library [19] and MUMPS [1], [21] or SuperLU [9], [27] as a solver for sparse matrices, where MUMPS emerges as the

Acknowledgement

The authors acknowledge the financial support from BMBF Grant 01IH08014A within the joint research project ASIL (Advanced Solvers Integrated Library).

References (31)

  • J.W. Demmel et al.

    Stability of block LU factorization

    Numer. Linear Algebr. Appl.

    (1995)
  • James W. Demmel et al.

    A supernodal approach to sparse partial pivoting

    SIAM J. Matrix Anal. Appl.

    (1999)
  • James W. Demmel et al.

    Stability of block algorithms with fast level-3 BLAS

    ACM Trans. Math. Softw.

    (1992)
  • Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, Henk Van Der Vorst, Solving Linear Systems on Vector and Shared...
  • Howard Elman et al.

    Finite Elements and Fast Iterative Solvers with Applications in Incompressible Fluid Dynamics

    (2005)
  • Cited by (18)

    • A SBS-BD based solver for domain decomposition in BE methods

      2013, Engineering Analysis with Boundary Elements
      Citation Excerpt :

      In these cases, iterative solvers are commonly employed since direct solvers (though robust in solving general dense systems [9,10]) generate a large amount of fill-ins and are computationally expensive. In fact, in most practical applications by using the FEM or the BEM, large sparse systems are generated, so that employing direct solvers additionally requires devising complex re-ordering (pivoting) strategies to reduce the fill-ins, and to improve the numerical stability and scalability (for parallel processing) of the algorithms [11–14]. Furthermore, high-precision direct solvers also take into account iterative refinement of the computed solution [15].

    • Secure Distributed Outsourcing of Large-scale LU Decomposition

      2023, Proceedings - 2023 IEEE SmartWorld, Ubiquitous Intelligence and Computing, Autonomous and Trusted Vehicles, Scalable Computing and Communications, Digital Twin, Privacy Computing and Data Security, Metaverse, SmartWorld/UIC/ATC/ScalCom/DigitalTwin/PCDS/Metaverse 2023
    • Hybrid Discretizations in Solid Mechanics for Non-linear and Non-smooth Problems

      2022, Lecture Notes in Applied and Computational Mechanics
    View all citing articles on Scopus
    View full text