Elsevier

Parallel Computing

Volume 35, Issue 5, May 2009, Pages 284-304
Parallel Computing

Parallel performance of the coarse space linear vertex solver and low energy basis preconditioner for spectral/hp elements

https://doi.org/10.1016/j.parco.2008.12.002Get rights and content

Abstract

The big bottleneck in scaling PDE-based codes to petaflop computing is scalability of effective preconditioners. We have developed and implemented an effective and scalable low energy basis preconditioner (LEBP) for elliptic solvers, leading to computational savings of an order of magnitude with respect to other preconditioners. The efficiency of LEBP relies on the implementation of parallel matrix-vector multiplication required by coarse solver to handle the h-scaling. We provide details on optimization, parallel performance and implementation of the coarse grain solver and show scalability of LEBP on the IBM Blue Gene and the Cray XT3.

Introduction

Scalability of highly effective preconditioners on thousands of processors is an open problem that had no satisfactory solution for many matrix problem arising from the discretisation of partial differential equations. As the effectiveness of the preconditioner increases so does the global interdependencies reflecting in a sense the multi-scale nature of the problem and hence the poor scaling. To the best of our knowledge, there are currently no effective preconditioners for the new generation of modal spectral/hp element discretisations that scale well on more than one thousand processors.

Several effective preconditioners for spectral element Navier–Stokes solvers can be found in the literature, but very few of these preconditioners are also scalable. Tufo and Fischer [1] presented a scalable parallel solver for the incompressible Navier–Stokes equation, for the first (nodal) generation of spectral elements. Performance of the solver was evaluated on the Intel ASCI-Red, CRAY T3E-600 and SGI ASCI-Blue computers with 333 MHz processors. Good scalability was observed for simulations with relatively high order polynomial approximation. Bergen et al. [2] used hierarchical hybrid grid (HHG) to solve efficiently large scale linear finite element problem. The HHG approach is essentially a geometric multigrid method. Good scalability and also good solution time was achieved for solution of a large problem on up to 1024 SGI Atlix™ 3700 CPUs (1.6 GHz Itanium 2 processors). Lottes and Fischer [3] studied performance of the variations of multigrid method applied to spectral element nodal discretizations of the Helmholtz equation. Several methods considered in their work resulted in convergence rates comparable to regular (Cartesian) grid based multigrid methods. Although effective for non-deformed spectral elements, geometric multigrid is essentially a sequential approach and can not lead to a high parallel efficiency on petaflop computers. Pavarino et al. [4] developed overlapping Shwarz methods for nodal triangular and quadrilateral spectral elements. Their results show that it is possible to obtain convergence rate independent of the order of polynomial expansion and number of elements.

A low energy basis preconditioner (LEBP) for elliptic substructured solvers was proposed by Bica [5] and later implemented by Sherwin and Casarin [6] for modal spectral/hp elements. In this work we present a parallel implementation of the LEBP appropriate for a large number of processors and investigate its performance. Specifically, we discuss in detail implementation of a parallel coarse space linear vertex solver. Our results show that LEBP is very effective in simulations on more than one thousand processors while it exhibits similar scalability to a diagonal preconditioner. The goal of this project was to develop a scalable and efficient coarse space linear vertex solver required by a LEBP for elliptic substructured solver. Guided by the information obtained with profiling tools on IBM Blue Gene/L and DataStar (Power4+/Federation) computer at SDSC we were able to make optimizations by redesigning the communication in several bottleneck routines. Profiling code simultaneously on several computational platforms allowed us to perform differential optimization. As a result, the code implemented with low energy basis preconditioner and the optimized coarse space linear vertex solver now scales well on thousands of processors.

The spectral/hp element code NEKTAR is employed in our studies [7]. The computational domain consists of structured or unstructured grids or a combination of both, similar to those employed in standard finite element and finite volume methods. In each element the solution is approximated in terms of hierarchical mixed order Jacobi polynomial expansions. This provides a way of hierarchically refining the numerical solution by increasing the order of the expansion (p-refinement) within every element without the need to regenerate the mesh, thus avoiding a potential significant overhead cost associated with remeshing. Domain partitioning, required by the parallel solver, is done by Metis [10]. NEKTAR employs up to third-order accurate semi-implicit time integration scheme. A high-order splitting scheme [8] is adopted, that decouples the velocity and pressure fields requiring only the inversion of three Helmholtz operators for the velocity components (in three-dimensions) and one Poisson operator for the pressure; for details see Appendix A. Due to the matrix sparsity arising from symmetric linear operators, iterative solutions based on preconditioned conjugate gradient (PCG) are typically preferred. The effectiveness of preconditioner is normally estimated by reduction in iteration count. However, the parallel efficiency of a given preconditioner is also strongly affected by the additional computational cost associated with the preconditioning step and by the volume of communication involved. In this paper we measure the effectiveness of preconditioner by monitoring the reduction in the average cpu-time required for one time step in large 3D simulations of flow in arterial geometries.

The following computational platforms were used for the benchmarking and code developing:

  • IBM Blue Gene of San Diego Supercomputing Center (SDSC): This computer is housed in three racks with 3072 compute nodes. Each node has two PowerPC processors that run at 700 MHz and share 512 MB of memory. All compute nodes are connected by two high-speed networks: a 3D torus for point-to-point message passing and a global tree for collective message passing.

  • IBM Power4+/Federation of SDSC: This computer has eight-way P655+ 1.5 GHz and 32-way P690 1.7 GHz compute nodes with 32 and 128 GB of memory, respectively. In our study we used the P655 compute nodes.

  • Cray XT3 MPP system of Pittsburgh Supercomputing Center (PSC) with 2068 compute nodes linked by a custom-designed interconnect. Each compute node has two 2.6 GHz AMD Opteron processors with its own cache, and shared 2 GB of memory and the network connection. The nodes are connected in a three-dimensional torus using a HyperTransport link to a dedicated Cray SeaStar communications engine.

  • Cray XT3 MPP of US Army Engineer Research and Development Center (ERDC): The computer has 4160 nodes, each containing one 2.6-GHz AMD Opteron 64-bit dual-core processor and 4 GB of shared memory. The nodes are connected in a three-dimensional torus using a HyperTransport link to a dedicated Cray SeaStar communications engine.

The paper is organized as follows: in Section 2 we first review the hierarchical basis implemented in NEKTAR. In Section 3 we overview the low energy preconditioner, extend the formulation of [6], and introduce an improvement in performance of the LEBP for hybrid meshes with prismatic elements. In Section 4 we provide details on the construction and implementation of parallel coarse space solver – the primary bottleneck in the scaling of the low energy preconditioner. In Section 5 we present results on parallel efficiency of the LEBP. In Section 6 we conclude with a brief summary. In Appendix A Numerical scheme in NEKTAR, Appendix B Construction of the elemental transformation matrix, Appendix C Parallel matrix vector multiplication in NEKTAR we provide additional information on the time-stepping scheme employed in NEKTAR, construction of LEBP and communication patterns implemented for parallel matrix-vector product in NEKTAR.

Section snippets

Expansion basis in NEKTAR

The computational domain in NEKTAR consists of tetrahedra, hexahedra, prisms, pyramids or a combination of these. Within each element the solution is approximated in terms of hierarchical, mixed order, semi-orthogonal Jacobi polynomial expansions [7]. It is hierarchical in a sense that the modes are separated into vertex (linear term), edge, face and bubble (high-order terms) modes. In Fig. 1 we provide an illustration of the domain decomposition and polynomial basis employed in NEKTAR.

The

Formulation

Consider the elliptic boundary value problem2u(x)+λu(x)=f(x);λ0,defined on computational domain Ω which is discretized into Nel non-overlapping spectral elements; the computational sub-domain, associated with a particular element, is denoted by Ωe. A standard spectral/hp element [7] is defined on a local, to this element, system of coordinates ξ, which can be mapped to the global coordinate system x=xe(ξ). Then, a standard spectral/hp element [7] spatial approximation of u is given byuδ(x)=i=

Parallel coarse space linear vertex solver

In this section we first discuss the partitioning of global linear vertex degrees of freedom and formulation of the Schur complement for the coarse vertex solve. Second, we compare several numerical approaches to tackle the solution of a linear system required by the coarse vertex solve. Third, we provide details on the parallel construction of the Schur complement for the coarse vertex solve during the preprocessing stage. Finally, we discuss different algorithms for parallel matrix-vertex

Parallel performance results

In this section we investigate the performance of low energy basis and diagonal preconditioner used in NEKTAR. We compare the scalability and effectiveness of the two preconditioners. The Schur complement of the Mass and the Stiffness operators constructed with the polynomial basis used in NEKTAR are diagonally dominant, and the values on their diagonals are varying, and this makes the diagonal preconditioning quite effective, unlike in other methods. In Table 4 we compare the number of

Summary

We developed a scalable and efficient preconditioner for solution of elliptic equations using the spectral/hp element method. The preconditioner was successfully applied to simulations of unsteady bioflows in domains with complex geometry, where the elliptic solves is the dominant cost. The number of iterations was reduced by factor 10–50 while the overall computational cost of the simulation was reduced by an order of magnitude when a standard diagonal preconditioner was substituted by the

Acknowledgements

This work was supported by the National Science Foundation’s CI-TEAM program under Grant OCI-0636336. Supercomputer resources were provided by the NSF Large Allocations Resource Committee at San Diego Supercomputer Center, DOD’s ERDC and Pittsburgh Supercomputing Center. We want to thank the San Diego Supercomputer Center for their support of this project through the Strategic Applications Collaboration program. We also like to thank Dr. David O’Neal of PSC for his support. We want to

References (15)

  • H.M. Tufo, P.F. Fischer, Terascale spectral element algorithms and implementations gordon bell prize paper, in:...
  • B. Bergen, F. Hulsemann, U. Rude, Is 1.7×1010 unknowns the largest finite element system that can be solved today?,...
  • J.W. Lottes et al.

    Hybrid multigrid/Schwarz algorithms for spectral element method

    Journal of Scientific Computing

    (2005)
  • L.F. Pavarino et al.

    Overlapping Schwartz method for Fekete and gauss-Lobatto spectral elements

    SIAM Journal of Scientific Computing

    (2007)
  • I. Bica, Iterative substructuring algorithm for p-version finite element method for elliptic problems. Ph.D. thesis,...
  • S. Sherwin et al.

    Low-energy basis preconditioning for elliptic substructured solvers based on unstructured spectral/hp element discretization

    Journal of Computational Physics

    (2001)
  • G.E. Karniadakis et al.

    Spectral/hp element methods for CFD

    (2005)
There are more references available in the full text version of this article.

Cited by (17)

  • GPGPU implementation of mixed spectral-finite difference computational code for the numerical integration of the three-dimensional time-dependent incompressible Navier-Stokes equations

    2014, Computers and Fluids
    Citation Excerpt :

    A stabilized finite-element formulation for the three-dimensional unsteady incompressible Navier–Stokes equations was implemented on a distributed-memory parallel computer by Behara and Mittal [26]. Grinberg et al. [27] developed and tested an effective and scalable low-energy-basis preconditioner for Navier–Stokes solvers. Though, and since a few years, cluster performance appears to have reached an upper limit for many technical codes.

View all citing articles on Scopus
View full text