Parallel performance of the coarse space linear vertex solver and low energy basis preconditioner for spectral/hp elements
Introduction
Scalability of highly effective preconditioners on thousands of processors is an open problem that had no satisfactory solution for many matrix problem arising from the discretisation of partial differential equations. As the effectiveness of the preconditioner increases so does the global interdependencies reflecting in a sense the multi-scale nature of the problem and hence the poor scaling. To the best of our knowledge, there are currently no effective preconditioners for the new generation of modal spectral/hp element discretisations that scale well on more than one thousand processors.
Several effective preconditioners for spectral element Navier–Stokes solvers can be found in the literature, but very few of these preconditioners are also scalable. Tufo and Fischer [1] presented a scalable parallel solver for the incompressible Navier–Stokes equation, for the first (nodal) generation of spectral elements. Performance of the solver was evaluated on the Intel ASCI-Red, CRAY T3E-600 and SGI ASCI-Blue computers with 333 MHz processors. Good scalability was observed for simulations with relatively high order polynomial approximation. Bergen et al. [2] used hierarchical hybrid grid (HHG) to solve efficiently large scale linear finite element problem. The HHG approach is essentially a geometric multigrid method. Good scalability and also good solution time was achieved for solution of a large problem on up to 1024 SGI Atlix™ 3700 CPUs (1.6 GHz Itanium 2 processors). Lottes and Fischer [3] studied performance of the variations of multigrid method applied to spectral element nodal discretizations of the Helmholtz equation. Several methods considered in their work resulted in convergence rates comparable to regular (Cartesian) grid based multigrid methods. Although effective for non-deformed spectral elements, geometric multigrid is essentially a sequential approach and can not lead to a high parallel efficiency on petaflop computers. Pavarino et al. [4] developed overlapping Shwarz methods for nodal triangular and quadrilateral spectral elements. Their results show that it is possible to obtain convergence rate independent of the order of polynomial expansion and number of elements.
A low energy basis preconditioner (LEBP) for elliptic substructured solvers was proposed by Bica [5] and later implemented by Sherwin and Casarin [6] for modal spectral/hp elements. In this work we present a parallel implementation of the LEBP appropriate for a large number of processors and investigate its performance. Specifically, we discuss in detail implementation of a parallel coarse space linear vertex solver. Our results show that LEBP is very effective in simulations on more than one thousand processors while it exhibits similar scalability to a diagonal preconditioner. The goal of this project was to develop a scalable and efficient coarse space linear vertex solver required by a LEBP for elliptic substructured solver. Guided by the information obtained with profiling tools on IBM Blue Gene/L and DataStar (Power4+/Federation) computer at SDSC we were able to make optimizations by redesigning the communication in several bottleneck routines. Profiling code simultaneously on several computational platforms allowed us to perform differential optimization. As a result, the code implemented with low energy basis preconditioner and the optimized coarse space linear vertex solver now scales well on thousands of processors.
The spectral/hp element code NEKTAR is employed in our studies [7]. The computational domain consists of structured or unstructured grids or a combination of both, similar to those employed in standard finite element and finite volume methods. In each element the solution is approximated in terms of hierarchical mixed order Jacobi polynomial expansions. This provides a way of hierarchically refining the numerical solution by increasing the order of the expansion (p-refinement) within every element without the need to regenerate the mesh, thus avoiding a potential significant overhead cost associated with remeshing. Domain partitioning, required by the parallel solver, is done by Metis [10]. NEKTAR employs up to third-order accurate semi-implicit time integration scheme. A high-order splitting scheme [8] is adopted, that decouples the velocity and pressure fields requiring only the inversion of three Helmholtz operators for the velocity components (in three-dimensions) and one Poisson operator for the pressure; for details see Appendix A. Due to the matrix sparsity arising from symmetric linear operators, iterative solutions based on preconditioned conjugate gradient (PCG) are typically preferred. The effectiveness of preconditioner is normally estimated by reduction in iteration count. However, the parallel efficiency of a given preconditioner is also strongly affected by the additional computational cost associated with the preconditioning step and by the volume of communication involved. In this paper we measure the effectiveness of preconditioner by monitoring the reduction in the average cpu-time required for one time step in large 3D simulations of flow in arterial geometries.
The following computational platforms were used for the benchmarking and code developing:
- •
IBM Blue Gene of San Diego Supercomputing Center (SDSC): This computer is housed in three racks with 3072 compute nodes. Each node has two PowerPC processors that run at 700 MHz and share 512 MB of memory. All compute nodes are connected by two high-speed networks: a 3D torus for point-to-point message passing and a global tree for collective message passing.
- •
IBM Power4+/Federation of SDSC: This computer has eight-way P655+ 1.5 GHz and 32-way P690 1.7 GHz compute nodes with 32 and 128 GB of memory, respectively. In our study we used the P655 compute nodes.
- •
Cray XT3 MPP system of Pittsburgh Supercomputing Center (PSC) with 2068 compute nodes linked by a custom-designed interconnect. Each compute node has two 2.6 GHz AMD Opteron processors with its own cache, and shared 2 GB of memory and the network connection. The nodes are connected in a three-dimensional torus using a HyperTransport link to a dedicated Cray SeaStar communications engine.
- •
Cray XT3 MPP of US Army Engineer Research and Development Center (ERDC): The computer has 4160 nodes, each containing one 2.6-GHz AMD Opteron 64-bit dual-core processor and 4 GB of shared memory. The nodes are connected in a three-dimensional torus using a HyperTransport link to a dedicated Cray SeaStar communications engine.
The paper is organized as follows: in Section 2 we first review the hierarchical basis implemented in NEKTAR. In Section 3 we overview the low energy preconditioner, extend the formulation of [6], and introduce an improvement in performance of the LEBP for hybrid meshes with prismatic elements. In Section 4 we provide details on the construction and implementation of parallel coarse space solver – the primary bottleneck in the scaling of the low energy preconditioner. In Section 5 we present results on parallel efficiency of the LEBP. In Section 6 we conclude with a brief summary. In Appendix A Numerical scheme in NEKTAR, Appendix B Construction of the elemental transformation matrix, Appendix C Parallel matrix vector multiplication in NEKTAR we provide additional information on the time-stepping scheme employed in NEKTAR, construction of LEBP and communication patterns implemented for parallel matrix-vector product in NEKTAR.
Section snippets
Expansion basis in NEKTAR
The computational domain in NEKTAR consists of tetrahedra, hexahedra, prisms, pyramids or a combination of these. Within each element the solution is approximated in terms of hierarchical, mixed order, semi-orthogonal Jacobi polynomial expansions [7]. It is hierarchical in a sense that the modes are separated into vertex (linear term), edge, face and bubble (high-order terms) modes. In Fig. 1 we provide an illustration of the domain decomposition and polynomial basis employed in NEKTAR.
The
Formulation
Consider the elliptic boundary value problemdefined on computational domain Ω which is discretized into non-overlapping spectral elements; the computational sub-domain, associated with a particular element, is denoted by . A standard spectral/hp element [7] is defined on a local, to this element, system of coordinates ξ, which can be mapped to the global coordinate system . Then, a standard spectral/hp element [7] spatial approximation of u is given by
Parallel coarse space linear vertex solver
In this section we first discuss the partitioning of global linear vertex degrees of freedom and formulation of the Schur complement for the coarse vertex solve. Second, we compare several numerical approaches to tackle the solution of a linear system required by the coarse vertex solve. Third, we provide details on the parallel construction of the Schur complement for the coarse vertex solve during the preprocessing stage. Finally, we discuss different algorithms for parallel matrix-vertex
Parallel performance results
In this section we investigate the performance of low energy basis and diagonal preconditioner used in NEKTAR. We compare the scalability and effectiveness of the two preconditioners. The Schur complement of the Mass and the Stiffness operators constructed with the polynomial basis used in NEKTAR are diagonally dominant, and the values on their diagonals are varying, and this makes the diagonal preconditioning quite effective, unlike in other methods. In Table 4 we compare the number of
Summary
We developed a scalable and efficient preconditioner for solution of elliptic equations using the spectral/hp element method. The preconditioner was successfully applied to simulations of unsteady bioflows in domains with complex geometry, where the elliptic solves is the dominant cost. The number of iterations was reduced by factor 10–50 while the overall computational cost of the simulation was reduced by an order of magnitude when a standard diagonal preconditioner was substituted by the
Acknowledgements
This work was supported by the National Science Foundation’s CI-TEAM program under Grant OCI-0636336. Supercomputer resources were provided by the NSF Large Allocations Resource Committee at San Diego Supercomputer Center, DOD’s ERDC and Pittsburgh Supercomputing Center. We want to thank the San Diego Supercomputer Center for their support of this project through the Strategic Applications Collaboration program. We also like to thank Dr. David O’Neal of PSC for his support. We want to
References (15)
- H.M. Tufo, P.F. Fischer, Terascale spectral element algorithms and implementations gordon bell prize paper, in:...
- B. Bergen, F. Hulsemann, U. Rude, Is 1.7×1010 unknowns the largest finite element system that can be solved today?,...
- et al.
Hybrid multigrid/Schwarz algorithms for spectral element method
Journal of Scientific Computing
(2005) - et al.
Overlapping Schwartz method for Fekete and gauss-Lobatto spectral elements
SIAM Journal of Scientific Computing
(2007) - I. Bica, Iterative substructuring algorithm for p-version finite element method for elliptic problems. Ph.D. thesis,...
- et al.
Low-energy basis preconditioning for elliptic substructured solvers based on unstructured spectral/hp element discretization
Journal of Computational Physics
(2001) - et al.
Spectral/hp element methods for CFD
(2005)
Cited by (17)
GPGPU implementation of mixed spectral-finite difference computational code for the numerical integration of the three-dimensional time-dependent incompressible Navier-Stokes equations
2014, Computers and FluidsCitation Excerpt :A stabilized finite-element formulation for the three-dimensional unsteady incompressible Navier–Stokes equations was implemented on a distributed-memory parallel computer by Behara and Mittal [26]. Grinberg et al. [27] developed and tested an effective and scalable low-energy-basis preconditioner for Navier–Stokes solvers. Though, and since a few years, cluster performance appears to have reached an upper limit for many technical codes.
Sub-iteration leads to accuracy and stability enhancements of semi-implicit schemes for the Navier-Stokes equations
2011, Journal of Computational PhysicsA new domain decomposition method with overlapping patches for ultrascale simulations: Application to biological flows
2010, Journal of Computational PhysicsAn unconditionally stable rotational velocity-correction scheme for incompressible flows
2010, Journal of Computational Physics