1 Introduction and Motivation

Geodynamics

Mantle convection is a critical component of the Earth system. Although composed of rocks, the Earth’s mantle behaves like a highly viscous fluid on geological time-scales. Its motion is driven by internal heating due to radioactive decay and by substantial heating from below. The latter stems from the release of primordial heat stored in the core from the time of the planet’s accretion. The mantle convection currents are largely responsible for many of Earth’s surface tectonic features, forming mountain chains and oceanic trenches through plate tectonics, and contributing significantly to the accumulation of stresses released in inter-plate earthquakes. Hence, a thorough quantitative understanding of mantle convection is indispensible to gain further insight into these processes. However, besides such sort of fundamental questions, mantle convection does also have direct influence on some societal and commercial issues. Viscous stresses caused by up- and downwellings in the mantle induce dynamic topography, i.e. they lead to elevation or depression of parts of Earth’s surface. Reconstructing the latter and the associated sea-levels back in time is crucial for localisation of oil-reservoirs and the determination of future sea-level rises caused by climate change.

Although the basic equations for describing the mantle convection process are not in question, resulting from the force balance between viscous and buoyancy forces and conservation of mass and energy, key system parameters, such as the buoyancies and viscosities, remain poorly known. In particular the rheology of the mantle, which is a fundamental input parameter for geodynamic models, is not well known. Studies based on modeling the geoid e.g. [87], the convective planform e.g. [21], glacial isostatic adjustment e.g. [72, 80], true polar wander e.g. [82, 86, 89] and plate motion changes e.g. [57] consistently point to the need for a significant viscosity increase between the upper and the lower mantle. But the precise form of the viscosity profile remains uncertain. Commonly the viscosity profiles display a peak in the mid lower mantle [81, 85], or involve a rheology contrast located around 1000 km depth [93], or favor an asthenosphere with a strong viscosity reduction to achieve high flow velocities and stress amplification in the sublithospheric mantle [51, 102]. Geodynamic arguments on the uncertainties and trade-offs in the viscosity profile of the upper mantle have been summarized recently by Richards and Lenardic [88].

Since the mantle is not directly accessible to observations and laboratory experiments can hardly reproduce the relevant temperatures, pressures, and time-scales, computer simulation is vital for studying mantle convection. An example is given in Fig. 1. One of the challenges is the enormous spatial scales that must be resolved. While the mantle has a thickness of roughly 3000 km, important features, such as subducting slabs might only have a width of several kilometers. Additionally, realistic simulations will require a very large number of time-steps. Furthermore, fully realistic models of the mantle must deal with many complexities such as non-linear viscosity models, phase transitions, or the treatment of the non-constant hydrostatic background density of the mantle, see e.g. [50].

Fig. 1
figure 1

Velocity flowlines under the Himalaya mountain range; snapshot from a global convection model simulated with hierarchical hybrid grids

TerraNeo

TerraNeo Footnote 1aims to design and realize a new community software framework for extreme scale Earth Mantle simulations. We have a special constellation where groups from Geophysics, Numerical Mathematics, and High Performance Computing collaborate towards a unique co-design effort. The first 3-year funding period has already been summarized in [7]. Initially, the team had focussed on fundamental mathematical questions together with general scalability and performance issues, as documented in [38, 40,41,42,43]. In particular it could be shown that even with peta-scale class machines, computations are possible that resolve the global Earth Mantle with about 1 km resolution, requiring the solution of indefinite sparse linear systems with more than 1012 degrees of freedom. These very large computations and scaling experiments were performed with a prototype software that was implemented using the pre-existing hierarchical hybrid grids (HHG) multigrid library [12, 13]. This step was necessary to gain experience with parallel multigrid solvers for indefinite problems and the specific difficulties arising in Earth Mantle models. While this approach led to quick first results, the prototype character of HHG implied several fundamental limitations. Therefore, the second funding period is being used for a fundamental redesign of HHG under the new acronym HyTeG for Hybrid Tetrahedral Grids. The goal is to leverage the lessons learned with HHG for the design of a new, extremely fast, but also flexible, extensible, and sustainable Geophysics software framework.

Additionally, research on several exascale topics was conducted in both funding periods. These include resilience, inverse problems, and uncertainty quantification. Detailed results on these topics, as well as on the new HyTeG software architecture, will be reported in the following sections of this report.

Multigrid Methods

Multigrid methods belong to the set of fastest solvers for sparse linear systems that arise from the discretization of PDEs and are, thus, of central importance for exascale research. Their invention and their popularization e.g. in [17] constitute one of the major breakthroughs in numerical mathematics and computational science. To illustrate their relevance for exascale computing, we summarize here a thought experiment from [90].

Assuming that systems with N = 1012, i.e. a trillion degrees of freedom (DoF) would be solved by classical Gaussian elimination, this would require the astronomical number of operations of 2∕3N 3 = 2∕3 × 1036. Better elimination based algorithms can exploit the sparse matrix structure. A typical method of this class, such as nested dissection would still require around 1024 operations. For such a work load, a modern, fast PC with a performance of 100 GFlops would need more than 100,000 years of compute time. Note that solving one such system is still only worth one time step of many. This line of thought may be seen as just another argument why Earth Mantle Convection is a grand challenge problem and why e.g. the authors of [23] write:

A uniform discretization of the mantle at for instance 1 km resolution would result in meshes with nearly a trillion elements, which is far beyond the capacity of the largest available supercomputers.

It is important to realize that parallel computing alone does not solve the problem. Even the fastest German supercomputer, currently SuperMUC-NGFootnote 2, would still need longer than a year of compute time (if it did not run out of memory before) to execute 1024 operations.

In the TerraNeo project we have demonstrated, and we will report below, that a well designed massively parallel multigrid method is indeed capable of solving such large systems with a trillion unknowns in compute times in the order of seconds. This is fundamentally based on their fast convergence, for many types of problems, that leads to asymptotically optimal complexity solvers, when it is combined with a nested iteration in the form of a so called full multigrid method [17, 41]. Based on this, it comes as little surprise that multigrid methods are employed in several of the SPPEXA projects, such as e.g. [3, 4, 24, 69, 75] with excellent success for a wide variety of applications.

In the literature on multigrid methods of the past two decades, much attention has been given to so called algebraic multigrid methods (AMG) that do not operate on a mesh structure, but that attempt to construct the necessary hierarchy based on analyzing the sparse system matrix. These methods can exhibit excellent parallel scalability [3] and are often used as components within other parallel solvers, such as domain decomposition methods [62]. AMG methods have several advantages, most importantly that they can be interfaced to an application via classical sparse matrix technology. They can also save the application developer from worrying about the necessary solution process and many numerical analysts who design novel discretizations have taken this perspective. However, AMG methods work only well for certain types of systems, and consequently users of these new discretization techniques now find themselves being trapped with linear systems for which no efficient parallel algorithms are available, yet. Few methods devised in the applied mathematics community can demonstrate even solving systems with 109 degrees of freedom, falling three or four orders behind of what we can demonstrate here.

Additionally, the convenience of algebraic multigrid also comes at the price of a loss in performance. Algebraic multigrid methods have only a limited scope of applicability, and additionally they often lose an order of magnitude in performance in their expensive setup phase. Of course they are inherently also not matrix-free. As we will discuss below, matrix-free methods are essential to reach maximal performance. Geometric multigrid methods can be realized as matrix-free algorithms, potentially leading to another order of magnitude in performance improvement. For this reason we have invested heavily in new matrix-free methods [8, 9, 11], similar to other SPPEXA projects [5, 67].

The price of using geometric multigrid lies in the more complex algorithms which often have to be tailored carefully to each specific application, and the significantly more complex interfaces that are needed between the other software components and the solver. This in turn creates the need for new software technologies as are being developed in the HyTeG framework.

2 Basic Ideas and Concepts

As mentioned above the mathematical-physical model of mantle convection is based on the force balance between viscous and buoyancy forces and conservation of mass and energy. The resulting general system of equations is e.g. given in [7, 105]. Expressing these in terms of the central quantities of interest, i.e. velocity, pressure and temperature results in a coupled system composed of a stationary Stokes component and a time-dependent equation for the temperature. The former constitutes the most expensive part in these kinds of simulations and we will, thus, concentrate on this aspect mostly in the following. In the case of the Boussinesq approximation the Stokes part is given in non-dimensional form by

$$\displaystyle \begin{aligned} -{\mathrm{div}}\big(\eta (\nabla \mathbf{u} + (\nabla \mathbf{u})^{\top})\big) + \nabla p&= -Ra \,T\,{\mathbf{e}}_r {} \end{aligned} $$
(1)
$$\displaystyle \begin{aligned} {\mathrm{div}}(\mathbf{u})&= 0 {} \end{aligned} $$
(2)

with dimensionless velocity u, pressure p and the unit vector in radial direction e r. The Rayleigh number Ra describes the vigor of the convection and is given by

$$\displaystyle \begin{aligned} Ra = \frac{\alpha g \rho_0 \Delta T R_0^3}{\kappa \eta_r}\enspace. \end{aligned}$$

A description of the physical quantities represented by the symbols is given in Table 1. The finite element discretization that we consider for the stationary equations is backed by the concept of HHG. Initially, the domain is partitioned into an unstructured mesh of tetrahedra. In a second step, each tetrahedron is iteratively, uniformly refined in a way that a block-structured, tetrahedral mesh is created [12, 13, 40]. This refinement strategy results in a grid hierarchy \(\mathcal {T} \mathrel{\mathop:}= \{ \mathcal {T}_\ell , \ell = 0, 1, \dots , L \}\) that allows for a canonical implementation of geometric multigrid methods.

Table 1 Physical quantities and their representing symbols

The computational domain is distributed to the parallel processes by means of the unstructured coarse grid elements. To realize efficient communication procedures, the mesh is enhanced by so-called interface primitives. For each shared face, edge, and vertex an additional macro-primitive is allocated in the data structure. Together with the macro-tetrahedra, each primitive is assigned to one parallel process. Starting from a two times refined mesh, each macro-primitive has at least one inner vertex.

We employ conforming finite elements on the domain Ω and define the function spaces of globally continuous, piecewise polynomial functions with polynomial degree d on level as

$$\displaystyle \begin{aligned} S^d_\ell(\Omega) \mathrel{\mathop:}= \{ v \in \mathcal{C}(\overline{\Omega}) : v|{}_T \in P_d(T), \forall T \in \mathcal{T}_\ell\}. \end{aligned} $$

For the majority of our experiments and simulations we implement a stabilized P1-P1 discretization for the Stokes equation, using the velocity and pressure finite element spaces

$$\displaystyle \begin{aligned} {\mathbf{V}}_\ell \times Q_\ell \mathrel{\mathop:}= \left[ S^1_\ell(\Omega) \cap H^1_0(\Omega) \right]^3 \times S^1_\ell(\Omega) \cap L^2_0(\Omega). \end{aligned} $$

Here, \(L^2_0(\Omega ) \mathrel{\mathop:}= \{q \in L^2(\Omega ) : \langle q, 1 \rangle _\Omega = 0\}\) with 〈⋅, ⋅〉Ω being the inner product in L 2( Ω). The P1-P1 element pairing does not fulfill the LBB-condition and we employ a PSPG stabilization [19, 33]. It follows the standard weak formulation for the discretized Stokes problem: find (u , p ) ∈V  × Q such that

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle a({\mathbf{u}}_\ell,{\mathbf{v}}_\ell) + b({\mathbf{v}}_\ell,p_\ell) = f({\mathbf{v}}_\ell) \qquad \quad \forall\,{\mathbf{v}}_\ell \in {\mathbf{V}}_\ell,\\ &\displaystyle &\displaystyle b({\mathbf{u}}_\ell,q_\ell) - c_\ell(q_\ell,p_\ell) = g_\ell(q_\ell) \qquad \quad \forall\,q_\ell \in Q_\ell, {} \end{array} \end{aligned} $$
(3)

where

$$\displaystyle \begin{aligned} a(\mathbf{u},\mathbf{v}) &\mathrel{\mathop:}= 2\langle \eta D(\mathbf{u}), D(\mathbf{v}) \rangle_{\Omega} \\ b(\mathbf{u},q) &\mathrel{\mathop:}= -\langle \mathrm{div}\ \mathbf{u}, q \rangle_{\Omega} \\ f(\mathbf{v}) &\mathrel{\mathop:}= \langle \mathbf{f}, \mathbf{v} \rangle_{\Omega}, \end{aligned} $$

for all \(\mathbf {u}, \mathbf {v} \in H^1_0(\Omega )^3\) and \(q \in L^2_0(\Omega )\). The stabilization terms are defined as

$$\displaystyle \begin{aligned} c_\ell(q_\ell,p_\ell) &\mathrel{\mathop:}= \sum_{T \in \mathcal{T}_\ell} \frac{1}{12} (\int_T dx)^{1/3} \,\langle \nabla p_\ell, \nabla q_\ell \rangle_T \\ g_\ell(q_\ell) &\mathrel{\mathop:}= -\sum_{T \in \mathcal{T}_\ell} \frac{1}{12} (\int_T dx)^{1/3} \, \langle \mathbf{f}, \nabla q_\ell \rangle_T. \end{aligned} $$

The discrete formulation of (3) can then be expressed as the system of linear equations

(4)

A description of the general form of the temperature equation and our work on its discretization and time-stepping can be found in [7].

3 Summary of Project Results

3.1 Efficiency of Solvers and Software

A rigorous quantitative performance analysis lies at the heart of systematic research in large scale scientific computing. The systematic analysis of performance is central to the research agenda of computational science [91], since it defines how successful computational models are. In this sense, performance analysis here means more than measuring the speedup of a parallel solver or studying its weak and strong scaling properties. Beyond this, modern scientific computing must attempt to quantify numerical cost in metrics that are independent of the run-time of an arbitrary reference implementation. Most importantly, the numerical cost must be set in relation to what is numerically achieved, i.e. the accuracy that is delivered by a specific computation.

As one step in this direction, we extend Achi Brandt’s notion of textbook multigrid efficiency (TME) to massively parallel algorithms in [41]. Using the finite element based prototype multigrid implementation of the HHG library, we have employed the TME paradigm for scalar linear equations with constant and varying coefficients as well as to linear systems with saddle-point structure. A core step is then to extend the idea of TME to the parallel setting in a way that is adapted to the parallel architecture under consideration. To this end, we develop a new characterization of a work unit (WU) in an architecture-aware fashion by taking into account modern performance modeling techniques, in particular the standard Roofline model and the more advanced ECM model. The newly introduced parallel TME measure is studied in [41] with large-scale computations and for solving problems with up to 200 billion unknowns.

3.2 Reducing Complexity in Models and Algorithms

When striving for optimal efficiency, it is essential to choose models and discretization schemes whose incurred computational cost is as low as possible. Within TerraNeo we have therefore analyzed and improved the discretization schemes under study. A restricting factor was initially that only simple conforming discretizations are feasible in the prototype HHG library. E.g. discontinuous Galerkin discretizations, as in [6, 69], were not yet feasible in HHG. In particular, buoyancy-driven flow models thus demand a careful treatment of the mass-balance equation to avoid spurious source and sink terms in the non-linear coupling between flow and transport. In the context of finite-elements, it is therefore commonly proposed to employ sufficiently rich pressure spaces, containing piecewise constant shape functions to obtain local or even strong mass-conservation. In three-dimensional computations, this usually requires nonconforming approaches, special meshes or higher order velocities, which make these schemes prohibitively expensive for some applications and complicate the implementation into legacy code. In [39], we propose and analyze a lean and conservatively coupled scheme based on standard stabilized linear equal-order finite elements for the Stokes part and vertex-centered finite volumes for the energy equation. We show that in a weak mass-balance it is possible to recover exact conservation properties by a local flux-correction which can be computed efficiently on the control volume boundaries of the transport mesh. Furthermore, we discuss implementation aspects and demonstrate the effectiveness of the flux-correction by different two- and three-dimensional examples which are motivated by geophysical applications in [101].

In the special case of constant viscosity the Stokes system can be cast into different formulations by exploiting the incompressibility constraint. For instance the strain in the weak formulation can be replaced by the gradient to decouple the velocity components in the different coordinate directions. Thus the discretization of the simplified problem leads to fewer nonzero entries in the stiffness matrix. This is of particular interest in large scale simulations where a reduced memory footprint and accordingly reduced bandwidth requirement can help to significantly accelerate the computations. In the case of a piecewise constant viscosity, as it typically arises in multi-phase flows, or when the boundary conditions involve traction, the situation is more complex, and the cross derivatives in the original Stokes system must be treated with care. A naive application of the standard vectorial Laplacian results in a physically incorrect solution, while formulations based on the strain increase the computational effort everywhere, even when the inconsistencies arise only from an incorrect treatment in a small fraction of the computational domain. In [55], we present a new approach that is consistent with the strain-based formulation and preserves the decoupling advantages of the gradient-based formulation in iso-viscous subdomains. The modification is equivalent to locally changing the discretization stencils, hence the more expensive discretization is restricted to a lower dimensional interface, making the additional computational cost asymptotically negligible. We demonstrate the consistency and convergence properties of the new method and show that in a massively parallel setup, the multigrid solution of the resulting discrete systems is faster than for the classical strain-based formulation.

3.3 Stokes Solvers and Performance

3.3.1 Multigrid Approaches for the Stokes System

In this subsection we briefly review different classes of solvers for Stokes type systems involving a multigrid component. One characteristic feature is the indefinite structure and thus special care in the design of the solver is required. Numerical experiments and performance studies were originally performed with the HHG software and are thus mostly limited to stabilized conforming P1-P1 discretizations. We point here also to the possibility to generate such parallel solvers automatically and achieve similarly good scalability and efficiency results, as demonstrated in the ExaStencils project [70, 75]. A comprehensive summary of the supercomputers used for the simulations conducted in this project is listed in Table 2.

Table 2 Characteristics of the different supercomputers used for simulations presented in this publication

In TerraNeo three classes of solvers we consider are a preconditioned Krylov method for the indefinite system, a preconditioned Krylov method for the positive definite pressure based Schur complement and a monolithic (sometimes also called an all-at-once) multigrid solver for the indefinite system. While in the first two approaches the multigrid solver is only applied to the velocity component, the monolithic multigrid scheme works simultaneously on the velocity and the pressure component. A systematic quantitative performance study of all three approaches on modern peta-scale systems is given in [43]. More than 750,000 parallel threads are used in the largest simulation runs. Our main finding is that the all-at-once multigrid scheme outperforms the two alternatives. It does not only show the smallest memory footprint, but also results in the shortest compute time. More than 10 trillionFootnote 3 unknowns (> 1013) have been solved for on a Blue Gene/Q system, see Table 3.

Table 3 Exploring the limits of the monolithic multigrid solver: weak scaling on JUQUEEN

For such huge systems a sophisticated matrix-free code design with minimized memory traffic is a must. Although the use of GMRES type Krylov methods for the indefinite system is quite popular due to its robustness, it is not an option for us. The non-optimal memory requirement and compute time per iteration before the next restart slows down the overall performance drastically and restricts the feasible system size. To test the flexibility of the solver, we consider not only a box as geometry but also a thick spherical shell motivated by our application, a pipe filled with spherical obstacles of different diameters and a simplified artery, see Fig. 2. The physical system under consideration then ranges from a thermo-mechanically coupled system to a non-linear viscosity model of Carreau type.

Fig. 2
figure 2

Left to right: concentration in a lid driven cavity type setting; streamlines and speed in a benchmark problem; streamlines for a non-linear dynamic viscosity model and stationary flow in a pipe

3.3.2 Smoothers for Indefinite Systems

There are different strategies to construct smoothers for the coupled systems. Among the popular ones are Braess–Sarazin, Vanka, and Uzawa type approaches. While the first two types consider a saddle point problem either in a global or a local form as starting point, the Uzawa type scheme is based on a factorization of the saddle point system in lower and upper triangular matrices, see [16, 77, 98, 99, 104]. The advantage of the Uzawa version is clearly a smaller FLOP count and a reduced communication traffic compared to the two mentioned alternatives. Also in contrast to the Vanka smoother where local patches have to be considered it can be applied node-wise. Let us assume that A is symmetric and positive definite and \(\hat A\) and \(\hat S\) satisfy

$$\displaystyle \begin{aligned} \hat A \geq A, \quad \hat S \geq S := C B A^{-1} B^\top \end{aligned}$$

then the following smoothing property holds

$$\displaystyle \begin{aligned} \| \mathcal{K} \mathcal{M}^\nu \| \leq \sqrt{2}\, \eta(\nu- 1)\, \| \mathcal{D} \|\enspace, \end{aligned}$$

where ν is the number of smoothing steps, ∥⋅∥ a suitable operator norm and

$$\displaystyle \begin{aligned} \mathcal{M} := \text{Id} - \mathcal{K}^{-1} \left( \begin{array}{cc} \hat A & 0 \\ B & \hat S \end{array} \right), \quad \mathcal{D} := \left( \begin{array}{cc} \hat A & 0 \\ 0 & \hat S \end{array} \right), \quad \eta(\nu) := \frac{1}{2^\nu} \left ( \begin{array}{c} \nu \\ \lfloor (\nu-1)/2 \rfloor \end{array} \right) . \end{aligned}$$

A mathematically rigorous proof is given in [29]. Following the abstract framework provided by Larin and Reusken [73], the following identity turns out to be of crucial importance

$$\displaystyle \begin{aligned} \mathcal{K} \mathcal{M}^\nu = M_1 \mathcal{K} \mathcal{M}_s^{\nu -1 } M_2 \end{aligned}$$

with suitable operators M 1, M 2 and \(\mathcal {M}_s \) being the error propagation operator of the standard Uzawa. We note that in comparison to earlier theoretical results [95, 106, 107], we do not require symmetry of the Uzawa type smoother and thus the computational cost is reduced from two applications of the velocity smoother to one in each iteration. The achieved upper bound for the smoothing property is sufficient to guarantee level independent \(\mathcal W\)-cycle or variable \(\mathcal V\)-cycle results. However, in large scale applications \(\mathcal V\)-cycle schemes are much more attractive than \(\mathcal W\)-cycle based multigrid methods. The number of coarse solves per iteration is one in case of the \(\mathcal V\)-cycle but growths exponentially with the depth of the multigrid hierarchy in case of a \(\mathcal W\)-cycle. Unfortunately, there is no \(\mathcal V \)-cycle theory for this all-at-once multigrid method available, and moreover numerical experiments show that the convergence rate deteriorates for a fixed and level independent number of smoothing steps per level and an increased level count. As a compromise a mildly variable \(\mathcal V\)-cycle turns out to perform best. The HHG data structure and the matrix-free concept allows us, even on a standard workstation, to test the solver for systems with more than 100 millions of unknowns. The computations reported in Table 4 are performed on an Intel Xeon CPU E3-1226 v3, 3.30 GHz with 32 GB memory. For this first test only a single core is used and thus the timing does not reflect the parallel performance of the solver. For the pressure we use damped Gauss–Seidel applied to the operator C where the damping parameter is determined by a power iteration on a coarse level and not dependent on the mesh-size, but on the mesh-regularity.

Table 4 Number of iterations to reduce the residual by factor of 10−8 (Intel Xeon)

While the all-at-once multigrid solver requires a special smoother, the Schur-complement formulation allows for a natural application of multigrid on the velocity component as part of an inner iteration in a preconditioned conjugate gradient method for the pressure. In case of an isoviscous flow the Schur complement is spectrally equivalent to the mass matrix and thus the condition number does not deteriorate with an increasing number of degrees of freedom. Due to its simple structure this iterative solver can be easily implemented by reusing standard software components and is thus suitable for a rigorous performance analysis. We do not report here in further detail, but an innovative performance analysis can be found in [40, 41]. Good scalability on more than half a million parallel threads is demonstrated in [42]. Together with the excellent node-level performance, this is essential for achieving high performance levels.

3.3.3 Multigrid Coarse Grid Solvers

Finally we report on results for agglomeration techniques. In the context of the matrix-free HHG implementation it is natural to use a preconditioned Krylov method as a coarse grid solver. However, in the context of very large systems this is a bottleneck since this coarse solver is non-optimal, and will thus ultimately become a limit to scalability. To improve the parallel efficiency for systems beyond a billion of unknowns, we proceed in two steps. In a first preprocessing step, the coarse grid problem is assembled in a classical fashion storing matrix entries in standard compressed row storage format. The system can then be solved by methods from the PETSc library, [79]. More specifically, a block preconditioned MINRES iteration is applied. For the velocity component we here use an AMG preconditioned conjugate gradient (CG) iteration and for the pressure a lumped mass-matrix approach. Here still the full parallel system is employed so that only small coarse grid subproblems are assigned to each parallel process. In the second step, we therefore propose to use agglomeration to reduce the number of parallel processes on the coarsest level and to reduce the communication overhead in each iteration. This becomes necessary when the ratio between communication time and compute time becomes unbalanced and when the overall coarse solve time is severely dominated by communication. In Table 5, we report on the parallel efficiency for a weak scaling study on JUQUEEN (IBM BlueGene/Q-System, 28 racks, 28,672 nodes, 458,752 cores). The number of unknowns ranges from approximately 80 million to over 100, 000 million and the number of parallel threads is increased by a factor of more than 15, 000. Over this large range, the parallel efficiency shows only a moderate deterioration and remains above 90% even in the largest run. This results mainly from the fact that for the largest run we reduce the number of processes on the coarsest level by a factor of eight. Using a master-slave agglomeration technique has the advantage of a short collecting and distribution phase.

Table 5 Parallel efficiency for master-slave agglomeration technique and up to more than 50,000 parallel threads on JUQUEEN

In Fig. 3 we compare the naive non-optimal coarse solver with the PETSc based agglomeration strategy for the largest run, i.e. np = 61, 440. We observe that with the naive approach roughly half of the compute time is spent in the coarse solve, although it has only a small fraction of the total number of unknowns. The situation is drastically different for the agglomeration technique in combination with the PETSc solver. Here only roughly 5% of the actual time to solve is spent in the coarse solver routines. A closer look at the ratio between MPI communication and operator computations exhibits that without agglomeration strategy a significant amount of time is lost by simple communication. The situation is again drastically different for the agglomeration strategy. Here the communication overhead is reduced to less than 5%. We note that for a smaller system size and a moderate number of parallel threads, the difference in the time to solution for both approaches is negligible. For np = 30 the naive non-optimal coarse solver takes less than 10% of the total run-time and the communication load is less than 5% compared to the FLOP count. Moreover the savings in run-time in case of a PETSc coarse solver are limited due to the set-up phase. In conclusion, and somewhat contrary to conventional wisdom, we find that for up to ten million unknowns, a more sophisticated coarse solver does not pay off.

Fig. 3
figure 3

Ratio of relative time to solve spent in the different multigrid components: a non-optimal Krylov based solver without agglomeration (left) and AMG preconditioned Krylov based solver with agglomeration

To show its generality, we also test the strategy on a different peta-scale system (Hazel Hen) and for a problem with a strongly varying discontinuous viscosity profile as it is relevant for geodynamical simulations, see Table 6. Here, we additionally exploit a block low-rank coarse grid solver (BLR) which uses low-rank compressions in an elimination method to devise an approximate solver [1]. The reduced parallel efficiency, compared to the setting before, does not result from the different architecture, nor the use of BLR, but mainly from the fact that one additional multigrid iteration is required.

Table 6 Parallel efficiency in a weak scaling experiment for a geodynamically relevant setting on Hazel Hen

3.4 Multi-Level Monte Carlo

Due to errors in measurements, geophysical data are inevitably stochastic. Therefore, it is desirable to quantify these uncertainties when computing geophysical applications. A typical approach is to use ensemble simulations, i.e. running multiple computations with slight variations of perturbed data, and evaluating computational quantities of interest via Monte Carlo sampling. A naive sampling-based uncertainty quantification for 3D partial differential equations results in an extremely large computational complexity. More sophisticated approaches, such as multilevel Monte Carlo (MLMC), can reduce this complexity significantly. The performance can be further enhanced when the Monte Carlo sampling over several levels of mesh refinement is combined with a fast multigrid solver. In a parallel environment, however, sophisticated scheduling strategies are needed to exploit MLMC based on multigrid solvers. In [28], we explored the concurrent execution across the three layers of the MLMC method. Namely, parallelization across levels, across samples, and across the spatial grid.

An alternative way to classify these different parallel execution models is to consider the time-processor diagram, as illustrated in Fig. 4, where the scheduling of each sample \(Y^{i}_{\ell }\), with index 1 ≤ i ≤ N on level 0 ≤  ≤ L, is represented by a rectangular box with the height expressing the number of processors used. We call a parallel execution model homogeneous bulk synchronous if at any time in the processor diagram all tasks execute on the same level with the same number of processors. Otherwise we call it heterogeneous bulk synchronous.

Fig. 4
figure 4

From left to right: illustration of homogeneous (one-layer and two-layer) and of heterogeneous bulk synchronous strategies (two-layer and three-layer)

The one-layer homogeneous strategy, as shown in Fig. 4 (left), offers no flexibility. The theoretical run-time is simply given by the sum of the time of all samples. Even if this method guarantees perfect load balancing, it will not lead to an optimal efficiency since on the coarser levels the scalability of the solver is typically significantly worse than on the finer levels. On the coarsest level we may even have less grid points than processors. Thus, we discard this approach from our list of possible options.

We, instead, consider the following two variants: Firstly, sample synchronous homogeneous (SaSyHom) where a synchronization step is imposed after each sequential step (see Fig. 5, left). Here statistical quantities can be updated after each synchronization point. Secondly, level synchronous homogeneous (LeSyHom), where each block of processors executes all samples without any synchronization (see Fig. 5, center).

Fig. 5
figure 5

Illustration of different homogeneous scheduling strategies. Left: sample synchronous homogeneous (SaSyHom); centre: level synchronous homogeneous (LeSyHom); right: dynamic level synchronous homogeneous (DyLeSyHom)

In case of large run-time variations of the samples, we additionally consider dynamic variants where samples can be assigned to processors dynamically. Figure 5 (right) illustrates the DyLeSyHom strategy. Here it is essential that not all processor blocks execute the same number of sequential steps.

We compare static strategies with their dynamic variants in the case of large run-time variations on three MLMC levels, i.e. L = 2, with a fine grid resolution of about 1.6 ⋅ 107 mesh nodes. We assume a lognormal random coefficient, but in order to increase the run-time variation, we choose a greater variance σ 2 = 4.0 and λ = 0.5. As quantity of interest we employ the PDE solution u evaluated at the point x = (0.5, 0.5, 0.5). All samples are computed with a FMG-2V(4,4) cycle with up to a hundred additional V(4,4) cycles until a relative residuum of 10−10. Pre-computed variance estimates lead to an a priori strategy with (N )=0,1,2 = (27, 151, 10, 765, 3792) samples per level. In this scenario, 8192 processors were available for the scheduler. In Table 7 we compare the required time of the LeSyHom strategy in the static and dynamic variant and see that the dynamic strategy is 6% faster than its static counterpart.

Table 7 Comparison of static and dynamic scheduling strategies

The largest uncertainty quantification scenario we considered, involved finest grids with almost 70 billion degrees of freedom, and a total number of samples beyond 10, 000, most of which are computed on coarser levels. This computation was executed on the JUQUEEN supercomputer using more than 132, 000 processor cores in an excellent overall compute time of about 1.5 h.

3.5 Inverse Problem and Adjoint Computations

The adjoint method is a powerful technique that allows the computation of sensitivities (Fréchet derivatives) with respect to model parameters. It solves inverse problems where analytical solutions are not available or the cost to solve the associated forward problem many times is prohibitively high. In geodynamics it has been applied to the inverse problem of mantle convection—i.e., to restore past mantle flow e.g. [22], where it finds an optimal initial flow state that evolves into the present-day state of Earth’s mantle. In doing so, the adjoint method has the potential to link together diverse observations and theoretical expectations from seismology, geology, mineral physics, paleomagnetism and fluid dynamics, greatly enhancing our understanding of the solid Earth system.

Adjoint equations for mantle flow restoration have been derived for incompressible [22, 52, 59], compressible [35] and thermo-chemical mantle flow [36], and the uniqueness properties of the inverse problem have been related explicitly to the tangential component of the surface velocity field of a mantle convection model [25]. So knowledge of the latter is essential to assure convergence [100] and to obtain a small null space for the restored flow history [52]. To reduce the computational cost, we use a two-scale time step size predictor-corrector strategy to couple between Stokes and temperature similar to the one discussed in [46].

The framework was used to implement the adjoint technique. Specifically, we minimized the misfit functional

$$\displaystyle \begin{aligned} \chi(T) \mathrel{\mathop:}= \frac{1}{2}\int \limits_{\Omega} \int \limits_{[t_0,t_1]}\Big(T(T_0,\mathbf{x}, t) - T_E(\mathbf{x})\Big)^2 \delta (t-t_1) \, \mathrm{d}t \, \mathrm{d}x \end{aligned} $$
(5)

that relates to the squared difference of the geodynamic model temperature and the thermal heterogeneity distribution T E for the final state at t = t 1. The latter can be inferred e.g., from seismology and considerations of mantle mineralogy. We used a first order optimization approach to update the initial temperature T 0 in the descent direction ϕ in a process described by

$$\displaystyle \begin{aligned} T_0^{i+1}(\mathbf{x}) = T_0^{i}(\mathbf{x})-\beta \phi(\mathbf{x}, t_0) \quad \text{for }i=0,1,\dots\,, \end{aligned}$$

where β controls the correction in the descent direction and has to be sufficiently small [84]. In order to reduce the computational cost of deriving the step length, we found that fixing β = 0.4 yields sufficient results in our experiments, in agreement with [22]. The descent direction ϕ(x, t 0) is obtained by solving the adjoint equations, which are given by Bunge et al. [22]. Solving the adjoint equations requires velocity and temperature states from the forward problem. We checkpointed these states in the forward problem such that they could be read for the adjoint iteration. In our exploration of the inverse problem, we used a similar setup as presented in [23]. We considered a 45× 45 part of a spherical shell domain and discretized it by a tetrahedral mesh. Applying four uniform refinement steps, the mesh has a resolution of 2.7 ⋅ 105 grid points. The initial temperature field given by

$$\displaystyle \begin{aligned} T(\mathbf{x}) = T_0 + \exp\left(-\frac{1}{\sigma^2} \|\mathbf{x}-{\mathbf{x}}_0\|{}^2\right), \end{aligned}$$

where σ = 1∕20 determines the extent of the anomaly and x 0 denotes the center. Note that in our setup the latter is situated 2∕9 D below the core-mantle boundary, where D denotes relative mantle thickness, and thus lower than in [23].

3.5.1 Twin Experiment

To verify our implementation, we considered a so called twin experiment. In this setup a reference flow history is generated from the forward model. The final state of this reference flow is used as the terminal state that drives the geodynamic inverse problem. The reconstructed flow history is then compared against the known reference history to assess the quality of the inversion. Thus, while in a realistic geophysical setting only the final state T E is known, in the twin experiments, the true (unreconstructed) initial T I and final state T E are given from the reference twin. This allows us to study the convergence of initial and final states. In addition to the misfit error (5) at the final time, we define the misfit error at the initial time by

$$\displaystyle \begin{aligned} m_I(T) \mathrel{\mathop:}= \frac{1}{2} \int \limits_{\Omega} \int \limits_{[t_0,t_1]} \Big( T(\mathbf{x}, t)-T_I(\mathbf{x})\Big)^2 \delta (t-t_0) \, \mathrm{d}t \, \mathrm{d}x \end{aligned}$$

In Fig. 6, we present the initial/final temperature (left/right) from the reference twin. Figure 7 shows isosurfaces of the reconstructed initial (top row) and final (bottom row) temperature after adjoint iterations i = 0, 5, 10. For the sake of simplicity, the first guess for the unknown initial temperature is taken from the final state of the reference twin. Therefore, the reconstructed final temperature for i = 0 is a plume that has evolved much farther than the final reference state. After solving the inverse problem with 10 iterations, the error at the final state is reduced by more than one order of magnitude, while the error at the initial state is reduced by a factor of 3, and we observe a good agreement between reconstructed and reference temperatures.

Fig. 6
figure 6

Left: initial temperature state, right: final temperature state of the reference twin

Fig. 7
figure 7

From left to right: reconstruction of initial (top row) and final (bottom row) state: i = 0, 5, 10 iterations, very right: reference states

3.6 Matrix-Free Algorithms

Matrix-free approaches are an essential component for ultra-high resolution finite-element simulations. The reason for this is twofold. The classical finite-element workflow of assembling the global system matrix from the local element contributions and then solving the resulting linear system of equations with algorithms implemented in matrix-vector fashion leads to enormous data traffic between main memory and computational units (CPU cores). This constitutes a severe performance bottleneck. Another aspect is that the cost for holding the matrix in main memory becomes prohibitive, too. Note that for our simulations with 1013 DoFs, see [43] and Sect. 3.3 the solution vector alone requires 80 TByte of memory. Thus, matrix-free methods, which do not assemble the global matrix, but only provide the means to evaluate an associated matrix–vector product receive increasing attention, see e.g. [5, 20, 66,67,68, 78, 92]. An overview on the history and different approaches can e.g. be found in [9].

We note that the concept of hierarchical hybrid grids, from its inception on, was designed to be matrix-free. Based on its original premise that the macro mesh resolves both, geometry and material parameters, only a single discretization stencil had to be computed and stored for each macro primitive, as opposed to one for every fine mesh node, which corresponds to one row of the global matrix, [13]. For the case of locally varying material parameters, such as the viscosity in mantle convection, this could be extended, using classical local element matrix based approaches, see [9, 14, 37]. However, this approach does not carry over to the case of non-polyhedral domains and/or higher order elements. Thus, we developed alternative and also more efficient approaches.

3.6.1 Matrix-Free Approaches Based on Surrogate Polynomials

In our HHG and HyTeG frameworks the coupling between DoFs is considered in the form of a stencil like in classical finite differences. Due to the local regularity of the refined block-structured mesh the stencil pattern is invariant across an individual macro primitive. For example in the case of P 1 elements and a scalar equation in 3D we obtain a 15-point stencil for all nodes within a volume primitive or on a face primitive. However, in the case of a curved domain and/or locally variable material parameters the stencil weights are non-constant over a macro primitive. These weights can be computed on-the-fly when they are needed to apply the stencil locally, e.g. during a smoothing step, by standard quadrature. Note that at least for P 1 elements this is theoretically still faster than the more sophisticated approach of fusing quadrature and application of the local element-matrix, see [66]. However, it is significantly slower than the original one-stencil-per-primitive scenario. Thus, in [8] we introduce a new two-scale approach that employs surrogate polynomials to approximate the stencil weights. The idea, in a nutshell, is to consider the individual stencil weights as functions on each primitive. These functions can be sampled, in a setup phase, on a certain number of sampling points, typically for all nodes of the primitive on a coarser level of the mesh hierarchy. Then an approximating polynomial is constructed for the weight by determining the polynomial coefficients through a standard least-squares fit. Whenever the stencil weight is needed the associated polynomial is then evaluated. Performance-wise this raises two questions. How much memory is required for storing the polynomial coefficients and what is the cost for evaluating the polynomial, as opposed to that of on-the-fly quadrature. The answer to the first question is that the memory footprint even in 3D is not too large for polynomials of moderate degree. Consider as an example a trivariate polynomial of degree five. It has 56 coefficients. Representing a 15-point stencil by such polynomials for all nodes of a volume primitive, thus, requires only 6720 bytes. Note also that the number of polynomials needed can be reduced by symmetry arguments and the zero row-sum property of consistent weak differential operators, see [8, 30]. Thanks to the logical structuredness of the mesh inside our volume and face primitives evaluation of a polynomial can be performed lexicographically along lines. It, thus, reduces to a 1D problem which can efficiently be executed using incremental updates based on the concept of divided differences, see [8, 30].

In [8] we conducted an extensive performance study of the new approach for the 3D Poisson problem on different curved domains which we connected to a polygonal domain by means of a mapping function. Thus, we are in fact solving a diffusion equation with a non-constant symmetric definite material tensor. As an example Fig. 8 shows results for a weak scaling experiment on SuperMUC. Here we consider a V(3,3) cycle and assign 16 macro elements to one core. On the finest grid there are 3.3 × 105 DoFs per volume primitive. We see that the surrogate operator approach (using quadratic polynomials) clearly outperforms the on-the-fly quadrature and is only about a factor two more expensive than the original one-stencil-per-primitive approach in HHG which cannot capture the domain curvature. Additionally, we study the influence of the approximation on the discretization error and the multigrid convergence rate. The former is influenced by the ratio between the mesh size of the macro mesh and the solution level, the sampling level and also the polynomial degree, while the latter is basically non-existent, see Table 8. In [30] we present a theoretical analysis of the consistency error of the surrogate operator approach and extend our numerical experiments to more challenging differential operators using problems from linear elasticity and non-linear p-Laplacian diffusion.

Fig. 8
figure 8

Weak scaling behavior of a V(3, 3) cycle for different approaches: on-the-fly quadrature (IFEM), surrogate polynomials (LSQP), surrogate polynomials combined with double discretization (LSQP DD) and surrogate polynomials after node-level performance optimization (LSQP OPT); for comparison (CONS) gives times for the one-stencil-per-primitive approach that neglects curvature

Table 8 Discretization error in the L 2-norm for IFEM and LSQP (with different polynomial degree q) and ratio of the asymptotic multigrid convergence rates for curved pipe domain (level  = 0 refers to an already twice refined mesh)

In mantle convection models viscosity may vary significantly on a local scale, i.e. within a volume primitive. In this case, an approximation of stencil weights by low-order polynomials as in [8] will no longer be sufficient. One might try higher order polynomials as was implemented in HyTeG for the tests presented in [30]. Albeit this is currently only available in 2D. Alternatively, we devised a variant of our approach, see [10, 11], denoted in the following as LSQPe. Here, instead of directly approximating the stencil weights, we replace the entries of the local element matrices by surrogate polynomials. The stencil weights are then computed from these surrogate element matrices in the standard fashion for the P 1 case. This variant was used to perform the dynamic topography simulations of Sect. 3.10. As an example Table 9 shows weak scaling results on SuperMUC Phase 1 for a mantle convection model with a temperature-dependent viscosity and a global resolution of 1.7 km for the largest scenario. Run-times are for a V-cycle with an Uzawa smoother, stopping criterion is a residual reduction of five orders of magnitude. For complete details we refer to [11]. Note the parallel efficiency of 83% for the largest run, which could be further improved by using a more sophisticated coarse grid solver, see Sect. 3.3, than the one available at the time of the experiment.

Table 9 Weak scaling of the LSQPe approach using ca. 2.3 × 107 DoFs per core; shown are: average run-time w/ and w/o coarse grid solver (c.g.) for one UMG cycle and no. of UMG iterations; values in brackets give no. of c.g. iterations (preconditioner/MINRES); parallel efficiency w.r.t. one UMG cycle is given for timings w/ and w/o c.g.; additionally the average time for a single residual application on the finest level is given

3.6.2 A Stencil Scaling Approach for Accelerating Matrix-Free Finite Element Implementations

In [9] we present a novel approach to fast on-the-fly low order finite element assembly for scalar elliptic partial differential equations of Darcy type with variable coefficients optimized for matrix-free implementations. In this approach, we introduce a new operator that is obtained by scaling the reference operator, i.e. the stencil obtained from the constant coefficient case. Assuming sufficient regularity, an a priori analysis showed that solutions obtained by this approach are unique and have asymptotically optimal order convergence in the H 1-norm and the L 2-norm on hierarchical hybrid grids. These preliminary considerations motivate our novel approach to reduce the cost by recomputing the surrogate stencil entries for a matrix-free solver in a more efficient way.

To demonstrate the advantages of our novel scaling approach we consider, among other examples, a Poisson problem on a domain defined through a blending function. Particularly, we consider a half cylinder mantle with inner radius r 1 = 0.8 and outer radius r 2 = 1.0, height z 1 = 4.0 and with an angular coordinate between 0 and π as our physical domain Ωphy. The cylinder mantle is additionally warped inwards by \(w(z) = 0.2\sin \left (z\pi /z_1\right )\) in axial direction. The mapping Φ : Ωphy → Ω is given by

with the reference domain Ω = (r 1, r 2) × (0, π) × (0, z 1). It follows for the mapping tensor K of the Poisson problem that

Obviously, this tensor is symmetric and positive definite. In addition to the geometry blending, we use a variable material parameter a(x, y, z) = 1 + z. This yields the following PDE on the reference domain Ω: \(-\mathrm {div} \left ( a K \nabla u \right )= f\). Additionally, the manufactured solution is chosen as

$$\displaystyle \begin{aligned} u(x,y,z) = \sin\left(\frac{x-r_1}{r_2-r_1} \pi \right) \cos\left(4y\right) \exp\left({z}/{2}\right)\enspace. \end{aligned}$$

For our numerical experiments, we employ a macro mesh composed of 9540 hexahedral blocks, where each block is further split into six tetrahedral elements. The resulting system is solved on SuperMUC Phase 2 using 14,310 compute cores, i.e., four macro elements are assigned per core. The largest system involves solving a linear system with DoFs. We employ a multigrid solver with a V(3,3) cycle and the iterations are stopped when the residual has been reduced by a factor of 10−8. In Table 10, we report on the resulting discretization error and its estimated order of convergence (eoc), the asymptotic multigrid convergence rate ρ, and the time-to-solution for different refinement levels . Note that we restricted the scaling approach in this test to volume and face primitives, which contain the vast majority of DoFs.

Table 10 Results for large scale 3D application with errors measured in the discrete L 2-norm

These results demonstrate that the new scaling approach reproduces the discretization error, as is expected from our variational crime analysis [8, 9, 30]. Additionally, the multigrid convergence rate is not affected. For larger the run-time as compared to the nodal integration approach requires only about 30% of the time.

3.6.3 Stencil Scaling for Vector-Valued PDEs with Applications to Generalized Newtonian Fluids

Our target problem is vector-valued, thus, we expanded the scalar stencil scaling idea from Sect. 3.6.2 and developed a similar matrix-free approach for vector-valued PDEs [31]. The construction is again based on the use of hierarchical hybrid grids, the conceptual basis in the HHG and HyTeG [63] frameworks. Vector-valued second-order elliptic PDEs play an important role in mathematical modeling and arise e.g. in problems from elastostatics and fluid dynamics. Numerical experiments indicated that the idea of the scalar stencil scaling (denoted below as unphysical scaling) cannot directly be applied to these equations, because the standard finite-element solution cannot be reproduced, even in the case of linear coefficients. Thus, there is need of a modified stencil scaling method (denoted below as physical scaling) that is also suited for matrix-free finite element implementations on hierarchical hybrid grids. It turns out this vector-valued scaling requires computation of an additional correction term. While this makes it more complicated and expensive, compared to the scalar stencil scaling, it is able to reproduce the standard finite-element solutions, while requiring only a fraction of the time to obtain them. In the best scenario, we could observe a speedup of about 122% compared to standard on-the-fly integration. Our largest example involved solving a Stokes problem with 12,288 compute cores.

One of the examples studied is the scenario of a non-linear incompressible Stokes problem where the fluid is assumed to be of generalized Newtonian type, modeled by a shear-thinning Carreau model

The parameters employed are given in Table 11 in dimensionless form. The values stem from experimental results, cf. [34, Chapter II].

Table 11 Dimensionless parameters for the Carreau viscosity model

The computational domain Ω is depicted in Fig. 9. The channel has a length of 5 and a depth and height of 1. The kink starts at position x = 1.5. The domain is discretized by 14,208 tetrahedra on the coarsest level. The boundary  Ω is composed into Dirichlet and Neumann parts, i.e.,  Ω = ΓD ∪ ΓN with ΓD = {(x, y, z) ∈  Ω | x < 5} and ΓN =  Ω∖ ΓD. The volume force term f, the external forces \(\hat {\boldsymbol {t}}\), and the Dirichlet boundary term g are set to

$$\displaystyle \begin{aligned} \boldsymbol{f} = (0,0,100)^\top, \; \hat{\boldsymbol{t}} = (0,0,0)^\top, \; \text{and} \; \boldsymbol{g} = 16 \, y \, (1 - y) \, z \, (1 - z)\cdot(1, 0, 1)^\top. \end{aligned} $$

We solve this non-linear system by applying an inexact fixed-point iteration where the underlying linear systems are only solved approximately to prevent over-solving.

Fig. 9
figure 9

Velocity streamlines for the non-linear generalized Newtonian Stokes problem

In order to solve the systems, we employ the inexact Uzawa solver presented from [29] with variable V (3, 3) cycles where 2 smoothing steps are added to each coarser refinement level which enforces convergence of the method.

In Table 12, we present the relative errors of the physical and unphysical scaling approaches along the line θ = [0, 5] ×{0.5}×{0.42} in the supremum norm, computed on the 4 times refined grid. It can clearly be seen that the physical scaling yields significantly better results with errors smaller by three orders of magnitude. Further experiments indicate that the unphysical scaling does not converge to the standard finite element solution even after additional mesh refinements. Therefore, we conclude that the unphysical scaling is an inconsistent discretization for vector-valued problems.

Table 12 Relative errors along the line θ in the supremum norm for different scaling approaches

Table 13 presents the relative time-to-solutions for the nodal integration and physical scaling measured on SuperMUC Phase 2. On the finest level, after 6 refinements, the relative time-to-solution of the physical scaling is at about 81%. Figure 9 shows the streamlines of the velocity within the computational domain computed with the physical scaling.

Table 13 Relative time-to-solution comparison of the nodal integration and physical scaling approach for the non-linear generalized Stokes problem

3.7 Resilience

Extremely concurrent simulations may in the future require a software design that is resilient to faults of individual software and hardware components. The probability of faults in the underlying systems increases with the number of parallel processes. Especially long-running simulations may suffer from lower mean time between failures and restarting applications with run-times of several hours or days consumes vast amounts of resources.

Therefore, fault tolerance techniques have become a research topic as preparation for possibly unreliable future exascale systems. In the TerraNeo project we have so far mainly focused on hard faults. To cope with the failure of a core or node, we mainly distinguish between two categories of methods: one class relies on checkpointing techniques, where snapshots of the simulation are stored in regular intervals so that the state can be loaded upon failure. A second class are algorithm-based fault tolerance (ABFT) techniques, where (partially) lost data can be re-computed on-the-fly. Regarding large-scale simulations, checkpointing techniques often suffer from bad serial and parallel performance if data is written to disk. However, there are approaches that solely rely on distributed, in-memory checkpointing, which could be combined with compression techniques [71] that were also explored in a student project [74]. This can lead to a flexible, fast, and scalable resilience method [64].

In [53, 54] an alternative ABFT technique to provide resilience specifically for multigrid methods is developed. Given a faulty subdomain ΩF, created when a processor (core or node) crashes, the global solution can be recovered by solving a recovery problem on ΩF via a local multigrid iteration. After the recovery subproblem is solved, the global iteration continues.

A priori, it is not clear, how many iterations will be required in the faulty subdomain. In [54] a fixed number of V-cycle iterations is employed. This can lead to under– or over-solving in ΩF. While under-solving leads to a poor recovery, over-solving inhibits the global convergence due to wrong values at the interface. The convergence behavior of both scenarios is illustrated in Fig. 10.

Fig. 10
figure 10

Convergence behavior for over- and under-solving in the faulty subdomain

To address this issue, in [56] the recovery algorithm is enhanced by an adaptive control mechanism. Instead of solving the subproblem using a fixed number of iterations, a hierarchically weighted error estimator is introduced to define a stopping criterion for the faulty subdomain. The estimator is designed to be well-suited to extreme-scale parallel multigrid settings as they are employed in applications. It guards the algorithm against over- or under-solving the subproblem and therefore wasting computational resources.

As already suggested in [54], a so-called superman strategy is employed, which increases the computing power locally in the faulty subdomain. The recovery computation is accelerated e.g. through a further subdivision of ΩF by means of additional shared memory parallelism. In Fig. 11 the development of the estimated error of a recovered solution is compared to the error of a fault-free scenario and a no-recovery scenario. Locally, the number of processes is increased by a factor of 4.

Fig. 11
figure 11

Comparison of the estimated error using the hierarchically weighted error estimator to the error in a fault-free and a recovery-free scenario. κσ LRB denotes the local recoupling threshold for details see [56]

We have also conducted studies with asynchronous recovery strategies. Here, the local problem on ΩF and the problem on the healthy domain \(\Omega \setminus \overline {\Omega }_F\) are solved concurrently. This means that during the recovery process, the multigrid iteration is also continued in parallel in the healthy domain. Different strategies have been explored in [54] which boundary conditions are suitable for the healthy domain on the interface. After the stopping criterion in the faulty domain is fulfilled, the local recovery iteration is terminated and a recoupling procedure is initiated.

To emphasize the applicability to large-scale scenarios, the adaptive approach is scaled in [56] to a simulation with about 6.9 × 1011 DoFs and 245, 766 processes run on the JUQUEEN supercomputer. Additionally, the generality of the approach was demonstrated for a scenario where multiple failures in different regions of the domain have been triggered.

3.8 General Performance Issues

Simulations in geophysics require a high spatial resolution which leads to problems with many degrees of freedom. The success of a geophysics simulation framework will thus depend on its scalability and that it has only minimal memory overhead. In the previous sections we already presented scalability results to highlight the results obtained in the TerraNeo project.

However, message passing scalability is not the only metric that is important in high-performance computing. Even more important is the absolute performance which depends critically also on the performance on a single core or a single node. In the extreme scale computing community the relevance of a systematic node-level performance engineering is increasingly being realized, see e.g. [4, 45, 48, 60, 67]. Note that traditional scalability is easier to achieve when the node-level performance is lower. Publishing only scalability results without absolute performance measures is a frequently observed parallel computing cheat, as exposed in [2]. We emphasize here that the TerraNeo project has in this sense profited from long term research efforts in HPC performance engineering. These stem originally from the Dime projectFootnote 4 [65, 97] and have led to the excellent performance features of the HHG library [12, 13]. Thus, node level performance in TerraNeo is not coming as an afterthought imposed on existing codes, but has been an a priori design goal with high priority.

Within TerraNeo, these techniques were continuously employed to analyze the performance and were used to guide the development of new matrix-free methods. This includes rather simple analysis like calculating the update cost for a single stencil update in terms of floating point operations per second which allows comparing the actual performance to the maximally achievable performance on a given hardware. This type of analysis has been performed in [10] and [11].

In [9] this metric was also extended with an analysis of the memory traffic that is required to apply the stencil to a function which, in terms of linear algebra, corresponds to matrix-vector multiplication. The analysis provided a lower and upper bound for the performance with respect to memory. One for the optimistic assumption that all data required resides in the last level cache, and one for the pessimistic assumption that everything needs to be loaded from main memory. To evaluate this analysis the Intel AdvisorFootnote 5 was used as an automatic tool to perform the well established Roofline analysis [58, 103]. The results are given in Fig. 12.

Fig. 12
figure 12

Roofline model for different approaches presented in [9]

In cooperation with the HPC group of the RRZE in Erlangen, a more sophisticated analysis was performed in [8, 40, 41] where the Execution-Cache-Memory (ECM) model [48] was used to examine the performance.

The Roofline model assumes that only one data location is the bottleneck. This can be the main memory or any of the cache levels represented by the four different limits in Fig. 12. The ECM model however, takes into account memory, all levels of cache, and finally the registers within the CPU. The model helped to design the new computational kernels by identifying performance bottlenecks and guiding subsequent performance optimization steps. The end result of this development procedure are kernels where the analytically predicted optimal duration of a stencil-based update is in good agreement with the measurements in the real program code. Here an error of only ≈15% is considered acceptable. In Fig. 13 the occupancy of the execution units within the CPU is shown, as it was derived from the performance models.

Fig. 13
figure 13

Duration of the different phases involved during eight stencil-based updates as obtained by the ECM model for the newly developed kernel in [8]

3.9 HyTeG

We successfully demonstrated that the combination of a fully unstructured grid with regular refinement and matrix-free algorithms can be used to solve large geodynamical problems. A substantial amount of work in the TerraNeo project was performed using the HHG prototype. Even though the latter shows excellent performance and scalability, certain issues need to be faced. Due to the fact that the codebase was created over a decade ago, the coding standard cannot live up to current requirements. This makes it hard to maintain the framework and to familiarise new users with it, which is essential for a community code. Furthermore, the prototype was never meant to be used for higher-order discretizations or other than nodal elements.

Therefore a redesign is initiated under a new name: Hybrid Tetrahedral Grids (HyTeG) [63]. In addition to building on the knowledge gathered over the years from developing the HHG framework we could also utilize ideas and infrastructure from the waLBerla framework [44]. One of the fundamental changes in HyTeG is the strict separation of the data structures that define the macro mesh geometry and the actual simulation data. As in HHG, the tetrahedra of the unstructured macro mesh are split into their geometric primitives, namely volumes, faces, edges and vertices. The lower dimensional primitives (faces, edges and vertices) are used to decouple the volume primitives in terms of communication. Contrary to HHG the parallel partitioning is not solely based on volume primitives, instead all primitives get partitioned between processes. This e.g. allows to take also the computational and memory footprint for face primitives into account for load balancing, for which we can employ sophisticated tools such as ParMetis [61]. Note that in HyTeG the partitioning does not involve global data structures, an essential aspect when entering the exascale era. Like with HHG, it was decided to use C++ as the primary programming language due to its high performance and spread in the HPC community. Up to this point, no compute data are associated with the primitives. In a second step, these can be attached to the primitives. This might be a single value for only some of the primitives or a full hierarchy of grids for every single primitive. This separation also allows for efficient light-weight static and dynamic load balancing techniques similar to those described in [32, 96].

As mentioned above, one major goal when moving from HHG to HyTeG is the realization of different discretizations on hierarchical hybrid grids. In addition to unknowns located at the vertices of each micro-element of the mesh, unknowns on the edges, faces and inside the elements are also supported. These basic types of unknowns are implemented in such a way that different discretizations can be realized by combination which is illustrated in Fig. 14. Using this technique, a Taylor-Hood discretization for a Stokes flow simulation can be realized as well as finite volume discretizations to simulate energy transport [63].

Fig. 14
figure 14

Basic types can be combined to create various finite element discretizations

By introducing different types of discretizations, there are also many more compute-intensive kernels that need to be taken care of. The solution chosen in HyTeG to solve this problem is code generation. Inspired by the work in ExaStencils [76, 94], our joint collaboration in this direction [15], and also using experience from pystencilsFootnote 6 this can be efficiently realized. One difference to other projects that use whole program generation, however, is that only compute intensive kernels are generated, but not the surrounding data structures. In contrast to HyTeG itself, pystencils is using Python, which allows for much more flexibility and metaprogramming capabilities. The generated kernels, however are in C++, which eliminates the need for a Python environment when running HyTeG on supercomputers.

Another important feature that was introduced with the community code in mind is a state-of-the-art continuous integration process. With every change in the codebase, an automated process is started, which ensures the compatibility with different hardware and software configurations. Additionally, extensive testing is also part of the pipeline to guarantee the correctness of all features inside HyTeG. To illustrate the complexity of this process Fig. 15 shows all possible combinations of configurations.

Fig. 15
figure 15

Each combination of one leaf of every color represents one possible configuration

3.10 Asthenosphere Velocities and Dynamic Topography

In order to demonstrate the flexibility of the different algorithmic components, we have studied selected application questions. In [102] we investigated the question of flow speeds in the asthenosphere. The latter is a mechanically weak layer in the upper mantle right below the lithosphere, the rigid outermost shell of our planet. It plays an important role in convection modelling as it distinguishes itself from the lower parts of the mantle by a significantly lower viscosity. The precise details of the contrast are, however, unknown. A variety of geologic observations and geodynamic arguments indicates that velocities in the upper mantle may exceed those of tectonic plates by an order of magnitude [49, 51]. The framework was used to simulate high-resolution whole earth convection models with asthenospheric channels of varying thickness. Reduction of the asthenospheric thickness is balanced by an associated reduction in its viscosity, following the Cathles parameter [88]. This resulted for the tested end-member case in a set-up with an asthenosphere channel of only 100 km depth and a significant viscosity contrast of up to 4 orders of magnitude relative to the deeper mantle. We found a velocity increase by a factor of 10 between a reference case with an upper mantle of 1000 km depth and the very thin channel end-member case, translating into speeds of ≈ 20 cm/a within the narrow asthenosphere. Our suggested and numerically verified Poiseuille flow model predicts that the upper mantle velocity scales with the inverse of the asthenospheric thickness.

Note that the prototype implemented in HHG already allows to include real-world geophysical data. The model presented in Fig. 16, e.g., uses present-day plate velocities, see [83], as surface boundary conditions and a buoyancy term based on present day temperature and density fields, converted from a seismic tomography model via techniques described in [27]. Viscosity depends on temperature and includes a jump at the bottom of the asthenosphere, chosen at 660 km.

Fig. 16
figure 16

Mantle convection model computed in HHG with real-world geophysical data: temperature field from seismic tomography for viscosity (left) and resulting flow speeds (right); quantities are non-dimensional

Another geodynamic quantity we studied is dynamic topography. This term, which plays a crucial role in our understanding of long term sea-level variations, refers to deflections of the Earth’s surface as a response to viscous stresses in the mantle, and has been known from geodynamic arguments for a long time [47, 87]. The precise magnitude of dynamic topography is still debated, however, due to uncertainties in measuring it [18]. Dynamic topography occurs both at the surface and the core-mantle boundary (CMB), see [26] for a review. In [10] we studied the influence of lateral viscosity variations on the amplitude and pattern of dynamic topography. We first considered a standard benchmark with a purely radially varying viscosity profile, see [23, 105], for which a semi-analytic solution exists in terms of the propagator matrix technique [47, 85, 87]. This allowed us to verify suitable correctness of our LSQPe approach from Sect. 3.6. We then compared models with increasing levels of complexity, see Fig. 17, going up to a profile with lateral viscosity variations due to varying thickness of the lithosphere and temperature-dependent variations in the lower mantle. Full details and a discussion of the outcomes can be found in [10].

Fig. 17
figure 17

Surface dynamic topography for cases: A—viscosity with only radial variations, B—with the addition of viscosity variations due to varying thickness of the lithosphere, C—with additional temperature-dependent viscosity in the lower mantle (left) and pairwise difference between scenarios (right)

4 Conclusions and Future Work

TerraNeo is a project in Computational Science and Engineering [91]. We successfully developed innovative scientific computing methods for the exascale era. To this end, new algorithms were devised for simulating Earth Mantle convection. We analyzed the accuracy of these methods and their cost on massively parallel computers, showing that our methods exhibit excellent performance. A highly optimized prototype implementation was designed using the HHG multigrid library. Based on HHG, we demonstrated computations with an unprecedented resolution of the Earth’s Mantle based on real-world geophyical input data.

In TerraNeo many new methods for future exascale geodynamic simulations were invented: they include new solver components for monolithic multigrid methods for saddle point problems, specially designed smoothers and new strategies to solve the coarse grid problems. Two new classes of matrix-free methods were introduced and analyzed, one based on surrogate polynomials, the other one based on stencil scaling. New scheduling algorithms for parallel multilevel Monte Carlo methods and uncertainty quantification were studied. Methods for fault tolerance were investigated. This includes methods based on in-memory checkpointing as well as new methods for fast algorithmic reconstruction of lost data when hard faults occur. Inverse problems in mantle convection were studied using adjoint techniques.

Careful parallel performance studies complement our work and assess the suitability of the algorithmic components on future exascale computers. Many of the methods were tested on application oriented problems, such as for example a geophysics study on the relation of the thickness of the asthenosphere and upper mantle velocities and the influence of viscosity variations on dynamic topography.

We found that conventional C++-based implementation techniques lack expressiveness and flexibility for massively parallel and highly optimized parallel codes and that achieving performance portability is a major difficulty. Starting from this insight, we invented new programming techniques based on automatic program generation, learning from neighboring SPPEXA projects such as ExaStencils. These ideas have already been realized in HyTeG and waLBerla as new and innovative simulation software architectures for multiphysics exascale computing.

Thus, some aspects of our research in TerraNeo are of mathematical nature, others fall into the field of computer science. Our methods have also already been used to perform research in geophysics, the target discipline of TerraNeo. However, we point out that the research contribution of TerraNeo falls neither into the intersection of all these fields, nor can our contribution be understood from the viewpoint of mathematics or computer science alone. It is also no project in the geosciences, since its primary goal is not the creation of new geophysical insight. The goal of TerraNeo is the construction and analysis of simulation methods that are the enabling technologies for future exascale research in geodynamics.

This result could not be reached by either of the classical disciplines alone. The synthesis of knowledge and methods from mathematics, computer science, and geophysics becomes more than the sum of its parts. We emphasize additionally that our research in computational science and engineering is not restricted to the application in geophysics. Many of the innovations developed in TerraNeo can be transferred to other target disciplines.

With our new computational methods, we have broken previously existing barriers to computational performance. This is demonstrated for example by our solution of indefinite linear systems that have in excess of 1013 degrees of freedom. To our knowledge this constitutes the largest finite element computation published to date, even though the computation was still performed on JUQUEEN, a machine that is now outdated and was already retired.

TerraNeo funding in the last period was reduced from the desired four to only three positions so that the new software design and its development could not be completed as originally proposed. A preliminary version of the TerraNeo software will be made public and will contain essential parts of the promised core functionality, but it still lacks most of the application oriented functionality that would make it fully usable as a Geophysics community code. Efforts will be made to continue the development so that the central research goal of TerraNeo can be reached.