Keywords

1 Introduction

It is more and more common to identify simulation as the third pillar of science together with theory and experimentation. Parallel computers provide the computing power required by the more demanding of these simulations. The complexity and heterogeneity of these architectures do however force scientists to write complex code (using vectorization, parallelization, accelerator specific languages, etc.) These optimizations heavily depend on the target machine and the whole code has to be adapted whenever it is ported to a new architecture.

As a result, scientists have to become experts in the art of computer optimizations in addition to their own domain of expertise. It is very difficult in practice to maintain a code targeting multiple distinct architectures. One fundamental cause for this situation is the tight interleaving of two distinct concerns imposed by most programming models. On the one hand, the algorithm comes from the expertise of the domain scientist and does not depend on the target architecture. On the other hand, optimization is the expertise of optimization specialists and has to be adapted for every new architecture.

Many approaches have been proposed to improve this situation in the form of libraries or languages. Approaches based on automated optimization processes typically isolate the algorithmic aspects very well but restrict their domain of applicability and/or the range of supported optimizations. Approaches based on optimization tools and libraries enable optimization specialists to express common optimizations very efficiently but leave others mixed with the algorithm.

In this paper, we propose the Independent Kernel Scheduling (\(\textsc {InKS}_{\textsf {}}\)) programming model to separate algorithm from optimization choices in HPC simulation codes. We define the \(\textsc {InKS}_{\textsf {pia}}\) language used to express the algorithm of an application independently of its optimization. Such a program can be optimized with the \(\textsc {InKS}_{\textsf {O}}\) family of domain-specific languages (DSLs) for common optimizations or C++ for less common optimizations.

This paper makes the following contributions: (1) it defines the \(\textsc {InKS}_{\textsf {}}\) programming model and its platform-independent algorithmic language \(\textsc {InKS}_{\textsf {pia}}\); (2) it proposes an implementation of \(\textsc {InKS}_{\textsf {}}\) with two optimization DSLs, \(\textsc {InKS}_{\textsf {O/Loop}}\) and \(\textsc {InKS}_{\textsf {O/XMP}}\); and (3) it evaluates the approach on the synthetic NAS parallel benchmarks [3] and on the 6D Vlasov-Poisson solving with a semi-Lagrangian method.

The remaining of the paper is organized as follows. Section 2 presents and analyzes related work. Section 3 describes the \(\textsc {InKS}_{\textsf {}}\) programming model and its implementation. Section 4 presents the 6D Vlasov-Poisson problem and its implementation using \(\textsc {InKS}_{\textsf {}}\) while Sect. 5 evaluates the approach. Finally, Sect. 6 concludes the paper and identifies some perspectives.

2 Related Works

Many approaches are used to implement optimized scientific simulations. A first widely used approach is based on imperative languages such as Fortran, C or C++. Libraries like MPI extend this to distributed memory with message passing. Abstractions very close to the actual execution machine make fine-tuning possible to achieve good performance on any specific architecture. It does however require encoding complex optimizations directly in the code. As there is no language support to separate the algorithm and architecture-specific optimizations, tedious efforts have to be applied [10] to support performance portability. Algorithm and optimizations are instead often tightly bound together in codes.

A second approach is thus offered by tools (libraries, frameworks or language extensions) that encode classical optimizations. OpenMP [5] or Kokkos [4] support common shared-memory parallelization patterns. E.g, Kokkos offers multidimensional arrays and iterators for which efficient memory mappings and iteration orders are selected independently. E.g, UPC [8] or XMP  [14] support the partitioned global address space paradigm. In XMP, directives describe array distribution and communications between nodes. These tools offer gains of productivity when the optimization patterns they offer fit the requirements. The separation of optimizations from the main code base also eases porting between architectures. Even if expressed more compactly optimizations do however remain mixed with the algorithm and only cover part of the requirements.

A third approach pushes this further with tools that automate the optimization process. For example, PaRSEC [11] or StarPU [1] support the many-tasks paradigm. In StarPU, the user expresses its code as a DAG of tasks with data dependencies that is automatically scheduled at runtime depending on the available resources. Another examples are SkeTo [18] or Lift [16] that offer algorithmic skeletons. Lift offers a limited set of parallel patterns whose combinations are automatically transformed by an optimizing compiler. Automating optimization improves productivity and clearly separate these optimizations which improves portability. The tools do however not cover the whole range of potential optimizations such as the choice of work granularity inside tasks. The algorithm remains largely interleaved with optimization choices even with this approach.

A last approach is based on DSLs that restrict optimizations, such as Pochoir [17] or PATUS [6], DSLs for stencil problems. In Pochoir, the user only specifies a stencil (computation kernel and access pattern), boundary conditions and a space-time domain while all optimizations are handled by a compiler. DSLs restrict the developer to the expression of the algorithm only, while optimizations are handled independently. This ensures a very good separation of these aspects. The narrower the target domain is, the more efficient domain and architecture-specific optimizations are possible. However, it makes it less likely for the tool to cover the needs of a whole application. Real-world applications can fall at the frontier between the domains of distinct DSLs or not be covered by a single one. Performance then depends on the choice of DSLs to use and the best choice depends on the target architecture leading to new portability issues.

To summarize, one can consider a continuum of approaches from very general approaches where the optimization process is manual to more and more domain specific where the optimization process can be automated. The more general approaches support a large range of optimizations and application domains but yield high implementation costs and low separation of concerns and portability. The more automated approaches reduce implementation costs and offer good separation of concerns and portability but restrain the range of supported domains and optimizations. Ideally, one would like to combine all these advantages: (1) the domain generality of imperative languages, (2) the ease of optimization offered by dedicated tools and (3) the separation of concerns and performance portability offered by DSLs. The following section describes the \(\textsc {InKS}_{\textsf {}}\) programming model that aims to combine these approaches to offer such a solution.

3 The \(\textsc {InKS}_{\textsf {}}\) Programming Model

This section describes the core of our contribution, the \(\textsc {InKS}_{\textsf {}}\) programming model. This approach is based on the use of distinct languages to express algorithm and optimization choices; thus enforcing their separation. The algorithm of the simulation consists in the set of values computed, the formula used to produce each of them as well as the simulation inputs and outputs. We include in optimization choices all that is not the algorithm, such as the choice of a computing unit for each computation, their ordering, the choice of a location in memory for each value, etc. Multiple optimization choices can differ in performance but simulation results depend on the algorithm only.

Fig. 1.
figure 1

The \(\textsc {InKS}_{\textsf {}}\) model

The \(\textsc {InKS}_{\textsf {}}\) approach is summarized in Fig. 1. The \(\textsc {InKS}_{\textsf {pia}}\) language is used to express the algorithm with no concern for optimization choices. A compiler generates non-optimized, generic choices automatically from this specification for test purposes. The \(\textsc {InKS}_{\textsf {O}}\) family of DSLs is used to define common optimizations efficiently while C++ is used to describe arbitrarily complex optimizations. Many versions of the optimization choices can be devised to optimize for multiple targets.

The remaining of the section describes \(\textsc {InKS}_{\textsf {pia}}\) and proposes two preliminary \(\textsc {InKS}_{\textsf {O}}\) DSLs. \(\textsc {InKS}_{\textsf {O/XMP}}\) handles domain decomposition and inter-node communications while \(\textsc {InKS}_{\textsf {O/Loop}}\) focuses on efficient loops.

The \(\textsc {InKS}_{\textsf {pia}}\) language. In InKS\(_\textsf {pia} \) [2] (illustrated in Listing 1), values are stored in infinite multidimensional arrays based on dynamic single assignment (DSA, each coordinate can only be written once). Memory placement of each coordinate is left unspecified. Computations are specified by kernel procedures that (1) take as parameters data arrays and integer coordinates; (2) specify the coordinate they might read and will write in each array; and (3) define either a C++ or \(\textsc {InKS}_{\textsf {}}\) implementation. An \(\textsc {InKS}_{\textsf {}}\) implementation defines kernels validity domains: coordinates where C++ kernels can generate values in arrays. Kernel execution order is left unspecified. The simulation entry point is a kernel marked public. This specifies a parameterized task graph (PTG) [7]. This representation covers a large range of problems but imposes a few limitations. Mostly, all the problem parameters must be known at launch time, it does for example not support adaptive mesh or time-steps. It is still possible to express these concerns outside \(\textsc {InKS}_{\textsf {pia}}\) and call the \(\textsc {InKS}_{\textsf {}}\) implementation multiple times with different parameters.

The \(\textsc {InKS}_{\textsf {pia}}\) compiler [2] extracts C++ functions from \(\textsc {InKS}_{\textsf {pia}}\) kernels and generates a function with correct but non-optimized loops and memory allocations to execute them. Arbitrarily complex versions of these can also be written manually in plain C++ and rely on existing optimization tools. These functions can be called from any existing code whose language supports the C calling convention. However, that approach requires information present in \(\textsc {InKS}_{\textsf {pia}}\) to be repeated. The \(\textsc {InKS}_{\textsf {O}}\) DSLs thus interface optimization tools with \(\textsc {InKS}_{\textsf {pia}}\).

figure a

The \(\textsc {InKS}_{\textsf {O/XMP}}\) Optimization Language. \(\textsc {InKS}_{\textsf {O/XMP}}\) (illustrated in Listing 2) handles distributed memory domain decomposition by combining C and directives based on XMP and adapted for \(\textsc {InKS}_{\textsf {}}\). The compiler replaces these directives by C and XMP code (Listing 3). The inks decompose directive supports static or dynamic allocation of arrays from the algorithm. The domain size is extracted from \(\textsc {InKS}_{\textsf {pia}}\) source and the user only has to specify its mapping onto memory. As in XMP, \(\textsc {InKS}_{\textsf {O/XMP}}\) supports domain decomposition mapped onto an XMP topology. In addition, it supports dimension reordering and folding which consists in reusing the same memory address for subsequent indices in a given dimension. This feature is important to reuse memory due to the DSA nature of \(\textsc {InKS}_{\textsf {pia}}\) arrays. The exchange directive supports halo exchanges. The required halo size is automatically extracted from the algorithm and the user only has to specify when to execute the communications and in what dimension. While XMP requires halo values to be stored contiguously with the domain, \(\textsc {InKS}_{\textsf {O/XMP}}\) support a dynamic halo extension where halo values are stored in dedicated, dynamically allocated buffers to reduce memory footprint.

figure b

The \(\textsc {InKS}_{\textsf {O/Loop}}\) Optimization Language. \(\textsc {InKS}_{\textsf {O/Loop}}\) (illustrated in Listing 4) offers to specify manually loop nests for which the compiler generates plain C++ loops (Listing 5). Plain C++ is usable with \(\textsc {InKS}_{\textsf {O/Loop}}\). The loop keyword introduces a nest optimization with a name, the list of parameters from the algorithm on which the loop bounds depend and a reference to the optimized kernel. Loop bounds can be automatically extracted from \(\textsc {InKS}_{\textsf {pia}}\), but the Set keyword makes it possible to restrict these bounds. The Order keyword specifies the iteration order on the dimensions and the Block keyword enables the user to implement blocking. It takes as parameters the size of block for the loops starting from the innermost one. If there are less block sizes than loops, the remaining loops are not blocked. The Buffer keyword supports copying data in a local buffer before computation and back after to ensure data continuity and improve vectorization. The compiler uses data dependencies from the algorithm to check the validity of the loops order and generate vectorization directives where possible.

4 The 6D Vlasov/Poisson Problem

The 6D Vlasov-Poisson equation, presented in (1), describes the movement of particles in a plasma and the resulting electric field. We study its resolution for a single species on a 6D Cartesian mesh with periodic boundary conditions. We solve the Poisson part using a fast Fourier transform (FFT) and rely on a Strang splitting (order 2 in time) for the Vlasov part. This leads to 6 1D advections: 3 in the space dimensions (\(x_1 ,x_2, x_3\)) and 3 in the velocity dimensions (\(v_1, v_2, v_3\)). Each 1D advection relies on a Lagrange interpolation of degree 4. In the space dimensions, we use a semi-Lagrangian approach where the stencil is not applied around the destination point but at the foot of characteristics, a coordinate known at runtime only. This is described in more details in [15].

$$\begin{aligned} \left\{ \begin{aligned}&\frac{\partial f(t, x, v)}{\partial t}+v.\nabla _xf(t,x,v)-E(t,x).\nabla _vf(t,x,v)=0 \\&-\varDelta \phi (t, x) = 1 - \rho (t, x) \\&E(t, x) = -\nabla \phi (t, x) \\&\rho (t, x) = \displaystyle {\int } f(t, x, v) dv \end{aligned} \right. \end{aligned}$$
(1)

The main unknown is f (f6D in the code), the distribution function of particles in 6D phase space. Due to the Strang splitting, a first half time-step of advections is required after f6D initialization but before the main time-loop. These advections need the electric field E as input. E is obtained through the FFT-based Poisson solver that in turn needs the charge density \(\rho \) as input. \(\rho \) is computed by a reduction of f6D. The main time-loop is composed of 3 steps: advections in space dimensions, computation of the charge density (reduction) and electric field (Poisson solver) and advections in velocity dimensions.

The 6D nature of f6D requires a lot of memory, but the regularity of the problem means it can be distributed in blocks with good load-balancing. Halos are required to hold values of neighbors for the advections. Connected halo zones would increase the number of points in all dimensions and consume too much memory. Split advections mean that halos are required in a single dimension at a time though. We therefore use dynamic halos composed of two buffers, one for each boundary of the advected dimension (denoted “right" and “left") as shown on Fig. 2. Listing 2 corresponds to the \(\textsc {InKS}_{\textsf {O/XMP}}\) implementation of this strategy.

Fig. 2.
figure 2

Dynamic halo exchange representation on a 2D domain

Advections are the main computational cost of the problem, accounting for 90% of the sequential execution time. Six loops surround the stencil computation of each advection and in a naive version, the use of a modulo to handle periodicity and application along non-contiguous dimensions slow down the computation. To enable vectorization and improve cache use, we copy f6D elements into contiguous buffers along with the left and right halos. Advections are applied on these buffers before copying them back into f6D. Blocking further improves performance by copying 16 elements at a time. Listing 5 corresponds to the \(\textsc {InKS}_{\textsf {O/Loop}}\) implementation of these optimizations.

5 Evaluation

This section evaluates the \(\textsc {InKS}_{\textsf {}}\) model on the NAS parallel benchmark, a simple stencil code and the 6D Vlasov-Poisson problem. Plain C++ is used for the synthetic benchmarks optimization while \(\textsc {InKS}_{\textsf {O/XMP}}\), \(\textsc {InKS}_{\textsf {O/Loop}}\), plain C++ and C with XMP are used for Vlasov-Poisson. All codes are compiled with Intel 18 compiler (-O3 -xHost), Intel MPI 5.0.1 and executed on the Poincare cluster (Idris, France) with 32 GB RAM and two Sandy Bridge E5-2670 CPUs per node and a QLogic QDR InfiniBand interconnect.

5.1 Synthetic Benchmarks

We have implemented 4 out of the 5 sequential NAS benchmark kernels (IS, FT, EP and MG) in \(\textsc {InKS}_{\textsf {pia}}\) and optimized them with plain C++. The C++ version [9], is used as reference. We have also implemented a 3D heat equation resolution by finite differences (7-point stencil) with two distinct C++ optimization strategies from a single \(\textsc {InKS}_{\textsf {pia}}\) source. Both strategies comes from the reference [12]. One uses double buffering (Heat/Buf) and the other implements a cache oblivious strategy (Heat/Obl).

Table 1. Execution time of the C++ and \(\textsc {InKS}_{\textsf {}}\) implementations of the sequential NAS benchmark, class B - time/iteration of the 3D heat equation (7point stencil), size \((1024^3)\) - median and standard deviation of 10 executions - GNU complexity score of the implementation.

The NAS CG kernel relies on indirections not expressible in the PTG model of \(\textsc {InKS}_{\textsf {pia}}\). Its implementation would thus have to rely on a large C++ kernel whose optimization would be mixed with the algorithm. \(\textsc {InKS}_{\textsf {pia}}\) can however be used to express all other NAS kernels as well as the 3D heat equation solver. Even if not as expressive as C or Fortran, \(\textsc {InKS}_{\textsf {pia}}\) covers the needs of a wide range of simulation domains and offers abstractions close to the execution machine rather than from a specific simulation domain. Among others, it can express computations such as FFTs or stencils with input coordinates unknown at compile-time.

\(\textsc {InKS}_{\textsf {}}\) separates the specification of algorithm and optimization in distinct files. Multiple optimization strategies can be implemented for a single algorithm, as shown for the 3D heat equation where each relies on a specific memory layout and scheduling. It thus offers a clear separation of algorithm and optimization.

Finding the right metric to evaluate the easiness of writing a code is a difficult question. As illustrated in Listing 4 however, algorithm expression in \(\textsc {InKS}_{\textsf {pia}}\) is close to the most naive C implementation where loops are replaced by \(\textsc {InKS}_{\textsf {}}\) validity domains with no worry for optimization. The specification of optimization choices is close to their expression in C++. Table 1 compares the GNU complexity score of \(\textsc {InKS}_{\textsf {}}\) optimizations to the reference code. \(\textsc {InKS}_{\textsf {}}\) scores are slightly better because kernels extracted from the algorithm hide computations and thus, part of the complexity. In addition, the use of C++ to write optimizations let optimization specialists reuse their preexisting knowledge of this language. These considerations should not hide the fact that some information has to be specified both in the \(\textsc {InKS}_{\textsf {pia}}\) and C++ files with this approach leading to more code overall.

Regarding performance, the \(\textsc {InKS}_{\textsf {}}\) approach makes it possible to express optimizations that do not change the algorithm. Optimizations of the four NAS parallel benchmarks and 3D heat equation solver in \(\textsc {InKS}_{\textsf {}}\) were trivial to implement and their performance match or improve upon the reference as presented in Table 1. Investigation have shown that Intel ICC 18 does not vectorize properly the reference versions of Heat/Obl and NAS/MG. The use of the Intel ivdep directive as done on the \(\textsc {InKS}_{\textsf {}}\) versions leads to slightly better performance.

5.2 Vlasov-Poisson

We evaluate \(\textsc {InKS}_{\textsf {O/XMP}}\) and \(\textsc {InKS}_{\textsf {O/Loop}}\) on Vlasov-Poisson separately as they target different optimizations and are not usable together currently. A first experiment focuses on the sequential aspects with the intra-node optimization of the \(v_1\) advection using either \(\textsc {InKS}_{\textsf {O/Loop}}\) or plain C++. A second experiment focuses on the parallel aspects with the charge density computation, the Poisson solver and a halo exchange optimized either with C/XMP or with \(\textsc {InKS}_{\textsf {O/XMP}}\). The reference is the Fortran/MPI implementation from the Selalib library [13].

Distinct files for both concerns in \(\textsc {InKS}_{\textsf {}}\) makes possible to write a unique \(\textsc {InKS}_{\textsf {pia}}\) algorithm and multiple versions of optimization choices. Four optimization choices are implemented based on one \(\textsc {InKS}_{\textsf {pia}}\) source : (1) \(\textsc {InKS}_{\textsf {O/XMP}}\), (2) C with XMP, (3) \(\textsc {InKS}_{\textsf {O/Loop}}\) and (4) plain C++. This proves, to some extent, that the separation of concerns is respected. As of now, \(\textsc {InKS}_{\textsf {O/XMP}}\) and \(\textsc {InKS}_{\textsf {O/Loop}}\) are not usable together since \(\textsc {InKS}_{\textsf {O/Loop}}\) relies on C++ that XMP does not support. We plan to address this limitation in the future.

For the \(v_1\) advection, both the C++ and \(\textsc {InKS}_{\textsf {O/Loop}}\) optimizations of the \(\textsc {InKS}_{\textsf {}}\) code achieve performance similar to the reference as shown in Table 2. For the parallel aspects, the \(\textsc {InKS}_{\textsf {O/XMP}}\) optimization offers performance similar to XMP as shown on Fig. 3. The performance is comparable to MPI on the reduction operation but MPI is faster on the Poisson solver and the halo exchanges. At the moment, it seems that XMP does not optimize local copies which slows down the Poisson solver. Besides, XMP directives used for the halo exchanges are based on MPI RMA which make the comparison with MPI Send/Receive complex. Still, MPI is much harder to program: more than 350 lines of MPI and Fortran are required to handle domain decomposition, remapping for FFT and halo exchange in Selalib vs. 50 lines in XMP and 15 in \(\textsc {InKS}_{\textsf {O/XMP}}\).

Table 2. Comparison between Fortran, C++ and \(\textsc {InKS}_{\textsf {O/Loop}}\) version of the \(v_1\) advection on a \(32^6\) grid (double precision) on a single E5-2670 core with vectorization. Median of 12 executions
Fig. 3.
figure 3

Weak and strong scaling for 3 parts of the Vlasov-poisson solver up to 64 nodes (1 process/node) on a \(32^6\) grid divided among processes (strong scaling) or \(16^6\) grid per process (weak scaling). Median of 10 executions.

The \(\textsc {InKS}_{\textsf {O}}\) family of DSLs enables the developer to specify optimization choices only while algorithmic information is extracted from \(\textsc {InKS}_{\textsf {pia}}\) code. This is illustrated by Listings 2, presenting the \(\textsc {InKS}_{\textsf {O/XMP}}\) 6D domain decomposition, and 3, presenting the XMP version. Both are equivalent, but the \(\textsc {InKS}_{\textsf {O/XMP}}\) expects only optimization choices parameters. Hence, one can test another memory layout, such as a different dimension ordering, by changing only a few parameters, while multiple directives must be modified in XMP. Similarly, with \(\textsc {InKS}_{\textsf {O/Loop}}\) (Listing 4), developers can easily test different optimization choices that would be tedious in plain C++. Since \(\textsc {InKS}_{\textsf {O/XMP}}\) and \(\textsc {InKS}_{\textsf {O/Loop}}\) are respectively usable with C and C++, \(\textsc {InKS}_{\textsf {}}\) does not restrict the expressible optimization choices: one can still implement optimizations not handled by \(\textsc {InKS}_{\textsf {O}}\) in C/C++. Moreover, operations such as halo size computation or vectorization capabilities detection are automatized using the algorithm. In summary, the approach enables optimization specialists to focus on their specialty which make the development easier.

6 Conclusion and Future Works

In this paper, we have presented the \(\textsc {InKS}_{\textsf {}}\) programming model to separate algorithm (\(\textsc {InKS}_{\textsf {pia}}\)) and optimization choices (\(\textsc {InKS}_{\textsf {O}}\) & C++) and its implementation supporting two DSLs : \(\textsc {InKS}_{\textsf {O/Loop}}\) for loop optimizations and \(\textsc {InKS}_{\textsf {O/XMP}}\) for domain decomposition. We have evaluated \(\textsc {InKS}_{\textsf {}}\) on synthetic benchmarks and on the Vlasov-Poisson solving. We have demonstrated its generality and its advantages in terms of separation of concerns to improve maintainability and portability while offering performance on par with existing approaches.

While this paper demonstrates the interest of the \(\textsc {InKS}_{\textsf {}}\) approach, it still requires some work to further develop it. We plan to apply the \(\textsc {InKS}_{\textsf {}}\) model on a range of different problems. We will improve the optimization DSLs; base \(\textsc {InKS}_{\textsf {O/Loop}}\) on existing loop optimization tools and ensure good interactions with \(\textsc {InKS}_{\textsf {O/XMP}}\). We also want to target different architectures to demonstrate the portability gains of the \(\textsc {InKS}_{\textsf {}}\) approach.