BALANCING THE LOAD: A VORONOI BASED SCHEME FOR PARALLEL COMPUTATIONS

Elad Steinberg; Almog Yalinewich; Re'em Sari; Paul Duffell

doi:10.1088/0067-0049/216/1/14

1. INTRODUCTION

Large computational tasks require machines with distributed memory since in such systems the number of CPUs and the size of the total available memory is not set by the capacity of a single machine. An important downside to the distributed memory approach is that communications between processors may take a significant fraction of the computation time since information is carried over a network.

Ensuring proper load balance during parallel calculations of partial differential equations (e.g., hydrodynamics) using the distributed memory approach, is essential for achieving superior performance. In this paper we present a general scheme that is applicable to any problem that can be solved with domain decomposition. Since our implementation is for hydrodynamics, the rest of the paper uses nomenclature of hydrodynamics without loss of generality. For uniform fixed mesh codes (e.g., Athena Stone et al. 2008) load balancing is no great challenge. Partitioning the domain into equal sized static partitions is straightforward, results in balanced loading and minimal communications. For static but nonuniform mesh, a tree based partitioning can be used (e.g., Dubinski 1996) to divide the domain among the different CPUs. Smooth particle hydrodynamics (SPH), adaptive mesh refinement (AMR), and moving mesh codes differ from the above since the number of cells/points in a given part of the domain is not constant in time. Pluto (Mignone et al. 2012), an AMR code, achieves load balancing by implementing the Kernighan-Lin algorithm for solving knapsack problems (see Schloegel et al. (2000) for more details). Recently it has become very popular to use space filling curves (e.g., Morton curve, Warren & Salmon 1995; or a Peano–Hilbert curve, Shirokov & Bertschinger 2005) to map the cells/points among the different CPUs and to define their boundaries (Shirokov & Bertschinger 2005; Springel 2005; Teyssier 2002). However, as the points move during the time evolution, the load balance between processors deteriorates and redistribution among CPUs is necessary.

2. A VORONOI BASED PARALLELIZATION SCHEME

The novel method introduced here uses a Voronoi diagram to decompose the domain. This is not to be confused with hydrodynamic Voronoi schemes, like AREPO or TESS (Springel 2010; Duffell & MacFadyen 2011), or own code RICH (Yalinewich et al. 2014), which use Voronoi diagram to determine the computational cells boundaries. The basic building blocks of the method are CPU mesh points. The corresponding Voronoi diagram based on these mesh points defines for each CPU a region within the computational domain. Each CPU retains in its memory only the hydro points that lie inside its Voronoi cell. At the beginning of the calculation, the domain is decomposed via the Voronoi diagram, and each CPU is assigned its relevant hydro points. The Voronoi diagram is built only for the CPU mesh points, while the hydro points (or cells for grid based codes) have their own independent structure that is the same as it would be in the serial version.

At the beginning of each subsequent time step, we check which hydro points moved between different CPU Voronoi cells, and send their information to the new CPU in which they lie now. This can be done extremely easily and efficiently since the Voronoi diagram already contains the information of who are the neighbors of each CPU. Since typically hydro points can not move in a single time step distances that are larger than their size, the number of communications required is limited to about $\mathcal {O}(\sqrt{N})$ for two dimensions (2D) or $\mathcal {O}(N^{2/3})$ for three dimensions (3D), where N is the number of hydro points per CPU. Therefore, for large enough N the communication time is guaranteed to be a small fraction of the computational time even if communications are relatively slow.

Another novelty of this method is that the CPU mesh points can be moved in such a manner as to try and maintain a constant workload per CPU. To determine how the CPU mesh points are to be moved, each CPU is given a merit based on its current workload.

Specifically, we following a scheme which we call "Pressure Balancing Scheme." For each time step (or every few time steps if the problem is "smooth" with regards to its hydro points distribution) we implement the following procedure:

1.
Every CPU constructs the Voronoi diagram of all of the CPU mesh points.
2.
The information of Hydro points that moved outside the CPU's Voronoi cell to which they belonged in the previous time step are sent to their new CPU and new hydro points are received from the neighboring CPUs.
3.
A hydrodynamical time step is preformed, during which ghost hydro points are communicated to/from neighboring CPUs. Since in a 2D Voronoi diagram a point has on average 6 neighbors, each CPU in our implementation communicates on average with 12 other CPUs (its adjacent neighbors and the neighbors mutual neighbors). For codes that hold a tree data structure for the location of the points or grid based codes, this number can be reduced in most cases only to the adjacent neighbors. In the case of Voronoi based moving mesh codes, a Voronoi diagram of all of the local hydro points is constructed taking into account information from the received ghost hydro points. This is preformed in a separate communication than the one in the above bullet.
4.
At the end of the computational time step, each CPU determines its merit, which in our scheme we call "pressure," by counting how many hydro points are in its domain and this data is broadcasted to the neighboring CPUs.
5.
Every CPU determines its new location by
$\begin{eqnarray} \boldsymbol{dx_i}&=& M_{\mathrm{best}}\cdot \sum _{j=0}^{\mathrm{Neighbors}}\left(\boldsymbol {x}_i^n-\boldsymbol {x}_j^n\right)(1/(M_j+1)\nonumber \\ &&-1/(M_i+1)) \end{eqnarray} \tag{ 1 }$

$\begin{equation} \boldsymbol {x}_i^{n+1}=\boldsymbol {x}_i^n+\hat{d\!x_i}\cdot \mathrm{Min}(\alpha R_i,\mathrm{abs}(\boldsymbol {dx_i})), \end{equation} \tag{ 2 }$
where M_best is the total number of hydro points divided by the number of processors, M_i is the number of hydro points in the ith CPU. R_i is the CPU's effective radius calculated as $\pi R_i^2=A_i$ for 2D and $4\pi R_i^3/3=V_i$ for 3D where A_i is the area of the ith CPU Voronoi cell in 2D and V_i is its volume in 3D, n is the temporal index and we set α = 0.04. The worst case scenario for the fractional number of hydro points that are to be communicated in the next time step is of order α.
6.
Optionally, a Lloyd iteration can be performed (e.g., Equation (63) in Springel 2010) in order to ensure that the CPU cells remain rather round.

This scheme allows the domains of CPUs to "drift" with the flow of the points, maintaining a good load balance and diminishing the amount of data exchange between CPUs.

In order to achieve best performance, the initial positions of the CPU mesh points should mimic the distribution of the hydro points. If the initial load balance is poor, or an abrupt shift in the load balance occurs, the number of time steps to regain good load balance can be roughly estimated as the number of time steps it takes a CPU mesh point to travel the distance of the domain size. If A is the area of the domain, then the distance that a badly load balanced CPU mesh point has to travel, Δr, is of order

$\begin{equation} \Delta r\approx \sqrt{A}. \end{equation} \tag{ 3 }$

The distance that a CPU mesh point travels in a single time step, δr, is roughly

$\begin{equation} \delta r\approx \alpha R\approx \alpha \sqrt{A\over N_{{\rm CPU}}}, \end{equation} \tag{ 4 }$

where N_CPU is the number of CPUs. The number of time steps required to find a good load balance should typically be

$\begin{equation} {\Delta r\over \delta r}\approx {\sqrt{{N_{{\rm cpu}}}}\over \alpha }. \end{equation} \tag{ 5 }$

For the 3D case it is

$\begin{equation} {\Delta r\over \delta r}\approx {{N_{{\rm cpu}}}^{1/3}\over \alpha }. \end{equation} \tag{ 6 }$

The Voronoi based parallelization scheme is not limited to a particular implementation of solving the Euler equations, it can be applied to SPH, AMR, and moving mesh codes. Extending the Voronoi based parallelization scheme is trivial for SPH, where each SPH particle is a hydro point in our notation. For AMR (or other grid based codes), each cell can be associated with a point (its location inside the cell is arbitrary but a natural choice would be the cells center of mass), thus enabling the Voronoi based parallelization scheme where the points that are inside the cells are the hydro points in our notation.

3. PERFORMANCE

In the following section we run several benchmarks to test our method. All tests are using the RICH moving mesh hydrodynamics code (Yalinewich et al. 2014), which solves the Euler equations on a moving Voronoi mesh. RICH is made parallel by using MPI and the Voronoi based parallelization scheme described in this paper. All of the tests are run on the Astric cluster at Hebrew University, which hosts 56 Intel E5-2670 Xeon processors for a combined total of 448 cores connected via InfiniBand FDR non-blocking connection. For all of the following tests the load balance is defined xto be the ratio between the MPI process with the highest number of cells to the average zones per MPI process, i.e., max M_i/M_best.

3.1. Load Balancing

We use a simple blast wave problem in order to check the performance of our new load balancing scheme. The initial problem is set up with a random mesh of uniform density, with a total of 10⁴ × 128 hydro points distributed equally among 128 MPI processes. In order to measure the effectiveness of our parallelization scheme, we plot the load balancing max M_i/M_best as a function of time in Figure 1. The test starts with almost perfect load balance, and as the hydro points begin to redistribute themselves, the work is no longer perfectly distributed. However, the average load balance is close to unity (mean value of 1.11), and thus the parallelization is efficient considering that there are no global redistributions of data.

**Figure 1.** Load balance, defined as the ratio between the maximum number of hydro points in an MPI process to the average number of hydro points per MPI processes, as a function of time for the blast wave test problem. The simulation was run on 128 MPI processes with a total number of 10⁴ × 128 hydro points.
Download figure:
Standard image High-resolution image

3.2. Weak Scaling

In a perfect parallelization scheme, increasing the number of MPI processes while keeping the workload per MPI process constant should result with a constant computational time per time step; this is known as weak scaling. However, the need to communicate information between processes and load imbalance prevent this from happening in practice.

In order to check the weak scaling of our scheme, the same blast wave problem as in the previous section is run. First we keep a constant workload of 10⁵ hydro points per MPI process, while increasing the number of MPI processes, and a second time keeping 10⁴ hydro points per MPI processor. Figure 2 shows that for a small number of MPI processes the time per time step increases slightly as the number of MPI processes is increased. There are two main factors for this. The first, is the ability of a core to increase its speed while the number of MPI processes that are being used on its CPU die is low (this is known as Turbo Boost). The second is that our implementation currently uses blocking messages when we send and receive data from the neighboring MPI processes. This gives rise to idle time when the MPI processor is waiting for a message. This inefficiency saturates once there are MPI processes that are completely surrounded by other MPI processes (as opposed to the domain walls). Future versions of RICH will use non-blocking messages, which should improve performance as well as other performance increasing changes. In order to show the effect of Intel's Turbo Boost, we run the single thread version of the code with 4-8 concurrent runs per CPU and measure their run time relative to a single run. We then rescale the results of the run with 10⁵ hydro points per MPI process accordingly and plot them in Figure 2. The ratio between our single thread with 10⁵ hydro points' run time and the rescaled result for 448 MPI process is 0.78. The ratio between the run time of 16 MPI processes and 448 MPI process is 0.93, showing that our code has good scaling once the number of neighbors saturates. For comparison, we run the same blast wave problem with 10⁵ SPH points per MPI process with Gagdet2 (Springel 2005), using a fixed time step for all of the particles and in 3D (which is better optimized than the 2D version). The results are shown in Figure 2. Our code clearly scales better for more than 32 MPI processes, while our scaling for less than 32 MPI processes is worse.

**Figure 2.** Weak scaling for the blast wave test problem. Plotted is the number of seconds per time step while keeping the workload per MPI process constant with 10⁴ hydro points, shown in blue triangles, and 10⁵ hydro points per MPI process, shown in red circles. The red upside triangles are the results for the 10⁵ hydro points but rescaled to take into account Intel's Turbo boost. For comparison, the results of Gadget2 are shown as red crosses.
Download figure:
Standard image High-resolution image

3.3. Strong Scaling

Ideally, for a fixed number of hydro points, the computational time should scale inversely proportional to the number of processors used. It should be equal to the computational time on a single processor divided by the number of processors used. In practice this is not achieved due to imperfect load balancing and time spent on communication between processors. Strong scaling, defined as keeping the total workload constant while the number of CPUs increases, is another way to measure the efficiency of the parallelization scheme.

The same problem as before is ran, once with a total workload of 10⁶ hydro points and once with 10⁷ hydro points. The same qualitative behavior as in the weak scaling can be seen in Figure 3. For the larger workload, the scaling is one over the number of MPI processes. For the smaller workload, using more than 256 MPI processes decreases the efficiency. This is because the workload per MPI process is so low that the overhead in constructing the ghost cells needed for the boundaries (with their communication) dominates the computation. At least for this implementation, running with less than 4000 hydro points per MPI process is counterproductive.

**Figure 3.** Strong scaling for our blast wave test problem. Plotted is the time per time step while keeping the global workload constant. Blue circles are used for the run with a global workload 10⁷ cells, blue triangles for a global workload of 10⁶ cells and the red lines show $1/N_{\mathrm{MPI_Processors}}$ scaling.
Download figure:
Standard image High-resolution image

**Figure 3.** Strong scaling for our blast wave test problem. Plotted is the time per time step while keeping the global workload constant. Blue circles are used for the run with a global workload 10⁷ cells, blue triangles for a global workload of 10⁶ cells and the red lines show $1/N_{\mathrm{MPI_Processors}}$ scaling.
Download figure:
Standard image High-resolution image

3.4. Uniform Distribution

In this section we test the performance of this scheme on a domain which has uniform initial distribution of hydro points. We run the Gresho vortex problem (Gresho & Chan 1990) as set up in Springel (2010) on 64 MPI processes starting with a uniform distribution of cells while each MPI process hosts 10⁴ hydro points. Figure 4 shows the load balancing, as defined above, as a function of time. It is evident that good efficiency is maintained at all times.

**Figure 4.** Load balance as a function of time for our Gresho vortex test problem.
Download figure:
Standard image High-resolution image

3.5. Achieving Load Balance

This scheme can also quickly achieve excellent load balance even when the initial distribution of work is very unbalanced. To show this, we set up a static mesh problem, where the hydro points are distributed in the domain [ − 1, 1]², with random position picked from an exponential distribution that is described such that the probability of having a hydro point between r and r+dr, where r is the radius, is given by λexp (− λr)dr with λ = 10. The initial positions of CPU mesh points is randomly chose from a uniform distribution in the unit square. This initial setup gives an inefficient configuration, and our goal is to show how the processors position move according to our algorithm and advance toward more efficient load balance. The hydro mesh points are kept fixed and, in fact, no hydro is calculated.

The system is then iterated using the load balancing steps described in the previous section, and the load balance is recorded for each iteration. This test is preformed with 96 MPI processes and a total of 9.6 × 10⁵ hydro mesh points. With our initial setup the load balance is 66 and the configuration of the CPUs is shown in Figure 5. The load balance as a function of the iteration number is plotted in Figure 6 while the final configuration of the CPUs is shown in Figure 5. Our scheme improves the load balance which eventually saturates on the average value of 1.4. It takes ∼300 iterations to achieve this load balance, which is comparable to our estimate of $\sqrt{{N_{{\rm cpu}}}}/\alpha =100$ . The mean value of the load balance after 300 iterations is 1.4, a dramatic improvement in efficiency over the initial load balance of 60.

**Figure 5.** Initial (top) and final (bottom) setup of the CPUs for the exponential distribution load balance test.
Download figure:
Standard image High-resolution image

**Figure 6.** Load balance as a function of the iteration number for the exponential distribution load balance test.
Download figure:
Standard image High-resolution image

4. FURTHER EXTENSIONS

Our merit system can easily be extended to deal with more complex scenarios where the computational time per hydro point is not constant. Hydro points whose computational time is longer than other points can simply be taken into account as more than one hydro point when the effective pressure is calculated. An effective implementation of this option is to calculate the time it took each CPU to advance a single time step and give a merit based on that time. This way, the algorithm will tend to even out the computational time among the CPUs even if additional calculations other than hydrodynamics are taking place.

In some cases a large dynamical range of time steps, which are different for different hydro points, is present (Katz et al. 1996). Some codes (e.g., Springel 2010) have an option to advance in time only a subset of hydro points with their relevant time step, thus while one hydro point is advanced one large time step, another hydro point with a smaller time step can be advanced many smaller time steps. Defining a pressure just as the time it takes each CPU cell, will results in good load balance, but could create a very large memory imbalance between CPUs, since some CPUs will have many hydro points with a large time step and some will have a small number of hydro points with small time steps. One way to overcome this issue is to use a MultiGrid approach. A Voronoi diagram of the processors should be created for every time step level, and then our method for load balancing should be applied to each time step level individually while taking into account only hydro points with the corresponding time step. Note that there is no reason for the CPU to have its domains overlap between the different time step levels.

Since our load balancing method is a geometrical one, problems with extreme non-uniformity in the hydro points distribution can be difficult to address without allowing the Voronoi cells of the CPUs to have large aspect ratios. There are two possible improvements to handle these situations. The first is to have each CPU run more than one MPI process, this allows the CPU domains to be smaller at the cost of more communications. The second is to have an additional "global" attraction force that helps contract the CPU cells in areas of extreme hydro points density. As an example of a "global" attraction force, the mesh points of the CPUs can be moved according to Equation (71) in Springel (2010). Specifically we use

$\begin{equation} \boldsymbol {dx_i}=\sum _{j\ne i}\pi \left(\frac{R_i}{|\boldsymbol {x}_i^n-\boldsymbol {x}_j^n|} \right)^3\left(\frac{M_{\mathrm{best}}}{M_j}-1\right) \left(\boldsymbol {x}_i^n-\boldsymbol {x}_j^n\right). \end{equation} \tag{ 7 }$

As an example of the above, we run a problem similar to the one in Section 3.5, but instead of having one center for the exponential distribution, we have 40 different centers distributed in a uniform random manner, with λ = 320. For each center, a total of 2 × 10⁵ hydro points are drawn. The initial locations of the CPUs are drawn uniformly. They are then evolved with our load balancing scheme with the addition of the above mentioned "global" attractive force given by Equation (7). We limit the magnitude of this force to be no larger than αR_i/5. The load balance as a function of the iteration number is presented in Figure 7. The mean load balance after 1000 iterations is 2.05, using 4096 MPI processes. This shows that our parallelization scheme can perform well on an extremely inhomogeneous problems with a large number of MPI processes.

**Figure 7.** Load balance as a function of the iteration number for the exponential distribution load balance test.
Download figure:
Standard image High-resolution image

5. DISCUSSION

We present here a novel method to parallelize a computational code on a distributed memory machine. It is relevant for handling the load balance of partial differential equations that can be solved with domain decompositions. This method, which uses a moving Voronoi diagram to determine the CPU domains, ensures that there is no need for global data redistribution throughout the whole computation but only exchange of points adjacent to the edges of the CPU's domain. The ability of the CPUs to "flow" with the hydro points greatly reduces the need to transfer data between CPUs. Tests run with the RICH moving mesh code, show very good scaling up to 448 MPI processes.

The main caveat for this parallelization scheme is that it is efficient only as long as the number of CPUs is smaller than the number of hydro points per CPU. Since typically the number of hydro points per CPU is much larger than the number of CPUs, this caveat does not impose a strong constraint.

We would like to thank Frederic Bournaud, Romain Teyssier, and Jim Stone for their insightful comments. E.S. is supported by an Ilan Ramon grant from the Israeli Ministry of Science. This research is supported in part by ISF, ISA, and iCORE grants, and a Packard Fellowship.

BALANCING THE LOAD: A VORONOI BASED SCHEME FOR PARALLEL COMPUTATIONS

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

ABSTRACT

1. INTRODUCTION

2. A VORONOI BASED PARALLELIZATION SCHEME