One-dimensional partitioning for heterogeneous systems: Theory and practice☆
Introduction
In many applications of parallel computing, load balancing is achieved by mapping a possibly multi-dimensional computational domain down to a one-dimensional (1D) array, and then partitioning this array into parts with equal weights. Space filling curves are commonly used to map the higher dimensional domain to a 1D workload array to preserve locality and minimize communication overhead after partitioning [5], [6], [9], [15]. Similarly, processors can be mapped to a 1D array so that communication is relatively faster between close processors in this processor chain [10]. This eases mapping for computational domains and improves efficiency of applications. The load balancing problem for these applications can be modeled as the chain-on-chain partitioning (CCP) problem, where we map a chain of tasks onto a chain of processors. Formally, the objective of the CCP problem is to find a sequence of separators to divide a chain of tasks with associated computational weights into consecutive parts to minimize maximum load among processors.
In our earlier work [17], we studied the CCP problem for homogenous systems, where all processors have identical computational power. We have surveyed the rich literature on this problem, proposed novel methods as well as improvements on existing methods, and studied how these algorithms can be implemented efficiently to be effective in practice. In this work, we investigate how these techniques can be generalized for heterogeneous systems, where processors have varying computational powers. Two distinct problems arise in partitioning chains for heterogeneous systems. The first problem is the CCP problem, where a chain of tasks is to be mapped onto a chain of processors, i.e., the th task subchain in a partition is assigned to the th processor. The second problem is the chain partitioning (CP) problem, where a chain of tasks is to be mapped onto a set, as opposed to a chain, of processors, i.e., processors can be permuted for subchain assignments. For brevity, the CCP problem for homogenous systems and heterogeneous systems will be referred to as the homogenous CCP problem and heterogeneous CCP problem, respectively. The CP problem refers to the chain partitioning problem for heterogeneous systems, since it has no counterpart for homogenous systems.
In this article, we show that the heterogeneous CCP problem can be solved in polynomial time, by enhancing the exact algorithms proposed for the solution of the homogenous CCP problem [17]. We present how these exact algorithms for homogenous systems can be enhanced for heterogeneous systems and implemented efficiently for runtime performance. We also present how the heuristics widely used for the solution of homogenous CCP problem can be adapted for heterogeneous systems. We present the implementation details and pseudocodes for the exact algorithms and heuristics for clarity and reproducibility. Our experiments with workload arrays coming from image-space-parallel volume rendering and row-parallel sparse matrix vector multiplication applications show that our proposed exact algorithms produce substantially better results than the heuristics, while the solution times remain comparable. On average, optimal solutions provide 4.9 and 8.7 times better load imbalance than heuristics for 128-way partitionings of volume rendering and sparse matrix datasets, respectively. On average, the time it takes to compute an optimal solution is less than 2.20 times the time it takes to compute an approximation using heuristics for 128 processors, and thus the preprocessing times can be easily compensated by the improved efficiency of the subsequent computation even for a few iterations.
The CP problem on the other hand, is NP-complete as we prove in this paper. Our proof uses a pseudo-polynomial reduction from the 3-Partition problem, which is known to be NP-complete in the strong sense [7]. Our empirical studies showed that processor ordering has a very limited effect on the solution quality, and an optimal CCP solution on a random processing ordering serves as an effective CP heuristic.
The remainder of this paper is organized as follows. Table 1 summarizes important symbols used throughout the paper. Section 2 introduces the heterogeneous CCP problem. In Section 3, we summarize the solution methods for homogenous CCP. In Section 4, we discuss how solution methods for homogenous systems can be enhanced to solve the heterogeneous CCP problem. In Section 5, we discuss the CP problem, prove that it is NP-Complete. We present the results of our empirical studies with the proposed methods in Section 6, and finally, we conclude with Section 7.
Section snippets
Chain-on-chain (CCP) problem for heterogeneous systems
In the heterogeneous CCP problem, a computational problem, which is decomposed into a chain of tasks with associated positive computational weights is to be mapped onto a processor chain of processors with associated execution speeds . The execution time of task on processor is . For clarity, we note that there are no precedence constraints among the tasks in the chain.
A task subchain is defined as a
CCP algorithms for homogenous systems
The homogenous CCP problem can be considered as a special case of the heterogeneous CCP problem, where the processors are assumed to have equal speed, i.e., for all . Here, we review the CCP algorithms for homogenous systems. A comprehensive review and presentation of homogenous CCP algorithms are available in [17].
Proposed CCP algorithms for heterogeneous systems
The algorithms we propose in this section extend the techniques for homogenous CCP to heterogeneous CCP. All algorithms discussed in this section require an initial prefix-sum operation on the task-weight array for the efficiency of subsequent subchain-weight computations. The prefix-sum operation replaces the th entry with the sum of the first entries () so that computational weight of a task subchain can be efficiently determined as in time. In our
Chain Partitioning (CP) problem for heterogeneous systems
In this section, we study the problem of partitioning a chain of tasks onto a set of processors, as opposed to a chain of processors. The solution to this problem is not only separators on the task chain, but also processor-to-subchain assignments. Thus, we define a mapping as a partition of the given task chain with for , and a permutation of the given set of processors . According to this mapping, the th task
Experimental setup
The 1D task arrays used in both CCP and CP experiments were derived from two different applications: image-space-parallel direct volume rendering and row-parallel sparse matrix vector multiplication.
Direct volume rendering experiments are performed on three curvilinear datasets from NASA Ames Research Center [13], namely Blunt Fin (blunt), Combustion Chamber (comb), and Oxygen Post (post). These datasets are processed using the tetrahedralization techniques described in [8], [18] to produce
Conclusions
We studied the problem of one-dimensional partitioning of nonuniform workload arrays with optimal load balancing for heterogeneous systems. We investigated two cases: chain-on-chain partitioning, where a chain of tasks is partitioned onto a chain of processors; and chain partitioning, where the task chain is partitioned onto a set of processors (i.e., permutation of the processors is allowed). We showed that chain-on-chain partitioning algorithms for homogenous systems can be revised to solve
Availability
The algorithms proposed in this work are implemented in Java language and made publicly available at http://www.cs.bilkent.edu.tr/~tabak/hetccp/.
Acknowledgments
First author was supported by the Director, Office of Science, Division of Mathematical, Information, and Computational Sciences of US Department of Energy under contract DE-AC03-76SF00098.
Ali Pinar is a member of the High Performance Computing Research Department at Lawrence Berkeley National Laboratory. His research is on combinatorial problems arising in algorithms and applications of scientific computing, with emphasis on parallel computing, sparse matrix computations, computational problems in electric power systems, data analysis, and interconnection network design. He received his PhD degree in 2001 in computer science from University of Illinois at Urbana-Champaign, with
References (18)
- et al.
New challanges in dynamic load balancing
Applied Numerical Mathematics
(2005) Rectilinear partitioning of irregular data parallel computations
Journal of Parallel and Distributed Computing
(1994)- et al.
Fast optimal load balancing algorithms for 1D partitioning
Journal of Parallel and Distributed Computing
(2004) - et al.
Hypergraph-partitioning-based remapping models for image-space-parallel direct volume rendering of unstructured grids
IEEE Transactions on Parallel and Distributed Systems
(2007) - H.-A. Choi, B. Narahari, Algorithms for mapping and partitioning chain structured parallel computations, in:...
- et al.
Introduction to Algorithms
(1989) - T. Davis, University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices, NA Digest,...
- K.D. Devine, B. Hendrickson, E.G. Boman, M.M.S. John, C. Vaughan, Zoltan: A dynamic load-balancing library for parallel...
- et al.
Computers and Intractability; A Guide to the Theory of NP-Completeness
(1990)
Cited by (14)
Dynamic load balancing for direct-coupled multiphysics simulations
2020, Computers and FluidsCitation Excerpt :Thus, determining a domain decomposition reduces to solving a one-dimensional partitioning problem, which is known as the chains-on-chains partitioning (CCP) problem. The CCP problem has extensively been studied in the literature, giving exact solutions [6], parallel heuristics [7], scalable hierarchical algorithms [8], and also an extension to heterogeneous systems [9]. All those methods assume that sufficiently accurate information about the workload distribution is available.
Automatic mesh refinement and parallel load balancing for Fokker–Planck-DSMC algorithm
2018, Journal of Computational PhysicsCitation Excerpt :Balancing total storage, however, allows for simulation runs with a larger amount of particles and avoids exceeding the available memory on a given process. We further remark that EB is trivially extensible to heterogeneous systems [21] in that it allows to compute optimal partitions also when the actual computational power of the individual machines on which the processes run is taken into account. The performance of each process could be dynamically estimated at run-time by the measurement of execution time per particle per time-step, which is already done in the code for profiling.
Highly scalable SFC-based dynamic load balancing and its application to atmospheric modeling
2018, Future Generation Computer SystemsLoad-balancing spatially located computations using rectangular partitions
2012, Journal of Parallel and Distributed ComputingPlacing pipeline stages on a Grid: Single path and multipath pipeline execution
2010, Future Generation Computer SystemsCitation Excerpt :This heuristic shows the best trade-off between efficiency and simplicity, as discussed in [16]. In contrast, solving for the exact solution of the interval-based single path pipeline execution (ISP) remains an NP-complete problem which can be proved by the reduction from the Chain Partitioning (CP) problem for heterogeneous systems [24]. The proof can be carried out by showing that the CP problem is a special case of the ISP problem which has no communication cost between pipeline stages.
Ali Pinar is a member of the High Performance Computing Research Department at Lawrence Berkeley National Laboratory. His research is on combinatorial problems arising in algorithms and applications of scientific computing, with emphasis on parallel computing, sparse matrix computations, computational problems in electric power systems, data analysis, and interconnection network design. He received his PhD degree in 2001 in computer science from University of Illinois at Urbana-Champaign, with the option of computational science and engineering, and his B.S. and M.S. degrees in 1994 and 1996, respectively, in Computer Engineering from Bilkent University, Turkey.
He is a member of SIAM and its activity groups in Supercomputing, Computational Science and Engineering, and Optimization; IEEE Computer Society; and ACM. He is also elected to serve as the secretary of SIAM activity group in Supercomputing for the 2008–2009 term.
E. Kartal Tabak is a Ph.D. candidate at the Computer Engineering Department of Bilkent University. His research interests include parallel computing and algorithms, high performance Web Search Engines, high performance application servers, and software engineering.
Cevdet Aykanat received the BS and MS degrees from Middle East Technical University, Ankara, Turkey, both in electrical engineering, and the PhD degree from Ohio State University, Columbus, in electrical and computer engineering. He was a Fulbright scholar during his PhD studies. He worked at the Intel Supercomputer Systems Division, Beaverton, Oregon, as a research associate. Since 1989, he has been affiliated with the Department of Computer Engineering, Bilkent University, Ankara, Turkey, where he is currently a professor. His research interests mainly include parallel computing, parallel scientific computing and its combinatorial aspects, parallel computer graphics applications, parallel data mining, graph and hypergraph partitioning, load balancing, neural network algorithms, high performance information retrieval systems, parallel and distributed web crawling, parallel and distributed databases, and grid computing. He has (co)authored about 50 technical papers published in academic journals indexed in SCI. He is the recipient of the 1995 Young Investigator Award of The Scientific and Technological Research Council of Turkey and 2007 Science Award bestowed by the METU Parlar Foundation. He is a member of the ACM and the IEEE Computer Society. He has been recently appointed as a member of IFIP Working Group 10.3 (Concurrent Systems).
- ☆
This work is partially supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under projects EEEAG-105E065 and EEEAG-106E069.