Bandwidth optimal all-reduce algorithms for clusters of workstations

doi:10.1016/j.jpdc.2008.09.002

Journal of Parallel and Distributed Computing

Volume 69, Issue 2, February 2009, Pages 117-124

https://doi.org/10.1016/j.jpdc.2008.09.002 Get rights and content

Abstract

We consider an efficient realization of the all-reduce operation with large data sizes in cluster environments, under the assumption that the reduce operator is associative and commutative. We derive a tight lower bound of the amount of data that must be communicated in order to complete this operation and propose a ring-based algorithm that only requires tree connectivity to achieve bandwidth optimality. Unlike the widely used butterfly-like all-reduce algorithm that incurs network contention in SMP/multi-core clusters, the proposed algorithm can achieve contention-free communication in almost all contemporary clusters, including SMP/multi-core clusters and Ethernet switched clusters with multiple switches. We demonstrate that the proposed algorithm is more efficient than other algorithms on clusters with different nodal architectures and networking technologies when the data size is sufficiently large.

Introduction

The all-reduce operation combines values from all processes and distributes the results to all processes. It is commonly used in parallel computing. In the Message Passing Interface (MPI) standard [18], the routine for this operation is MPI_Allreduce.

We consider an efficient realization of the all-reduce operation with large data sizes in cluster environments, under the assumption that the reduce operator is associative and commutative. We derive a tight lower bound of the amount of data that must be communicated in order to complete the all-reduce operation, and use the lower bound to establish the minimum time required for this operation. We propose a ring-based algorithm for this operation that achieves bandwidth optimality on tree topologies in that (1) each node sends the minimum amount of data required to complete this operation, and (2) all communications are contention free.

Currently, the most widely used all-reduce scheme is the butterfly-like algorithm [22], [23], [27], where the all-reduce operation is realized with a recursive halving reduce-scatter followed by a recursive doubling all-gather. When the network can support the butterfly communication pattern without contention, this algorithm is optimal both in the latency term (using the minimum number of communication rounds needed) and in the bandwidth term (each node communicating the minimum amount of data required). The problem with the butterfly-like algorithm is that the butterfly communication pattern can cause network contention in many contemporary clusters, such as the widely deployed SMP/multi-core clusters. In contrast, our ring-based algorithm only requires a tree topology to be bandwidth optimal, and can achieve contention-free communication in almost all contemporary clusters including SMP/multi-core clusters and Ethernet switched clusters with multiple switches. The ring-based algorithm also requires less working memory and can be applied to clusters with non-power-of-two numbers of nodes. One limitation of the proposed algorithm is that it is only optimal in the bandwidth term, but not the latency term: the number of communication rounds is proportional to the number of processes. Another issue in the ring-based algorithm is that the reduction results are computed with different “bracketing”, which may cause problems in the presence of rounding errors.

We evaluate the proposed algorithm on various clusters of workstations, including high-end SMP/multi-core clusters with Myrinet and InfiniBand interconnects and low-end Ethernet switched clusters. The results show that the proposed algorithm significantly outperforms other algorithms when the data size is sufficiently large, which demonstrates the effectiveness of the proposed algorithm.

The rest of the paper is organized as follows: Section 2 introduces the all-reduce operation and the communication model that we use. In Section 3, we derive the theoretical lower bound on the communication time required for this operation. Section 4 presents the proposed bandwidth optimal all-reduce algorithm. Section 5 reports the results of our experiments. The related work is discussed in Section 6. Section 7 concludes the paper.

Section snippets

All-reduce operation

We will use a generic operator $\oplus$ to denote the reduce operator in the all-reduce operation. MPI_Allreduce requires the reduce operator to be associative, that is, $(a \oplus b) \oplus c = a \oplus (b \oplus c)$ . Moreover, all built-in operations for MPI_Allreduce are also commutative, that is, $a \oplus b = b \oplus a$ . We assume that the reduce operator is both associative and commutative in this paper.

In terms of operating results, an all-reduce operation is equivalent to a reduction operation that reduces the results to one process,

The lower bounds

In a general-purpose all-reduce operation, data items are independent of each other. The reduction on different items are independent of one another. Hence, the amount of data that must be communicated in order to complete an all-reduce operation on $X$ items is equal to $X$ times the amount for the single-item all-reduce operation. To obtain the lower bound for a general all-reduce operation on $X$ items, we only need to find the lower bound for the operation on a single item.

In a one-item

A ring-based bandwidth optimal algorithm

We will describe a ring-based bandwidth optimal all-reduce algorithm for tree topologies, and show how such an algorithm can be applied to high-end clusters. The tree topology by itself is not particularly interesting. We develop bandwidth optimal algorithms for the tree topology only because tree provides the minimum connectivity and most networks have tree embeddings: our bandwidth optimal all-reduce algorithm for trees can be applied to any topology with a tree embedding. Note that since our

Experiments

Based on the algorithms described in the previous section, we implement all-reduce routines in two forms. For high-end SMP/multi-core clusters, we implement a stand-alone all-reduce routine based on Lemma 6. For clusters with a physical tree topology, we develop a routine generator that takes the tree topology information as input and automatically produces an all-reduce routine that uses the topology specific algorithm for the topology. The routine generator reads a topology description file

Related work

The all-reduce operation has been extensively studied. The one-item all-reduce operation has been studied under different names such as census function [2], global combine[3], [5], [6], [27], and gossip [15]. The lower bound for the communication time under various communication models has been established [2], [3], [5], [15]. In [15], it is shown that to complete a one-item all-reduce operation under the telephone model, at least $2 N - 4$ connections must be established when $N > 4$ . In [2], [3], [5],

Conclusions

We investigate efficient implementations of the all-reduce operation with large data sizes under the assumption that the reduce operator is both associative and commutative. We derive a theoretical lower bound on the communication time of this operation and develop a bandwidth optimal all-reduce algorithm on tree topologies. This algorithm only requires tree connectivity to achieve bandwidth optimality and can be applied to contemporary clusters. We demonstrate the effectiveness of the proposed

Acknowledgments

This work is supported in part by National Science Foundation (NSF) grants: CCF-0342540, CCF-0541096, and CCF-0551555. Experiments are also performed on resources sponsored through an NSF Teragrid grant CCF-050010T.

Pitch Patarasuk received the B.S. degree in civil engineering from Chulalongkorn University, Thailand, in 1999, and the M.S. degree in computer science from Florida State University in 2004. He is currently a Ph.D. student in the Computer Science Department at Florida State University. His research interests include distributed systems, cluster computing, and communication optimizations.

References (28)

A. Bar-Noy et al.
Optimal computation of census functions in the postal model
Discrete Applied Mathematics
(1995)
W. Gropp et al.
A high-performance, portable implementation of the MPI message passing interface standard
Parallel Computing
(1996)
A. Karwande et al.
An MPI prototype for compiled communication on ethernet switched clusters
Journal of Parallel and Distributed Computing
(2005)
W. Knodel
New gossips and telephones
Discrete Mathematics
(1975)
R.G. Lane et al.
An empirical study of reliable multicast protocols over ethernet-connected networks
Performance Evaluation Journal
(2007)
P. Patarasuk et al.
Techniques for pipelined broadcast on ethernet switched clusters
Journal of Parallel and Distributed Computing
(2008)
R. van de Geijn
On global combine operations
Journal of Parallel and Distributed Computing
(1994)
G. Almasi, et al., Optimization of MPI collective communication on BlueGene/L systems, in: International Conference on...
A. Bar-Noy et al.
Computing global combine operstions in the multiport postal model
IEEE Transactions on Parallel and Distributed Systems
(1995)
L. Bongo, O. Anshus, J. Bjorndalen, T. Larsen, Extending collective operations with application semantics for improving...

J. Bruck et al.

Efficient global combine operations in multi-port message-passing systems

Parallel Processing Letters

(1993)

J. Bruck et al.

On the design and implementation of broadcast and global combine operations using the postal model

IEEE Transactions on Parallel and distributed Systems

(1996)

A. Faraj et al.

Bandwidth efficient all-to-all broadcast on switched clusters

International Journal of Parallel Programming

(2008)

A. Faraj et al.

A message scheduling scheme for all-to-all personalized communication on ethernet switched clusters

IEEE Transactions on Parallel and Distributed Systems

(2007)

Cited by (370)

Distributed Analytics For Big Data: A Survey
2024, Neurocomputing
In recent years, a constant and fast information growing has characterized digital applications in the majority of real-life scenarios. Thus, a new information asset, namely Big Data, has been defined and lead to different challenges, mainly related to data storage, management and analysis. Focusing on the last challenge, several Big Data analytics techniques have been developed, based on Machine Learning and Deep Learning paradigms. When dealing with Big Data, traditional approaches often take a lot of time to produce even a single predictive model, due to the extremely high demand of computational resources.
The design of approaches specifically oriented to Big Data is required to overcome these computational issues. Most solutions rely on the deployment of Big Data analytics infrastructures on a cluster of machines and/or on parallelization techniques. When deployment and parallelization apply to Machine Learning and Deep Learning, we can refer to the terms Distributed Machine Learning and Distributed Deep Learning, respectively.
We here discuss the main principles and features of Distributed Machine Learning and Distributed Deep Learning frameworks. The main contribution of this work is a survey of solutions proposed in the literature, through the investigation of selected features and capabilities. In particular, the survey provides a comparative analysis according to the following classification criteria: implemented parallelization technique, supporting device, supported architecture, implemented communication mode, working mode, and class of algorithms.
The paper also gives an overview of the most commonly used criteria and metrics for the performance evaluation of analyzed frameworks; finally, some emerging but promising optimization techniques are reviewed apart from our classification.
Sketch-fusion: A gradient compression method with multi-layer fusion for communication-efficient distributed training
2024, Journal of Parallel and Distributed Computing
Gradient compression is an effective technique for improving the efficiency of distributed training. However, introducing gradient compression can reduce model accuracy and training efficiency. Furthermore, we also find that using a layer-wise gradient compression algorithm would lead to significant compression and communication overhead, which can negatively impact the scaling efficiency of the distributed training system. To address these issues, we propose a new method called $S k e t c h - F u s i o n$ SGD, which leverages the Count-Sketch data structure to enhance the scalability and training speed of distributed deep learning systems. Moreover, our method employs LayerFusion to optimize gradient compression algorithms' scalability and convergence efficiency by formulating an optimal multi-layer fusion strategy without introducing extra hyperparameters. We evaluate our method on a cluster of 16 GPUs and demonstrate that it can improve training efficiency by up to 18.6% without compromising the model's accuracy. In addition, we find that applying our LayerFusion algorithm to other gradient compression methods improved their scalability by up to 2.87×.
Canary: Congestion-aware in-network allreduce using dynamic trees
2024, Future Generation Computer Systems
The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce operation, data from multiple hosts is aggregated together and then broadcasted to each host participating in the operation. Allreduce performance can be improved by a factor of two by aggregating the data directly in the network. Switches aggregate data coming from multiple ports before forwarding the partially aggregated result to the next hop. In all existing solutions, each switch needs to know the ports from which it will receive the data to aggregate. However, this forces packets to traverse a predefined set of switches, making these solutions prone to congestion. For this reason, we design Canary, the first congestion-aware in-network allreduce algorithm. Canary uses load balancing algorithms to forward packets on the least congested paths. Because switches do not know from which ports they will receive the data to aggregate, they use timeouts to aggregate the data in a best-effort way. We develop a P4 Canary prototype and evaluate it on a Tofino switch. We then validate Canary through simulations on large networks, showing performance improvements up to 40% compared to the state-of-the-art.
RAMP: A flat nanosecond optical network and MPI operations for distributed deep learning systems
2024, Optical Switching and Networking
Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8 Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171 $\times$ speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16 $\times$ and 7.8-58 $\times$ reduction in Megatron and DLRM training time respectively while offering 38-47 $\times$ and 6.4-26.5 $\times$ improvement in energy consumption and cost respectively.
SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applications
2024, Journal of Parallel and Distributed Computing
Parallel and distributed deep learning (PDNN) has become an effective strategy to reduce the long training times of large-scale deep neural networks. Mainstream PDNN software packages based on the message-passing interface (MPI) and employing synchronous stochastic gradient descent rely crucially on the performance of MPI allreduce collective communication routine.
In this work, we propose a novel scalable universal allreduce meta-algorithm called SUARA. In general, SUARA consists of L serial steps, where $L \geq 2$ , executed by all MPI processes involved in the allreduce operation. At each step, SUARA partitions this set of processes into subsets, which execute optimally selected library allreduce algorithms to solve sub-allreduce problems on these subsets in parallel, to accomplish the whole allreduce operation after completing all the L steps. We then design, theoretically study and implement a two-step SUARA ( $L = 2$ ) called SUARA2 on top of the Open MPI library. We prove that the theoretical asymptotic speedup of SUARA2 executed by P processes over the base Open MPI routine is $O (\sqrt{P})$ . Our experiments on Shaheen-II supercomputer employing 1024 nodes demonstrate over 2x speedup of SUARA2 over native Open MPI allreduce routine, which translates into the performance improvement of training of ResNet-50 DNN on ImageNet by 9%.
Accelerating gradient tracking with periodic global averaging
2024, arXiv

View all citing articles on Scopus

Xin Yuan received his B.S. and M.S. degrees in Computer Science from Shanghai Jiaotong University in 1989 and 1992, respectively. He obtained his Ph.D. degree in Computer Science from the University of Pittsburgh in 1998. He is currently an associate professor at the Department of Computer Science, Florida State University. His research interests include parallel and distributed systems, compilers, and networking. His email address is: [email protected].

View full text

Bandwidth optimal all-reduce algorithms for clusters of workstations

Abstract

Introduction

Section snippets

All-reduce operation

The lower bounds

A ring-based bandwidth optimal algorithm

Experiments

Related work

Conclusions

Acknowledgments

Discrete Applied Mathematics

Parallel Computing

Journal of Parallel and Distributed Computing

Discrete Mathematics

Performance Evaluation Journal

Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing

Computing global combine operstions in the multiport postal model

IEEE Transactions on Parallel and Distributed Systems

Efficient global combine operations in multi-port message-passing systems

Parallel Processing Letters

On the design and implementation of broadcast and global combine operations using the postal model

IEEE Transactions on Parallel and distributed Systems

Bandwidth efficient all-to-all broadcast on switched clusters

International Journal of Parallel Programming

A message scheduling scheme for all-to-all personalized communication on ethernet switched clusters

IEEE Transactions on Parallel and Distributed Systems