1 Introduction

Performing data mining over data collected by edge devices, most importantly, mobile phones, is of very high interest [17]. Collecting such data at a central location has become more and more problematic in the past years due to novel data protection rules [9] and in general due to the increasing public awareness to issues related to data handling. For this reason, there is an increasing interest in methods that leave the raw data on the device and process it using distributed aggregation.

Google introduced federated learning to answer this challenge [12, 13]. This approach is very similar to the well-known parameter server architecture for distributed learning [7] where worker nodes store the raw data. The parameter server maintains the current model and regularly distributes it to the workers who in turn calculate a gradient update and send it back to the server. The server then applies all the updates to the central model. This is repeated until the model converges. In federated learning, this framework is optimized so as to minimize communication between the server and the workers. For this reason, the local update calculation is more thorough, and compression techniques can be applied when uploading the updates to the server.

In addition to federated learning, gossip learning has also been proposed to address the same challenge [10, 15]. This approach is fully decentralized, no parameter server is necessary. Nodes exchange and aggregate models directly. The advantages of gossip learning are obvious: since no infrastructure is required, and there is no single point of failure, gossip learning enjoys a significantly cheaper scalability and better robustness. The key question, however, is how the two approaches compare in terms of performance. This is the question we address in this work. To be more precise, we compare the two approaches in terms of convergence time and model quality, assuming that both approaches utilize the same amount of communication resources in the same scenarios.

To make the comparison as fair as possible, we make sure that the two approaches differ mainly in their communication patterns. However, the computation of the local update is identical in both approaches. Also, we apply subsampling to reduce communication in both approaches, as introduced in [12] for federated learning. Here, we adapt the same technique for gossip learning.

We learn linear models using stochastic gradient descent (SGD) based on the logistic regression loss function. For realistic simulations, we apply smartphone churn traces collected by the application Stunner [2]. We note that both approaches offer mechanisms for explicit privacy protection, apart from the basic feature of not collecting data. In federated learning, Bonawitz et al. [3] describe a secure aggregation protocol, whereas for gossip learning one can apply the methods described in [4]. Here, we are concerned only with the efficiency of the different communication patterns and do not compare security mechanisms.

The result of our comparison is that gossip learning is in general comparable to the centrally coordinated federated learning approach, and in many scenarios gossip learning actually outperforms federated learning. This result is rather counter-intuitive and suggests that decentralized algorithms should be treated as first class citizens in the area of distributed machine learning overall, considering the additional advantages of decentralization.

The outline of the paper is as follows. Section 2 describes the basics of federated learning and gossip learning. Section 3 describes the specific algorithmic details that were applied in our comparative study, in particular, the management of the learning rate parameter and the subsampling compression techniques. Section 4 presents our results.

2 Background

Classification is a fundamental problem in machine learning. Here, a data set \(D=\{(x_1,y_1), \dots ,(x_n,y_n)\}\) of n examples is given, where an example is represented by a feature vector \(x\in R^d\) and the corresponding class label \(y\in C\), where d is the dimension of the problem and C is the set of class labels. The problem of classification is often expressed as finding the parameters w of a function \(f_w:R^d\rightarrow C\) that can correctly classify as many examples as possible in D, as well as outside D (this latter property is called generalization). Expressed formally, the objective function J(w) captures the error of the model parameters w, and we wish to minimize J(w) in w:

$$\begin{aligned} w^*=\arg \min _{w}J(w) = \arg \min _{w} \frac{1}{n}\sum _{i=1}^n \ell (f_w(x_i),y_i) + \frac{\lambda }{2}\Vert w\Vert ^2, \end{aligned}$$
(1)

where \(\ell ()\) is the loss function (the error of the prediction), \(\Vert w\Vert ^2\) is the regularization term, and \(\lambda \) is the regularization coefficient. By keeping the model parameters small, regularization helps in avoiding overfitting to the training set.

Perhaps the simplest algorithm to approximate \(w^*\) is the gradient descent method. Here, we start with a random weight vector \(w_0\). In each iteration, we compute \(w_{t+1}\) based on \(w_t\) by finding the gradient of the objective function at \(w_t\) and making a step towards the direction opposite to the gradient. One such iteration is called a gradient update. Formally,

$$\begin{aligned} w_{t+1} = w_{t} - \eta _{t}(\frac{\partial J}{\partial w}(w_t)) = w_{t} - \eta _{t}(\lambda w_t + \frac{1}{n}\sum _{i=1}^n \frac{\partial \ell (f_w(x_i),y_i)}{\partial w}(w_t)), \end{aligned}$$
(2)

where \(\eta _t\) is the learning rate at iteration t. Stochastic gradient descent (SGD) is similar, only we use a single example \((x_i,y_i)\) instead of the entire database to perform an update:

$$\begin{aligned} w_{t+1} = w_{t} - \eta _{t}(\lambda w_t + \frac{\partial \ell (f_w(x_i),y_i)}{\partial w}(w_t)). \end{aligned}$$
(3)

It is also usual to apply a so called minibatch update, in which more than one example is used, but not the entire database.

In this study we use logistic regression as our machine learning model, where the specific form of the objective function is given by

$$\begin{aligned} J(w) = -\frac{1}{n}\sum _{i=1}^n \ln P(y_i|x_i,w) +\frac{\lambda }{2}\Vert w\Vert ^2, \end{aligned}$$
(4)

where \(y_i\in \{0,1\}\), \(P(0|x_i,w) = (1\,+\,\exp (w^Tx))^{-1}\) and \(P(1|x_i,w) = 1 - P(0|x_i,w)\).

2.1 Federated Learning

The pseudocode of the federated learning algorithm [12, 13] is shown in Algorithm 1 (master) and Algorithm 2 (worker). The master periodically sends the current model w to all the workers asynchronously in parallel and collects the answers from the workers. Any answers from workers arriving with a delay larger than \(\varDelta _f\) are simply discarded. After \(\varDelta _f\) time units have elapsed, the master aggregates the received gradients and updates the model. We also send and maintain the model age t (based on the average number of examples used for training) in a similar fashion, to enable the use of dynamic learning rates in the local learning. These algorithms are very generic, the key characteristics of federated learning lie in the details of the update method (line 2 of Algorithm 2) and the compression mechanism (line 4 of Algorithm 2 and line 10 of Algorithm 1). The update method is typically implemented through a minibatch gradient descent algorithm that operates on the local data, initialized with the received model w. The details of our implementation of the update method and compression is presented in Sect. 3.

figure a
figure b
figure c

2.2 Gossip Learning

Gossip Learning is a method for learning models from fully distributed data without central control. Each node k runs Algorithm 3. First, the node initializes a local model \(w_k\) (and its age \(t_k\)). This is then periodically sent to another node in the network. (Note that these cycles are not synchronized.) The node selection is supported by a so-called sampling service [11, 16]. Upon receiving a model \(w_r\), the node merges it with the local model, and updates it using the local data set \(D_k\). Merging is typically achieved by averaging the model parameters; see Sect. 3 for specific implementations. In the simplest case, the received model merely overwrites the local model. This mechanism results in the models taking random walks in the network and being updated when visiting a node. The possible update methods are the same as in the case of federated learning, and compression can be applied as well.

3 Algorithms

In this section we describe the details of the update, init, compress, aggregate, and merge methods. Methods update, init and compress are shared among federated learning and gossip learning. In all the cases we used the implementations in Algorithms 4 and 5. In the minibatch update we compute the sum instead of the average to give an equal weight to all the examples irrespective of batch size. (Note that even if the minibatch size is fixed, actual sizes will vary because the number of examples at a given node is normally not divisible with the nominal batch size.) We used the dynamic learning rate \(\eta _t=\eta /t\), where t is the number of instances the model was trained on.

figure d
figure e
figure f
figure g
figure h

Method aggregate is used in Algorithm 1. Its function is to decompress and aggregate the received gradients encoded with compress. When there is no actual compression (compressNone in Algorithm 7), simply the average of gradients is taken (aggregateDefault in Algorithm 6). The compression technique we employed is subsampling [13]. When using subsampling, workers do not send all of the model parameters back to the master, but only random subsets of a given size (see compressSubsampling). Note that the indices need not be sent, instead, we can send the random seed used to select them. The missing values are treated as zero. Due to this, the gradient average needs to be scaled as shown in aggregateSubsampled to create an unbiased estimator of the original gradient. We introduce a slight improvement to this scaling method in aggregateSubsampledImproved. Here, instead of scaling based on the theoretical probability of including a parameter, we calculate the actual average for each parameter separately based on the number of the gradients that contain the given parameter.

In gossip learning, merge is used to combine the local model with the incoming one. In the simplest variation, the local model is discarded in favor of the received model (see mergeNone in Algorithm 8). It is usually a better idea to take the average of the parameter vectors [15]. We use average weighted by model age (see mergeAverage). Subsampling can be used with gossip learning as well, in which case mergeSubsampled must be used, which considers only the received parameters.

4 Experiments

4.1 Datasets

We used three datasets from the UCI machine learning repository [8] to test the performance of our algorithms. The first is the Spambase (SPAM E-mail Database) dataset containing a collection of emails. Here, the task is to decide whether an email is spam or not. The emails are represented by high level features, mostly word or character frequencies. The second dataset is Pendigits (Pen-Based Recognition of Handwritten Digits) that contains downsampled images of \(4\times 4\) pixels of digits from 0 to 9. The third is the HAR (Human Activity Recognition Using Smartphones) [1] dataset, where human activities (walking, walking upstairs, walking downstairs, sitting, standing and laying) were monitored by smartphone sensors (accelerometer, gyroscope and angular velocity). High level features were extracted from these measurement series.

The main properties, such as size or number of features, are presented in Table 1. In our experiments we standardized the feature values, that is, shifted and scaled them to have a mean of 0 and a variance of 1. Note that the standardization can be approximated by the nodes in the network locally if the approximation of the statistics of the features are fixed and known, which can be ensured in a fixed application.

Table 1. Data set properties

In our simulation experiments, each example in the training data was assigned to one node when the number of nodes was 100. This means that, for example, with the HAR dataset each node gets 73.5 examples on average. When the network size is 1000, we replicate the examples, that is, each example is assigned to 10 different nodes. As for the distribution of class labels on the nodes, we applied two different setups. The first one is uniform assignment, which means that we assigned the examples to nodes at random independently of class label. The number of samples assigned to each node was the same (to be more precise, it differed by at most one due to the number of samples not being divisible by 100).

The second one is single class assignment when every node has examples only from a single class. Here, the different class labels are assigned uniformly to the nodes, and then the examples with a given label are assigned to one of the nodes with the same label, uniformly. These two assignment strategies represent the two extremes in any real application. In a realistic setting the class labels will likely be biased but much less so than in the case of the single class assignment scenario.

4.2 System Model

In our simulation experiments, we used a fixed random k-out overlay network, with \(k=20\). That is, every node had \(k=20\) fixed random neighbors. Simulations were performed with a network size of 100 and 1000 nodes. In the churn-free scenario, every node stayed online for the whole experiment. The churn scenario is based on a real trace gathered from smartphones (see Sect. 4.3 below). We assumed that a message is successfully delivered if and only if both the sender and the receiver remains online during the transfer. We also assume that the nodes are able to detect which of their neighbors are online at any given time with a delay that is negligible compared to the transfer time of a model.

We assumed uniform upload and download bandwidths for the nodes, and infinite bandwidth on the side of the server. Note that the latter assumption favors federated learning, as gossip learning does not use a server. The uniform bandwidth assumption is motivated by the fact that it is likely that in a real application there will be a configured (uniform) bandwidth cap that is significantly lower than the average available bandwidth. The transfer time of a full model was assumed to be 172 s (irrespective of the dataset used) in the long transfer time scenario, and 17.2 s in the short transfer time scenario. This allowed for around 1,000 and 10,000 iterations over the course of 48 h, respectively.

The cycle length parameters \(\varDelta _g\) and \(\varDelta _f\) were set based on the constraint that in the two algorithms the nodes should be able to exploit all the available bandwidth. In our setup this also means that the two algorithms transfer the same number of bits overall in the network in the same time-window. This will allow us to make fair comparisons regarding convergence dynamics. The gossip cycle length \(\varDelta _g\) is thus exactly the transfer time of a full model, that is, nodes are assumed to send messages continuously. The cycle length \(\varDelta _f\) of federated learning is the round-trip time, that is, the sum of the upstream and downstream transfer times. When compression is used, the transfer time is proportionally less as defined by the compression rate. Note, however, that in federated learning the master always sends the full model to the workers, only the upstream transfer is compressed.

It has to be noted that we assume much longer transfer times than what would be appropriate for the actual models in our simulation. To put it differently, in our simulations we pretend that our models are very large. This is because in the churn scenario if the transfer times are very short, the network hardly changes during the learning process, so effectively we learn over a static subset of the nodes. Long transfer times, however, make the problem more challenging because many transfers will fail, just like in the case of very large machine learning models such as deep neural networks. In the case of the no-churn scenario this issue is completely irrelevant, since the dynamics of convergence are identical apart from scaling time.

4.3 Smartphone Traces

The trace we used was collected by a locally developed openly available smartphone app called STUNner, as described previously [2]. In a nutshell, the app monitors and collects information about charging status, battery level, bandwidth, and NAT type.

Fig. 1.
figure 1

Online session length distribution (left) and dynamic trace properties (right)

We have traces of varying lengths taken from 1191 different users. We divided these traces into 2-day segments (with a one-day overlap), resulting in 40,658 segments altogether. With the help of these segments, we were able to simulate a virtual 48-h period by assigning a different segment to each simulated node.

To ensure our algorithm is phone and user friendly, we defined a device to be online (available) when it has been on a charger and connected to the internet for at least a minute, hence we never use battery power at all. In addition, we also treated those users as offline who had a bandwidth of less than 1 Mbit/s.

Figure 1 illustrates some of the properties of the trace. The plot on the right illustrates churn via showing, for every hour, what percentage of the nodes left, or joined the network (at least once), respectively. We can also see that at any given moment about 20% of the nodes are online. The average session length is 81.368 min.

4.4 Hyperparameters and Algorithms

The learning rate \(\eta \) and regularization coefficient \(\lambda \) were optimized using grid search assuming the no-failure scenario, no compression, and uniform assignment. The resulting values are shown in Table 1. These hyperparameters depend only on the database, they are robust to the selection of the algorithm. Minibatches of size 10 were used in each scenario. We used logistic regression as our learning algorithm, embedded in a one-vs-all meta-classifier.

Fig. 2.
figure 2

Federated learning, 100 nodes, long transfer time, no failures, different aggregation algorithms and subsampling probabilities.

4.5 Results

We ran the simulations using PeerSim [14]. We measure learning performance with the help of the 0–1 loss, which gives the proportion of the misclassified examples in the test set. In the case of gossip learning the loss is defined as the average loss over the online nodes.

Fig. 3.
figure 3

Federated learning and gossip learning with 100 (left) and 1000 (right) clients, long transfer time, no failures, with different subsampling probabilities. Minibatch Stochastic Gradient Descent (SGD) is implemented by gossip learning with no merging (using mergeNone).

First, we compare the two aggregation algorithms for subsampled models in Algorithm 6 (Fig. 2) in the no-failure scenario. The results indicate a slight advantage of aggregateSubsamplingImproved, although the performance depends on the database. In the following we will apply aggregateSubsamplingImproved as our implementation of method aggregate.

Fig. 4.
figure 4

Federated learning and gossip learning over the smartphone trace (left) and an artificial exponential trace (right), long transfer time, with different subsampling probabilities.

Fig. 5.
figure 5

Federated learning and gossip learning with no churn (left) and over the smartphone trace (right), short transfer time, with different subsampling probabilities.

Fig. 6.
figure 6

Federated learning and gossip learning with no churn (left) and over the smartphone trace (right), long transfer time, single class assignment, with different subsampling probabilities.

The comparison of the different algorithms and subsampling probabilities is shown in Fig. 3. The stochastic gradient descent (SGD) method is also shown, which was implemented by gossip learning with no merging (using mergeNone). Clearly, the parallel methods are all better than SGD. Also, it is very clear that subsampling helps both federated learning and gossip learning. However, gossip learning benefits much more from it. The reason is that in the case of federated learning subsampling is implemented only in the worker master direction, the master sends the full model back to the workers [12]. However, in gossip learning, subsampling can be applied to all the messages.

Most importantly, gossip learning clearly outperforms federated learning in the case of high compression rates (low sampling probability) over two of the three datasets, and it is competitive on the remaining dataset as well. This was not expected, as gossip learning is fully decentralized, so the aggregation is clearly delayed compared to federated learning. Indeed, with no compression, federated learning performs better. However, with high compression rates, slower aggregation is compensated by a higher communication efficiency. Figure 3 also illustrates scaling. As we can see, the performance with 100 and 1000 nodes is practically identical for both algorithms.

Figure 4 contains our results with the churn trace. In the first hour, the two algorithms behave just like in the no-churn scenario. On the longer range, clearly, federated learning tolerates the churn better. This is because in federated learning nodes always work with the freshest possible models that they receive from the master, even right after coming back online. In gossip learning, outdated models could temporarily participate in the optimization, albeit with a smaller weight. In this study we did not invest any effort into mitigating this effect, but outdated models could potentially be removed with more aggressive methods as well.

We also include an artificial trace scenario, where online session lengths are exponentially distributed following the same expected length (81 min) as in the smartphone trace. The offline session length is set so we have 10% of the nodes spending any given federated learning round online in expectation, assuming no compression. This is to reproduce similar experiments in [13]. The results are similar to those over the smartphone trace, only the noise is larger for gossip learning, because the exponential model results in an unrealistically large variance in session lengths.

Figure 5 shows the convergence dynamics when we assume short transfer times (see Sect. 4.2). Clearly, the scenarios without churn result in the same dynamics (apart from a scaling factor) as the scenarios with long transfer time. The algorithms are somewhat more robust to churn in this case, since the nodes are more stable relative to message transfer time.

Figure 6 contains the results of our experiments with the single class assignment scenario, as described in Sect. 4.1. In this extreme scenario, the learning problem becomes much harder. Still, gossip learning remains competitive in the case of high compression rates.

5 Conclusions

Here, our goal was to compare federated learning and gossip learning in terms of efficiency. We designed an experimental study to answer this question. We compared the convergence speed of the two approaches under the assumption that both methods use the available bandwidth, resulting in an identical overall bandwidth consumption.

We found that in the case of uniform assignment, gossip learning is not only comparable to the centralized federated learning, but it even outperforms it under the highest compression rate settings. In every scenario we examined, gossip learning is comparable to federated learning. We add that this result relies on our experimental assumptions. For example, if one considers the download traffic to be essentially free in terms of bandwidth and time then federated learning is more favorable. This, however, is not a correct approach because it hides the costs at the side of the master node. For this reason, we opted for modeling the download bandwidth to be identical to the upload bandwidth, but still assuming an infinite bandwidth at the master node.

As for future work, the most promising direction is the design and evaluation of more sophisticated compression techniques [5] for both federated and gossip learning. Also, in both cases, there is a lot of opportunity to optimize the communication pattern by introducing asynchrony to federated learning, or adding flow control to gossip learning [6].