Abstract
Increasingly stringent data privacy regulations limit the development of person re-identification (ReID) because person ReID training requires centralizing an enormous amount of data that contains sensitive personal information. To address this problem, we introduce federated person re-identification (FedReID)—implementing federated learning, an emerging distributed training method, to person ReID. FedReID preserves data privacy by aggregating model updates, instead of raw data, from clients to a central server. Furthermore, we optimize the performance of FedReID under statistical heterogeneity via benchmark analysis. We first construct a benchmark with an enhanced algorithm, two architectures, and nine person ReID datasets with large variances to simulate the real-world statistical heterogeneity. The benchmark results present insights and bottlenecks of FedReID under statistical heterogeneity, including challenges in convergence and poor performance on datasets with large volumes. Based on these insights, we propose three optimization approaches: (1) we adopt knowledge distillation to facilitate the convergence of FedReID by better transferring knowledge from clients to the server, (2) we introduce client clustering to improve the performance of large datasets by aggregating clients with similar data distributions, and (3) we propose cosine distance weight to elevate performance by dynamically updating the weights for aggregation depending on how well models are trained in clients. Extensive experiments demonstrate that these approaches achieve satisfying convergence with much better performance on all datasets. We believe that FedReID will shed light on implementing and optimizing federated learning on more computer vision applications.
1 INTRODUCTION
Person re-identification (ReID) aims to match the same person appearing in disjoint camera views. It has received considerable attention because of its wide applications in business and public security, such as customer trajectory analysis and criminal investigation [22]. Person ReID has achieved outstanding performance [4, 45, 54], attributed to the advances of deep neural networks (DNNs) [13, 21].
However, the increasing concerns of data privacy protection limit the development of person ReID [8]. DNN-based approaches are data-hungry, relying on centralizing a sizable amount of data to achieve high performance [58]. Training images of person ReID contain sensitive personal information, which could reveal the identity and location of individuals. Centralizing these images imposes potential privacy leakage risks. Hence, it is crucial to navigate the development of person ReID under the premise of privacy protection.
Federated learning (FL), an emerging distributed training technique, has empowered many applications with privacy-preserving mechanisms [18], such as healthcare applications [5, 41] and consumer products [36, 37]. FL preserves data privacy by training models collectively with decentralized clients. These clients, instead of transferring raw data, only transfer training updates to a central server. This reduces privacy leakage risks, as raw data are kept locally. Despite the advantages of FL, implementing FL to person ReID and optimizing its performance are largely overlooked; such implementation possibility is only mentioned in the work of Hao et al. [12], but their study does not present dataset or benchmark results.
In this work, we propose Federated Person Re-identification (FedReID), a new person ReID training paradigm to enable multimedia researchers to train models with privacy guaranteed. Besides privacy protection, FedReID possesses other advantages: reducing communication overhead of uploading plenty amount of data [35], adapting models in clients to local scenes, and obtaining a holistic model that generalizes in diverse scenarios. A usage example of FedReID is video surveillance across communities or districts, where multiple entities collaborate to learn a generalized model without revealing their private video surveillance data.
However, implementing FL to person ReID is not trivial—statistical heterogeneity is a major challenge of FedReID in real-world scenarios [25]: (1) data is in non-identical and independent distribution (non-IID) [56] because data collected from different cameras could have significant discrepancies in resolution, illumination, and angles, and (2) data volume is unbalanced with varied pedestrian flow in different locations. Although some studies illustrate that non-IID harms the training convergence and model performance in tasks like image classification [56], the impact of statistical heterogeneity on FedReID has not yet been explored.
This work aims to optimize FedReID under statistical heterogeneity via benchmark analysis. We start by constructing a new benchmark, FedReIDBench, with nine representative ReID datasets and a specially designed algorithm for FedReID (Section 3). In the benchmark, a server coordinates nine clients (each containing a dataset) to conduct training on their local data and aggregates training updates iteratively. We then conduct benchmark analysis (Section 4), revealing that statistical heterogeneity leads to performance degradation and difficulty in convergence. We end by proposing three performance optimization methods: client clustering (CC) (Section 5.1) and dynamic weight adjustment (Section 5.3) to elevate performance, and knowledge distillation (KD) (Section 5.2) to facilitate convergence. Specifically, CC groups clients with similar data distribution and aggregate training updates within each group. KD uses a public dataset to transfer knowledge from clients to the server more effectively. In addition, weight adjustment dynamically updates the weights of clients’ training updates in server aggregation. Extensive experiments demonstrate the effectiveness of the benchmark and the significance of the optimization approaches. We believe that FedReID will shed light on implementing and optimizing FL to more computer vision applications.
In summary, we make the following contributions:
We construct a new benchmark for FedReID, simulating real-world scenarios of statistical heterogeneity with nine representative person ReID datasets.
We provide useful insights and investigate potential bottlenecks of FedReID by analyzing the benchmark results.
We propose three performance optimization methods: KD to facilitate convergence, as well as CC and dynamic weight adjustment to elevate performance.
We extensively evaluate these optimization methods to demonstrate their effectiveness.
2 RELATED WORKS
2.1 Person ReID
The objective of person ReID is to retrieve the identity of interest from disjoint camera views. It is an important computer vision task that is widely applied in public security, such as video surveillance [58]. The advances of the DNN have greatly improved the performance of person ReID by learning better feature representations compared to traditional hand-crafted feature operations [29, 31, 46, 52]. Over the years of development, the community has constructed many person ReID datasets [15, 28, 50, 57, 59]. These datasets are collected from various locations with different camera views. The majority of person ReID studies focus on extracting better feature representations by improving the architecture of DNNs [22, 54]. They rely on the assumption that data, collected from different cameras in various locations, can be centralized to a central server. However, centralizing plenty of images of individuals raises potential privacy leakage risks. Different from previous approaches, we proposed FedReID—a new training paradigm for ReID to learn ReID models from decentralized data. FedReID mitigates potential privacy leakage issues, as data is not transferred to a central server.
2.2 Federated Learning
FL is an emerging distributed training technique that trains models with decentralized clients coordinated by a central server [18].
Benchmarks. To facilitate the development of FL, researchers have published several benchmarks and datasets: LEAF [3] is the first benchmark for FL research, containing federated datasets for image classification and natural language processing tasks; Streets [34] is a real-world image dataset collected from street cameras for object detection; and OARF [17] is a benchmark that aims to facilitate a wide range of FL applications, such as trend prediction, recommendation, and sentiment analysis. However, different from these tasks, person ReID is a retrieval task in which no existing benchmark contains related datasets. In this work, we construct a new FL benchmark that simulates real-world scenarios of FedReID.
Algorithm. The most known algorithm for FL is Federated Averaging (FedAvg) [35]. It defines an iterative training process in which clients send trained local models to a server and the server sends back the aggregated global model to clients. The benchmarks mentioned previously adopt FedAvg as the standard algorithm. However, FedAvg requires all clients to have identical models. It is not suitable for FedReID because clients could have varied classifiers. Therefore, we propose an enhanced algorithm, Federated Partial Averaging (FedPav).
Statistical heterogeneity. Statistical heterogeneity—non-IID and unbalanced data—is a major challenge of FL [18, 25]. In traditional distributed training [9, 44], data in multiple nodes of cloud clusters are IID. Data in multiple FL clients, however, could be heterogeneous. To address this challenge, some studies focus on optimizing training in clients [1, 19, 24, 26, 55], whereas recent work [55] requires extra communication by sharing features among clients. However, some studies optimize the aggregation process in the server [47, 48, 61, 62]. In addition, several studies share voluntary or public data between the server and clients [53, 56]. These methods are validated on small datasets [3, 7, 20], and thus may not be directly applicable to the challenging scenario of FedReID. In this work, we introduce three optimization methods targeting the statistical heterogeneity of FedReID via in-depth benchmark analysis.
This work is an extension of our previous conference version [63]. The main improvements are as follows: (1) we introduce a new performance optimization method—CC; (2) we integrate CC with the previously proposed weight adjustment method, achieving the best performance; (3) we conduct more performance evaluations for comparison with the benchmark results and the proposed optimization methods; and (4) we provide more comprehensive descriptions for the proposed optimization methods. Despite that another work [51] also studied FedReID after our conference work [63], it focuses more on adapting to unseen domains, whereas we aim to address the statistical heterogeneity revealed from our benchmark analysis.
3 FEDERATED PERSON REID BENCHMARK
This section introduces a new FL benchmark for person ReID, FedReIDBench. This benchmark comprises nine representative datasets, two possible implementation architectures, one enhanced algorithm, and several performance evaluation metrics.
3.1 Datasets
We construct the benchmark dataset with nine representative person ReID datasets as shown in Table 1. It contains a total of 224,064 images of 17,991 identities. These datasets are collected at multiple locations (or countries) and published by different organizations at different times. They not only vary in the number of images, identities, and camera views but also differ in image resolution, illumination, and scenes.
Datasets | # Cameras | Train | Test | ||||
---|---|---|---|---|---|---|---|
# IDs | # Images | Query | Gallery | ||||
# IDs | # Images | # IDs | # Images | ||||
MSMT17 [50] | 15 | 1,041 | 32,621 | 3,060 | 11,659 | 3,060 | 82,161 |
DukeMTMC-reID [59] | 8 | 702 | 16,522 | 702 | 2,228 | 1,110 | 17,611 |
Market-1501 [57] | 6 | 751 | 12,936 | 750 | 3,368 | 751 | 19,732 |
CUHK03-NP [30] | 2 | 767 | 7,365 | 700 | 1,400 | 700 | 5,332 |
PRID2011 [15] | 2 | 285 | 3,744 | 100 | 100 | 649 | 649 |
CUHK01 [28] | 2 | 485 | 1,940 | 486 | 972 | 486 | 972 |
VIPeR [11] | 2 | 316 | 632 | 316 | 316 | 316 | 316 |
3DPeS [2] | 2 | 93 | 450 | 86 | 246 | 100 | 316 |
iLIDS-VID [49] | 2 | 59 | 248 | 60 | 98 | 60 | 130 |
Note: These datasets have large variances in data volume, decreasing from top to bottom.
Note: These datasets have large variances in data volume, decreasing from top to bottom.
The variances in these datasets simulate the statistical heterogeneity in real-world scenarios: the disparity of data volumes represents the unbalanced data problem, and the domain discrepancies among datasets represent the non-IID problem. Unlike centralized training where data is IID, statistical heterogeneity makes training even more challenging.
3.2 Architectures
Figure 1(a), (b) illustrate two architectures for possible implementation scenarios of FedReID: edge-cloud architecture and device-edge-cloud architecture. In both architectures, the cloud represents the central server connecting to multiple edges.
Edge-cloud architecture. In this architecture, cameras are the edges that directly connect with the server to conduct FL. The server coordinates these cameras to train models with locally collected images. This architecture significantly reduces privacy leakage risks, as the data always stays at the edges. However, deployment of this architecture requires cameras to have enough computation power and storage capability. A real-world application of this architecture would be video surveillance for a community with multiple cameras on different streets.
Device-edge-cloud architecture. This is a three-layer hierarchical architecture. Edge servers are in the middle layer. On the one hand, they construct local training datasets by gathering images from multiple camera views, which is similar to how datasets in the benchmark are collected. On the other hand, edge servers collaboratively perform FL with their local datasets under the coordination of the server. A good illustration of this architecture would be multiple communities collaborating to learn person ReID models, where each community has an edge server collecting data from multiple cameras.
3.3 Algorithm
The standard FL algorithm FedAvg [35] is not suitable for FedReID because it requires identical model structures in all clients. The model structure of the benchmark is ID-discriminative embedding (IDE) [58], which is a common baseline for DNN-based person ReID. This model structure consists of a backbone and a classifier: the backbone is ResNet-50 [13] in our FedReIDBench; the classifier is a linear layer whose dimension depends on the number of identities of clients. Since the number of identities could vary among clients, their classifiers could differ in clients. Hence, we adopt the enhanced algorithm for FedReID: FedPav [63].
FedPav allows models in clients to be only partially identical. For FedReID, FedPav enables clients to use the same backbone but different identity classifiers for FL, as shown in Figure 1(c). The training process is similar to FedAvg except that clients only transfer the identical part of models to the central server for aggregation.
Algorithm 1 summarizes FedPav. We aim to obtain a holistic global model and personalized local models for clients at the end of the training. Each training round \(t\) of FedPav contains four steps: (1) distribution: the central server chooses a fraction (\(K\) out of \(N\)) of clients for current round of training and distributes the global model \(w^t\) to these clients; (2) local training: each client \(k\) initializes the backbone \(w_k^{t}\) using the global model parameters and trains the model with a local dataset for \(E\) local epochs with \(B\) batch size; (3) upload: each client \(k\) uploads the trained backbone \(w_k^{t+1}\) to the server; and (4) aggregation: the server generates a new global model \(w^{t+1}\) by aggregating updates from clients with weighted average. The training stops after iterating these four steps for \(T\) rounds. After training, we use the global model \(w\) to evaluate convergence and generalization, and use local models \(w_k\) to evaluate how well models adapt to local scenarios.
3.4 Performance Evaluation Metrics
We evaluate FedReID in two aspects: accuracy and communication cost.
Accuracy. The cumulative matching characteristics curve and mean Average Precision (mAP) [58] are standard person ReID evaluation metrics. Given an image as a query, person ReID matches it in a gallery of images based on similarity. Cumulative matching characteristics measures the probability that the query identity is in the top-\(k\) most similar matched gallery images. We consider \(k=\lbrace 1, 5, 10\rbrace\) in the benchmark, representing the rank-1 accuracy, rank-5 accuracy, and rank-10 accuracy. In addition, we report the mAP of all queries.
Communication cost. Since FL requires iterative communication between a server and multiple clients, we also consider the communication costs. The total communication cost is \(T \times 2 \times M\), where \(T\) is the number of communication rounds and \(M\) is the transmission message size (model size). \(2\, \times \, M\) is the communication cost of each round, considering both uploading and downloading from clients.
3.5 Reference Implementation
To facilitate ease of use and reproducibility, we open source referenced implementation in GitHub.1 It includes data preprocessing, proposed algorithm, and optimization methods. We plan to integrate it to EasyFL [60] in the future. In addition, we provide the experimental settings as follows.
Learning rate. The initialized learning rates were different for the identity classifier and the backbone: 0.05 for the identity classifier and 0.005 for the backbone. The learning rate schedulers of both are the same with step size 40 and gamma 0.1. In addition, the learning rate for the server fine-tuning in KD is 0.0005.
Optimizer. We use Stochastic Gradient Decent (SGD) as the optimizer. The optimizer is set with weight decay 5e-4 and momentum 0.9.
FL settings. The default settings of FL algorithms are as follows: batch size \(B = 32\), local epoch \(E = 1\), and total training rounds \(T = 300\).
4 BENCHMARK ANALYSIS
In this section, we present the results of extensive experiments on the benchmark. We investigate the performance of two architectures, the impact of different federated settings, and the impact of statistical heterogeneity.
We initialize the backbone with ResNet-50 [13] parameters pre-trained on ImageNet [10]. For hyperparameters, we use batch size \(B = 32\) and local epoch \(E = 1\) to train \(T = 300\) communication rounds by default.
4.1 Edge-Cloud Architecture
In the edge-cloud architecture, each camera is a client. Since each person ReID dataset contains data from several camera views, we simulate FedReID in this architecture by assigning data of the same camera view to one client. As a dataset is divided into several clients by camera views, we term it the federated-by-camera scenario.
To understand FedReID performance in the federated-by-camera scenario, we compare it with two other settings. The first is the federated-by-identity scenario: we divide one dataset into partitions for multiple clients, where each client includes one partition that contains an equal number of identities. The number of clients equals the number of camera views. The second is centralized training: training with data merged from multiple cameras, which can be considered as the upper bound. For example, the Market-1501 dataset [57] contains six camera views with 751 identities. In the federated-by-identity scenario, we divide it into six clients, where each client includes 125 non-overlapping identities. The centralized training means training with the Market-1501 dataset.
Table 2 presents the comparisons of global models of different settings on two datasets: CUHK03-NP [30] and Market-1501 [57]. Compared with the federated-by-identity scenario or centralized training, the federated-by-camera scenario performs much worse. This indicates that learning from only one camera view is infeasible to obtain a generalized model in person ReID, where the evaluation is based on images from multiple camera views. Hence, even though industrial cameras have enough computation and storage capacity to support edge-cloud architecture, the device-edge-cloud architecture could be more adequate for FedReID because each client learns cross-camera knowledge. All other experiments in the article are conducted based on the device-edge-cloud architecture.
Dataset | # Clients | Settings | Rank-1 | Rank-5 | Rank-10 | mAP |
---|---|---|---|---|---|---|
CUHK03-NP | 2 | Federated-by-camera | 11.21 | 19.14 | 25.71 | 11.11 |
Federated-by-identity | 51.71 | 69.50 | 76.79 | 47.39 | ||
Centralized training | 49.29 | 68.86 | 76.57 | 44.52 | ||
Market-1501 | 6 | Federated-by-camera | 61.13 | 74.88 | 80.55 | 36.57 |
Federated-by-identity | 85.69 | 93.44 | 95.81 | 66.36 | ||
Centralized training | 88.93 | 95.34 | 96.88 | 72.62 |
Note: The federated-by-camera scenario achieves the worst performance, indicating that edge-cloud architecture could be inadequate for FedReID.
Note: The federated-by-camera scenario achieves the worst performance, indicating that edge-cloud architecture could be inadequate for FedReID.
4.2 Device-Edge-Cloud Architecture
In the device-edge-cloud architecture, edge servers collect data from multiple cameras and conduct FedReID with a central server. Since each of the benchmark datasets consists of data from multiple camera views, we simulate this scenario with nine clients—each client contains one unique dataset of the benchmark datasets. In all experiments, we choose nine clients to participate in training.
Under this architecture, we consider two types of models produced from FedReID training: (1) local model: the specialized models trained after \(E\) local epochs in clients before uploading to the server in each training round, and (2) global model: the generalized model obtained in the server by aggregating models uploaded from clients.
To understand the performance of FedReID, we compare global and local models with the other two models: (1) standalone training: the models trained in clients with their own dataset (without participating in FL), and (2) centralized training: the model trained using the combination of all benchmark datasets, simulating conventional person ReID training that centralizes datasets. Centralized training can be treated as the upper bound of FedReID, whereas FedReID is meaningful for a client only when the performance of global or local models is better than standalone training.
4.2.1 Impact of Federated Settings.
We first investigate the performance of FedReID (the global model) using the FedPav algorithm under different federated settings, including batch size \(B\) and local epochs \(E\).
Batch size reflects the trade-off between computation power consumption and model accuracy. With the same local epochs, a larger batch size reduces computation time because the training can better take advantage of the parallelism provided by the client hardware. (Computation is fully utilized as long as \(B\) is large enough.) Figure 2(a) compares the rank-1 accuracy of FedPav using different batch sizes \(B = \lbrace 32, 64, 128\rbrace\), under the setting that local epochs \(E = 1\) and communication rounds \(T = 300\). Smaller batch size generally achieves better performance in most datasets but consumes higher computation.
Local epochs reflect the trade-off between the communication cost and model accuracy. The total training epochs \(E_{total}\) can be calculated with \(E_{total} = T \times E\), where T is the communication rounds and \(E\) is the number of local epochs. By fixing the total training epochs for a fair comparison, smaller \(E\) means larger communication rounds \(T\), requiring higher communication costs. In addition, we compare the rank-1 accuracy of different numbers of local epochs in Figure 2(b). Despite that \(E = 5\) performs worse than \(E = 10\) in several datasets, smaller numbers of local epochs \(E\) generally result in better performance. The smallest number of local epoch \(E = 1\) achieves much better performance than \(E = 5\) and \(E = 10\) in all datasets, and it requires the highest communication cost, indicating the trade-off between communication costs and model accuracy.
4.2.2 Impact of Statistical Heterogeneity.
The statistical heterogeneity hinders the convergence and performance of FedReID. Specifically, non-IID causes difficulty in convergence, and both non-IID and unbalanced data limit the performance of FedReID.
Later, Figure 4 shows that FedPav does not converge well, as the accuracy (of the global model) fluctuates throughout training. We argue that it is mainly due to non-IID data of nine clients. As datasets in clients have domain discrepancy (e.g., illumination, resolution, scenes), aggregating them simply by weighted average leads to unstable and unpredictable results. As a result, it causes difficulty in selecting a representative global model for other scenarios. We report the accuracy by averaging the three best global models throughout training, evaluated every 10 rounds.
Furthermore, Table 3 compares the performance of the global and local models obtained from FedReID with standalone and centralized training. The results are twofold: on the one hand, standalone training outperforms both the global and local models in large datasets such as DukeMTMC-reID [59] and CUHK03-NP [30]; on the other hand, both the global and local models outperform standalone training in small datasets such as VIPeR [11] and 3DPeS [2], and even outperform centralized training in the iLIDS-VID dataset [49]. These results indicate that although clients with larger datasets do not benefit from FedReID, the ones with smaller datasets gain significant improvement. We interpret the results from two perspectives: (1) for clients with large datasets, they dominate in server aggregation as the weights for aggregation are positively correlated with data volumes, causing less gain from others; (2) for clients with small datasets, they learn from other clients more effectively because their models are not well trained.
Method | MSMT17 | DukeMTMC | Market | CUHK03 | PRID2011 | CUHK01 | VIPeR | 3DPeS | iLIDS-VID |
---|---|---|---|---|---|---|---|---|---|
Centralized training | 54.6 | 84.2 | 91.7 | 64.0 | 80.0 | 89.7 | 65.5 | 82.1 | 80.6 |
Standalone training | 49.6 | 80.1 | 88.9 | 49.3 | 55.0 | 69.0 | 27.5 | 65.4 | 52.0 |
Global model | 41.0 | 74.3 | 83.4 | 31.7 | 37.7 | 73.4 | 48.1 | 69.2 | 79.9 |
Local model | 48.3 | 78.1 | 83.6 | 39.5 | 50.7 | 80.7 | 52.0 | 80.6 | 84.7 |
Another observation from Table 3 is that local models outperform the global model in all datasets. As the global model is produced by aggregating local models, we argue that non-IID data causes performance degradation in the server aggregation. Better aggregation methods can be considered to better transfer knowledge from local models to the global model.
5 PERFORMANCE OPTIMIZATION
In this section, we first propose three methods to address the problems caused by statistical heterogeneity: CC, KD, and dynamic weight adjustment. Then, we present experimental results of these optimization methods compared with standalone training and the benchmark results.
5.1 Client Clustering
To tackle the performance degradation caused by non-IID data in server aggregation of all clients, we propose to aggregate clients with similar data distributions. As discussed in Section 4.2.2, local models outperform the global model in all datasets. The global model is obtained by aggregating local models, so the performance drop mainly sources from the aggregation of clients with diverse data distributions. To tackle this problem, we propose CC to split clients into several groups based on their data distributions and aggregate models within each group in the server.
Figure 3(a) depicts the process of CC with the following steps: (1) we extract features \(f_k\) from one batch data (32 samples) of a public person ReID dataset2 using the trained model \(w_k\) from client \(k\); (2) we adopt a clustering algorithm to cluster these features into multiple groups; (3) we aggregate models of clients within each group, obtaining a global model in each group; and (4) we use the global model of each group to update local models of clients within that group for the next training round. In Figure 3(a), we cluster clients into two groups: one group contains clients {1, 4} and another one contains clients {2, 3, 5}, based on their features \(f\). Then, we aggregate \(w_1\) and \(w_4\) to obtain global model \(w_{c1}\), and we aggregate \(w_2\), \(w_3\), and \(w_5\) to obtain \(w_{c2}\). At the start of next training round, we update local models of clients {1, 4} with \(w_{c1}\) and local models of clients {2, 3, 5} with \(w_{c2}\). CC obtains multiple global models after training, so we focus on evaluating personalized local models \(w_k\) of each client \(k\).
In this way, we use the features as a proxy to measure the similarity of data distributions among clients. The intuition behind CC is that the clients clustered into the same group share more similar data distributions. The choice of the clustering algorithm is important for the overall performance of this method. We utilize a hierarchical clustering algorithm, FINCH [39], to cluster clients based on similarities of features extracted from their models. Regarding each client as a cluster at the start, we group the clients that are first neighbors; two clients are first neighbors if their features have the shortest distance (cosine similarity) or they share the same first neighbor. FINCH merges first neighbors in each clustering step. In our scenario, since nine clients would be merged into one cluster after two to three clustering steps, we only cluster for one step per communication round. As a result, the server would have two to three clusters, where each cluster contains two to seven clients. FINCH is able to deliver good clustering results without prior knowledge of the targeted number of clusters.
5.2 Knowledge Distillation
Besides CC, we adopt KD to elevate performance and improve the convergence of FedReID. Since local models outperform the global model, this suggests that local models contain more knowledge than the global model—simple server aggregation could not effectively aggregate knowledge from local models. KD is a method proposed by Hinton et al. [14] to transfer knowledge from a teacher model to a student model, where the teacher model contains more knowledge than the student model. We adopt KD to better transfer knowledge from local models to the global model, regarding clients as teachers and the server as the student.
After clients finish local training and upload models, we apply KD with a public shared dataset \(\mathcal {D}_{shared}\) in the server. Figure 3(b) illustrates the additional steps required from KD. In the first step, the server uses each trained model \(w_k\) of client \(k\) to generate soft labels3 \(\ell _k\) using samples of \(\mathcal {D}_{shared}\). These soft labels represent the knowledge of clients’ models. In the second step, apart from model aggregation, the server aggregates these soft labels with \(\ell = \frac{1}{K} \sum _{k \in S_t} \ell _k\). In the third step, the server fine-tunes the global model with \(\mathcal {D}_{shared}\) and corresponding labels \(\ell\) to learn the distilled knowledge.
5.3 Weight Adjustment
In addition to tackling the performance degradation caused by non-IID data, we propose to dynamically update the weights for aggregation to curb the adverse effect of unbalanced data. As discussed in Section 4.2.2, the weights of server aggregation are inappropriate. The formula for server aggregation [35, 63] is \(w^{t+1} = \sum _{k \in S_t} \frac{n_k}{n} w^{t+1}_k\), where \(n\) is the total data volume and \(n_k\) is the data volume of client \(k\). The weights of local models depend on the data volume of clients—larger datasets lead to larger weights. Since data volumes have large discrepancies among datasets, large datasets dominate in the server aggregation. For example, the weight of the largest dataset (MSMT17 [50]) is around 40%, whereas the weight of smallest dataset (iLIDS-VID [49]) is only 0.3%. Models from smaller datasets are almost negligible in aggregation. Such unbalanced data volume hampers clients with large datasets to effectively learn from others. Hence, we introduce a novel weight adjustment method to obtain more suitable weights for weighted average in aggregation.
Cosine distance weight. We introduce cosine distance weight (CDW) to substitute the weight of data volumes. CDW adjust the weights for aggregation dynamically in each round, based on how well models are trained in clients. It is measured by changes in features extracted from models before and after training. Such changes are calculated by cosine distance. In particular, in each training round, client \(k\) downloads and trains on the global model \(w^t_k\) from the server to obtain a new local model \(w^{t+1}_k\). Figure 3(c) demonstrates our method to calculate the new weight with \(w^t_k\) and \(w^{t+1}_k\), with the following steps: (1) client \(k\) extracts logits \(g^t_k\) with a random batch data \(\mathcal {D}_{batch}\) using \((w^t_k, v^t_k)\)4; (2) client \(k\) obtains a new local model \((w^{t+1}_k, v^{t+1}_k)\) after local training; (3) client \(k\) extracts features \(g^{t+1}_k\) with \(\mathcal {D}_{batch}\) using \((w^{t+1}_k, v^{t+1}_k)\); and (4) we calculate the cosine distance of these two logits \(g^{t}_k\) and \(g^{t+1}_k\), with following formula: (1) \(\begin{equation} d^{t+1}_k = 1 - \frac{g^{t}_k \cdot g^{t+1}_k}{\left\Vert g^{t}_k \right\Vert \left\Vert g^{t+1}_k \right\Vert }, \end{equation}\) where the cosine distance \(d^{t+1}_k\) of each client \(k\) is pushed to the server. The server uses the following formula to obtain the new weight: (2) \(\begin{equation} p^{t+1}_k = \frac{d^{t+1}_k}{\sum _{k \in S_t} d^{t+1}_k}, \end{equation}\) where the server uses \(p^{t+1}_k\) to replace \(\frac{n_k}{n}\) as the new weight for aggregation. The intuition of CDW is that clients whose local trainings are more effective should contribute more to the aggregation. The cosine distance \(d_k\) measures the scale of changes in local training that updates model \(w^t_k\) to \(w^{t+1}_k\).
5.4 Combinations of Optimization Method
We can achieve even better performance by combinations of these three optimization methods: CC, KD, and CDW. We consider only combining CDW and CC, and CDW and KD.
It is not desirable to combine CC and KD because they both enhance the server aggregation. On the one hand, KD only fine-tunes a single global model, whereas CC contains multiple global models. On the other hand, both methods address the non-IID problem: KD aims to further improve the global model, whereas CC tends to elevate the performance of local models. Hence, we do not consider the combination of these two methods.
Since CDW tackles unbalanced data volume, either the combination of CDW and CC or the combination of CDW and KD addresses statistical heterogeneity with non-IID and unbalanced data problems. To combine CDW with CC or KD, we just need to replace the original weights in the server aggregation process with the new weights. As CC has no single global model, combining it with CDW aims to achieve better local models; as KD further fine-tunes the global model, combining it with CDW aims to achieve a better global model. We summarize these two combinations in Algorithms 2 and 3.
5.5 Evaluation
We present the empirical evaluation of these performance optimization approaches compared with the benchmark and standalone training. By default, we conduct these experiments with batch size \(B = 32\) and local epoch \(E = 1\). For both CC and KD, we adopt an additional unlabeled dataset—CUHK02 [27]. This dataset is regarded as a public dataset that is shareable among clients and the server. The CUHK02 dataset is an extension of the CUHK01 dataset. It includes 7,264 images of 1,816 identities collected from six camera views.
We first evaluate the effectiveness of KD and the combination of CDW and KD by monitoring performance changes of global models as training proceeds. Figure 4 shows the performance changes (either rank-1 accuracy or mAP) of KD, the combination of CDW and KD, and the benchmark results on eight datasets. Compared with the benchmark results, training with KD achieves much better convergence; KD can also lead to higher performance, especially when datasets in clients share similar data distributions with the public shared dataset. For example, we use the CUHK02 dataset as the shared dataset, so the accuracy of the global models on the CUHK03-NP and CUHK01 datasets is better than the benchmark results. Moreover, training with the combination of KD and CDW achieves outstanding performance on almost all datasets—better than the benchmark results or training with KD. These results indicate that the combination of KD and CDW is able to obtain the best generalized global model that is transferable to other scenarios.
Next, we evaluate the effectiveness of CC, CDW, and the combination of these two methods by comparing the performance of their local models. Table 4 shows the increase in rank-1 accuracy of several methods when compared with standalone training on nine datasets. Although FedNova [48] and FedProx [26] slightly improve the performance of the smallest dataset (iLIDS-VID), they are incapable to elevate the performance of large datasets, like our benchmark method. We further analyze the results in three folds. First, CC effectively mitigates the drawback of the benchmark, improving the performance on larger datasets such as MSMT17 [50]. This is because the dominance of larger datasets over smaller datasets is reduced as they are clustered into different groups in aggregation. Most of the time, CC creates two clusters: the first one contains clients with PRID2011, CUHK03-NP, VIPeR, 3DPeS, and iLIDS-VID datasets; the second one contains clients with MSMT17, DukeMTMC-reID, Market-1501, and CUHK01 datasets. Second, CDW outperforms the standalone training on all datasets. This indicates that CDW effectively addresses the unbalanced data problem such that all clients are beneficial in FL. Third, the combination of CDW and CC further elevates the performance in most datasets. Although such combination produces slight decreases on smaller datasets compared with CDW, it significantly further improves the performance of larger datasets. It increases the motivation of clients with larger datasets to participate in FL.
Methods | MSMT17 | DukeMTMC | Market | CUHK03-NP | PRID2011 | CUHK01 | VIPeR | 3DPeS | iLIDS-VID |
---|---|---|---|---|---|---|---|---|---|
Benchmark | –1.3 | –2.0 | –5.4 | –9.8 | –4.3 | +11.6 | +24.5 | +15.2 | +32.7 |
FedNova [48] | –2.1 | –2.8 | –4.4 | –14.6 | 0.0 | +9.9 | +24.4 | +12.6 | +35.8 |
FedProx [26] | –0.1 | –1.6 | +1.0 | –6.4 | –1.0 | +12.5 | +24.1 | +7.7 | +34.7 |
CC | +2.4 | –1.3 | +0.1 | +3.9 | +6.0 | +9.3 | +4.1 | –1.2 | +16.3 |
CDW | +4.0 | +1.3 | +1.4 | +1.2 | +7.3 | +13.8 | +26.0 | +16.3 | +30.3 |
CC & CDW | +4.1 | +3.8 | +2.0 | +2.2 | +13.0 | +6.1 | +28.2 | +6.5 | +28.6 |
Note: CC effectively improves the performance on larger datasets, and CDW effectively elevates the performance on all datasets. In addition, the combination of CC and CDW achieves the best overall performance, especially on the larger datasets. These experiments were run with batch size \(B = 32\) and local epoch \(E = 1\).
Note: CC effectively improves the performance on larger datasets, and CDW effectively elevates the performance on all datasets. In addition, the combination of CC and CDW achieves the best overall performance, especially on the larger datasets. These experiments were run with batch size \(B = 32\) and local epoch \(E = 1\).
Last, we demonstrate the generalization ability of our methods by comparing existing methods on the CAVIAR4REID [6] and GRID [33] datasets. Specifically, we compare with the unsupervised cross-domain fine-tuning methods DSTML [16] and UMDL [38]; the unsupervised generalization methods CrossGrad [40], MLDS [23], SSDAL [43], and DIMN [42]; and recent work [51]. For evaluation on CAVIAR4REID, we follow the work of Liu et al. [32] and Peng et al. [38] to randomly select 36 identities that appeared on two camera views. The GRID dataset contains 250 identities from two camera views. For both datasets, we use images of one camera view as the query and another one as the gallery. Table 5 shows that our proposed FedReID with optimizations (CC & CDW and KD & CDW) outperforms all existing methods on rank-1 accuracy on both datasets; KD & CDW achieves especially good performance. Note that we do not fine-tune trained models on these two evaluation datasets. These results further illustrate the significance of our methods.
Datasets | Existing Methods (w/o privacy except [51]) | Ours (w/ privacy) | |||||||
---|---|---|---|---|---|---|---|---|---|
DSTML | UMDL | CrossGrad | MLDG | SSDAL | DIMN | Decentralized [51] | CC & CDW | KD & CDW | |
CAVIAR [6] | 28.2 | 41.6 | – | – | – | – | 45.6 | 46.8 | 53.2 |
GRID [33] | – | – | 9.0 | 15.8 | 22.4 | 29.3 | 24.2 | 30.0 | 36.8 |
Note: Our trained models outperform the existing methods on both datasets without extra fine-tuning. These results demonstrate the generalization ability of our methods.
Note: Our trained models outperform the existing methods on both datasets without extra fine-tuning. These results demonstrate the generalization ability of our methods.
6 CONCLUSION
In this article, we presented FedReID, a new paradigm of person ReID training with decentralized data. To investigate the challenges of FedReID, we construct a new benchmark to simulate real-world scenarios. Based on the results and insights from benchmark analysis, we proposed three optimization approaches to elevate performance. We proposed CC and KD to address the non-IID problem and introduced CDW to address the unbalanced data problem. Empirical results demonstrated that the combination of CDW and CC achieves the best local models, and the combination of CDW and KD achieves the best global model, among all methods. In the future, we plan to investigate the system heterogeneity challenges among clients. We also plan to extend FedReID to support unsupervised learning.
Footnotes
1 https://github.com/cap-ntu/FedReID.
Footnote2 The public person ReID dataset is shareable among the server and clients. This dataset can be unlabeled.
Footnote3 These labels are termed soft labels, as they are the predicted labels, not the actual labels, of the dataset.
Footnote4 \((w^t_k, v^t_k)\) is the concatenation of global model \(w^t_k\) and local classifier \(v^t_k\).
Footnote
- [1] . 2020. Federated learning based on dynamic regularization. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [2] . 2011. 3DPeS: 3D people dataset for surveillance and forensics. In Proceedings of the 2011 Joint ACM Workshop on Human Gesture and Behavior Understanding (J-HGBU’11). ACM, New York, NY, 59–64. Google ScholarDigital Library
- [3] . 2018. LEAF: A benchmark for federated settings. CoRR abs/1812.01097 (2018). http://arxiv.org/abs/1812.01097.Google Scholar
- [4] . 2019. ABD-Net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8351–8361.Google ScholarCross Ref
- [5] . 2020. FedHealth: A federated transfer learning framework for wearable healthcare. IEEE Intelligent Systems 35 (2020), 83–93.Google ScholarCross Ref
- [6] . 2011. Custom pictorial structures for re-identification. In Proceedings of the British Machine Vision Conference (BMVC’11).Google ScholarCross Ref
- [7] . 2017. EMNIST: Extending MNIST to handwritten letters. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN’17). IEEE, Los Alamitos, CA, 2921–2926.Google ScholarCross Ref
- [8] . 2019. EU Personal Data Protection in Policy and Practice. Springer.Google ScholarCross Ref
- [9] . 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25, , , , and (Eds.). Curran Associates, 1223–1231. http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf.Google Scholar
- [10] . 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- [11] . 2008. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In Proceedings of the European Conference on Computer Vision. 262–275.Google ScholarDigital Library
- [12] . 2018. Edge AIBench: Towards comprehensive end-to-end edge computing benchmarking. In Proceedings of the 2018 BenchCouncil International Symposium on Benchmarking, Measuring, and Optimizing.Google Scholar
- [13] . 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google ScholarCross Ref
- [14] . 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531.Google Scholar
- [15] . 2011. Person re-identification by descriptive and discriminative classification. In Proceedings of the Scandinavian Conference on Image Analysis (SCIA’11). 91–102.Google ScholarCross Ref
- [16] . 2015. Deep transfer metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 325–333.Google ScholarCross Ref
- [17] . 2020. The OARF benchmark suite: Characterization and implications for federated learning systems. arXiv preprint arXiv:2006.07856 (2020).Google Scholar
- [18] . 2019. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977 (2019).Google Scholar
- [19] . 2020. Scaffold: Stochastic controlled averaging for federated learning. In Proceedings of the International Conference on Machine Learning. 5132–5143.Google Scholar
- [20] . 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto, Toronto, Ontario.Google Scholar
- [21] . 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105.Google ScholarDigital Library
- [22] . 2019. A survey of open-world person re-identification. IEEE Transactions on Circuits and Systems for Video Technology 30, 4 (2019), 1092–1108.Google ScholarCross Ref
- [23] . 2018. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- [24] . 2021. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10713–10722.Google ScholarCross Ref
- [25] . 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37 (2020), 50–60.Google ScholarCross Ref
- [26] . 2020. Federated optimization in heterogeneous networks. In Proceedings of the 3rd Machine Learning and Systems Conference (MLSys’20). 429–450.Google Scholar
- [27] . 2013. Locally aligned feature transforms across views. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. 3594–3601.Google ScholarDigital Library
- [28] . 2012. Human reidentification with transferred metric learning. In Computer Vision—ACCV 2012. Lecture Notes in Computer Science, Vol. 7724. Springer, 31–44.Google Scholar
- [29] . 2014. DeepReID: Deep filter pairing neural network for person re-identification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 152–159.Google ScholarDigital Library
- [30] . 2014. DeepReID: Deep filter pairing neural network for person re-identification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 152–159.Google ScholarDigital Library
- [31] . 2016. Multi-scale triplet CNN for person re-identification. In Proceedings of the 24th ACM International Conference on Multimedia (MM’16). ACM, New York, NY, 192–196. Google ScholarDigital Library
- [32] . 2014. Semi-supervised coupled dictionary learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3550–3557.Google ScholarDigital Library
- [33] . 2013. Person re-identification by manifold ranking. In Proceedings of the 2013 IEEE International Conference on Image Processing. IEEE, Los Alamitos, CA, 3567–3571.Google ScholarCross Ref
- [34] . 2019. Real-world image datasets for federated learning.
arxiv:1910.11089 [cs.CV] (2019).Google Scholar - [35] . 2017. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS’17). 1273–1282. http://proceedings.mlr.press/v54/mcmahan17a.html.Google Scholar
- [36] . 2020. FedFast: Going beyond average for faster training of federated recommender systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1234–1242.Google ScholarDigital Library
- [37] . 2020. Billion-scale federated learning on mobile clients: A submodel design with tunable privacy. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1–14.Google ScholarDigital Library
- [38] . 2016. Unsupervised cross-dataset transfer learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1306–1315.Google ScholarCross Ref
- [39] . 2019. Efficient parameter-free clustering using first neighbor relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8934–8943.Google ScholarCross Ref
- [40] . 2018. Generalizing across domains via cross-gradient training. arXiv preprint arXiv:1804.10745 (2018).Google Scholar
- [41] . 2018. Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. In Proceedings of the International MICCAI Brain Lesion Workshop (BrainLes’18). 92–104.Google Scholar
- [42] . 2019. Generalizable person re-identification by domain-invariant mapping network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 719–728.Google ScholarCross Ref
- [43] . 2016. Deep attributes driven multi-camera person re-identification. In Proceedings of the European Conference on Computer Vision. 475–491.Google ScholarCross Ref
- [44] . 2022. GradientFlow: Optimizing network performance for large-scale distributed DNN training. IEEE Transactions on Big Data 8, 2 (2022), 495–507.Google Scholar
- [45] . 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV’18). 480–496.Google ScholarDigital Library
- [46] . 2018. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia (MM’18). ACM, New York, NY, 274–282. Google ScholarDigital Library
- [47] . 2020. Federated learning with matched averaging. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=BkluqlSFDS.Google Scholar
- [48] . 2020. Tackling the objective inconsistency problem in heterogeneous federated optimization. arXiv preprint arXiv:2007.07481 (2020).Google Scholar
- [49] . 2014. Person re-identification by video ranking. In Computer Vision—ECCV 2014, , , , and (Eds.). Springer International Publishing, Cham, Switzerland, 688–703.Google Scholar
- [50] . 2018. Person transfer GAN to bridge domain gap for person re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 79–88.Google ScholarCross Ref
- [51] . 2021. Decentralised learning from independent multi-domain labels for person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence 35, 4 (
May 2021), 2898–2906. https://ojs.aaai.org/index.php/AAAI/article/view/16396.Google ScholarCross Ref - [52] . 2018. Local convolutional neural networks for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia (MM’18). ACM, New York, NY, 1074–1082. Google ScholarDigital Library
- [53] . 2019. Federated learning with unbiased gradient aggregation and controllable meta updating. In Proceedings of the NIPS Federated Learning for Data Privacy and Confidentiality Workshop.Google Scholar
- [54] . 2021. Deep learning for person re-identification: A survey and outlook. arXiv:2001.04193 (2021).Google Scholar
- [55] . 2021. Federated learning for non-IID data via unified feature learning and optimization objective alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4420–4428.Google ScholarCross Ref
- [56] . 2018. Federated learning with non-IID data. CoRR abs/1806.00582 (2018). http://arxiv.org/abs/1806.00582.Google Scholar
- [57] . 2015. Scalable person re-identification: A benchmark. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15).1116–1124.Google ScholarCross Ref
- [58] . 2016. Person re-identification: Past, present and future. CoRR abs/1610.02984 (2016). http://arxiv.org/abs/1610.02984.Google Scholar
- [59] . 2017. Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarCross Ref
- [60] . 2022. EasyFL: A low-code federated learning platform for dummies. IEEE Internet of Things Journal. Early access, January 20, 2022. Google ScholarCross Ref
- [61] . 2021. Collaborative unsupervised visual representation learning from decentralized data. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4912–4921.Google ScholarCross Ref
- [62] . 2022. Divergence-aware federated self-supervised learning. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=oVE1z8NlNe.Google Scholar
- [63] . 2020. Performance optimization of federated person re-identification via benchmark analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 955–963.Google ScholarDigital Library
Index Terms
- Optimizing Performance of Federated Person Re-identification: Benchmarking and Analysis
Recommendations
Joint Optimization in Edge-Cloud Continuum for Federated Unsupervised Person Re-identification
MM '21: Proceedings of the 29th ACM International Conference on MultimediaPerson re-identification (ReID) aims to re-identify a person from non-overlapping camera views. Since person ReID data contains sensitive personal information, researchers have adopted federated learning, an emerging distributed training method, to ...
Performance Optimization of Federated Person Re-identification via Benchmark Analysis
MM '20: Proceedings of the 28th ACM International Conference on MultimediaFederated learning is a privacy-preserving machine learning technique that learns a shared model across decentralized clients. It can alleviate privacy concerns of personal re-identification, an important computer vision task. In this work, we implement ...
Federated Unsupervised Cluster-Contrastive learning for person Re-identification: A coarse-to-fine approach
AbstractPerson Re-identification (ReID) has attracted considerable interests in recent years, largely driven by the escalating demand for public safety measures. However, the acquisition and handling of sensitive personal data can trigger significant ...
Highlights- An unsupervised federated general-to-specific learning strategy for Person ReID.
- A strategy to explore client-specific knowledge with patch-level representations.
- A camera-style augmentation and a camera-invariant loss for camera-...
Comments