Adaptive Label Allocation for Unsupervised Person Re-Identification

Song, Yihu; Liu, Shuaishi; Yu, Siyang; Zhou, Siyu

doi:10.3390/electronics11050763

Open AccessArticle

Adaptive Label Allocation for Unsupervised Person Re-Identification

¹

Department of Control Engineering, Changchun University of Technology, Changchun 130012, China

²

Department of Digital Media, Changchun University of Technology, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(5), 763; https://doi.org/10.3390/electronics11050763

Submission received: 20 January 2022 / Revised: 23 February 2022 / Accepted: 24 February 2022 / Published: 2 March 2022

(This article belongs to the Section Artificial Intelligence Circuits and Systems (AICAS))

Download

Browse Figures

Versions Notes

Abstract

:

Most unsupervised methods of person re-identification (Re-ID) obtain pseudo-labels through clustering. However, in the process of clustering, the hard quantization loss caused by clustering errors will make the model produce false pseudo-labels. In order to solve this problem, an unsupervised model based on softened labels training method is proposed. The innovation of this method is that the correlation among image features is used to find the reliable positive samples and train them in a smooth manner. To further explore the correlation among image features, some modules are carefully designed in this article. The dynamic adaptive label allocation (DALA) method which generates pseudo-labels of adaptive size according to different metric relationships among features is proposed. The channel attention and transformer architecture (CATA) auxiliary module is designed, which, associated with convolutional neural network (CNN), functioned as the feature extractor of the model aimed to capture long range dependencies and acquire more distinguishable features. The proposed model is evaluated on the Market-1501 and the DukeMTMC-reID. The experimental results of the proposed method achieve 60.8 mAP on Market-1501 and 49.6 mAP on DukeMTMC-reID respectively, which outperform most state-of-the-art models in fully unsupervised Re-ID task.

Keywords:

softened labels; unsupervised person re-identification; transformer; channel attention

1. Introduction

Person re-identification (Re-ID) has become a popular applications in the field of intelligent security and person tracking, which aims to match the same person appearing in different cameras. In recent years, neural networks have achieved significant success in many fields [1,2,3,4,5] due to their powerful feature representation capabilities. With the rapid development of CNN, supervised Re-ID [6,7] has achieved great success. However, due to the data deviation in different datasets, the performance of the model trained on source domain will significantly degrade when it is directly transferred to the target domain. Besides, the method of supervised Re-ID requires intensive manual labeling, which is not applicable to real-world applications.

Recently, people pay more attention to unsupervised Re-ID, which has achieved good progress. Some works in literature [8,9,10,11,12] focus on unsupervised domain adaption (UDA), which require the use of prior knowledge obtained from other labeled person data. In this article, a method without using any labels is proposed to solve the person re-identification problem. Most of the current works in literature [13,14,15] use clustering method to generate pseudo-labels for unsupervised person Re-ID training. This method appears to be very effective. However, there are many different pedestrians wearing similar clothes and appearing under the same camera which make them look very similar so that cluster-based methods will attribute them to the same class and assign the same pseudo-label.

In this article, a softened label-based method is proposed to solve this problem. The proposed method no longer requires clustering; hence, the hard quantization losses is reduced during the training process. Different from the clustering method that directly defines the similar images as the same label, the proposed method is trying to mine the relation among unlabeled images as a gentle constraint to make similar images have closer representations by assigning them softened labels. During the training process, the designed network can predict not only the ground-truth class, but also similar classes. The proposed method has two advantages. First, the unsupervised person Re-ID model based on softened labels does not use hard labels, which reduces the error caused by hard quantization losses. Second, the smoother supervision method of softened labels makes the model have higher potential.

In addition, the feature extraction method based on CNN has become the mainstream, and most person Re-ID works directly use ResNet-50 [16] as the feature extractor of the model. However, the operation of convolution only allows the model focus on the local information of the image, while ignore the mining of global structural information. Taking this into consideration, the CATA auxiliary module which combines the self-attention mechanism in Transformer with the channel attention mechanism is proposed. By integrating this module into convolutional neural network, the model can acquire more distinguishing features and improve its robustness.

The performance of the proposed model on two mainstream person datasets is evaluated and the experimental results verified that the proposed softened label-based method is more effective and robust. The contribution of the proposed method can be summarized in the following three points. First, a softened label-based method is adopted to train an unsupervised person Re-ID model to avoid the hard quantization loss caused by clustering. Second, the DALA which adaptively assigns softened labels to sample feature and similar features to mine the relation among images features is proposed to generate softened labels that fit the model more closely. Third, a CATA auxiliary module is proposed to obtain more distinguishing features. The proposed module can easily embed in ResNet-50 as the feature extractor of the model and further improve the robustness of the model.

2. Related Work

2.1. Unsupervised Person Re-Identification

The existing research methods of unsupervised person re-identification can be roughly divided into three types: (a) using domain adaptation to align the feature distribution between source domains and target domains [11], (b) applying generative adversarial network (GAN) to perform image style transfer, meanwhile maintaining the identity annotations on source domains [12], and (c) generating pseudo-labels on target domains for training via assigning similar images with similar labels [13,14,15]. The first two methods define unsupervised person re-identification as a transfer learning task, which only leverages the labeled data on source domains. Wu et al. proposed camera-aware similarity consistency loss to align the pairwise distribution of inter-camera matching and cross-camera matching [10]. Lin et al. used maximum mean discrepancy (MMD) distance to align the distribution of mid-level features in the source and target domains [11]. Deng et al. proposed similarity preserving GAN (SPGAN) [12]. They used cycle GAN to transfer source-domain style images to the target domain, while keeping the image labels unchanged, and they finally performed supervised learning on the generated image.

Different from the previous two methods, method based on generating pseudo labels only leverages person images with target domain. This method generally requires a specific rule to generate pseudo labels, then training the Re-ID model with pseudo-labels. The quality of the pseudo-labels will directly determine the performance of the model. The pure unsupervised algorithm based on clustering is currently the most common way to generate pseudo-labels. Lin et al. proposed a bottom-up clustering method to generate pseudo-labels [15]. It extracts image features through a convolutional neural network, merges a fixed number of clusters at a time through euclidean distance, generates pseudo-labels, and fine-tunes the network with pseudo-labels. However, due to clustering errors, some images will be given wrong labels, resulting in hard quantization losses, resulting in a decrease in the accuracy of the model. Without clustering, Lin et al. proposed a softened similarity learning (SSL) method [17], which treats each person sample as an individual class, and gradually narrows the distance between the sample feature and the similar feature by assigning softened labels to similar features, which effectively avoids the hard quantification loss caused by clustering error. However, when assigning softened labels, SSL assigns same softened labels to similar features even if there are different confidence between sample feature and similar features, which is not smoothing enough.

Inspired by the SSL method, an intuitive method called dynamic adaptive label allocation is proposed to generate pseudo-labels. Intuitively, the image features with high similarity will have higher confidence that they belong to the same class. Based on this point of view, the feature similarity will be the only basis for determining the size of the pseudo-label value. Compared with clustering, this method not only makes the model smoother, but alleviates the error caused by the hard quantization loss. Compared with SSL, this method focuses on mining the relationship between image features, which shows the great potential of the soft label-based method.

2.2. Attention Mechanism

Attention mechanisms are known to be useful in Re-ID. The purpose of the channel attention mechanism is to make the model pay more attention to important channels and ignore irrelevant channels. Hu et al. introduced a compact module to exploit the inter-channel relationship [18]. In their squeeze-and-excitation (SE) module, they use global average-pooled features to compute channel-wise attention. Woo et al. proposed convolutional block attention module (CBAM) [19]. They recommend using average-pooled features and max-pooled features together to compute channel-wise attention. Wang et al. believe that dimensionality reduction is not conducive to the weight learning of channel attention, and they proposed to use one-dimensional convolution to complete the information interaction between channels [20]. Here, the above three channel attention mechanisms are tested and the most suitable one is found, as discussed later in the experimental part.

As for spatial attention, most of the current selective attention modules learn attention from limited local contexts. In the CBAM, a convolutional layer with a large filter size 7 × 7 is applied over the cross channel pooled spatial features to produce a spatial attention map. Limited by the actual receptive field, none of these methods can effectively capture a wide range of information to determine the spatial attention. Recently, the transformer architecture has been used in vision tasks to capture long range dependencies of feature maps. Li et al. proposed a transformer encoder–decoder architecture which is combined with ResNet-50 to identify the characteristics of the pedestrian part [21]. Inspired by them, the transformer architecture is treated as the spatial attention to improve the model’s ability to integrate spatial information.

In this article, two kinds of mechanisms are combined for getting more robust features. Through some experiments, it is verified that using these two attention mechanisms is better than using only the channel attention mechanism or using only the spatial attention mechanism. Further, some extensive ablation studies are conducted to demonstrate the effectiveness of our model in finding discriminative features and suppressing irrelevant features for person Re-ID.

3. Proposed Method

In this article, the unsupervised Re-ID problem through a softened label-based method is investigated. As illustrated in Figure 1, the framework has three components: (1) The proposed CATA module is embedded in ResNet-50 as the feature extraction network of the model. (2) Without label information, each unlabeled image data is treated as an individual class. The classification network is adopted to initialize the image feature representations, then storing them in the memory bank as shown in the purple part in the Figure 1. (3) The DALA method is proposed to re-assign softened labels to images, then the model will be retrained with new label information. Those features after training will update the memory bank, which is shown in the orange part in Figure 1.

3.1. Person Re-Identification Based on Softened Similarity

3.1.1. Initialization with Hard Labels

Whether in a supervised person re-identification environment or an unsupervised person re-identification environment, our goal is to learn a feature embedding function

Φ (θ : x)

, where parameters of

Φ

are collectively denoted as

θ

. Since it does not have the identity information of pedestrian images, for a pedestrian data set

X = {x_{1}, x_{2}, \dots, x_{n}}

, initially the label of each training data

x_{i}

is assigned by its index

Y = \{y_{i} = i | 1 \leq i \leq N\}

, and the index value entered into the network is used as the pseudo-label of the image. In this way, each training image is assumed to fall into an individual class by itself. For the feature vector v of each image x, it is normalized by

L_{2}

regularization so that

∥ v ∥ = 1

via

V = \frac{Φ (θ : x)}{∥ Φ (θ : x) ∥}

. Then the probability of an image belongs to the i-th class is defined as:

p (y_{i} | x, V) = \frac{\exp (V_{i}^{T} V / τ)}{\sum_{j = 1}^{N} \exp (V_{j}^{T} V / τ)}

(1)

where V is the memory bank [22] that stores the feature of each class,

V_{j}

is the j-th column in memory bank which indicates the feature of j-th class. N is the number of categories, which is the number of pedestrian images. The

τ

is set to 0.07 following [22].

Then, the non-parameterized instance classification loss function is adopted to train the image. The hard cross-entropy loss function is defined as:

L_{h - C E} = - \sum_{j = 1}^{N} l o g (p (y_{i} | x_{i}, V)) t (y_{i})

(2)

where

t (y_{i})

denotes the label of class i, the label of the ground truth is set to 1, and the label of the other class is 0.

The purpose of using hard labels to initialize the network is to minimize the Euclidean distance between the sample feature

v_{i}

and the features

V_{j = y_{i}}

in the memory bank, and to maximize the Euclidean distance between the feature

v_{i}

and the features

V_{j \neq y_{i}}

in the memory bank. As a result, the network has completed the initialization of all unlabeled images and obtains an initial discriminative ability.

3.1.2. Dynamic Adaptive Label Allocation

After initializing the network with the hard label, the model has learned to recognize each unlabeled image, and each training sample is learned to push other training images away during the training process. However, those images with the same identity should be closer in the feature space. Forcing the images of the same person to have obviously different representations of the same images will have negative effect on the network. Therefore, a softened label method is proposed to pull the images which have same identity closer. Firstly, the model will find K similar features from memory bank that are most similar to the feature of each training sample. It uses the Euclidean distance to represent the similarity between features. In this way, the smaller the Euclidean distance, the higher the degree of similarity between features. Then it will assign pseudo-labels to K similar features, those similar features are defined as

X_{i}^{s i m i l a r} = {x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{k},}

, the corresponding label is

Y_{i}^{s i m i l a r} = {y_{i}^{1}, y_{i}^{2}, \dots, y_{i}^{k},}

. The difference with clustering method is that clustering method will treat similar features as the same class and assign the same hard labels. While this method will give similar features softened labels, so that the model can not only predict each image into the ground truth class, but make it acceptable to predict the training image into similar class.

Equation (3) shows the way that SSL generates pseudo labels: it finds a suitable hyperparameter through experiments to balance the effort between sample class and similar classes. For data

x_{i}

, the target label distribution formula of SSL is

t (y_{j}) = \{\begin{matrix} λ, & y_{j} = y_{i} \\ (1 - λ) / k, & y_{j} \in y_{i}^{s i m i l a r} \\ 0, & o t h e r w i s e \end{matrix}

(3)

where

λ

is a manual value. It replaces the one-hot hard label with the softened label by introducing a small manual parameter

λ

to adjust the probability distribution. However, notice that

λ

is a fixed value, which results in the same expected probability of every input sample and so do other similar categories. In fact, different sample features have different correlation with their corresponding similar features. Further, the smaller the Euclidean distance between the sample feature and the similar features in the feature space, the higher the confidence that they belong to the same class.

Based on this point of view, the DALA method is proposed to assign pseudo-labels to images. Firstly, the Euclidean distance between the sample feature and the other features in memory bank need to be calculated. Then it will find k nearest similar features to sample feature, and assign softened labels to the sample feature and those similar features based on Equation (4). For data

x_{i}

, the target label distribution formula of our method is

t (y_{j}) = \{\begin{matrix} l_{g} = 1 - \sum_{j = 1}^{k} α (1 - d (x_{i}, x_{j})) / k, & y_{j} = y_{i} \\ l_{f}^{s i m i l a r} = α (1 - d (x_{i}, x_{j})), & y_{j} \in y_{i}^{s i m i l a r} \\ 0, & o t h e r w i s e \end{matrix}

(4)

where

α

denotes a multiplicative scaling coefficient, which is used to reduce the confidence of similar features.

d (x_{i}, x_{j})

denotes the Euclidean distance between feature

x_{i}

and feature

x_{j}

. For the convenience of follow-up,

l_{g}

is used to represent the sample feature label and

l_{g}^{s i m i l a r}

represent the similar feature labels. When

d (x_{i}, x_{j})

is relatively large, it will get a lower confidence on the similar feature, and this similar feature will get a smaller label. In contrast, when

d (x_{i}, x_{j})

is relatively small, it will get a higher confidence on the similar feature, and the similar feature will get a bigger label.

Comparing with the SSL’s pseudo-label allocation method, DALA shows more robustness. When the model trains each sample feature, it will assign a suitable size of pseudo-label according to the similarity between the sample feature and those similar features. This method will give the model a lot of freedom, and reduce the impact on the accuracy of the model when there are error categories in similar features.

3.1.3. Fine-Tuning the Model with Softened Labels

By taking reliable classes into account, the confidence of the sample feature class is reduced, and the confidence of the similar feature classes is increased, which guides the network to learn the similarity among images of the identity smoothly. The network is fine-tuned by the softened cross-entropy loss function (Equation (5)):

L_{s - C E} = - l_{g} l o g (p (y_{i} | x_{i}, V)) - \sum_{j = 1}^{k} l o g (p (y_{i}^{j} | x_{i}, V))

(5)

The fine-tuned network can not only reduce the Euclidean distance between image features and the ground truth features in the memory bank, but also reduce the distance between sample feature and similar features. After each iteration, the model will update the memory bank. The same update way as MoCo [23] is adopted, the difference is the DALA’s momentum update way is on the representations of the same sample, not the encoder. The update method of memory bank is as follows:

V_{i} = m V_{i} + (1 - m) x_{i}

(6)

where

V_{i}

denotes the feature vector of the i-th image in the memory bank,

x_{i}

denotes the new feature vector corresponding to the image,

m \in [0, 1)

, is a momentum coefficient. When m is 0, it means that the updated feature will completely replace the previous feature in the memory bank. When m is a value greater than 0 and less than 1, it means that the updated feature retains some part of features of the previous sample, m is used to stabilizes the learning process. Further, m cannot be 1, otherwise, the memory bank will not be updated. In our experiments, a relatively small momentum works better than a lager value. It means that it is good for the learning of the model to retain the characteristics of a small part of previous samples.

Through the softened classification network, the model can gradually learn sample features that are close to their similar images. The learning of the reliable class is softened and gentle. Due to the DALA method, when it includes the wrong image in the reliable set, it tries to reduce negative effects. Besides this, the relatively weak supervision signal makes the model freer and has a higher potential.

3.2. CATA Model

Most current researches overemphasize the importance of algorithms but ignore the impact of features quality. Therefore, the CATA module is proposed, which is integrated into the convolutional neural network to extract more distinguishing features. As shown in the Figure 2, CATA combines the channel attention and transformer architecture modules, which can separately weight the channel attention and capture the long-range dependencies in the feature map.

3.2.1. Channel Attention

Recently, the channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks. Channel attention is generally to enhance or suppress different channels information for different tasks by modeling the importance of each feature channel. In this section, several mainstream channel attention mechanisms is compared. Squeeze and excitation (SE), as the first proposed channel attention module, achieves promising performance. Specifically, given the input features, SE block first employs a global average pooling for each channel independently, then two fully connected (FC) layers with non-linearity followed by a sigmoid function are used to generate channel weights. The two FC layers are supposed to capture non-linear cross-channel interaction, which involves dimensional reduction to control model complexity. The channel attention in CBAM is basically the same as the SE-Net structure, the difference is that CBAM employs both average pooling and max pooling to aggregate features. It argues that max pooling gathers another important clue about distinctive object features to infer finer channel-wise attention. Different from them, efficient channel attention (ECA) believes that dimensionality reduction is unnecessary; it captures local cross-channel interactions by considering each channel and its k neighbors. This method aims at capturing local cross-channel interaction, which shares some similarities with channel local convolutions and channel-wise convolutions. Compared with SE-Module and CBAM, the ECA-Module not only reduces the number of parameters, but also reduces the complexity of the model while ensuring accuracy.

Some experiments are conducted to test the effects of the three kinds of channel attention mechanisms used in our model. As a result, the ECA-Module is selected as the CA part of the feature extraction network, the details of which will be discussed in Section 4.2.

3.2.2. Transformer Architecture

Recalling the convolutional neural network, it mainly relies on the convolutional operation method to obtain the information in the image, expand the receptive field through the down-sampling method, and gradually deepen the network to obtain the characteristic information of the image. In this process, the shallow network often only pays attention to the local information in the feature map, and ignores the relationship of long-range dependencies, which will cause the network to lose a lot of important image information. In response to this problem, the multi-head self-attention mechanism in the Transformer Encoder is used to assist the convolutional neural network to help the model capture complete feature map information.

Given an input tensor of shape (h, w, c), where h, w, and c respectively refer to the height, width, and the number of channels of the feature map Z. Firstly, 1 × 1 convolution is used to reduce the channel dimension of the feature map Z to a smaller dimension d, creating the new feature map

F \in R^{h * w * d}

. Then, the input of the Transformer Encoder requires a one-dimensional sequence into a one-dimensional sequence. So the two-dimensional feature map is flattened to a matrix

X \in R^{h w * d}

. Lastly, the multi-head self-attention (MHSA) as proposed in the Transformer architecture is performed. The output

S_{h}

of the self-attention mechanism for a single head can be formulated as:

S_{h} = S o f t m a x (\frac{(F W_{q}) {(F W_{k})}^{T}}{\sqrt{d_{k}}}) (F W_{v})

(7)

where

W_{q}, W_{k} \in R^{c * d k}

, and

W_{v} \in R^{c * d v}

are learned linear transformations that map the input F to queries

Q = F W_{q}

, keys

K = F W_{k}

, and values

V = F W_{v}

, where

\sqrt{d_{k}}

is a scaling factor. The outputs of all heads are then concatenated and projected again as follows:

M H S A (F) = C o n c a t [S_{1}, \dots, S_{N h}] W_{f}

(8)

where

W_{f} \in R^{d v * d v}

is a learned linear projection. MHSA(F) is then reshaped into a tensor of shape (h, w, c) to match the original spatial dimensions.

In order to make the attention operation position aware, the two-dimensional relative self-attention is implemented by independently adding relative height information and relative width information following [24]. Some works have observed that relative distance aware position embedding is more suitable for vision tasks. This can be attributed to the fact that attention not only considers content information, but also considers the relative distance between different location features, which can effectively associate cross objects information with position awareness. In this work, the 2D relative position self-attention implementation is adopted.

4. Experiments and Results

In this section, some experiments are conducted to verify the effectiveness of the proposed method. The experimental environment is PyTorch, and the server is a GTX 2080Ti GPU. Firstly, a CATA ablation experiment is conducted to prove the effectiveness of the CATA module. Then some experiments on the hyperparameters in the DALA method and comparative experiments with SSL is conducted for confirming the effectiveness of the proposed DALA method. Lastly, this method is evaluated on Market1501 and DukeMTMC-ReID, and compared with other state-of-the-art methods. In order to better understand the entire training process, some visualization works are also presented.

4.1. Dataset and Evaluation Metrics

Market-1501. Market1501 [25] includes 32,668 images of 1501 persons captured by six cameras. Each person is captured by at least two cameras. Market1501 can be divided into a training set which contains 12,936 images of 751 people and a test set which contains 19,732 images of 750 people.

DukeMTMC-reID. DukeMTMC-reID [26] consists of 36,411 labeled images belonging to 1404 identities, which contains 16,522 images for training, 2228 images for query, and 17,661 images for gallery. Table 1 shows the information of two datasets.

Evaluating Setting. In evaluation, for an image in query, a normal way is to calculate Euclidean distance with all gallery images and then sort it as the result. The mean average precision (mAP) [25] and the rank-k accuracy is used to evaluate the performance of the model. Rank-k emphasizes the accuracy; it means the query picture has the match in the top-k list. Besides this, mAP is computed from the Cumulated Matching Characteristics (CMC). CMC curve shows the probability that a query has the match in different size of lists. Given a single query, the Average Precision (AP) is computed according to its precision-recall curve, and the mAP is the mean of AP.

Implementation Details. The ResNet-50 pretrained on ImageNet as a backbone network, then adds the proposed CATA module after layer2 and layer3 of the backbone, and the parameters of the CATA part are initialized randomly. The stride of conv4 block is set to 1 to increase the feature resolution and retain more information about pedestrian images. The layers after the pooling-5 layer are removed and add a batch normalization layer, which produces a 2048-dim feature. The input image is resized to 256 × 128 for training and testing, where the random cropping, random flipping and random erasing [27] are used for data augmentation. The batch size is set to 16, and SGD optimizer with GAF [28] is employed with a momentum of 0.9.

4.2. Ablation Study

The comparison of three channel attention mechanisms. Three channel attention mechanisms are tested as the CA part of our CATA module. Firstly, the backbone which is Resnet-50 combined with TA part is defined. Then, it adds the SE-Module, CBAM-C (the channel attention part of CBAM) and ECA-Module separately on this basis. Table 2 shows that the SE-Module and ECA-Module have made improvements in both datasets, and the ECA-Module performs relatively better. Although CBAM-C achieved the best results in Market1501, it obtained relatively poor results in the Duke dataset, so the CBAM-C is not robust in our model. Finally, the ECA-Module is selected as the CA part of our model.

The impact of CATA model. To verify the validity of the CATA module, the abalation experiment is tested in different settings. The results are summarized in Table 3. Compared with the backbone (Resnet-50), it can be seen that, only using the CA module, the accuracy did not increase much in the two datasets. When only the TA module is used, the accuracy improved a lot. However, when the two modules are used in combination, the accuracy achieves the best performance. On Market1501 and Duke, the CATA module improves the performance over backbone by 6.6%, 6.4% on mAP respectively, and 4.6%, 6.1% respectively on rank-1 accuracy.

The impact of momentum coefficient m.Table 4 shows the performance on Market1501 datasets with different update parameters m, where m is used to stabilize the learning process, the update way of memory bank reference Equation (6). Specifically, when m is equal to 0, the memory bank will completely update after each training iteration. When m is between 0 and 1, the memory bank will retain some of the previous characteristic parameters in the process of updating parameters. Lastly, when m is equal to 1, it means that the memory bank will not update in the training iteration. From Table 4, it can be seen that this method achieves the best performance when m is 0.01, which shows that keeping a small part of old sample features is good for model performance.

Hyper-parameter Analysis. Some important hyper-parameters are investigated in this section. Figure 3 shows the effects of parameter k and parameter

α

in Equation (4). The parameter k denotes the number of selected reliable features. It can be observed that as k increases from 1 to 5, rank-1 accuracy and mAP on Market1501 continue increasing. When k is 0, meaning the model does not learn from other samples, the rank-1 accuracy achieves only 28.7% (data not shown). It can be seen that, as k increases, rank-1 increases by 2%, while mAP increases by 9%. The reason is that when k is relatively small, the learned similarity is not adequate, and the model has difficulty finding enough images of the same identity. When k is 5, the model finds enough images of the same pedestrian, and it performs best at this time. However, as k continues to increase, the performance of the model decreases, which means that the model erroneously involves other pedestrian images. However, thanks to our DALA method, the wrong pedestrian image will get a smaller softened label, reducing its negative impact. Therefore, even if the number of k continues to increase, the model can maintain a good performance, which shows that the model is sufficiently robust. As for scaling coefficient

α

, it controls the confidence of similar features. Through Equation (4), when the value of

α

is relatively small, the confidence of similar image features will be small, which causes the model to pay more attention to the sample feature and ignore the importance of similar features. On the contrary, when the value of

α

is large, it will lead to insufficient training of the sample features of the model. Therefore, it is very important to maintain an appropriate relationship between the sample feature and similar features.

Comparison with SSL.Table 5 shows the comparison of the DALA and the SSL. In order to make the experiment fairer, the two methods are provided the same settings, including the CATA module. In the same environment, the results show that DALA has higher accuracy than SSL in the two datasets. On Market1501 and Duke, DALA improves the performance over SSL by 5.7%, 3.4% on mAP respectively, and 2.7%, 3.4% respectively on rank-1 accuracy. This shows that the DALA method mines the relationship among features better than SSL.

4.3. Comparison with State-of-the-Art Methods

Our method is compared with transfer learning methods and pure unsupervised methods on Market-1501 and DukeMTMC-reID, shown in Table 6. It is compared with methods trained with only unlabeled data. The compared methods include hand-crafted based methods (LOMO [29], BOW [25]) and deep learning-based methods (MMCL [14], BUC [15]). It can be seen from Table 6 that compared with other deep learning-based methods, our method performs better. The reasons could be that: (1) the hand-crafted based methods have poor feature extraction ability, and thus show lower performance; (2) the method based on deep learning is intended to generate pseudo labels through clustering, and it is difficult to avoid hard quantization losses. Besides these, the feature extraction network is improved to obtain more a robust feature, which is also a key factor for our model to achieve higher accuracy.

This method is also compared with unsupervised domain adaption methods, including GAN-based methods and Distribution alignment-based methods. Many transfer learning methods use additional labeled source domain data for training. This is why the cross-domain Re-ID model has higher accuracy than the pure unsupervised Re-ID model. Nonetheless, our method still outperforms most of them using only unlabeled data for training. The performance of our method can be improved by using the re-ranking method [30].

Table 6. Performance comparison with recent methods on Market1501 and DukeMTMC-reID. * denotes using the re-ranking similarity in the evaluating phase.

Methods	Reference	Market1501			DukeMTMC-reID
Methods	Reference	Source	mAP	Rank-1	Source	mAP	Rank-1
PTGAN [31]	CVPR18	Duke	20.5	45.5	Market	16.4	30.0
HHL [8]	ECCV18	Duke	31.4	62.2	Market	27.2	46.9
TJ-AIDL [9]	CVPR18	Duke	26.5	58.2	Market	23.0	44.3
SSG [32]	ICCV19	Duke	58.3	80.0	Market	53.4	73.0
CSCL [10]	ICCV19	Duke	35.6	64.7	Market	30.5	51.5
AD-Cluster [33]	CVPR20	Duke	68.3	86.7	Market	54.1	72.6
LOMO [29]	CVPR15	None	8.0	27.2	None	4.8	12.3
BOW [25]	ICCV15	None	14.8	35.8	None	8.3	17.1
BUC [15]	AAAI19	None	29.6	61.9	None	22.1	40.4
HCT [13]	CVPR20	None	56.4	80.0	None	50.7	69.6
MMCL [14]	CVPR20	None	45.5	80.3	None	40.7	65.2
JVTC+ [34]	ECCV20	None	47.5	79.5	None	50.7	74.6
Proposed method	This article	None	60.8	85.3	None	49.6	69.8
Proposed method *	This article	None	73.4	86.5	None	62.1	72.4

4.4. Visualization with T-SNE

As shown in Figure 4, the T-SNE [35] method is used to visualize each stage of the proposed method. Firstly, a pre-trained model is used on ImageNet to visualize the distribution of pedestrian data, and since the pre-trained model already has basic recognition capabilities, it can be seen that a little part of the same pedestrian features is close in the feature space. However, the pre-trained model is a generalized model and does not target pedestrian tasks very well, so that the pedestrian data is still very scattered in the feature space. Then, after the model is initialized, the same pedestrian data already has a similar distribution in the feature space. Because the model first treats each pedestrian image as a separate category, pulling the rest of the categories farther away during initialization, the data presents a very fragmented state. Finally, the model is fine-tuned by softened labels, and in the process of fine-tuning the model, similar features gradually converge. From Figure 4, the fine-tuned model successfully pulls pedestrians of the same category closer than the initialized model, and the model thus learns a better distribution.

4.5. Discussion

Those methods with hard labels are easily fit the noise labels, which limits its accuracy. The method based on softened labels can avoid this problem by introducing softened similarity learning. However, this method still has some limitations. For instance, this method can find several reliable images as positive samples, and assign them with softened labels, but it has no strength to find negative samples. Specifically, in the process of training, the network can pull reliable images closer but it has a relatively small strength to force images of different identities separate. In addition, it can be seen from Figure 4 that the softened based method cannot converge intra-class feature closely than the clustering-based method. However, this method has achieved impressive results only by determining the positive samples. If a scheme that can find suitable negative samples is designed, punishing them in the training progress, and making the features of the same category more compact in the feature space, the model should achieve a better performance.

5. Conclusions

In this article, the unsupervised person Re-ID problem is investigated. A new softened label assignment method named DALA is proposed that is based entirely on the metric relationship between features and does not require clustering. The proposed method adaptively generates suitable pseudo-labels and avoids quantitative loss, showing more flexibility. Besides this, to make features more robust, an auxiliary attention module called CATA is introduced, which is easy to embed in a convolutional neural network. Experiments on two mainstream Re-ID tasks validate the effectiveness of the proposed method.

Author Contributions

Conceptualization, Y.S. and S.L.; methodology, Y.S. and S.L.; software, Y.S.; validation, Y.S., S.Y. and S.L.; investigation, S.Y.; data curation, S.Z.; writing—original draft preparation, Y.S.; writing—review and editing, S.Y.; visualization, S.Z. and Y.S.; supervision, S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Project of National Natural Science Foundation of China under Grant No. 62106023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, Z.; Li, F.; Zhang, Y.; Sun, Y.; Jin, L. Different modified zeroing neural dynamics with inherent tolerance to noises for time-varying reciprocal problems: A control-theoretic approach. Neurocomputing 2019, 337, 165–179. [Google Scholar] [CrossRef]
Jin, L.; Wei, L.; Li, S. Gradient-Based Differential Neural-Solution to Time-Dependent Nonlinear Optimization. IEEE Trans. Autom. Control 2022, 1. [Google Scholar] [CrossRef]
Jin, L.; Li, J.; Sun, Z.; Lu, J.; Wang, F. Neural Dynamics for Computing Perturbed Nonlinear Equations Applied to ACP-based Lower Limb Motion Intention Recognition. IEEE Trans. Syst. Man Cybern 2021, 1–9. [Google Scholar] [CrossRef]
Sun, Z.; Shi, T.; Wei, L.; Sun, Y.; Liu, K.; Jin, L. Noise-suppressing zeroing neural network for online solving time varying nonlinear optimization problem: A control-based approach. Neural Comput. Appl. 2020, 32, 11505–11520. [Google Scholar] [CrossRef]
Sun, Z.; Tian, Y.; Li, H.; Wang, J. A superlinear convergence feasible sequential quadratic programming algorithm for bipedal dynamic walking robot via discrete mechanics and optimal control. Optim. Control. Appl. Methods 2016, 37, 1139–1161. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimed. 2020, 22, 2597–2609. [Google Scholar] [CrossRef] [Green Version]
Zhong, Z.; Zheng, L.; Li, S.; Yang, Y. Generalizing a person retrieval model hetero; homogeneously. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–188. [Google Scholar]
Wang, Y.; Zhu, X.; Gong, S.; Li, W. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2275–2284. [Google Scholar]
Wu, A.; Zheng, W.; Lai, J. Unsupervised person re-identification by camera-aware similarity consistency learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6921–6930. [Google Scholar]
Lin, S.; Li, H.; Li, C.; Kot, A. Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. In Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK, 2–6 September 2018; pp. 1–13. [Google Scholar]
Deng, W.; Zheng, L.; Ye, Q.; Kang, G.; Yang, Y.; Jiao, J. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 994–1003. [Google Scholar]
Zeng, K.; Ning, M.; Wang, Y.; Guo, Y. Hierarchical clustering with hard-batch triplet loss for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–19 June 2020; pp. 13654–13662. [Google Scholar]
Wang, D.; Zhang, S. Unsupervised person re-identification via multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–19 June 2020; pp. 1978–1987. [Google Scholar]
Lin, Y.; Dong, X.; Zheng, L.; Yan, Y.; Yang, Y. A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–31 January 2019; pp. 8738–8745. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
Lin, Y.; Xie, L.; Wu, Y.; Yan, C.; Tian, Q. Unsupervised person re-identification via softened similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–19 June 2020; pp. 3390–3399. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–19 June 2020; pp. 11534–11542. [Google Scholar]
Lin, Y.; He, J.; Zhang, T.; Liu, X.; Zhang, Y.; Wu, F. Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 2898–2907. [Google Scholar]
Wu, Z.; Xiong, Y.; Stella, X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3733–3742. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–19 June 2020; pp. 9726–9735. [Google Scholar]
Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q. Attention augmented convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 1116–1124. [Google Scholar]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3774–3782. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13001–13008. [Google Scholar]
Liu, M.; Chen, X.; Du, X.; Jin, L. Activated Gradients for Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 1–13. [Google Scholar] [CrossRef] [PubMed]
Liao, S.; Hu, Y.; Zhu, X.; Li, S. Person re-identification by local maximal occurrence representation; metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015; pp. 2197–2206. [Google Scholar]
Zhong, Z.; Zheng, L.; Cao, D.; Li, S. Re-ranking person re-identification with k-Reciprocal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1318–1327. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer GAN to bridge domain gap for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 79–88. [Google Scholar]
Fu, Y.; Wei, C.; Wang, S.; Zhou, Q.; Shi, H.; Huang, F. Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6112–6121. [Google Scholar]
Zhai, Y.; Lu, S.; Ye, Q.; Shan, X.; Chen, J.; Ji, R.; Tian, Y. Ad-cluster: Augmented discriminative clustering for domain adaptive person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–19 June 2020; pp. 9018–9027. [Google Scholar]
Li, J.; Zhang, S. Joint Visual and Temporal Consistency for Unsupervised Domain Adaptive Person Re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), Edinburgh, UK, 23–28 July 2020; pp. 483–499. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Overview of our method. Notably, the procedures with purple arrows are conducted once, the orange arrows are conducted iteratively. The arrows of different lengths in feature space to represent the size of the soft label values.

Figure 2. The CATA module. Rel denotes relative position embedding. ⨁ and ⨂ represent element wise sum and matrix multiplication respectively.

Figure 3. Parameter analysis on Market-1501. The left image denotes the impact of the number of similar images k. The right image denotes the parameter

α

.

Figure 3. Parameter analysis on Market-1501. The left image denotes the impact of the number of similar images k. The right image denotes the parameter

α

.

Figure 4. T-SNE visualization of 10 random classes in Market-1501 test set. The left one is produced by a untrained model, the middle one is produced by the initialized model, and the right one is produced by the fine-tuned model.

Table 1. Introduction to the datasets.

Datasets	Train ID	Images	Test ID	Images	Sum	Cameras
Market-1501	751	12,936	750	19,732	32,668	6
DukeMTMC-reID	702	16,522	702	19,889	36,411	8

Table 2. Performance (%) comparisons of three attention mechanisms in our methods.

Dataset	Market1501		DukeMTMC-reID
Settings	mAP	Rank-1	mAP	Rank-1
Backbone	56.3	83.7	45.9	67.0
+SE-Module	59.0	85.2	47.3	69.0
+CBAM-C	61.2	85.5	43.5	66.2
+ECA-Module	60.8	85.3	49.6	69.8

Table 3. Ablation study on CATA. The ResNet-50 is used as backbone. CA denotes only add channel attention in ResNet-50. TA denotes only add Transformer architecture in ResNet-50. CATA denotes add them both.

Dataset	Market1501		DukeMTMC-reID
Settings	mAP	Rank-1	mAP	Rank-1
Backbone	52.4	80.6	41.3	63.6
+CA	53.8	81.5	43.5	64.9
+TA	56.3	83.7	45.9	67.0
+CATA	60.8	85.3	49.6	69.8

Table 4. Performance (%) comparison under different update parameters m.

Momentum m	1	0.5	0.1	0.01	0.001	0
mAP	$f a i l$	58.4	59.3	60.8	59.6	59.4
Rank-1	$f a i l$	83.9	85.1	85.3	84.9	84.7

Table 5. Comparison with SSL under same settings.

Methods	Market1501		DukeMTMC-reID
Methods	mAP	Rank-1	mAP	Rank-1
SSL	55.1	81.9	46.9	66.4
DALA	60.8	85.3	49.6	69.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, Y.; Liu, S.; Yu, S.; Zhou, S. Adaptive Label Allocation for Unsupervised Person Re-Identification. Electronics 2022, 11, 763. https://doi.org/10.3390/electronics11050763

AMA Style

Song Y, Liu S, Yu S, Zhou S. Adaptive Label Allocation for Unsupervised Person Re-Identification. Electronics. 2022; 11(5):763. https://doi.org/10.3390/electronics11050763

Chicago/Turabian Style

Song, Yihu, Shuaishi Liu, Siyang Yu, and Siyu Zhou. 2022. "Adaptive Label Allocation for Unsupervised Person Re-Identification" Electronics 11, no. 5: 763. https://doi.org/10.3390/electronics11050763

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Label Allocation for Unsupervised Person Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Unsupervised Person Re-Identification

2.2. Attention Mechanism

3. Proposed Method

3.1. Person Re-Identification Based on Softened Similarity

3.1.1. Initialization with Hard Labels

3.1.2. Dynamic Adaptive Label Allocation

3.1.3. Fine-Tuning the Model with Softened Labels

3.2. CATA Model

3.2.1. Channel Attention

3.2.2. Transformer Architecture

4. Experiments and Results

4.1. Dataset and Evaluation Metrics

4.2. Ablation Study

4.3. Comparison with State-of-the-Art Methods

4.4. Visualization with T-SNE

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI