Full length articleHCFN: Hierarchical cross-modal shared feature network for visible-infrared person re-identification☆
Introduction
Person re-identification (Re-ID) is a retrieval task for pedestrians with the same identity in the case of cross cameras. It is of great significance to the big data search and intelligent security, and also plays an important role in the investigation of criminal cases. Therefore, scholars have conducted a lot of research on person Re-ID [1], [2], [3], [4], [5], [6]. In traditional visible–visible person re-identification (VV-reID), the default condition is daytime with sufficient light. Therefore, scholars have studied a series of solutions for problems such as background occlusion [7], [8], [9], illumination changes [10], [11], [12] and resolution differences [13], [14], [15], and achieved remarkable results. But at night or under weak lighting conditions, visible cameras cannot capture clear images, so nighttime becomes a frequent period of illegal activities such as robbery and terrorist activities. In this case, most cameras use infrared night vision to make up for the lack of visible light imaging. How to implement re-identification in both modalities becomes an urgent problem, which leads to the derivation of Visible-infrared person re-identification (VI-reID).
Differences in visible and infrared images lead to two important challenges for VI-reID: (1) The inter-modal gap between visible and infrared images. The detector obtains different thermal infrared rays by measuring the infrared difference between the target itself and the background to form an infrared image. Compared with visible images, infrared images have blurry visual effects and low contrast. Besides, there is no linear relationship in grayscale distribution, color or texture between visible features and infrared images. As shown in Fig. 1, red, yellow, white and even black will appear white under infrared cameras. Therefore, the huge modality gap makes it difficult for traditional feature methods to bridge. (2) The intra-modal varieties in visible and infrared images. Due to changes in appearance and environmental conditions, pedestrians have large intra-class changes and inter-class similarities. This means the similarity between the same person may be greater than that between different people. As shown in Fig. 1, the second person and the fifth person from left to right in the second row are more visually similar, while the visible light and infrared images of the second person are visually different. Thus the heterogeneity of the underlying features of the two modalities prevents us from obtaining the similarity between the visible and infrared image data by direct comparison. To solve these problems, existing methods make two efforts, as shown in Fig. 2: (1) Modality transformation, which is mainly implemented by transforming visible and infrared images into each other [16], [17], [18]. A common approach is to use the Generative Adversarial Network (GAN) as a modality converter to convert the infrared/visible image to the corresponding visible/infrared image, transforming the cross-modal problem into a single-modal problem. However, GANs are cumbersome and burdensome. The generation error and comparison error cause error accumulation, which reduces the accuracy of the model. (2) Modality alignment, which is mainly implemented by feature alignment and metric loss optimization [19], [20], [21], [22]. The former designs an one-stream or a two-stream network to learn modality-shared features of visible and infrared images, while the latter reduces cross-modal differences by designing metric loss functions. Most methods focus on learning inter-modal features to reduce modality gap and ignore the similarity learning of intra-modal samples, resulting in the model unable to learn discriminative features and limiting the performance of the model.
To address the above problems, our breakthrough point includes two aspects: (1) Separating modal-specific features and extracting modal-shared features. (2) Reducing the set-level differences between cross-modal images. We propose a novel Hierarchical Cross-modal shared Feature Network (HCFN), which mitigates modality differences in feature space and maintains identity consistency in set-level for VI-reID. HCFN consists of two main modules: Intra-modal Feature Extraction Module (IFEM) and Cross-modal Graph Interaction Module (CGIM). Our goal is to learn the representation of visible and infrared images in IFEM, which mines the layer-level attention instead of the pixel-level attention, reducing the effect of background on pedestrians. In addition, we apply the CGIM to build a bridge between the visible and infrared images. The graph structure connects the pedestrian images in the two modalities to enhance the network’s representation of identity-related features. We utilize the node relationship of the CGIM to effectively reduce the modal differences and speed up the model training process.
Main contributions of this paper are as follows:
- 1
We propose a novel cross-modal person re-identification method, called Hierarchical Cross-modal shared Feature Network (HCFN). HCFN mines visible and infrared images layer by layer without complex modality transformations. It not only considers the modality-related hidden representations, but also establishes connections between two different modalities.
- 2
We propose an Intra-modality Feature Extraction Module (IFEM), which enhances the representation of modal-shared features through a proposed Hierarchical Attention Module (HAM). In particular, it can aggregate the rich layer-level features of the baseline network, and reinforce the network’s ability to represent cross-modal images by mining the association information.
- 3
Extensive experiments on the benchmark datasets SYSU-MM01 and RegDB show the superiority of the proposed method. In addition, we construct ablation study to verify the effectiveness of each component of the framework.
Section snippets
Related work
Visible–visible person Re-ID (VV-reID) aims to retrieve pedestrian images across camera regions. The challenges of VV-reID come from changing in lighting conditions, camera angles and pedestrian poses, etc. Previous studies have focused on handcrafted feature extraction and subspace learning [23], [24], [25], [26]. Liao et al. [26] first enhanced color information of the image by Retinex transform pre-processing, extracting HSV, SILTP features in each patch and taking the maximum value of each
Proposed approach
In VI-reID, the top priority is handling differences between the intra-modal and inter-modal. We realize that images can be divided into modality-irrelevant features and modality-related features. The former includes posture, gender, clothing type, etc. The latter includes texture, clothing color, etc. We aim to separate the modality-related information to promote the network to learn the modality-irrelevant features, and project the cross-modal images into a graph space to reduce the impact of
Experiments
In this section, we present the experimental setting and implementation details. We validate the superiority of the proposed method by comparing it with other state-of-the-art methods and discuss the effectiveness of its components.
Conclusions
Most of the current methods use modality transformation to reduce modality gap, while others focus on modal alignment to learn cross-modal feature representations. In this paper, we aim to address the modality gap problem in person Re-ID by learning discriminative feature representations. First, we propose the Intra-modal Feature Extraction Module (IFEM) to learn modality-irrelevant and modality-related features of images. We build a Hierarchical Attention Mechanism (HAM) to make the model pay
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The work is partially supported by the National Natural Science Foundation of China (Nos. U1836216, 62176144, 62076153), the major fundamental research project of Shandong, China (No. ZR2019ZD03), and the Taishan Scholar Project of Shandong Province, China (No. ts20190924).
References (53)
- et al.
A feature disentangling approach for person re-identification via self-supervised data augmentation
Appl. Soft Comput.
(2021) - et al.
Person re-identification based on multi-scale feature learning
Knowl.-Based Syst.
(2021) - et al.
Depth occlusion perception feature analysis for person re-identification
Pattern Recognit. Lett.
(2020) - et al.
Adaptive deep metric embeddings for person re-identification under occlusions
Neurocomputing
(2019) - et al.
Modality adversarial neural network for visible-thermal person re-identification
Pattern Recognit.
(2020) - et al.
Global-local graph convolutional network for cross-modality person re-identification
Neurocomputing
(2021) - et al.
Person re-identification with feature pyramid optimization and gradual background suppression
Neural Netw.
(2020) - et al.
Deep-person: Learning discriminative deep features for person re-identification
Pattern Recognit.
(2020) - et al.
Camera style transformation with preserved self-similarity and domain-dissimilarity in unsupervised person re-identification
J. Vis. Commun. Image Represent.
(2021) - et al.
Dual-modality hard mining triplet-center loss for visible infrared person re-identification
Knowl.-Based Syst.
(2021)
Beyond modality alignment: Learning part-level representation for visible-infrared person re-identification
Image Vis. Comput.
Long-short temporal–spatial clues excited network for robust person re-identification
Int. J. Comput. Vis.
Guided attention in CNNs for occluded pedestrian detection and re-identification
Int. J. Comput. Vis.
Illumination-adaptive person re-identification
IEEE Trans. Multimed.
Deep high-resolution representation learning for cross-resolution person re-identification
IEEE Trans. Image Process.
Joint bilateral-resolution identity modeling for cross-resolution person re-identification
Int. J. Comput. Vis.
Cross-modality paired-images generation and augmentation for RGB-infrared person re-identification
Neural Netw.
Attend to the difference: Cross-modality person re-identification via contrastive correlation
IEEE Trans. Image Process.
Cited by (2)
Learning dual attention enhancement feature for visible–infrared person re-identification
2024, Journal of Visual Communication and Image RepresentationDeep learning algorithms for person re-identification: sate-of-the-art and research challenges
2024, Multimedia Tools and Applications
- ☆
This paper has been recommended for acceptance by Zicheng Liu.