Full length article
HCFN: Hierarchical cross-modal shared feature network for visible-infrared person re-identification

https://doi.org/10.1016/j.jvcir.2022.103689Get rights and content

Abstract

Compared with traditional visible–visible person re-identification, the modality discrepancy between visible and infrared images makes person re-identification more challenging. Existing methods rely on learning efficient transformation mechanisms in paired images to reduce the modality gap, which inevitably introduces noise. To get rid of these limitations, we propose a Hierarchical Cross-modal shared Feature Network (HCFN) to mine modality-shared and modality-specific information. Since infrared images lack color and other information, we construct an Intra-modal Feature Extraction Module (IFEM) to learn the content information and reduce the difference between visible and infrared images. In order to reduce the heterogeneous division, we apply a Cross-modal Graph Interaction Module (CGIM) to align and narrow the set-level distance of the inter-modal images. By jointly learning two modules, our method can achieve 66.44% Rank-1 on SYSU-MM01 dataset and 74.81% Rank-1 on RegDB datasets, respectively, which is superior compared with the state-of-the-art methods. In addition, ablation experiments demonstrate that HCFN is at least 4.9% better than the baseline network.

Introduction

Person re-identification (Re-ID) is a retrieval task for pedestrians with the same identity in the case of cross cameras. It is of great significance to the big data search and intelligent security, and also plays an important role in the investigation of criminal cases. Therefore, scholars have conducted a lot of research on person Re-ID [1], [2], [3], [4], [5], [6]. In traditional visible–visible person re-identification (VV-reID), the default condition is daytime with sufficient light. Therefore, scholars have studied a series of solutions for problems such as background occlusion [7], [8], [9], illumination changes [10], [11], [12] and resolution differences [13], [14], [15], and achieved remarkable results. But at night or under weak lighting conditions, visible cameras cannot capture clear images, so nighttime becomes a frequent period of illegal activities such as robbery and terrorist activities. In this case, most cameras use infrared night vision to make up for the lack of visible light imaging. How to implement re-identification in both modalities becomes an urgent problem, which leads to the derivation of Visible-infrared person re-identification (VI-reID).

Differences in visible and infrared images lead to two important challenges for VI-reID: (1) The inter-modal gap between visible and infrared images. The detector obtains different thermal infrared rays by measuring the infrared difference between the target itself and the background to form an infrared image. Compared with visible images, infrared images have blurry visual effects and low contrast. Besides, there is no linear relationship in grayscale distribution, color or texture between visible features and infrared images. As shown in Fig. 1, red, yellow, white and even black will appear white under infrared cameras. Therefore, the huge modality gap makes it difficult for traditional feature methods to bridge. (2) The intra-modal varieties in visible and infrared images. Due to changes in appearance and environmental conditions, pedestrians have large intra-class changes and inter-class similarities. This means the similarity between the same person may be greater than that between different people. As shown in Fig. 1, the second person and the fifth person from left to right in the second row are more visually similar, while the visible light and infrared images of the second person are visually different. Thus the heterogeneity of the underlying features of the two modalities prevents us from obtaining the similarity between the visible and infrared image data by direct comparison. To solve these problems, existing methods make two efforts, as shown in Fig. 2: (1) Modality transformation, which is mainly implemented by transforming visible and infrared images into each other [16], [17], [18]. A common approach is to use the Generative Adversarial Network (GAN) as a modality converter to convert the infrared/visible image to the corresponding visible/infrared image, transforming the cross-modal problem into a single-modal problem. However, GANs are cumbersome and burdensome. The generation error and comparison error cause error accumulation, which reduces the accuracy of the model. (2) Modality alignment, which is mainly implemented by feature alignment and metric loss optimization [19], [20], [21], [22]. The former designs an one-stream or a two-stream network to learn modality-shared features of visible and infrared images, while the latter reduces cross-modal differences by designing metric loss functions. Most methods focus on learning inter-modal features to reduce modality gap and ignore the similarity learning of intra-modal samples, resulting in the model unable to learn discriminative features and limiting the performance of the model.

To address the above problems, our breakthrough point includes two aspects: (1) Separating modal-specific features and extracting modal-shared features. (2) Reducing the set-level differences between cross-modal images. We propose a novel Hierarchical Cross-modal shared Feature Network (HCFN), which mitigates modality differences in feature space and maintains identity consistency in set-level for VI-reID. HCFN consists of two main modules: Intra-modal Feature Extraction Module (IFEM) and Cross-modal Graph Interaction Module (CGIM). Our goal is to learn the representation of visible and infrared images in IFEM, which mines the layer-level attention instead of the pixel-level attention, reducing the effect of background on pedestrians. In addition, we apply the CGIM to build a bridge between the visible and infrared images. The graph structure connects the pedestrian images in the two modalities to enhance the network’s representation of identity-related features. We utilize the node relationship of the CGIM to effectively reduce the modal differences and speed up the model training process.

Main contributions of this paper are as follows:

  • 1

    We propose a novel cross-modal person re-identification method, called Hierarchical Cross-modal shared Feature Network (HCFN). HCFN mines visible and infrared images layer by layer without complex modality transformations. It not only considers the modality-related hidden representations, but also establishes connections between two different modalities.

  • 2

    We propose an Intra-modality Feature Extraction Module (IFEM), which enhances the representation of modal-shared features through a proposed Hierarchical Attention Module (HAM). In particular, it can aggregate the rich layer-level features of the baseline network, and reinforce the network’s ability to represent cross-modal images by mining the association information.

  • 3

    Extensive experiments on the benchmark datasets SYSU-MM01 and RegDB show the superiority of the proposed method. In addition, we construct ablation study to verify the effectiveness of each component of the framework.

Section snippets

Related work

Visible–visible person Re-ID (VV-reID) aims to retrieve pedestrian images across camera regions. The challenges of VV-reID come from changing in lighting conditions, camera angles and pedestrian poses, etc. Previous studies have focused on handcrafted feature extraction and subspace learning [23], [24], [25], [26]. Liao et al. [26] first enhanced color information of the image by Retinex transform pre-processing, extracting HSV, SILTP features in each patch and taking the maximum value of each

Proposed approach

In VI-reID, the top priority is handling differences between the intra-modal and inter-modal. We realize that images can be divided into modality-irrelevant features and modality-related features. The former includes posture, gender, clothing type, etc. The latter includes texture, clothing color, etc. We aim to separate the modality-related information to promote the network to learn the modality-irrelevant features, and project the cross-modal images into a graph space to reduce the impact of

Experiments

In this section, we present the experimental setting and implementation details. We validate the superiority of the proposed method by comparing it with other state-of-the-art methods and discuss the effectiveness of its components.

Conclusions

Most of the current methods use modality transformation to reduce modality gap, while others focus on modal alignment to learn cross-modal feature representations. In this paper, we aim to address the modality gap problem in person Re-ID by learning discriminative feature representations. First, we propose the Intra-modal Feature Extraction Module (IFEM) to learn modality-irrelevant and modality-related features of images. We build a Hierarchical Attention Mechanism (HAM) to make the model pay

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The work is partially supported by the National Natural Science Foundation of China (Nos. U1836216, 62176144, 62076153), the major fundamental research project of Shandong, China (No. ZR2019ZD03), and the Taishan Scholar Project of Shandong Province, China (No. ts20190924).

References (53)

  • ZhangPeng et al.

    Beyond modality alignment: Learning part-level representation for visible-infrared person re-identification

    Image Vis. Comput.

    (2021)
  • Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, Shengjin Wang, Beyond part models: Person retrieval with refined part pooling...
  • Qize Yang, HongXing Yu, Ancong Wu, WeiShi Zheng, Patch-based discriminative feature learning for unsupervised person...
  • Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, Xinggang Wang, Mancs: A multi-task attentional network with curriculum...
  • LiShuai et al.

    Long-short temporal–spatial clues excited network for robust person re-identification

    Int. J. Comput. Vis.

    (2020)
  • ZhangShanshan et al.

    Guided attention in CNNs for occluded pedestrian detection and re-identification

    Int. J. Comput. Vis.

    (2021)
  • ZengZelong et al.

    Illumination-adaptive person re-identification

    IEEE Trans. Multimed.

    (2020)
  • Yukun Huang, Zheng-Jun Zha, Xueyang Fu, Wei Zhang, Illumination-invariant person re-identification, in: Proceedings of...
  • Xuan Zhao, Xin Xu, Multi-granularity and Multi-semantic Model for Person Re-identification in Variable Illumination,...
  • ZhangGuoqing et al.

    Deep high-resolution representation learning for cross-resolution person re-identification

    IEEE Trans. Image Process.

    (2021)
  • Zhiyi Cheng, Qi Dong, Shaogang Gong, Xiatian Zhu, Inter-task association critic for cross-resolution person...
  • ZhengWei-Shi et al.

    Joint bilateral-resolution identity modeling for cross-resolution person re-identification

    Int. J. Comput. Vis.

    (2022)
  • Guan’an Wang, Tianzhu Zhang, Jian Cheng, Si Liu, Yang Yang, Zengguang Hou, Rgb-infrared cross-modality person...
  • Seokeon Choi, Sumin Lee, Youngeun Kim, Taekyung Kim, Changick Kim, Hi-CMD: Hierarchical cross-modality disentanglement...
  • YangYang et al.

    Cross-modality paired-images generation and augmentation for RGB-infrared person re-identification

    Neural Netw.

    (2020)
  • ZhangShizhou et al.

    Attend to the difference: Cross-modality person re-identification via contrastive correlation

    IEEE Trans. Image Process.

    (2021)
  • Cited by (2)

    • Learning dual attention enhancement feature for visible–infrared person re-identification

      2024, Journal of Visual Communication and Image Representation

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text