Elsevier

Image and Vision Computing

Volume 89, September 2019, Pages 50-56
Image and Vision Computing

Discriminative Supervised Hashing for Cross-Modal Similarity Search

https://doi.org/10.1016/j.imavis.2019.06.004Get rights and content

Abstract

With the advantages of low storage cost and high retrieval efficiency, hashing techniques have recently been an emerging topic in cross-modal similarity search. As multiple modal data reflect similar semantic content, many works aim at learning unified binary codes. However, discriminative hashing features learned by these methods are not adequate. This results in lower accuracy and robustness. We propose a novel hashing learning framework which jointly performs classifier learning, subspace learning, and matrix factorization to preserve class-specific semantic content, termed Discriminative Supervised Hashing (DSH), to learn the discriminative unified binary codes for multi-modal data. Besides, reducing the loss of information and preserving the non-linear structure of data, DSH non-linearly projects different modalities into the common space in which the similarity among heterogeneous data points can be measured. Extensive experiments conducted on the three publicly available datasets demonstrate that the framework proposed in this paper outperforms several state-of-the-art methods.

Introduction

Recently, the explosion of multimedia data, such as image, text, video, and audio, increases the demands for high efficiency, low storage cost and effectiveness of retrieval applications. Hashing has received much attention in information retrieval and related areas because of its high retrieval processing speed. Among many hashing methods [[1], [2], [3], [4], [5], [6], [7], [8]], Minimal loss hashing (MLH) [2] is a framework based on the latent structural SVM. Kernel-based supervised hashing (KSH) [3], Supervised discrete hashing with point-wise labels (SDH) [1] and Scalable discrete hashing with pairwise supervision (COSDISH) [7] have been shown to deliver reasonable retrieval performance. However, the above methods have been designed for unimodal data setting and are not directly applicable to cross-modal retrieval.

Cross-modal is a very interesting scenario. For example, for a given image, it is possible to retrieve semantically relevant texts from the database. It is hard to directly measure the similarity between different modalities. To tackle the problem, most existing methods [[9], [10], [11], [12], [13], [14]] focus on finding a common subspace where the heterogeneous data can be measured. For instance, the main idea of the Inter-Media Hashing (IMH) [10] is that two points from the same neighborhood should be as close as possible in the common subspace. Semi-Paired Discrete Hashing (SPDH) [13] explores the common latent subspace by constructing a cross-view similarity graph. Fusion Similarity Hashing (FSH) [9] learns the hashing function by preserving the fusion similarity. However, the learned hashing codes have weak discrimination ability. Benefitting from the discriminative information provided by category labels, supervised hashing methods [[15], [16], [17], [18]] often improve the retrieval accuracy. Cross View Hashing (CVH) [17] aims to minimize the hamming distance between data objects belonging to the same class in a common hamming space. Semantic Correlation Maximization (SCM) [15] learns discriminative binary codes based on the cosine similarity between the semantic label vectors. Supervised Matrix Factorization Hashing (SMFH) [18] integrates graph regularization into the hashing learning framework. However, they tend to learn hashing through preserving the similarities of the inter-modal and intra-modal data but cannot ensure the learned hashing codes are semantically discriminative. In fact, it is very important that those samples with the same label have similar binary codes for cross-modal similarity search. Moreover, the computational cost of similarities of the inter-modal and intra-modal data is relatively high.

To tackle the problem, we propose a DSH model which integrates the classifier learning and matrix factorization with a consistent label into hashing learning framework. Furthermore, kernelized hash functions are learned for out-of-sample extension. Fig. 1 illustrates the overall framework of the proposed DSH. Compared with Ref. [19], our framework explores the shared structure of each category. The main contributions of DSH hashing method are given as follows:

  • 1.

    To learn more discriminative binary codes, DSH learns unified binary codes by combining classifier learning and label consistent matrix factorization.

  • 2.

    DSH learns hashing functions for each modality through employing the kernel method which can capture nonlinear structural information of the object.

Structurally, the rest of this paper falls into three sections. Our model and optimization algorithm are presented in Section 2. Section 3 shows the experimental results on three available datasets. Finally, the conclusions are drawn in Section 4. The source code of DSH proposed in this paper is available.

Section snippets

Notation and problem statement

Suppose that O = [o1,o2,…,on] is a set of n training instances with m modalities pairs. X(m)=[x1(m),x2(m),,xn(m)]denotes the m-th modality, where xi(m)Rdm is the i-th sample of X(m) with dimension dm. L = l1,l2,…,lnRc×n is a label matrix, where c denotes the number of categories. lik is the k-th element of li, lik = 1 if the i-th instance belongs to the k-th category and lik = 0 otherwise. Here, an instance o1 can be classified into multiple categories. Without loss of generality, data

Datasets

Wiki [22] contains 2866 multimedia documents harvested from Wikipedia. Every document consists of a pair of an image and a text description, and every paired sample is classified as one of 10 categories. We take 2866 pairs from the dataset to form the training set and the rest as a test set.

MirFlickr25k dataset [23] is collected from the Flickr website. It consists of 25,000 image-text pairs and each pair is assigned into some of 20 categories. We keep 20,015 pairs which have at least 20

Conclusion

In this paper, we propose a new model (DSH) which integrates subspace learning, classifier learning and the basis matrix learning into a joint framework to learn the unified hashing features that both retain discrimination ability and preserve class-specific content by using the label matrix. In contrast to previous works, a nonlinear method is introduced to learn a common subspace. We adopt the efficient DCC algorithm to optimize the problem with discrete constraint. We evaluate our method on

Declaration of Competing Interest

The authors declare no conflict of interest.

Acknowledgments

The paper is supported by theNational Natural Science Foundation of China (Grant Nos.61373055,61672265), UK EPSRC GrantEP/N007743/1,MURI / EPSRC / DSTL GrantEP/R018456/1, and the111 Project of Ministry of Education of China (Grant No.B12018).

References (25)

  • F. Shen et al.

    Supervised discrete hashing

  • M. Norouzi et al.

    Minimal loss hashing for compact binary codes

  • W. Liu et al.

    Supervised hashing with kernels

  • W. Liu et al.

    Discrete graph hashing

  • H. Liu et al.

    Towards optimal binary code learning via ordinal embedding

  • H. Liu et al.

    Ordinal constrained binary code learning for nearest neighbor search

  • W.-C. Kang et al.

    Column sampling based discrete supervised hashing

  • R. Ji et al.

    Toward optimal manifold hashing via discrete locally linear embedding

    IEEE Trans. Image Process.

    (2017)
  • H. Liu et al.

    Cross-modality binary code learning via fusion similarity hashing

  • J. Song et al.

    Inter-media hashing for large-scale retrieval from heterogeneous data sources

  • D. Wang et al.

    Robust and flexible discrete hashing for cross-modal similarity search

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • X. Shen et al.

    Robust cross-view hashing for multimedia retrieval

    IEEE Signal Process Lett.

    (2016)
  • This paper has been recommended for acceptance by Sinisa Todorovic.

    View full text