ABSTRACT
Nowadays massive amount of images and texts has been emerging on the Internet, arousing the demand of effective cross-modal retrieval such as text-to-image search and image-to-text search. To eliminate the heterogeneity between the modalities of images and texts, the existing subspace learning methods try to learn a common latent subspace under which cross-modal matching can be performed. However, these methods usually require fully paired samples (images with corresponding texts) and also ignore the class label information along with the paired samples. This may inhibit these methods from learning an effective subspace since the correlations between two modalities are implicitly incorporated. Indeed, the class label information can reduce the semantic gap between different modalities and explicitly guide the subspace learning procedure. In addition, the large quantities of unpaired samples (images or texts) may provide useful side information to enrich the representations from learned subspace. Thus, in this paper we propose a novel model for cross-modal retrieval problem. It consists of 1) a semi-supervised coupled dictionary learning step to generate homogeneously sparse representations for different modalities based on both paired and unpaired samples; 2) a coupled feature mapping step to project the sparse representations of different modalities into a common subspace defined by class label information to perform cross-modal matching. Experiments on a large scale web image dataset MIRFlickr-1M with both fully paired and unpaired settings show the effectiveness of the proposed model on the cross-modal retrieval task.
- L. Ballan, T. Uricchio, L. Seidenari, and A. Bimbo. A cross media model for automatic image annotation. In ICMR, 2014. Google ScholarDigital Library
- Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106:210--233, 2014. Google ScholarDigital Library
- D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Comput., 16(12):2639--2664, 2004. Google ScholarDigital Library
- D. Huang and Y. Wang. Coupled dictionary and feature space learning with applications to cross-domain image synthesis and recognition. In ICCV, pages 2496--2503, 2013. Google ScholarDigital Library
- C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan. Learning consistent feature representation for cross-modal multimedia retrieval. TMM, 17(3):370--381, 2015.Google ScholarDigital Library
- N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In ACM MM, 2010. Google ScholarDigital Library
- N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. JMLR, 15:2949--2980, 2014. Google ScholarDigital Library
- K. Wang, R. He, W. Wang, L. Wang, and T. Tan. Learning coupled feature spaces for cross-modal matching. In ICCV, pages 2088--2095, 2013. Google ScholarDigital Library
- Y. Yang, Y. Yang, and H. T. Shen. Effective transfer tagging from image to video. ACM Trans. Multimedia Comput. Commun. Appl., 9(2):1--20, 2013. Google ScholarDigital Library
- Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua. Exploiting web images for semantic video indexing via robust sample-specific loss. TMM, 17(2):246--256, 2015.Google Scholar
Index Terms
Semi-supervised Coupled Dictionary Learning for Cross-modal Retrieval in Internet Images and Texts
Recommendations
Semi-supervised constrained graph convolutional network for cross-modal retrieval
AbstractExploiting relationship among samples in cross-modal data plays a key role in the task of cross-modal retrieval, but most of existing methods only extract the correlation from pairwise samples and ignore the relations of unpaired ...
Graphical abstractDisplay Omitted
Highlights- We first transform all the samples to semantic embeddings and predicted label, and then we utilize them to construct graph dynamically.
Semi-supervised modality-dependent cross-media retrieval
In this paper, we propose a modality-dependent cross-media retrieval approach under semi-supervised conditions. The approach utilizes both labeled samples and unlabeled ones to obtain two couples of projection matrices and uses feature distance to ...
Discriminative coupled dictionary hashing for fast cross-media retrieval
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalCross-media hashing, which conducts cross-media retrieval by embedding data from different modalities into a common low-dimensional Hamming space, has attracted intensive attention in recent years. The existing cross-media hashing approaches only aim at ...
Comments