skip to main content
10.1145/2733373.2806346acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Semi-supervised Coupled Dictionary Learning for Cross-modal Retrieval in Internet Images and Texts

Authors Info & Claims
Published:13 October 2015Publication History

ABSTRACT

Nowadays massive amount of images and texts has been emerging on the Internet, arousing the demand of effective cross-modal retrieval such as text-to-image search and image-to-text search. To eliminate the heterogeneity between the modalities of images and texts, the existing subspace learning methods try to learn a common latent subspace under which cross-modal matching can be performed. However, these methods usually require fully paired samples (images with corresponding texts) and also ignore the class label information along with the paired samples. This may inhibit these methods from learning an effective subspace since the correlations between two modalities are implicitly incorporated. Indeed, the class label information can reduce the semantic gap between different modalities and explicitly guide the subspace learning procedure. In addition, the large quantities of unpaired samples (images or texts) may provide useful side information to enrich the representations from learned subspace. Thus, in this paper we propose a novel model for cross-modal retrieval problem. It consists of 1) a semi-supervised coupled dictionary learning step to generate homogeneously sparse representations for different modalities based on both paired and unpaired samples; 2) a coupled feature mapping step to project the sparse representations of different modalities into a common subspace defined by class label information to perform cross-modal matching. Experiments on a large scale web image dataset MIRFlickr-1M with both fully paired and unpaired settings show the effectiveness of the proposed model on the cross-modal retrieval task.

References

  1. L. Ballan, T. Uricchio, L. Seidenari, and A. Bimbo. A cross media model for automatic image annotation. In ICMR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106:210--233, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Comput., 16(12):2639--2664, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Huang and Y. Wang. Coupled dictionary and feature space learning with applications to cross-domain image synthesis and recognition. In ICCV, pages 2496--2503, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan. Learning consistent feature representation for cross-modal multimedia retrieval. TMM, 17(3):370--381, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In ACM MM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. JMLR, 15:2949--2980, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Wang, R. He, W. Wang, L. Wang, and T. Tan. Learning coupled feature spaces for cross-modal matching. In ICCV, pages 2088--2095, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Yang, Y. Yang, and H. T. Shen. Effective transfer tagging from image to video. ACM Trans. Multimedia Comput. Commun. Appl., 9(2):1--20, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua. Exploiting web images for semantic video indexing via robust sample-specific loss. TMM, 17(2):246--256, 2015.Google ScholarGoogle Scholar

Index Terms

  1. Semi-supervised Coupled Dictionary Learning for Cross-modal Retrieval in Internet Images and Texts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '15: Proceedings of the 23rd ACM international conference on Multimedia
      October 2015
      1402 pages
      ISBN:9781450334594
      DOI:10.1145/2733373

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 October 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      MM '15 Paper Acceptance Rate56of252submissions,22%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader