skip to main content
10.1145/3581783.3610949acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
abstract

Deep Multimodal Learning for Information Retrieval

Published:27 October 2023Publication History

ABSTRACT

Information retrieval (IR) is a fundamental technique that aims to acquire information from a collection of documents, web pages, or other sources. While traditional text-based IR has achieved great success, the under-utilization of varied data sources in different modalities (i.e., text, images, audio, and video) would hinder IR techniques from giving its full advancement and thus limits the applications in the real world. Within recent years, the rapid development of deep multimodal learning paves the way for advancing IR with multi-modality. Benefiting from a variety of data types and modalities, some latest prevailing techniques are invented to show great facilitation in multi-modal and IR learning, such as CLIP, ChatGPT, GPT4, etc. In the context of IR, deep multi-modal learning has shown the prominent potential to improve the performance of retrieval systems, by enabling them to better understand and process the diverse types of data that they encounter. Given the great potential shown by multimodal-empowered IR, there can be still unsolved challenges and open questions in the related directions. With this workshop, we aim to provide a platform for discussion about multi-modal IR among scholars, practitioners, and other interested parties.

References

  1. Hui Cui, Lei Zhu, Jingjing Li, Yang Yang, and Liqiang Nie. 2019. Scalable deep hashing for large-scale social image retrieval. IEEE Transactions on image processing, Vol. 29 (2019), 1271--1284.Google ScholarGoogle ScholarCross RefCross Ref
  2. Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2023. Multi-queue Momentum Contrast for Microvideo-Product Retrieval. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1003--1011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Wei Ji, Long Chen, Yinwei Wei, Yiming Wu, and Tat-Seng Chua. 2022. MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding. arXiv preprint arXiv:2212.13163 (2022).Google ScholarGoogle Scholar
  4. Wei Ji, Xi Li, Fei Wu, Zhijie Pan, and Yueting Zhuang. 2019. Human-centric clothing segmentation via deformable semantic locality-preserving network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 12 (2019), 4837--4848.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Wei Ji, Yicong Li, Meng Wei, Xindi Shang, Junbin Xiao, Tongwei Ren, and Tat-Seng Chua. 2021. VidVRD 2021: The Third Grand Challenge on Video Relation Detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4779--4783.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Wei Ji, Renjie Liang, Zhedong Zheng, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Mengze Li, and Tat-seng Chua. 2023. Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-based Active Learning. (2023).Google ScholarGoogle Scholar
  7. Hao Jiang, Wenjie Wang, Yinwei Wei, Zan Gao, Yinglong Wang, and Liqiang Nie. 2020. What aspect do you like: Multi-scale time-aware user interest modeling for micro-video recommendation. In Proceedings of the 28th ACM International conference on Multimedia. 3487--3495.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision. 4654--4662.Google ScholarGoogle ScholarCross RefCross Ref
  9. Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2928--2937.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yaxin Liu, Jianlong Wu, Leigang Qu, Tian Gan, Jianhua Yin, and Liqiang Nie. 2022. Self-supervised Correlation Learning for Cross-Modal Retrieval. IEEE Transactions on Multimedia (2022).Google ScholarGoogle Scholar
  11. Peng Qi, Yuyan Bu, Juan Cao, Wei Ji, Ruihao Shui, Junbin Xiao, Danding Wang, and Tat-Seng Chua. 2023. FakeSV: A Multimodal Benchmark with Rich Social Context for Fake News Detection on Short Video Platforms. AAAI (2023).Google ScholarGoogle Scholar
  12. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning. 8748--8763.Google ScholarGoogle Scholar
  13. Xindi Shang, Yicong Li, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2021. Video visual relation detection via iterative inference. In Proceedings of the 29th ACM international conference on Multimedia. 3654--3663.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5100--5111.Google ScholarGoogle ScholarCross RefCross Ref
  15. Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, and Tat-Seng Chua. 2022. Rethinking the two-stage framework for grounded situation recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2651--2658.Google ScholarGoogle ScholarCross RefCross Ref
  16. Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022. Video as Conditional Graph Hierarchy for Multi-Granular Question Answering. AAAI.Google ScholarGoogle Scholar
  17. Shuyu Yang, Yinan Zhou, Yaxiong Wang, Yujiao Wu, Li Zhu, and Zhedong Zheng. 2023. Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark. In Proceedings of the 2023 ACM on Multimedia Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chenchen Ye, Lizi Liao, Fuli Feng, Wei Ji, and Tat-Seng Chua. 2022. Structured and natural responses co-generation for conversational search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 155--164.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xuzheng Yu, Tian Gan, Yinwei Wei, Zhiyong Cheng, and Liqiang Nie. 2020. Personalized item recommendation for second-hand trading platform. In Proceedings of the 28th ACM International Conference on Multimedia. 3478--3486.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 16, 2 (2020), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. 2022. Video question answering: datasets, algorithms and challenges. EMNLP (2022).Google ScholarGoogle Scholar

Index Terms

  1. Deep Multimodal Learning for Information Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Check for updates

      Qualifiers

      • abstract

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)122
      • Downloads (Last 6 weeks)29

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader