ABSTRACT
Information retrieval (IR) is a fundamental technique that aims to acquire information from a collection of documents, web pages, or other sources. While traditional text-based IR has achieved great success, the under-utilization of varied data sources in different modalities (i.e., text, images, audio, and video) would hinder IR techniques from giving its full advancement and thus limits the applications in the real world. Within recent years, the rapid development of deep multimodal learning paves the way for advancing IR with multi-modality. Benefiting from a variety of data types and modalities, some latest prevailing techniques are invented to show great facilitation in multi-modal and IR learning, such as CLIP, ChatGPT, GPT4, etc. In the context of IR, deep multi-modal learning has shown the prominent potential to improve the performance of retrieval systems, by enabling them to better understand and process the diverse types of data that they encounter. Given the great potential shown by multimodal-empowered IR, there can be still unsolved challenges and open questions in the related directions. With this workshop, we aim to provide a platform for discussion about multi-modal IR among scholars, practitioners, and other interested parties.
- Hui Cui, Lei Zhu, Jingjing Li, Yang Yang, and Liqiang Nie. 2019. Scalable deep hashing for large-scale social image retrieval. IEEE Transactions on image processing, Vol. 29 (2019), 1271--1284.Google ScholarCross Ref
- Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2023. Multi-queue Momentum Contrast for Microvideo-Product Retrieval. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1003--1011.Google ScholarDigital Library
- Wei Ji, Long Chen, Yinwei Wei, Yiming Wu, and Tat-Seng Chua. 2022. MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding. arXiv preprint arXiv:2212.13163 (2022).Google Scholar
- Wei Ji, Xi Li, Fei Wu, Zhijie Pan, and Yueting Zhuang. 2019. Human-centric clothing segmentation via deformable semantic locality-preserving network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 12 (2019), 4837--4848.Google ScholarDigital Library
- Wei Ji, Yicong Li, Meng Wei, Xindi Shang, Junbin Xiao, Tongwei Ren, and Tat-Seng Chua. 2021. VidVRD 2021: The Third Grand Challenge on Video Relation Detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4779--4783.Google ScholarDigital Library
- Wei Ji, Renjie Liang, Zhedong Zheng, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Mengze Li, and Tat-seng Chua. 2023. Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-based Active Learning. (2023).Google Scholar
- Hao Jiang, Wenjie Wang, Yinwei Wei, Zan Gao, Yinglong Wang, and Liqiang Nie. 2020. What aspect do you like: Multi-scale time-aware user interest modeling for micro-video recommendation. In Proceedings of the 28th ACM International conference on Multimedia. 3487--3495.Google ScholarDigital Library
- Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision. 4654--4662.Google ScholarCross Ref
- Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2928--2937.Google ScholarCross Ref
- Yaxin Liu, Jianlong Wu, Leigang Qu, Tian Gan, Jianhua Yin, and Liqiang Nie. 2022. Self-supervised Correlation Learning for Cross-Modal Retrieval. IEEE Transactions on Multimedia (2022).Google Scholar
- Peng Qi, Yuyan Bu, Juan Cao, Wei Ji, Ruihao Shui, Junbin Xiao, Danding Wang, and Tat-Seng Chua. 2023. FakeSV: A Multimodal Benchmark with Rich Social Context for Fake News Detection on Short Video Platforms. AAAI (2023).Google Scholar
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning. 8748--8763.Google Scholar
- Xindi Shang, Yicong Li, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2021. Video visual relation detection via iterative inference. In Proceedings of the 29th ACM international conference on Multimedia. 3654--3663.Google ScholarDigital Library
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5100--5111.Google ScholarCross Ref
- Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, and Tat-Seng Chua. 2022. Rethinking the two-stage framework for grounded situation recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2651--2658.Google ScholarCross Ref
- Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022. Video as Conditional Graph Hierarchy for Multi-Granular Question Answering. AAAI.Google Scholar
- Shuyu Yang, Yinan Zhou, Yaxiong Wang, Yujiao Wu, Li Zhu, and Zhedong Zheng. 2023. Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark. In Proceedings of the 2023 ACM on Multimedia Conference.Google ScholarDigital Library
- Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1--10.Google ScholarDigital Library
- Chenchen Ye, Lizi Liao, Fuli Feng, Wei Ji, and Tat-Seng Chua. 2022. Structured and natural responses co-generation for conversational search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 155--164.Google ScholarDigital Library
- Xuzheng Yu, Tian Gan, Yinwei Wei, Zhiyong Cheng, and Liqiang Nie. 2020. Personalized item recommendation for second-hand trading platform. In Proceedings of the 28th ACM International Conference on Multimedia. 3478--3486.Google ScholarDigital Library
- Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 16, 2 (2020), 1--23.Google ScholarDigital Library
- Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. 2022. Video question answering: datasets, algorithms and challenges. EMNLP (2022).Google Scholar
Index Terms
- Deep Multimodal Learning for Information Retrieval
Recommendations
An information retrieval model based on vector space method by supervised learning
This paper proposes a method to improve retrieval performance of the vector space model (VSM) in part by utilizing user-supplied information of those documents that are relevant to the query in question. In addition to the user's relevance feedback ...
Incorporating rich features to boost information retrieval performance
Research highlights We propose a regression-based re-ranking framework that can take into account rich features for boosting information retrieval (IR) performance. A set of salient features that may affect IR performance are investigated. Extensive ...
First International Workshop on Conversational Approaches to Information Retrieval (CAIR'17)
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalRecent advances in commercial conversational services that allow naturally spoken and typed interaction, particularly for well-formulated questions and commands, have increased the need for more human-centric interactions in information retrieval. The ...
Comments