research-article

Open Access

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Authors:
Hongyi Zhu

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands

0009-0006-0298-0905
Search about this author

,
Jia-Hong Huang

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands

0000-0001-7943-2591
Search about this author

,
Stevan Rudinac

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands

0000-0003-1904-8736
Search about this author

,
Evangelos Kanoulas

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands

0000-0002-8312-0694
Search about this author

ICMR '24: Proceedings of the 2024 International Conference on Multimedia RetrievalMay 2024Pages 978–987https://doi.org/10.1145/3652583.3658032

Published:07 June 2024Publication History

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 978–987

ABSTRACT

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

References

Kenan E. Ak, Ashraf Ali Kassim, Joo-Hwee Lim, and Jo Yew Tham. 2018. Learning Attribute Representations with Localization for Flexible Fashion Search. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7708--7717.Google Scholar
James Allan. 1996. Incremental relevance feedback for information filtering. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
Gianni Amati and C J Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20 (2002), 357--389.Google ScholarDigital Library
R. Attar and Aviezri S. Fraenkel. 1977. Local Feedback in Full-Text Retrieval Systems. J. ACM 24 (1977), 397--417.Google ScholarDigital Library
Pia Borlund. 2003. The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Inf. Res. 8 (2003).Google Scholar
Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, and William Fedus et al. 2022. Scaling Instruction-Finetuned Language Models. ArXiv abs/2210.11416 (2022).Google Scholar
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. ArXiv abs/2305.06500 (2023).Google Scholar
Karan Desai and Justin Johnson. 2020. VirTex: Learning Visual Representations from Textual Annotations. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 11157--11168.Google Scholar
Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document Numérique 17 (2014), 61--84.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics.Google Scholar
Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Learning a Deep Convolutional Network for Image Super-Resolution. In European Conference on Computer Vision.Google ScholarCross Ref
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and Dirk Weissenborn et al. 2020. An Image isWorth 16x16Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2020).Google Scholar
Wen gang Zhou, Houqiang Li, and Qi Tian. 2017. Recent Advance in Contentbased Image Retrieval: A Literature Survey. ArXiv abs/1706.06064 (2017).Google Scholar
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven J. Rennie, and Rogério Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. In Neural Information Processing Systems.Google Scholar
Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven J. Rennie, and Rogério Schmidt Feris. 2019. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. ArXiv abs/1905.12794 (2019).Google Scholar
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 1472--1480.Google Scholar
James Hays and Alexei A. Efros. 2008. IM2GPS: estimating geographic information from a single image. 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008), 1--8.Google ScholarCross Ref
Brian Hu, Bhavan Vasu, and Anthony Hoogs. 2022. X-mir: Explainable medical image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 440--450.Google ScholarCross Ref
Hongyu Hu, Jiyuan Zhang, Minyi Zhao, and Zhenbang Sun. 2023. CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning. ArXiv abs/2309.02301 (2023).Google Scholar
Cheng Huang and HongmeiWang. 2019. A novel key-frames selection framework for comprehensive video summarization. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 577--589.Google ScholarCross Ref
Jia-Hong Huang. 2017. Robustness Analysis of Visual Question Answering Models by Basic Questions. King Abdullah University of Science and Technology, Master Thesis (2017).Google Scholar
Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem. 2017. VQABQ: Visual Question Answering by Basic Questions. VQA ChallengeWorkshop, CVPR (2017).Google Scholar
Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem. 2018. Robustness Analysis of Visual QA Models by Basic Questions. VQA Challenge and Visual Dialog Workshop, CVPR (2018).Google Scholar
Jia-Hong Huang, Modar Alfadly, Bernard Ghanem, and Marcel Worring. 2019. Assessing the robustness of visual question answering. arXiv preprint arXiv:1912.01452 (2019).Google Scholar
Jia-Hong Huang, Modar Alfadly, Bernard Ghanem, and MarcelWorring. 2023. Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions. arXiv preprint arXiv:2304.03147 (2023).Google Scholar
Thomas S. Huang, Charlie K. Dagli, Shyamsundar Rajaram, Edward Y. Chang, Michael I. Mandel, Graham E. Poliner, and Daniel P. W. Ellis. 2008. Active Learning for Interactive Multimedia Retrieval. Proc. IEEE 96 (2008), 648--667.Google ScholarCross Ref
Makoto Iwayama. 2000. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query Expansion by Prompting Large Language Models. ArXiv abs/2305.03653 (2023).Google Scholar
Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query Expansion by Prompting Large Language Models. arXiv preprint arXiv:2305.03653 (2023).Google Scholar
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904--4916.Google Scholar
Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, and Alexander Hauptmann. 2015. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (2015).Google ScholarDigital Library
Thorsten Joachims. 1997. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In International Conference on Machine Learning.Google Scholar
Omar Shahbaz Khan, Björn Þór Jónsson, Stevan Rudinac, Jan Zahálka, Hanna Ragnarsdóttir, Þórhildur Þorleiksdóttir, Gylfi Þór Guðmundsson, Laurent Amsaleg, and Marcel Worring. 2020. Interactive Learning for Multimedia at Large. Advances in Information Retrieval 12035 (2020), 495--510.Google ScholarDigital Library
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. ArXiv abs/2205.11916 (2022).Google Scholar
Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2012. Whittle Search: Image search with relative attribute feedback. 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), 2973--2980.Google ScholarCross Ref
Martha Larson, Mohammad Soleymani, Pavel Serdyukov, Stevan Rudinac, Christian Wartena, Vanessa Murdock, Gerald Friedland, Roeland Ordelman, and Gareth J. F. Jones. 2011. Automatic tagging and geotagging in video collections and communities. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR '11). Association for Computing Machinery, New York, NY, USA, Article 51, 8 pages. https://doi.org/10.1145/1991996.1992047Google ScholarDigital Library
Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. 2024. Chatting makes perfect: Chat-based image retrieval. Advances in Neural Information Processing Systems 36 (2024).Google Scholar
Dirk Lewandowski. 2008. The retrieval effectiveness of web search engines: considering results descriptions. Journal of documentation 64, 6 (2008), 915--937.Google ScholarCross Ref
David D. Lewis, Robert E. Schapire, Jamie Callan, and Ron Papka. 1996. Training algorithms for linear text classifiers. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning.Google Scholar
Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based product retrieval in taobao search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3181--3189.Google ScholarDigital Library
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. ArXiv abs/2305.10355 (2023).Google Scholar
Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. 2021. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. ArXiv abs/2110.05208 (2021).Google Scholar
Tsung-Yi Lin, Yin Cui, Serge J. Belongie, and James Hays. 2015. Learning deep representations for ground-to-aerial geolocalization. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 5007--5015.Google ScholarCross Ref
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision.Google Scholar
Wei-Chao Lin. 2019. Aggregation of Multiple Pseudo Relevance Feedbacks for Image Search Re-Ranking. IEEE Access 7 (2019), 147553--147559. https: //doi.org/10.1109/ACCESS.2019.2942142Google ScholarCross Ref
Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, and Ji-Rong Wen. 2022. COTS: Collaborative two-stream vision-language pre-training model for crossmodal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15692--15701.Google ScholarCross Ref
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Neural Information Processing Systems.Google Scholar
Xiaopeng Lu, Tiancheng Zhao, and Kyusong Lee. 2021. VisualSparta: an embarrassingly simple approach to large-scale text-to-image search with weighted bag-of-words. arXiv preprint arXiv:2101.00265 (2021).Google Scholar
Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023. Generative Relevance Feedback with Large Language Models. arXiv preprint arXiv:2304.13157 (2023).Google Scholar
Kelong Mao, Zhicheng Dou, Haonan Chen, Fengran Mo, and Hongjin Qian. 2023. Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search. arXiv preprint arXiv:2303.06573 (2023).Google Scholar
Kelong Mao, Zhicheng Dou, Bang Liu, Hongjin Qian, Fengran Mo, Xiangli Wu, Xiaohua Cheng, and Zhao Cao. 2023. Search-oriented conversational query editing. In Findings of the Association for Computational Linguistics: ACL 2023. 4160--4172.Google ScholarCross Ref
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013).Google Scholar
Henning Müller. 2020. Medical Image Retrieval: Applications and Resources. Proceedings of the 2020 International Conference on Multimedia Retrieval (2020).Google ScholarDigital Library
Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE international conference on computer vision. 3456--3465.Google ScholarCross Ref
Eyal Oren, Renaud Delbru, and Stefan Decker. 2006. Extending Faceted Navigation for RDF Data. In International Workshop on the Semantic Web.Google Scholar
Devi Parikh and Kristen Grauman. 2011. Relative attributes. 2011 International Conference on Computer Vision (2011), 503--510.Google ScholarDigital Library
Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.Google Scholar
Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, J. Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities: Collecting Regionto-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision 123 (2015), 74--93.Google ScholarDigital Library
Filip Radenovic, Giorgos Tolias, and Ondřej Chum. 2017. Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2017), 1655--1668.Google ScholarCross Ref
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.Google Scholar
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google Scholar
Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to Recommender Systems Handbook. In Recommender Systems Handbook.Google Scholar
Stephen E. Robertson. 1991. On Term Selection for Query Expansion. J. Documentation 46 (1991), 359--364.Google ScholarDigital Library
J. J. Rocchio. 1971. Relevance feedback in information retrieval.Google Scholar
Stevan Rudinac, Martha Larson, and Alan Hanjalic. 2012. Leveraging visual concepts and query performance prediction for semantic-theme-based video retrieval. International Journal of Multimedia Information Retrieval 1 (2012), 263--280.Google ScholarCross Ref
Yong Rui and Thomas S. Huang. 1999. A novel relevance feedback technique in image retrieval. In MULTIMEDIA '99.Google Scholar
Yong Rui, Thomas S. Huang, and Sharad Mehrotra. 1997. Content-based image retrieval with relevance feedback in MARS. Proceedings of International Conference on Image Processing 2 (1997), 815--818 vol.2.Google ScholarCross Ref
Chris Samarinas and Hamed Zamani. 2022. Revisiting Open Domain Query Facet Extraction and Generation. Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval (2022).Google ScholarDigital Library
Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning Visual Representations with Caption Annotations. In European Conference on Computer Vision.Google Scholar
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 815--823.Google ScholarCross Ref
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. ArXiv abs/2111.02114 (2021).Google Scholar
Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005. Context-sensitive information retrieval using implicit feedback. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. 2000. Contentbased image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000), 1349--1380. https://doi.org/10. 1109/34.895972Google ScholarDigital Library
Siqi Sun, Yen-Chun Chen, Linjie Li, ShuohangWang, Yuwei Fang, and Jingjing Liu. 2021. Lightningdot: Pre-training visual-semantic embeddings for real-time imagetext retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 982--997.Google ScholarCross Ref
Daichi Suzuki, Go Irie, and Kiyoharu Aizawa. 2023. Text-to-Image Fashion Retrieval with Fabric Textures. Proceedings of the 2023 ACM International Conference on Multimedia Retrieval (2023).Google ScholarDigital Library
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, and Amjad Almahairi et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023).Google Scholar
Nam S. Vo, Lu Jiang, Chen Sun, Kevin P. Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2018. Composing Text and Image for Image Retrieval - an Empirical Odyssey. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), 6432--6441.Google Scholar
ShuaiWang, Jiayi Shen, Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, and MarcelWorring. 2024. Prototype-Enhanced Hypergraph Learning for Heterogeneous Information Networks. In International Conference on Multimedia Modeling. Springer, 462--476.Google Scholar
Xuanhui Wang, Hui Fang, and Cheng Xiang Zhai. 2008. A study of methods for negative relevance feedback. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision. 2794--2802.Google ScholarDigital Library
Yiming Wang, Zhuosheng Zhang, and Rui Wang. 2023. Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chainof-Thought Method. In Annual Meeting of the Association for Computational Linguistics.Google Scholar
Zhenduo Wang and Qingyao Ai. 2021. Controlling the Risk of Conversational Search via Reinforcement Learning. Proceedings of the Web Conference 2021 (2021).Google ScholarDigital Library
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv abs/2201.11903 (2022).Google Scholar
Jinxi Xu and W. Bruce Croft. 1996. Query expansion using local and global document analysis. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google Scholar
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 5288--5296.Google Scholar
Fanghua Ye, Meng Fang, Shenghui Li, and Emine Yilmaz. 2023. Enhancing conversational search: Large language model-aided informative query rewriting. arXiv preprint arXiv:2310.09716 (2023).Google Scholar
Jan Zahálka, Stevan Rudinac, and Marcel Worring. 2015. Analytic Quality: Evaluation of Performance and Insight in Multimedia Collection Analysis. Proceedings of the 23rd ACM international conference on Multimedia (2015).Google ScholarDigital Library
Jan Zahálka, Stevan Rudinac, Björn þór Jónsson, Dennis C. Koelma, and Marcel Worring. 2018. Blackthorn: Large-Scale Interactive Multimodal Learning. IEEE Transactions on Multimedia 20 (2018), 687--698.Google ScholarDigital Library
Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. 2023. HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption. ArXiv abs/2310.01779 (2023).Google Scholar
ChengXiang Zhai and John D. Lafferty. 2001. Model-based feedback in the language modeling approach to information retrieval. In International Conference on Information and Knowledge Management.Google Scholar
Ke Zhang,Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Summary transfer: Exemplar-based subset selection for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1059--1067.Google ScholarCross Ref
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. 2022. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the 7th Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research), Zachary Lipton, Rajesh Ranganath, Mark Sendak, Michael Sjoding, and Serena Yeung (Eds.), Vol. 182. PMLR, 2--25.Google Scholar
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 863--871.Google ScholarDigital Library
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. ArXiv abs/2306.05685 (2023).Google Scholar

Index Terms

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search
    2. Users and interactive retrieval

Recommendations

Boosting legal case retrieval by query content selection with large language models
SIGIR-AP '23: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

Legal case retrieval, which aims to retrieve relevant cases to a given query case, benefits judgment justice and attracts increasing attention. Unlike generic retrieval queries, legal case queries are typically long and the definition of relevance is ...
Read More
Cluster-based retrieval using language models
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine ...
Read More
User term feedback in interactive text-based image retrieval
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

To alleviate the vocabulary problem, this paper investigates the role of user term feedback in interactive text-based image retrieval. Term feedback refers to the feedback from a user on specific terms regarding their relevance to a target image. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval
May 2024
1379 pages
ISBN:9798400706196
DOI:10.1145/3652583
General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Rachada Kongkachandra
Thammasat University, Thailand
,
Klaus Schoeffmann
Klagenfurt University, Austria
,
Program Chairs:
Duc-Tien Dang-Nguyen
University of Bergen, Norway
,
Luca Rossetto
University of Zurich, Switzerland
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Liting Zhou
Dublin City University, Ireland
Copyright © 2024 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 June 2024
Check for updates
Author Tags
interactive image retrieval
large language models
query rewriting
vision language models
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate254of830submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 0
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Boosting legal case retrieval by query content selection with large language models

Cluster-based retrieval using language models

User term feedback in interactive text-based image retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media