skip to main content
10.1145/3652583.3658032acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article
Open Access

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Published:07 June 2024Publication History

ABSTRACT

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

References

  1. Kenan E. Ak, Ashraf Ali Kassim, Joo-Hwee Lim, and Jo Yew Tham. 2018. Learning Attribute Representations with Localization for Flexible Fashion Search. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7708--7717.Google ScholarGoogle Scholar
  2. James Allan. 1996. Incremental relevance feedback for information filtering. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Gianni Amati and C J Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20 (2002), 357--389.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Attar and Aviezri S. Fraenkel. 1977. Local Feedback in Full-Text Retrieval Systems. J. ACM 24 (1977), 397--417.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Pia Borlund. 2003. The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Inf. Res. 8 (2003).Google ScholarGoogle Scholar
  6. Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, and William Fedus et al. 2022. Scaling Instruction-Finetuned Language Models. ArXiv abs/2210.11416 (2022).Google ScholarGoogle Scholar
  7. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. ArXiv abs/2305.06500 (2023).Google ScholarGoogle Scholar
  8. Karan Desai and Justin Johnson. 2020. VirTex: Learning Visual Representations from Textual Annotations. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 11157--11168.Google ScholarGoogle Scholar
  9. Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document Numérique 17 (2014), 61--84.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  11. Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Learning a Deep Convolutional Network for Image Super-Resolution. In European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  12. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and Dirk Weissenborn et al. 2020. An Image isWorth 16x16Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2020).Google ScholarGoogle Scholar
  13. Wen gang Zhou, Houqiang Li, and Qi Tian. 2017. Recent Advance in Contentbased Image Retrieval: A Literature Survey. ArXiv abs/1706.06064 (2017).Google ScholarGoogle Scholar
  14. Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven J. Rennie, and Rogério Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. In Neural Information Processing Systems.Google ScholarGoogle Scholar
  15. Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven J. Rennie, and Rogério Schmidt Feris. 2019. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. ArXiv abs/1905.12794 (2019).Google ScholarGoogle Scholar
  16. Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 1472--1480.Google ScholarGoogle Scholar
  17. James Hays and Alexei A. Efros. 2008. IM2GPS: estimating geographic information from a single image. 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008), 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  18. Brian Hu, Bhavan Vasu, and Anthony Hoogs. 2022. X-mir: Explainable medical image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 440--450.Google ScholarGoogle ScholarCross RefCross Ref
  19. Hongyu Hu, Jiyuan Zhang, Minyi Zhao, and Zhenbang Sun. 2023. CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning. ArXiv abs/2309.02301 (2023).Google ScholarGoogle Scholar
  20. Cheng Huang and HongmeiWang. 2019. A novel key-frames selection framework for comprehensive video summarization. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 577--589.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jia-Hong Huang. 2017. Robustness Analysis of Visual Question Answering Models by Basic Questions. King Abdullah University of Science and Technology, Master Thesis (2017).Google ScholarGoogle Scholar
  22. Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem. 2017. VQABQ: Visual Question Answering by Basic Questions. VQA ChallengeWorkshop, CVPR (2017).Google ScholarGoogle Scholar
  23. Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem. 2018. Robustness Analysis of Visual QA Models by Basic Questions. VQA Challenge and Visual Dialog Workshop, CVPR (2018).Google ScholarGoogle Scholar
  24. Jia-Hong Huang, Modar Alfadly, Bernard Ghanem, and Marcel Worring. 2019. Assessing the robustness of visual question answering. arXiv preprint arXiv:1912.01452 (2019).Google ScholarGoogle Scholar
  25. Jia-Hong Huang, Modar Alfadly, Bernard Ghanem, and MarcelWorring. 2023. Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions. arXiv preprint arXiv:2304.03147 (2023).Google ScholarGoogle Scholar
  26. Thomas S. Huang, Charlie K. Dagli, Shyamsundar Rajaram, Edward Y. Chang, Michael I. Mandel, Graham E. Poliner, and Daniel P. W. Ellis. 2008. Active Learning for Interactive Multimedia Retrieval. Proc. IEEE 96 (2008), 648--667.Google ScholarGoogle ScholarCross RefCross Ref
  27. Makoto Iwayama. 2000. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query Expansion by Prompting Large Language Models. ArXiv abs/2305.03653 (2023).Google ScholarGoogle Scholar
  29. Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query Expansion by Prompting Large Language Models. arXiv preprint arXiv:2305.03653 (2023).Google ScholarGoogle Scholar
  30. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904--4916.Google ScholarGoogle Scholar
  31. Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, and Alexander Hauptmann. 2015. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Thorsten Joachims. 1997. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In International Conference on Machine Learning.Google ScholarGoogle Scholar
  33. Omar Shahbaz Khan, Björn Þór Jónsson, Stevan Rudinac, Jan Zahálka, Hanna Ragnarsdóttir, Þórhildur Þorleiksdóttir, Gylfi Þór Guðmundsson, Laurent Amsaleg, and Marcel Worring. 2020. Interactive Learning for Multimedia at Large. Advances in Information Retrieval 12035 (2020), 495--510.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. ArXiv abs/2205.11916 (2022).Google ScholarGoogle Scholar
  35. Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2012. Whittle Search: Image search with relative attribute feedback. 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), 2973--2980.Google ScholarGoogle ScholarCross RefCross Ref
  36. Martha Larson, Mohammad Soleymani, Pavel Serdyukov, Stevan Rudinac, Christian Wartena, Vanessa Murdock, Gerald Friedland, Roeland Ordelman, and Gareth J. F. Jones. 2011. Automatic tagging and geotagging in video collections and communities. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR '11). Association for Computing Machinery, New York, NY, USA, Article 51, 8 pages. https://doi.org/10.1145/1991996.1992047Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. 2024. Chatting makes perfect: Chat-based image retrieval. Advances in Neural Information Processing Systems 36 (2024).Google ScholarGoogle Scholar
  38. Dirk Lewandowski. 2008. The retrieval effectiveness of web search engines: considering results descriptions. Journal of documentation 64, 6 (2008), 915--937.Google ScholarGoogle ScholarCross RefCross Ref
  39. David D. Lewis, Robert E. Schapire, Jamie Callan, and Ron Papka. 1996. Training algorithms for linear text classifiers. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning.Google ScholarGoogle Scholar
  41. Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based product retrieval in taobao search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3181--3189.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. ArXiv abs/2305.10355 (2023).Google ScholarGoogle Scholar
  43. Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. 2021. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. ArXiv abs/2110.05208 (2021).Google ScholarGoogle Scholar
  44. Tsung-Yi Lin, Yin Cui, Serge J. Belongie, and James Hays. 2015. Learning deep representations for ground-to-aerial geolocalization. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 5007--5015.Google ScholarGoogle ScholarCross RefCross Ref
  45. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision.Google ScholarGoogle Scholar
  46. Wei-Chao Lin. 2019. Aggregation of Multiple Pseudo Relevance Feedbacks for Image Search Re-Ranking. IEEE Access 7 (2019), 147553--147559. https: //doi.org/10.1109/ACCESS.2019.2942142Google ScholarGoogle ScholarCross RefCross Ref
  47. Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, and Ji-Rong Wen. 2022. COTS: Collaborative two-stream vision-language pre-training model for crossmodal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15692--15701.Google ScholarGoogle ScholarCross RefCross Ref
  48. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Neural Information Processing Systems.Google ScholarGoogle Scholar
  49. Xiaopeng Lu, Tiancheng Zhao, and Kyusong Lee. 2021. VisualSparta: an embarrassingly simple approach to large-scale text-to-image search with weighted bag-of-words. arXiv preprint arXiv:2101.00265 (2021).Google ScholarGoogle Scholar
  50. Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023. Generative Relevance Feedback with Large Language Models. arXiv preprint arXiv:2304.13157 (2023).Google ScholarGoogle Scholar
  51. Kelong Mao, Zhicheng Dou, Haonan Chen, Fengran Mo, and Hongjin Qian. 2023. Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search. arXiv preprint arXiv:2303.06573 (2023).Google ScholarGoogle Scholar
  52. Kelong Mao, Zhicheng Dou, Bang Liu, Hongjin Qian, Fengran Mo, Xiangli Wu, Xiaohua Cheng, and Zhao Cao. 2023. Search-oriented conversational query editing. In Findings of the Association for Computational Linguistics: ACL 2023. 4160--4172.Google ScholarGoogle ScholarCross RefCross Ref
  53. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013).Google ScholarGoogle Scholar
  54. Henning Müller. 2020. Medical Image Retrieval: Applications and Resources. Proceedings of the 2020 International Conference on Multimedia Retrieval (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE international conference on computer vision. 3456--3465.Google ScholarGoogle ScholarCross RefCross Ref
  56. Eyal Oren, Renaud Delbru, and Stefan Decker. 2006. Extending Faceted Navigation for RDF Data. In International Workshop on the Semantic Web.Google ScholarGoogle Scholar
  57. Devi Parikh and Kristen Grauman. 2011. Relative attributes. 2011 International Conference on Computer Vision (2011), 503--510.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.Google ScholarGoogle Scholar
  59. Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, J. Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities: Collecting Regionto-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision 123 (2015), 74--93.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Filip Radenovic, Giorgos Tolias, and Ondřej Chum. 2017. Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2017), 1655--1668.Google ScholarGoogle ScholarCross RefCross Ref
  61. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.Google ScholarGoogle Scholar
  62. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  63. Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to Recommender Systems Handbook. In Recommender Systems Handbook.Google ScholarGoogle Scholar
  64. Stephen E. Robertson. 1991. On Term Selection for Query Expansion. J. Documentation 46 (1991), 359--364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. J. J. Rocchio. 1971. Relevance feedback in information retrieval.Google ScholarGoogle Scholar
  66. Stevan Rudinac, Martha Larson, and Alan Hanjalic. 2012. Leveraging visual concepts and query performance prediction for semantic-theme-based video retrieval. International Journal of Multimedia Information Retrieval 1 (2012), 263--280.Google ScholarGoogle ScholarCross RefCross Ref
  67. Yong Rui and Thomas S. Huang. 1999. A novel relevance feedback technique in image retrieval. In MULTIMEDIA '99.Google ScholarGoogle Scholar
  68. Yong Rui, Thomas S. Huang, and Sharad Mehrotra. 1997. Content-based image retrieval with relevance feedback in MARS. Proceedings of International Conference on Image Processing 2 (1997), 815--818 vol.2.Google ScholarGoogle ScholarCross RefCross Ref
  69. Chris Samarinas and Hamed Zamani. 2022. Revisiting Open Domain Query Facet Extraction and Generation. Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning Visual Representations with Caption Annotations. In European Conference on Computer Vision.Google ScholarGoogle Scholar
  71. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  72. Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. ArXiv abs/2111.02114 (2021).Google ScholarGoogle Scholar
  73. Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005. Context-sensitive information retrieval using implicit feedback. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. 2000. Contentbased image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000), 1349--1380. https://doi.org/10. 1109/34.895972Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Siqi Sun, Yen-Chun Chen, Linjie Li, ShuohangWang, Yuwei Fang, and Jingjing Liu. 2021. Lightningdot: Pre-training visual-semantic embeddings for real-time imagetext retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 982--997.Google ScholarGoogle ScholarCross RefCross Ref
  76. Daichi Suzuki, Go Irie, and Kiyoharu Aizawa. 2023. Text-to-Image Fashion Retrieval with Fabric Textures. Proceedings of the 2023 ACM International Conference on Multimedia Retrieval (2023).Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, and Amjad Almahairi et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023).Google ScholarGoogle Scholar
  78. Nam S. Vo, Lu Jiang, Chen Sun, Kevin P. Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2018. Composing Text and Image for Image Retrieval - an Empirical Odyssey. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), 6432--6441.Google ScholarGoogle Scholar
  79. ShuaiWang, Jiayi Shen, Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, and MarcelWorring. 2024. Prototype-Enhanced Hypergraph Learning for Heterogeneous Information Networks. In International Conference on Multimedia Modeling. Springer, 462--476.Google ScholarGoogle Scholar
  80. Xuanhui Wang, Hui Fang, and Cheng Xiang Zhai. 2008. A study of methods for negative relevance feedback. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision. 2794--2802.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Yiming Wang, Zhuosheng Zhang, and Rui Wang. 2023. Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chainof-Thought Method. In Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  83. Zhenduo Wang and Qingyao Ai. 2021. Controlling the Risk of Conversational Search via Reinforcement Learning. Proceedings of the Web Conference 2021 (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv abs/2201.11903 (2022).Google ScholarGoogle Scholar
  85. Jinxi Xu and W. Bruce Croft. 1996. Query expansion using local and global document analysis. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle Scholar
  86. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 5288--5296.Google ScholarGoogle Scholar
  87. Fanghua Ye, Meng Fang, Shenghui Li, and Emine Yilmaz. 2023. Enhancing conversational search: Large language model-aided informative query rewriting. arXiv preprint arXiv:2310.09716 (2023).Google ScholarGoogle Scholar
  88. Jan Zahálka, Stevan Rudinac, and Marcel Worring. 2015. Analytic Quality: Evaluation of Performance and Insight in Multimedia Collection Analysis. Proceedings of the 23rd ACM international conference on Multimedia (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Jan Zahálka, Stevan Rudinac, Björn þór Jónsson, Dennis C. Koelma, and Marcel Worring. 2018. Blackthorn: Large-Scale Interactive Multimodal Learning. IEEE Transactions on Multimedia 20 (2018), 687--698.Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. 2023. HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption. ArXiv abs/2310.01779 (2023).Google ScholarGoogle Scholar
  91. ChengXiang Zhai and John D. Lafferty. 2001. Model-based feedback in the language modeling approach to information retrieval. In International Conference on Information and Knowledge Management.Google ScholarGoogle Scholar
  92. Ke Zhang,Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Summary transfer: Exemplar-based subset selection for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1059--1067.Google ScholarGoogle ScholarCross RefCross Ref
  93. Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. 2022. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the 7th Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research), Zachary Lipton, Rajesh Ranganath, Mark Sendak, Michael Sjoding, and Serena Yeung (Eds.), Vol. 182. PMLR, 2--25.Google ScholarGoogle Scholar
  94. Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 863--871.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. ArXiv abs/2306.05685 (2023).Google ScholarGoogle Scholar

Index Terms

  1. Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader