ABSTRACT
Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
- Kenan E. Ak, Ashraf Ali Kassim, Joo-Hwee Lim, and Jo Yew Tham. 2018. Learning Attribute Representations with Localization for Flexible Fashion Search. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7708--7717.Google Scholar
- James Allan. 1996. Incremental relevance feedback for information filtering. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
- Gianni Amati and C J Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20 (2002), 357--389.Google ScholarDigital Library
- R. Attar and Aviezri S. Fraenkel. 1977. Local Feedback in Full-Text Retrieval Systems. J. ACM 24 (1977), 397--417.Google ScholarDigital Library
- Pia Borlund. 2003. The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Inf. Res. 8 (2003).Google Scholar
- Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, and William Fedus et al. 2022. Scaling Instruction-Finetuned Language Models. ArXiv abs/2210.11416 (2022).Google Scholar
- Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. ArXiv abs/2305.06500 (2023).Google Scholar
- Karan Desai and Justin Johnson. 2020. VirTex: Learning Visual Representations from Textual Annotations. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 11157--11168.Google Scholar
- Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document Numérique 17 (2014), 61--84.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics.Google Scholar
- Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Learning a Deep Convolutional Network for Image Super-Resolution. In European Conference on Computer Vision.Google ScholarCross Ref
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and Dirk Weissenborn et al. 2020. An Image isWorth 16x16Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2020).Google Scholar
- Wen gang Zhou, Houqiang Li, and Qi Tian. 2017. Recent Advance in Contentbased Image Retrieval: A Literature Survey. ArXiv abs/1706.06064 (2017).Google Scholar
- Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven J. Rennie, and Rogério Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. In Neural Information Processing Systems.Google Scholar
- Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven J. Rennie, and Rogério Schmidt Feris. 2019. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. ArXiv abs/1905.12794 (2019).Google Scholar
- Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 1472--1480.Google Scholar
- James Hays and Alexei A. Efros. 2008. IM2GPS: estimating geographic information from a single image. 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008), 1--8.Google ScholarCross Ref
- Brian Hu, Bhavan Vasu, and Anthony Hoogs. 2022. X-mir: Explainable medical image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 440--450.Google ScholarCross Ref
- Hongyu Hu, Jiyuan Zhang, Minyi Zhao, and Zhenbang Sun. 2023. CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning. ArXiv abs/2309.02301 (2023).Google Scholar
- Cheng Huang and HongmeiWang. 2019. A novel key-frames selection framework for comprehensive video summarization. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 577--589.Google ScholarCross Ref
- Jia-Hong Huang. 2017. Robustness Analysis of Visual Question Answering Models by Basic Questions. King Abdullah University of Science and Technology, Master Thesis (2017).Google Scholar
- Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem. 2017. VQABQ: Visual Question Answering by Basic Questions. VQA ChallengeWorkshop, CVPR (2017).Google Scholar
- Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem. 2018. Robustness Analysis of Visual QA Models by Basic Questions. VQA Challenge and Visual Dialog Workshop, CVPR (2018).Google Scholar
- Jia-Hong Huang, Modar Alfadly, Bernard Ghanem, and Marcel Worring. 2019. Assessing the robustness of visual question answering. arXiv preprint arXiv:1912.01452 (2019).Google Scholar
- Jia-Hong Huang, Modar Alfadly, Bernard Ghanem, and MarcelWorring. 2023. Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions. arXiv preprint arXiv:2304.03147 (2023).Google Scholar
- Thomas S. Huang, Charlie K. Dagli, Shyamsundar Rajaram, Edward Y. Chang, Michael I. Mandel, Graham E. Poliner, and Daniel P. W. Ellis. 2008. Active Learning for Interactive Multimedia Retrieval. Proc. IEEE 96 (2008), 648--667.Google ScholarCross Ref
- Makoto Iwayama. 2000. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
- Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query Expansion by Prompting Large Language Models. ArXiv abs/2305.03653 (2023).Google Scholar
- Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query Expansion by Prompting Large Language Models. arXiv preprint arXiv:2305.03653 (2023).Google Scholar
- Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904--4916.Google Scholar
- Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, and Alexander Hauptmann. 2015. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (2015).Google ScholarDigital Library
- Thorsten Joachims. 1997. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In International Conference on Machine Learning.Google Scholar
- Omar Shahbaz Khan, Björn Þór Jónsson, Stevan Rudinac, Jan Zahálka, Hanna Ragnarsdóttir, Þórhildur Þorleiksdóttir, Gylfi Þór Guðmundsson, Laurent Amsaleg, and Marcel Worring. 2020. Interactive Learning for Multimedia at Large. Advances in Information Retrieval 12035 (2020), 495--510.Google ScholarDigital Library
- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. ArXiv abs/2205.11916 (2022).Google Scholar
- Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2012. Whittle Search: Image search with relative attribute feedback. 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), 2973--2980.Google ScholarCross Ref
- Martha Larson, Mohammad Soleymani, Pavel Serdyukov, Stevan Rudinac, Christian Wartena, Vanessa Murdock, Gerald Friedland, Roeland Ordelman, and Gareth J. F. Jones. 2011. Automatic tagging and geotagging in video collections and communities. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR '11). Association for Computing Machinery, New York, NY, USA, Article 51, 8 pages. https://doi.org/10.1145/1991996.1992047Google ScholarDigital Library
- Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. 2024. Chatting makes perfect: Chat-based image retrieval. Advances in Neural Information Processing Systems 36 (2024).Google Scholar
- Dirk Lewandowski. 2008. The retrieval effectiveness of web search engines: considering results descriptions. Journal of documentation 64, 6 (2008), 915--937.Google ScholarCross Ref
- David D. Lewis, Robert E. Schapire, Jamie Callan, and Ron Papka. 1996. Training algorithms for linear text classifiers. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
- Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning.Google Scholar
- Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based product retrieval in taobao search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3181--3189.Google ScholarDigital Library
- Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. ArXiv abs/2305.10355 (2023).Google Scholar
- Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. 2021. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. ArXiv abs/2110.05208 (2021).Google Scholar
- Tsung-Yi Lin, Yin Cui, Serge J. Belongie, and James Hays. 2015. Learning deep representations for ground-to-aerial geolocalization. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 5007--5015.Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision.Google Scholar
- Wei-Chao Lin. 2019. Aggregation of Multiple Pseudo Relevance Feedbacks for Image Search Re-Ranking. IEEE Access 7 (2019), 147553--147559. https: //doi.org/10.1109/ACCESS.2019.2942142Google ScholarCross Ref
- Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, and Ji-Rong Wen. 2022. COTS: Collaborative two-stream vision-language pre-training model for crossmodal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15692--15701.Google ScholarCross Ref
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Neural Information Processing Systems.Google Scholar
- Xiaopeng Lu, Tiancheng Zhao, and Kyusong Lee. 2021. VisualSparta: an embarrassingly simple approach to large-scale text-to-image search with weighted bag-of-words. arXiv preprint arXiv:2101.00265 (2021).Google Scholar
- Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023. Generative Relevance Feedback with Large Language Models. arXiv preprint arXiv:2304.13157 (2023).Google Scholar
- Kelong Mao, Zhicheng Dou, Haonan Chen, Fengran Mo, and Hongjin Qian. 2023. Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search. arXiv preprint arXiv:2303.06573 (2023).Google Scholar
- Kelong Mao, Zhicheng Dou, Bang Liu, Hongjin Qian, Fengran Mo, Xiangli Wu, Xiaohua Cheng, and Zhao Cao. 2023. Search-oriented conversational query editing. In Findings of the Association for Computational Linguistics: ACL 2023. 4160--4172.Google ScholarCross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013).Google Scholar
- Henning Müller. 2020. Medical Image Retrieval: Applications and Resources. Proceedings of the 2020 International Conference on Multimedia Retrieval (2020).Google ScholarDigital Library
- Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE international conference on computer vision. 3456--3465.Google ScholarCross Ref
- Eyal Oren, Renaud Delbru, and Stefan Decker. 2006. Extending Faceted Navigation for RDF Data. In International Workshop on the Semantic Web.Google Scholar
- Devi Parikh and Kristen Grauman. 2011. Relative attributes. 2011 International Conference on Computer Vision (2011), 503--510.Google ScholarDigital Library
- Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.Google Scholar
- Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, J. Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities: Collecting Regionto-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision 123 (2015), 74--93.Google ScholarDigital Library
- Filip Radenovic, Giorgos Tolias, and Ondřej Chum. 2017. Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2017), 1655--1668.Google ScholarCross Ref
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.Google Scholar
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google Scholar
- Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to Recommender Systems Handbook. In Recommender Systems Handbook.Google Scholar
- Stephen E. Robertson. 1991. On Term Selection for Query Expansion. J. Documentation 46 (1991), 359--364.Google ScholarDigital Library
- J. J. Rocchio. 1971. Relevance feedback in information retrieval.Google Scholar
- Stevan Rudinac, Martha Larson, and Alan Hanjalic. 2012. Leveraging visual concepts and query performance prediction for semantic-theme-based video retrieval. International Journal of Multimedia Information Retrieval 1 (2012), 263--280.Google ScholarCross Ref
- Yong Rui and Thomas S. Huang. 1999. A novel relevance feedback technique in image retrieval. In MULTIMEDIA '99.Google Scholar
- Yong Rui, Thomas S. Huang, and Sharad Mehrotra. 1997. Content-based image retrieval with relevance feedback in MARS. Proceedings of International Conference on Image Processing 2 (1997), 815--818 vol.2.Google ScholarCross Ref
- Chris Samarinas and Hamed Zamani. 2022. Revisiting Open Domain Query Facet Extraction and Generation. Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval (2022).Google ScholarDigital Library
- Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning Visual Representations with Caption Annotations. In European Conference on Computer Vision.Google Scholar
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 815--823.Google ScholarCross Ref
- Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. ArXiv abs/2111.02114 (2021).Google Scholar
- Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005. Context-sensitive information retrieval using implicit feedback. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
- A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. 2000. Contentbased image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000), 1349--1380. https://doi.org/10. 1109/34.895972Google ScholarDigital Library
- Siqi Sun, Yen-Chun Chen, Linjie Li, ShuohangWang, Yuwei Fang, and Jingjing Liu. 2021. Lightningdot: Pre-training visual-semantic embeddings for real-time imagetext retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 982--997.Google ScholarCross Ref
- Daichi Suzuki, Go Irie, and Kiyoharu Aizawa. 2023. Text-to-Image Fashion Retrieval with Fabric Textures. Proceedings of the 2023 ACM International Conference on Multimedia Retrieval (2023).Google ScholarDigital Library
- Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, and Amjad Almahairi et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023).Google Scholar
- Nam S. Vo, Lu Jiang, Chen Sun, Kevin P. Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2018. Composing Text and Image for Image Retrieval - an Empirical Odyssey. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), 6432--6441.Google Scholar
- ShuaiWang, Jiayi Shen, Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, and MarcelWorring. 2024. Prototype-Enhanced Hypergraph Learning for Heterogeneous Information Networks. In International Conference on Multimedia Modeling. Springer, 462--476.Google Scholar
- Xuanhui Wang, Hui Fang, and Cheng Xiang Zhai. 2008. A study of methods for negative relevance feedback. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
- Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision. 2794--2802.Google ScholarDigital Library
- Yiming Wang, Zhuosheng Zhang, and Rui Wang. 2023. Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chainof-Thought Method. In Annual Meeting of the Association for Computational Linguistics.Google Scholar
- Zhenduo Wang and Qingyao Ai. 2021. Controlling the Risk of Conversational Search via Reinforcement Learning. Proceedings of the Web Conference 2021 (2021).Google ScholarDigital Library
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv abs/2201.11903 (2022).Google Scholar
- Jinxi Xu and W. Bruce Croft. 1996. Query expansion using local and global document analysis. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Google Scholar
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 5288--5296.Google Scholar
- Fanghua Ye, Meng Fang, Shenghui Li, and Emine Yilmaz. 2023. Enhancing conversational search: Large language model-aided informative query rewriting. arXiv preprint arXiv:2310.09716 (2023).Google Scholar
- Jan Zahálka, Stevan Rudinac, and Marcel Worring. 2015. Analytic Quality: Evaluation of Performance and Insight in Multimedia Collection Analysis. Proceedings of the 23rd ACM international conference on Multimedia (2015).Google ScholarDigital Library
- Jan Zahálka, Stevan Rudinac, Björn þór Jónsson, Dennis C. Koelma, and Marcel Worring. 2018. Blackthorn: Large-Scale Interactive Multimodal Learning. IEEE Transactions on Multimedia 20 (2018), 687--698.Google ScholarDigital Library
- Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. 2023. HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption. ArXiv abs/2310.01779 (2023).Google Scholar
- ChengXiang Zhai and John D. Lafferty. 2001. Model-based feedback in the language modeling approach to information retrieval. In International Conference on Information and Knowledge Management.Google Scholar
- Ke Zhang,Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Summary transfer: Exemplar-based subset selection for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1059--1067.Google ScholarCross Ref
- Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. 2022. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the 7th Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research), Zachary Lipton, Rajesh Ranganath, Mark Sendak, Michael Sjoding, and Serena Yeung (Eds.), Vol. 182. PMLR, 2--25.Google Scholar
- Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 863--871.Google ScholarDigital Library
- Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. ArXiv abs/2306.05685 (2023).Google Scholar
Index Terms
- Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Recommendations
Boosting legal case retrieval by query content selection with large language models
SIGIR-AP '23: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific RegionLegal case retrieval, which aims to retrieve relevant cases to a given query case, benefits judgment justice and attracts increasing attention. Unlike generic retrieval queries, legal case queries are typically long and the definition of relevance is ...
Cluster-based retrieval using language models
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalPrevious research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine ...
User term feedback in interactive text-based image retrieval
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalTo alleviate the vocabulary problem, this paper investigates the role of user term feedback in interactive text-based image retrieval. Term feedback refers to the feedback from a user on specific terms regarding their relevance to a target image. ...
Comments