ABSTRACT
We propose an approach that enhances arbitrary existing cross-modal image retrieval performance. Most of the cross-modal image retrieval methods mainly focus on direct computation of similarities between a text query and candidate images in an accurate way. However, their retrieval performance is affected by the ambiguity of text queries and the bias of target databases (DBs). Dealing with ambiguous text queries and DBs with bias will lead to accurate cross-modal image retrieval in real-world applications. A DB-adaptive re-ranking method using modality-driven spaces, which can extend arbitrary cross-modal image retrieval methods for enhancing their performance, is proposed in this paper. The proposed method includes two approaches: "DB-adaptive re-ranking'' and "modality-driven clue information extraction''. Our method estimates clue information that can effectively clarify the desired image from the whole set of a target DB and then receives user's feedback for the estimated information. Furthermore, our method extracts more detailed information of a query text and a target DB by focusing on modality-driven spaces, and it enables more accurate re-ranking. Our method allows users to reach their desired single image by just answering questions. Experimental results using MSCOCO, Visual Genome and newly introduced datasets including images with a particular object show that the proposed method can enhance the performance of state-of-the-art cross-modal image retrieval methods.
Supplemental Material
Available for Download
Additional experiments
- Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. 2019. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 475--484. Google ScholarDigital Library
- Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 2 (2018), 423--443. Google ScholarDigital Library
- Ilaria Bartolini, Paolo Ciaccia, Vincent Oria, and M Tamer Özsu. 2007. Flexible integration of multimedia sub-queries with qualitative preferences. Multimedia Tools and Applications, Vol. 33, 3 (2007), 275--300. Google ScholarDigital Library
- Ilaria Bartolini, Paolo Ciaccia, and Florian Waas. 2001. FeedbackBypass: A new approach to interactive similarity query processing. In VLDB. 201--210. Google ScholarDigital Library
- Ilaria Bartolini, Marco Patella, and Guido Stromei. 2011. The windsurf library for the efficient retrieval of multimedia hierarchical data. In Proceedings of the International Conference on Signal Processing and Multimedia Applications. 1--10.Google Scholar
- Daragh Byrne, Aisling Kelliher, and Gareth JF Jones. 2011. Life editing: Third-party perspectives on lifelog content. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1501--1510. Google ScholarDigital Library
- Chia-Chun Chang, Min-Huan Fu, Hen-Hsen Huang, and Hsin-Hsi Chen. 2019. An interactive approach to integrating external textual knowledge for multimodal lifelog retrieval. In Proceedings of the ACM Workshop on Lifelog Search Challenge. 41--44. Google ScholarDigital Library
- Deepanwita Datta, Shubham Varma, Sanjay K Singh, et almbox. 2017. Multimodal retrieval using mutual information based textual query reformulation. Expert Systems with Applications, Vol. 68 (2017), 81--92. Google ScholarDigital Library
- Ritendra Datta, Jia Li, and James Z Wang. 2005. Content-based image retrieval: approaches and trends of the new age. In Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval. 253--262. Google ScholarDigital Library
- Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4601--4611.Google ScholarCross Ref
- Fartash Faghri, David J Fleet, Jamie Ryan Kiros, Google Brain Toronto, and Sanja Fidler. 2017. VSE: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612 (2017).Google Scholar
- Cristiano D Ferreira, Jefersson A Santos, R da S Torres, Marcos André Goncc alves, Rodrigo Carvalho Rezende, and Weiguo Fan. 2011. Relevance feedback based on genetic programming for image retrieval. Pattern Recognition Letters, Vol. 32, 1 (2011), 27--37. Google ScholarDigital Library
- Giorgio Giacinto. 2007. A nearest-neighbor approach to relevance feedback in content based image retrieval. In Proceedings of the 6th ACM international conference on Image and video retrieval. 456--463. Google ScholarDigital Library
- Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181--7189.Google ScholarCross Ref
- Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In Advances in neural information processing systems. 678--688. Google ScholarDigital Library
- Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics. 162--190.Google Scholar
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google ScholarCross Ref
- Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163--6171.Google ScholarCross Ref
- Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 5754--5763.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 (2014). Google ScholarDigital Library
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, Vol. 123, 1 (2017), 32--73. Google ScholarDigital Library
- Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, Vol. 22, 1 (1951), 79--86.Google Scholar
- Jun Li, Chang Xu, Wankou Yang, Changyin Sun, and Dacheng Tao. 2017. Discriminative multi-view interactive image re-ranking. IEEE Transactions on Image Processing, Vol. 26, 7 (2017), 3113--3127. Google ScholarDigital Library
- Shuang Liang and Zhengxing Sun. 2008. Sketch retrieval and relevance feedback with biased SVM classification. Pattern Recognition Letters, Vol. 29, 12 (2008), 1733--1741. Google ScholarDigital Library
- Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE European Conference on Computer Vision. 740--755.Google Scholar
- Wei-Chao Lin, Zong-Yao Chen, Shih-Wen Ke, Chih-Fong Tsai, and Wei-Yang Lin. 2015. The effect of low-level image features on pseudo relevance feedback. Neurocomputing, Vol. 166 (2015), 26--37. Google ScholarDigital Library
- Yutian Lin, Zhedong Zheng, Hong Zhang, Chenqiang Gao, and Yi Yang. 2020. Bayesian query expansion for multi-camera person re-identification. Pattern Recognition Letters, Vol. 130 (2020), 284--292.Google ScholarCross Ref
- Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107--4116.Google Scholar
- Dan Lu, Xiaoxiao Liu, and Xueming Qian. 2016. Tag-based image search by social re-ranking. IEEE Transactions on Multimedia, Vol. 18, 8 (2016), 1628--1639. Google ScholarDigital Library
- Devraj Mandal and Soma Biswas. 2017. Query specific re-ranking for improved cross-modal retrieval. Pattern Recognition Letters, Vol. 98 (2017), 110--116. Google ScholarDigital Library
- Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Comput. Surveys, Vol. 46, 3 (2014), 1--38. Google ScholarDigital Library
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532--1543.Google ScholarCross Ref
- Lorenzo Putzu, Luca Piras, and Giorgio Giacinto. 2020. Convolutional neural networks for relevance feedback in content based image retrieval. Multimedia Tools and Applications, Vol. 79, 37 (2020), 26995--27021.Google ScholarDigital Library
- Luca Rossetto, Ralph Gasser, Jakub Lokoc, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomas Soucek, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, et al. 2020. Interactive video retrieval in the age of deep learning-detailed evaluation of vbs 2019. IEEE Transactions on Multimedia (2020).Google Scholar
- Thomas Seidl and Hans-Peter Kriegel. 1997. Efficient user-adaptable similarity search in large multimedia databases. In VLDB, Vol. 97. 506--515. Google ScholarDigital Library
- Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on pattern analysis and machine intelligence, Vol. 22, 12 (2000), 1349--1380. Google ScholarDigital Library
- Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1979--1978.Google ScholarCross Ref
- Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. In Advances in Neural Information Processing Systems. 2651--2661. Google ScholarDigital Library
- Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations. 1--12.Google Scholar
- Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439--6448.Google ScholarCross Ref
- Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM international conference on Multimedia. 157--166. Google ScholarDigital Library
- Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215 (2016).Google Scholar
- Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the 27th ACM international conference on multimedia. 12--20. Google ScholarDigital Library
- Wei Wei, Mengmeng Jiang, Xiangnan Zhang, Heng Liu, and Chunna Tian. 2020. Boosting Cross-Modal Retrieval With MVSE and Reciprocal Neighbors. IEEE Access, Vol. 8 (2020), 84642--84651.Google ScholarCross Ref
- Yunchao Wei, Yao Zhao, Zhenfeng Zhu, Shikui Wei, Yanhui Xiao, Jiashi Feng, and Shuicheng Yan. 2016. Modality-dependent cross-media retrieval. ACM Transactions on Intelligent Systems and Technology, Vol. 7, 4 (2016), 1--13. Google ScholarDigital Library
- Bin Xu, Jiajun Bu, Chun Chen, Can Wang, Deng Cai, and Xiaofei He. 2013. EMR: A scalable graph-based ranking model for content-based image retrieval. IEEE Transactions on knowledge and data engineering, Vol. 27, 1 (2013), 102--114.Google Scholar
- Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.Google ScholarCross Ref
- Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the IEEE International Conference on Machine Learning. 2048--2057. Google ScholarDigital Library
- Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2020. Enhancing cross-Modal retrieval based on modality-specific and embedding spaces. IEEE Access, Vol. 8, 1 (2020), 96777--96786.Google ScholarCross Ref
- Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2021. Interactive re-ranking for cross-modal retrieval based on object-wise question answering. In Proceedings of the ACM International Conference on Multimedia in Asia. Google ScholarDigital Library
- Xiaojing Yu, Tianlong Chen, Yang Yang, Michael Mugo, and Zhangyang Wang. 2019. Cross-Modal Person Search: A Coarse-to-Fine Framework using Bi-Directional Text-Image Matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops.Google ScholarCross Ref
- Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck. 2020. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020. 418--428. Google ScholarDigital Library
- Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.Google ScholarCross Ref
- Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the IEEE European Conference on Computer Vision. 686--701.Google ScholarCross Ref
- Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9259--9266.Google ScholarDigital Library
- Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embedding with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (2020). Google ScholarDigital Library
- Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5802--5810. arxiv: 1904.01310Google ScholarCross Ref
- Susana Zoghbi, Geert Heyman, Juan Carlos Gomez, and Marie-Francine Moens. 2016. Cross-modal fashion search. In International Conference on Multimedia Modeling. Springer, 367--373.Google ScholarCross Ref
Index Terms
- Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval
Recommendations
Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval
Cross-modal image-retrieval methods retrieve desired images from a query text by learning relationships between texts and images. Such a retrieval approach is one of the most effective ways of achieving the easiness of query preparation. Recent cross-...
Interactive re-ranking for cross-modal retrieval based on object-wise question answering
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in AsiaCross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is ...
Combining re-ranking and rank aggregation methods for image retrieval
This paper presents novel approaches for combining re-ranking and rank aggregation methods aiming at improving the effectiveness of Content-Based Image Retrieval (CBIR) systems. Given a query image as input, CBIR systems retrieve the most similar images ...
Comments