research-article

Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval

Authors:
Rintaro Yanagi

Hokkaido University, Sapporo, Hokkaido, Japan

Hokkaido University, Sapporo, Hokkaido, Japan
View Profile

,
Ren Togo

Hokkaido University, Sapporo, Hokkaido, Japan

Hokkaido University, Sapporo, Hokkaido, Japan
View Profile

,
Takahiro Ogawa

Hokkaido University, Sapporo, Hokkaido, Japan

Hokkaido University, Sapporo, Hokkaido, Japan
View Profile

,
Miki Haseyama

Hokkaido University, Sapporo, Hokkaido, Japan

Hokkaido University, Sapporo, Hokkaido, Japan
View Profile

MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021Pages 3816–3825https://doi.org/10.1145/3474085.3475681

Published:17 October 2021Publication History

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3816–3825

ABSTRACT

We propose an approach that enhances arbitrary existing cross-modal image retrieval performance. Most of the cross-modal image retrieval methods mainly focus on direct computation of similarities between a text query and candidate images in an accurate way. However, their retrieval performance is affected by the ambiguity of text queries and the bias of target databases (DBs). Dealing with ambiguous text queries and DBs with bias will lead to accurate cross-modal image retrieval in real-world applications. A DB-adaptive re-ranking method using modality-driven spaces, which can extend arbitrary cross-modal image retrieval methods for enhancing their performance, is proposed in this paper. The proposed method includes two approaches: "DB-adaptive re-ranking'' and "modality-driven clue information extraction''. Our method estimates clue information that can effectively clarify the desired image from the whole set of a target DB and then receives user's feedback for the estimated information. Furthermore, our method extracts more detailed information of a query text and a target DB by focusing on modality-driven spaces, and it enables more accurate re-ranking. Our method allows users to reach their desired single image by just answering questions. Experimental results using MSCOCO, Visual Genome and newly introduced datasets including images with a particular object show that the proposed method can enhance the performance of state-of-the-art cross-modal image retrieval methods.

Supplemental Material

Available for Download

zip

mfp2829aux.zip (21.9 MB)

Additional experiments

References

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. 2019. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 475--484. Google ScholarDigital Library
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 2 (2018), 423--443. Google ScholarDigital Library
Ilaria Bartolini, Paolo Ciaccia, Vincent Oria, and M Tamer Özsu. 2007. Flexible integration of multimedia sub-queries with qualitative preferences. Multimedia Tools and Applications, Vol. 33, 3 (2007), 275--300. Google ScholarDigital Library
Ilaria Bartolini, Paolo Ciaccia, and Florian Waas. 2001. FeedbackBypass: A new approach to interactive similarity query processing. In VLDB. 201--210. Google ScholarDigital Library
Ilaria Bartolini, Marco Patella, and Guido Stromei. 2011. The windsurf library for the efficient retrieval of multimedia hierarchical data. In Proceedings of the International Conference on Signal Processing and Multimedia Applications. 1--10.Google Scholar
Daragh Byrne, Aisling Kelliher, and Gareth JF Jones. 2011. Life editing: Third-party perspectives on lifelog content. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1501--1510. Google ScholarDigital Library
Chia-Chun Chang, Min-Huan Fu, Hen-Hsen Huang, and Hsin-Hsi Chen. 2019. An interactive approach to integrating external textual knowledge for multimodal lifelog retrieval. In Proceedings of the ACM Workshop on Lifelog Search Challenge. 41--44. Google ScholarDigital Library
Deepanwita Datta, Shubham Varma, Sanjay K Singh, et almbox. 2017. Multimodal retrieval using mutual information based textual query reformulation. Expert Systems with Applications, Vol. 68 (2017), 81--92. Google ScholarDigital Library
Ritendra Datta, Jia Li, and James Z Wang. 2005. Content-based image retrieval: approaches and trends of the new age. In Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval. 253--262. Google ScholarDigital Library
Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4601--4611.Google ScholarCross Ref
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, Google Brain Toronto, and Sanja Fidler. 2017. VSE: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612 (2017).Google Scholar
Cristiano D Ferreira, Jefersson A Santos, R da S Torres, Marcos André Goncc alves, Rodrigo Carvalho Rezende, and Weiguo Fan. 2011. Relevance feedback based on genetic programming for image retrieval. Pattern Recognition Letters, Vol. 32, 1 (2011), 27--37. Google ScholarDigital Library
Giorgio Giacinto. 2007. A nearest-neighbor approach to relevance feedback in content based image retrieval. In Proceedings of the 6th ACM international conference on Image and video retrieval. 456--463. Google ScholarDigital Library
Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181--7189.Google ScholarCross Ref
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In Advances in neural information processing systems. 678--688. Google ScholarDigital Library
Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics. 162--190.Google Scholar
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google ScholarCross Ref
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163--6171.Google ScholarCross Ref
Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 5754--5763.Google ScholarCross Ref
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 (2014). Google ScholarDigital Library
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, Vol. 123, 1 (2017), 32--73. Google ScholarDigital Library
Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, Vol. 22, 1 (1951), 79--86.Google Scholar
Jun Li, Chang Xu, Wankou Yang, Changyin Sun, and Dacheng Tao. 2017. Discriminative multi-view interactive image re-ranking. IEEE Transactions on Image Processing, Vol. 26, 7 (2017), 3113--3127. Google ScholarDigital Library
Shuang Liang and Zhengxing Sun. 2008. Sketch retrieval and relevance feedback with biased SVM classification. Pattern Recognition Letters, Vol. 29, 12 (2008), 1733--1741. Google ScholarDigital Library
Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE European Conference on Computer Vision. 740--755.Google Scholar
Wei-Chao Lin, Zong-Yao Chen, Shih-Wen Ke, Chih-Fong Tsai, and Wei-Yang Lin. 2015. The effect of low-level image features on pseudo relevance feedback. Neurocomputing, Vol. 166 (2015), 26--37. Google ScholarDigital Library
Yutian Lin, Zhedong Zheng, Hong Zhang, Chenqiang Gao, and Yi Yang. 2020. Bayesian query expansion for multi-camera person re-identification. Pattern Recognition Letters, Vol. 130 (2020), 284--292.Google ScholarCross Ref
Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107--4116.Google Scholar
Dan Lu, Xiaoxiao Liu, and Xueming Qian. 2016. Tag-based image search by social re-ranking. IEEE Transactions on Multimedia, Vol. 18, 8 (2016), 1628--1639. Google ScholarDigital Library
Devraj Mandal and Soma Biswas. 2017. Query specific re-ranking for improved cross-modal retrieval. Pattern Recognition Letters, Vol. 98 (2017), 110--116. Google ScholarDigital Library
Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Comput. Surveys, Vol. 46, 3 (2014), 1--38. Google ScholarDigital Library
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532--1543.Google ScholarCross Ref
Lorenzo Putzu, Luca Piras, and Giorgio Giacinto. 2020. Convolutional neural networks for relevance feedback in content based image retrieval. Multimedia Tools and Applications, Vol. 79, 37 (2020), 26995--27021.Google ScholarDigital Library
Luca Rossetto, Ralph Gasser, Jakub Lokoc, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomas Soucek, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, et al. 2020. Interactive video retrieval in the age of deep learning-detailed evaluation of vbs 2019. IEEE Transactions on Multimedia (2020).Google Scholar
Thomas Seidl and Hans-Peter Kriegel. 1997. Efficient user-adaptable similarity search in large multimedia databases. In VLDB, Vol. 97. 506--515. Google ScholarDigital Library
Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on pattern analysis and machine intelligence, Vol. 22, 12 (2000), 1349--1380. Google ScholarDigital Library
Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1979--1978.Google ScholarCross Ref
Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. In Advances in Neural Information Processing Systems. 2651--2661. Google ScholarDigital Library
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations. 1--12.Google Scholar
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439--6448.Google ScholarCross Ref
Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM international conference on Multimedia. 157--166. Google ScholarDigital Library
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215 (2016).Google Scholar
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the 27th ACM international conference on multimedia. 12--20. Google ScholarDigital Library
Wei Wei, Mengmeng Jiang, Xiangnan Zhang, Heng Liu, and Chunna Tian. 2020. Boosting Cross-Modal Retrieval With MVSE and Reciprocal Neighbors. IEEE Access, Vol. 8 (2020), 84642--84651.Google ScholarCross Ref
Yunchao Wei, Yao Zhao, Zhenfeng Zhu, Shikui Wei, Yanhui Xiao, Jiashi Feng, and Shuicheng Yan. 2016. Modality-dependent cross-media retrieval. ACM Transactions on Intelligent Systems and Technology, Vol. 7, 4 (2016), 1--13. Google ScholarDigital Library
Bin Xu, Jiajun Bu, Chun Chen, Can Wang, Deng Cai, and Xiaofei He. 2013. EMR: A scalable graph-based ranking model for content-based image retrieval. IEEE Transactions on knowledge and data engineering, Vol. 27, 1 (2013), 102--114.Google Scholar
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.Google ScholarCross Ref
Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the IEEE International Conference on Machine Learning. 2048--2057. Google ScholarDigital Library
Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2020. Enhancing cross-Modal retrieval based on modality-specific and embedding spaces. IEEE Access, Vol. 8, 1 (2020), 96777--96786.Google ScholarCross Ref
Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2021. Interactive re-ranking for cross-modal retrieval based on object-wise question answering. In Proceedings of the ACM International Conference on Multimedia in Asia. Google ScholarDigital Library
Xiaojing Yu, Tianlong Chen, Yang Yang, Michael Mugo, and Zhangyang Wang. 2019. Cross-Modal Person Search: A Coarse-to-Fine Framework using Bi-Directional Text-Image Matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops.Google ScholarCross Ref
Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck. 2020. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020. 418--428. Google ScholarDigital Library
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.Google ScholarCross Ref
Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the IEEE European Conference on Computer Vision. 686--701.Google ScholarCross Ref
Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9259--9266.Google ScholarDigital Library
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embedding with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (2020). Google ScholarDigital Library
Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5802--5810. arxiv: 1904.01310Google ScholarCross Ref
Susana Zoghbi, Geert Heyman, Juan Carlos Gomez, and Marie-Francine Moens. 2016. Cross-modal fashion search. In International Conference on Multimedia Modeling. Springer, 367--373.Google ScholarCross Ref

Index Terms

Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Multimedia information systems
      1. Multimedia databases

Recommendations

Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval
Cross-modal image-retrieval methods retrieve desired images from a query text by learning relationships between texts and images. Such a retrieval approach is one of the most effective ways of achieving the easiness of query preparation. Recent cross-...
Read More
Interactive re-ranking for cross-modal retrieval based on object-wise question answering
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Cross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is ...
Read More
Combining re-ranking and rank aggregation methods for image retrieval

This paper presents novel approaches for combining re-ranking and rank aggregation methods aiming at improving the effectiveness of Content-Based Image Retrieval (CBIR) systems. Given a query image as input, CBIR systems retrieve the most similar images ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-modal image retrieval
question answering
re-ranking
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 374
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval

Interactive re-ranking for cross-modal retrieval based on object-wise question answering

Combining re-ranking and rank aggregation methods for image retrieval