skip to main content
10.1145/3474085.3475681acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval

Published:17 October 2021Publication History

ABSTRACT

We propose an approach that enhances arbitrary existing cross-modal image retrieval performance. Most of the cross-modal image retrieval methods mainly focus on direct computation of similarities between a text query and candidate images in an accurate way. However, their retrieval performance is affected by the ambiguity of text queries and the bias of target databases (DBs). Dealing with ambiguous text queries and DBs with bias will lead to accurate cross-modal image retrieval in real-world applications. A DB-adaptive re-ranking method using modality-driven spaces, which can extend arbitrary cross-modal image retrieval methods for enhancing their performance, is proposed in this paper. The proposed method includes two approaches: "DB-adaptive re-ranking'' and "modality-driven clue information extraction''. Our method estimates clue information that can effectively clarify the desired image from the whole set of a target DB and then receives user's feedback for the estimated information. Furthermore, our method extracts more detailed information of a query text and a target DB by focusing on modality-driven spaces, and it enables more accurate re-ranking. Our method allows users to reach their desired single image by just answering questions. Experimental results using MSCOCO, Visual Genome and newly introduced datasets including images with a particular object show that the proposed method can enhance the performance of state-of-the-art cross-modal image retrieval methods.

Skip Supplemental Material Section

Supplemental Material

References

  1. Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. 2019. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 475--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 2 (2018), 423--443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ilaria Bartolini, Paolo Ciaccia, Vincent Oria, and M Tamer Özsu. 2007. Flexible integration of multimedia sub-queries with qualitative preferences. Multimedia Tools and Applications, Vol. 33, 3 (2007), 275--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ilaria Bartolini, Paolo Ciaccia, and Florian Waas. 2001. FeedbackBypass: A new approach to interactive similarity query processing. In VLDB. 201--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ilaria Bartolini, Marco Patella, and Guido Stromei. 2011. The windsurf library for the efficient retrieval of multimedia hierarchical data. In Proceedings of the International Conference on Signal Processing and Multimedia Applications. 1--10.Google ScholarGoogle Scholar
  6. Daragh Byrne, Aisling Kelliher, and Gareth JF Jones. 2011. Life editing: Third-party perspectives on lifelog content. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1501--1510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chia-Chun Chang, Min-Huan Fu, Hen-Hsen Huang, and Hsin-Hsi Chen. 2019. An interactive approach to integrating external textual knowledge for multimodal lifelog retrieval. In Proceedings of the ACM Workshop on Lifelog Search Challenge. 41--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Deepanwita Datta, Shubham Varma, Sanjay K Singh, et almbox. 2017. Multimodal retrieval using mutual information based textual query reformulation. Expert Systems with Applications, Vol. 68 (2017), 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ritendra Datta, Jia Li, and James Z Wang. 2005. Content-based image retrieval: approaches and trends of the new age. In Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval. 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4601--4611.Google ScholarGoogle ScholarCross RefCross Ref
  11. Fartash Faghri, David J Fleet, Jamie Ryan Kiros, Google Brain Toronto, and Sanja Fidler. 2017. VSE: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612 (2017).Google ScholarGoogle Scholar
  12. Cristiano D Ferreira, Jefersson A Santos, R da S Torres, Marcos André Goncc alves, Rodrigo Carvalho Rezende, and Weiguo Fan. 2011. Relevance feedback based on genetic programming for image retrieval. Pattern Recognition Letters, Vol. 32, 1 (2011), 27--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Giorgio Giacinto. 2007. A nearest-neighbor approach to relevance feedback in content based image retrieval. In Proceedings of the 6th ACM international conference on Image and video retrieval. 456--463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181--7189.Google ScholarGoogle ScholarCross RefCross Ref
  15. Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In Advances in neural information processing systems. 678--688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics. 162--190.Google ScholarGoogle Scholar
  17. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163--6171.Google ScholarGoogle ScholarCross RefCross Ref
  19. Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 5754--5763.Google ScholarGoogle ScholarCross RefCross Ref
  20. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  21. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, Vol. 123, 1 (2017), 32--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, Vol. 22, 1 (1951), 79--86.Google ScholarGoogle Scholar
  24. Jun Li, Chang Xu, Wankou Yang, Changyin Sun, and Dacheng Tao. 2017. Discriminative multi-view interactive image re-ranking. IEEE Transactions on Image Processing, Vol. 26, 7 (2017), 3113--3127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shuang Liang and Zhengxing Sun. 2008. Sketch retrieval and relevance feedback with biased SVM classification. Pattern Recognition Letters, Vol. 29, 12 (2008), 1733--1741. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE European Conference on Computer Vision. 740--755.Google ScholarGoogle Scholar
  27. Wei-Chao Lin, Zong-Yao Chen, Shih-Wen Ke, Chih-Fong Tsai, and Wei-Yang Lin. 2015. The effect of low-level image features on pseudo relevance feedback. Neurocomputing, Vol. 166 (2015), 26--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yutian Lin, Zhedong Zheng, Hong Zhang, Chenqiang Gao, and Yi Yang. 2020. Bayesian query expansion for multi-camera person re-identification. Pattern Recognition Letters, Vol. 130 (2020), 284--292.Google ScholarGoogle ScholarCross RefCross Ref
  29. Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107--4116.Google ScholarGoogle Scholar
  30. Dan Lu, Xiaoxiao Liu, and Xueming Qian. 2016. Tag-based image search by social re-ranking. IEEE Transactions on Multimedia, Vol. 18, 8 (2016), 1628--1639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Devraj Mandal and Soma Biswas. 2017. Query specific re-ranking for improved cross-modal retrieval. Pattern Recognition Letters, Vol. 98 (2017), 110--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Comput. Surveys, Vol. 46, 3 (2014), 1--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  34. Lorenzo Putzu, Luca Piras, and Giorgio Giacinto. 2020. Convolutional neural networks for relevance feedback in content based image retrieval. Multimedia Tools and Applications, Vol. 79, 37 (2020), 26995--27021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Luca Rossetto, Ralph Gasser, Jakub Lokoc, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomas Soucek, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, et al. 2020. Interactive video retrieval in the age of deep learning-detailed evaluation of vbs 2019. IEEE Transactions on Multimedia (2020).Google ScholarGoogle Scholar
  36. Thomas Seidl and Hans-Peter Kriegel. 1997. Efficient user-adaptable similarity search in large multimedia databases. In VLDB, Vol. 97. 506--515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on pattern analysis and machine intelligence, Vol. 22, 12 (2000), 1349--1380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1979--1978.Google ScholarGoogle ScholarCross RefCross Ref
  39. Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. In Advances in Neural Information Processing Systems. 2651--2661. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations. 1--12.Google ScholarGoogle Scholar
  41. Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439--6448.Google ScholarGoogle ScholarCross RefCross Ref
  42. Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM international conference on Multimedia. 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215 (2016).Google ScholarGoogle Scholar
  44. Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the 27th ACM international conference on multimedia. 12--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wei Wei, Mengmeng Jiang, Xiangnan Zhang, Heng Liu, and Chunna Tian. 2020. Boosting Cross-Modal Retrieval With MVSE and Reciprocal Neighbors. IEEE Access, Vol. 8 (2020), 84642--84651.Google ScholarGoogle ScholarCross RefCross Ref
  46. Yunchao Wei, Yao Zhao, Zhenfeng Zhu, Shikui Wei, Yanhui Xiao, Jiashi Feng, and Shuicheng Yan. 2016. Modality-dependent cross-media retrieval. ACM Transactions on Intelligent Systems and Technology, Vol. 7, 4 (2016), 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Bin Xu, Jiajun Bu, Chun Chen, Can Wang, Deng Cai, and Xiaofei He. 2013. EMR: A scalable graph-based ranking model for content-based image retrieval. IEEE Transactions on knowledge and data engineering, Vol. 27, 1 (2013), 102--114.Google ScholarGoogle Scholar
  48. Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.Google ScholarGoogle ScholarCross RefCross Ref
  49. Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the IEEE International Conference on Machine Learning. 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2020. Enhancing cross-Modal retrieval based on modality-specific and embedding spaces. IEEE Access, Vol. 8, 1 (2020), 96777--96786.Google ScholarGoogle ScholarCross RefCross Ref
  51. Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2021. Interactive re-ranking for cross-modal retrieval based on object-wise question answering. In Proceedings of the ACM International Conference on Multimedia in Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Xiaojing Yu, Tianlong Chen, Yang Yang, Michael Mugo, and Zhangyang Wang. 2019. Cross-Modal Person Search: A Coarse-to-Fine Framework using Bi-Directional Text-Image Matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  53. Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck. 2020. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020. 418--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.Google ScholarGoogle ScholarCross RefCross Ref
  55. Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the IEEE European Conference on Computer Vision. 686--701.Google ScholarGoogle ScholarCross RefCross Ref
  56. Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9259--9266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embedding with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (2020). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5802--5810. arxiv: 1904.01310Google ScholarGoogle ScholarCross RefCross Ref
  59. Susana Zoghbi, Geert Heyman, Juan Carlos Gomez, and Marie-Francine Moens. 2016. Cross-modal fashion search. In International Conference on Multimedia Modeling. Springer, 367--373.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MM '21: Proceedings of the 29th ACM International Conference on Multimedia
            October 2021
            5796 pages
            ISBN:9781450386517
            DOI:10.1145/3474085

            Copyright © 2021 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 October 2021

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate995of4,171submissions,24%

            Upcoming Conference

            MM '24
            MM '24: The 32nd ACM International Conference on Multimedia
            October 28 - November 1, 2024
            Melbourne , VIC , Australia

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader