Skip to main content
Log in

Multi-level, multi-modal interactions for visual question answering over text in images

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Anderson, P, He, X, Buehler, C, Teney, D, Johnson, M, Gould, S, Zhang, L: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html, pp 6077–6086. IEEE Computer Society (2018)

  2. Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Lawrence Zitnick, C, Parikh, D: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433 (2015)

  3. Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Zitnick, CL, Parikh, D: VQA: visual question answering. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp 2425–2433. IEEE Computer Society (2015)

  4. Ben-younes, H, Cadene, R, Cord, M, Thome, N: Mutan: Multimodal tucker fusion for visual question answering. In: Proc. IEEE Int. Conf. Computer Vision (ICCV), pp 2631–2639 (2017)

  5. Biten, AF, Tito, R, Mafla, A, Gómez, L, Rusiñol, M, Mathew, M, Jawahar, CV, Valveny, E, Karatzas, D: ICDAR 2019 competition on scene text visual question answering. CoRR abs/1907.00490, 1907.00490(2019)

  6. Bojanowski, P, Grave, E, Joulin, A, Mikolov, T: Enriching word vectors with subword information. TACL 5, 135–146 (2017). https://transacl.org/ojs/index.php/tacl/article/view/999

    Article  Google Scholar 

  7. Chen, Z, Lu, H, Tian, S, Qiu, J, Kamiya, T, Serikawa, S, Xu, L: Construction of a hierarchical feature enhancement network and its application in fault recognition. IEEE Transactions on Industrial Informatics (2020)

  8. Devlin, J, Chang, M-W, Lee, K, Toutanova, K: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J, Doran, C, Solorio, T (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). https://www.aclweb.org/anthology/N19-1423/, pp 4171–4186. Association for Computational Linguistics (2019)

  9. Gao, P, Jiang, Z, You, H, Lu, P, Hoi, SCH, Wang, X, Li, H: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, long beach, ca, usa, june 16-20, 2019. http://openaccess.thecvf.com/content_CVPR_2019/html/Gao_Dynamic_Fusion_With_Intra-_and_Inter-Modality_Attention_Flow_for_Visual_CVPR_2019_paper.html, pp 6639–6648. Computer Vision Foundation / IEEE (2019)

  10. Goyal, Y, Khot, T, Summers-Stay, D, Batra, D, Parikh, D: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6904–6913 (2017)

  11. Gu, J, Lu, Z, Li, H, Li, VOK: Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. https://www.aclweb.org/anthology/P16-1154/. The Association for Computer Linguistics (2016)

  12. Jiang, T, Zeng, J, Zhou, K, Huang, P, Yang, T: Lifelong disk failure prediction via gan-based anomaly detection. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp 199–207, IEEE (2019)

  13. Jiang, Y, Natarajan, V, Chen, X, Rohrbach, M, Batra, D, Parikh, D: Pythia v0.1: the winning entry to the VQA challenge 2018. CoRR abs/1807.09956, 1807.09956 (2018)

  14. Johnson, J, Hariharan, B, van der Maaten, L, Fei-Fei, L, Lawrence Zitnick, C, Girshick, R: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2901–2910 (2017)

  15. Kahou, SE, Michalski, V, Atkinson, A, Kádár, A, Trischler, A, Bengio, Y: Figureqa: An annotated figure dataset for visual reasoning. arXiv:1710.07300 (2017)

  16. Kim, J-H, Jun, J, Zhang, B-T: Bilinear attention networks. In: Bengio, S, Wallach, HM, Larochelle, H, Grauman, K, Cesa-Bianchi, N, Garnett, R (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. http://papers.nips.cc/paper/7429-bilinear-attention-networks, pp 1571–1581 (2018)

  17. Krishna, R, Zhu, Y, Groth, O, Johnson, J, Hata, K, Kravitz, J, Chen, S, Kalantidis, Y, Li, L-J, Shamma, DA, Bernstein, MS, Fei-Fei, L: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  18. Kuznetsova, A, Rom, H, Alldrin, N, Uijlings, JRR, Krasin, I, Pont-Tuset, J, Kamali, S, Popov, S, Malloci, M, Duerig, T, Ferrari, V: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR abs/1811.00982, 1811.00982 (2018)

  19. Lin, Y, Zhao, H, Li, Y, Wang, D: Dcd zju, textvqa challenge 2019 winner. https://visualqa.org/workshop.html (2019)

  20. Lu, H, Li, Y, Mu, S, Wang, D, Kim, H, Serikawa, S: Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE internet of things journal 5(4), 2315–2322 (2017)

    Article  Google Scholar 

  21. Lu, H, Zhang, M, Xu, X, Li, Y, Shen, HT: Deep fuzzy hashing network for efficient image retrieval. IEEE Trans. Fuzzy Syst. (2020)

  22. Lu, H, Zhang, Y, Li, Y, Jiang, C, Abbas, H: User-oriented virtual mobile network resource management for vehicle communications. IEEE Trans. Intell. Transp. Syst. (2020)

  23. Ma, X, Zeng, J, Peng, L, Fortino, G, Zhang, Y: Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis. Futur. Gener. Comput. Syst. 93, 304–311 (2019)

    Article  Google Scholar 

  24. Malinowski, M, Fritz, M: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, pp 1682–1690 (2014)

  25. Malinowski, M, Fritz, M: Towards a visual turing challenge. CoRR abs/1410.8027, 1410.8027 (2014)

  26. Nguyen, D-K, Okatani, T: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Nguyen_Improved_Fusion_of_CVPR_2018_paper.html, pp 6087–6096. IEEE Computer Society (2018)

  27. Paszke, A, Gross, S, Chintala, S, Chanan, G, Yang, E, DeVito, Z, Lin, Z, Desmaison, A, Antiga, L, Lerer, A: Automatic differentiation in pytorch (2017)

  28. Pennington, J, Socher, R, Manning, CD: Glove: Global vectors for word representation. In: Moschitti, A, Pang, B, Daelemans, W (eds.) Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, october 25-29, 2014, doha, qatar, A meeting of sigdat, a special interest group of the ACL. https://www.aclweb.org/anthology/D14-1162/, pp 1532–1543. ACL (2014)

  29. Ren, M, Kiros, R, Zemel, R: Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961 (2015)

  30. Ren, S, He, K, Girshick, RB, Sun, J: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C, Lawrence, ND, Lee, DD, Sugiyama, M, Garnett, R (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks, pp 91–99 (2015)

  31. Singh, A, Natarajan, V, Shah, M, Jiang, Y, Chen, X, Batra, D, Parikh, D, Rohrbach, M: Towards VQA models that can read. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, long beach, ca, usa, june 16-20, 2019. http://openaccess.thecvf.com/content_CVPR_2019/html/Singh_Towards_VQA_Models_That_Can_Read_CVPR_2019_paper.html, pp 8317–8326. Computer Vision Foundation / IEEE (2019)

  32. submission, A: Msft_vti. https://evalai.cloudcv.org/web/challenges/challenge-page/224/

  33. Suhr, A, Lewis, M, Yeh, J, Artzi, Y: A corpus of natural language for visual reasoning. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 217–223 (2017)

  34. Tan, H, Bansal, M: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K, Jiang, J, Ng, V, Wan, X (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp 5099–5110. Association for Computational Linguistics (2019)

  35. Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, Polosukhin, I: Attention is all you need. In: Guyon, I, von Luxburg, U, Bengio, S, Wallach, HM, Fergus, R, Vishwanathan, SVN, Garnett, R (eds.) Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA. http://papers.nips.cc/paper/7181-attention-is-all-you-need, pp 5998–6008 (2017)

  36. Wang, P, Wang, D, Zhang, X, Li, X, Peng, T, Lu, H, Tian, X: Numerical and experimental study on the maneuverability of an active propeller control based wave glider. Applied Ocean Research 104, 102369 (2020)

    Article  Google Scholar 

  37. Wang, P, Wu, Q, Shen, C, Dick, A, Van Den Hengel, A: Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40(10), 2413–2427 (2017)

    Article  Google Scholar 

  38. Xu, K, Wang, Z, Shi, J, Li, H, Zhang, QC: A2-net: Molecular structure estimation from cryo-em density volumes. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp 1230–1237. AAAI Press (2019)

  39. Yang, Z, He, X, Gao, J, Deng, L, Smola, A: Stacked attention networks for image question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  40. Zeng, J, Ma, X, Zhou, K: Photo-realistic face age progression/regression using a single generative adversarial network. Neurocomputing 366, 295–304 (2019)

    Article  Google Scholar 

  41. Zhou, K, Zeng, J, Liu, Y, Zou, F: Deep sentiment hashing for text retrieval in social ciot. Futur. Gener. Comput. Syst. 86, 362–371 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 61672246, No. 61272068, No. 61672254, No. 62102159 and Program for Hust Academic Frontier Youth Team and the Natural Science Foundation of Hubei Province under grant No. 2020CFB492 and the Humanities and Social Science Fund of Ministry of Education of China under grant No. 21YJC870002. In addition, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiangfeng Zeng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Synthetic Media on the Web Guest Editors: Huimin Lu, Xing Xu, Jože Guna, and Gautam Srivastava

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Zhang, S., Zeng, J. et al. Multi-level, multi-modal interactions for visual question answering over text in images. World Wide Web 25, 1607–1623 (2022). https://doi.org/10.1007/s11280-021-00976-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-021-00976-2

Keywords

Navigation