Multi-level, multi-modal interactions for visual question answering over text in images

Chen, Jincai; Zhang, Sheng; Zeng, Jiangfeng; Zou, Fuhao; Li, Yuan-Fang; Liu, Tao; Lu, Ping

doi:10.1007/s11280-021-00976-2

Multi-level, multi-modal interactions for visual question answering over text in images

Published: 25 November 2021

Volume 25, pages 1607–1623, (2022)
Cite this article

World Wide Web Aims and scope Submit manuscript

Jincai Chen^1,2,3,
Sheng Zhang¹,
Jiangfeng Zeng⁴,
Fuhao Zou³,
Yuan-Fang Li⁵,
Tao Liu³ &
…
Ping Lu^1,2,3

512 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-Stage Reasoning Network with Modality Decomposition for Text VQA

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Article 13 September 2023

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

Article Open access 30 October 2022

References

Anderson, P, He, X, Buehler, C, Teney, D, Johnson, M, Gould, S, Zhang, L: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html, pp 6077–6086. IEEE Computer Society (2018)
Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Lawrence Zitnick, C, Parikh, D: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433 (2015)
Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Zitnick, CL, Parikh, D: VQA: visual question answering. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp 2425–2433. IEEE Computer Society (2015)
Ben-younes, H, Cadene, R, Cord, M, Thome, N: Mutan: Multimodal tucker fusion for visual question answering. In: Proc. IEEE Int. Conf. Computer Vision (ICCV), pp 2631–2639 (2017)
Biten, AF, Tito, R, Mafla, A, Gómez, L, Rusiñol, M, Mathew, M, Jawahar, CV, Valveny, E, Karatzas, D: ICDAR 2019 competition on scene text visual question answering. CoRR abs/1907.00490, 1907.00490(2019)
Bojanowski, P, Grave, E, Joulin, A, Mikolov, T: Enriching word vectors with subword information. TACL 5, 135–146 (2017). https://transacl.org/ojs/index.php/tacl/article/view/999
Article Google Scholar
Chen, Z, Lu, H, Tian, S, Qiu, J, Kamiya, T, Serikawa, S, Xu, L: Construction of a hierarchical feature enhancement network and its application in fault recognition. IEEE Transactions on Industrial Informatics (2020)
Devlin, J, Chang, M-W, Lee, K, Toutanova, K: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J, Doran, C, Solorio, T (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). https://www.aclweb.org/anthology/N19-1423/, pp 4171–4186. Association for Computational Linguistics (2019)
Gao, P, Jiang, Z, You, H, Lu, P, Hoi, SCH, Wang, X, Li, H: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, long beach, ca, usa, june 16-20, 2019. http://openaccess.thecvf.com/content_CVPR_2019/html/Gao_Dynamic_Fusion_With_Intra-_and_Inter-Modality_Attention_Flow_for_Visual_CVPR_2019_paper.html, pp 6639–6648. Computer Vision Foundation / IEEE (2019)
Goyal, Y, Khot, T, Summers-Stay, D, Batra, D, Parikh, D: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6904–6913 (2017)
Gu, J, Lu, Z, Li, H, Li, VOK: Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. https://www.aclweb.org/anthology/P16-1154/. The Association for Computer Linguistics (2016)
Jiang, T, Zeng, J, Zhou, K, Huang, P, Yang, T: Lifelong disk failure prediction via gan-based anomaly detection. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp 199–207, IEEE (2019)
Jiang, Y, Natarajan, V, Chen, X, Rohrbach, M, Batra, D, Parikh, D: Pythia v0.1: the winning entry to the VQA challenge 2018. CoRR abs/1807.09956, 1807.09956 (2018)
Johnson, J, Hariharan, B, van der Maaten, L, Fei-Fei, L, Lawrence Zitnick, C, Girshick, R: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2901–2910 (2017)
Kahou, SE, Michalski, V, Atkinson, A, Kádár, A, Trischler, A, Bengio, Y: Figureqa: An annotated figure dataset for visual reasoning. arXiv:1710.07300 (2017)
Kim, J-H, Jun, J, Zhang, B-T: Bilinear attention networks. In: Bengio, S, Wallach, HM, Larochelle, H, Grauman, K, Cesa-Bianchi, N, Garnett, R (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. http://papers.nips.cc/paper/7429-bilinear-attention-networks, pp 1571–1581 (2018)
Krishna, R, Zhu, Y, Groth, O, Johnson, J, Hata, K, Kravitz, J, Chen, S, Kalantidis, Y, Li, L-J, Shamma, DA, Bernstein, MS, Fei-Fei, L: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Kuznetsova, A, Rom, H, Alldrin, N, Uijlings, JRR, Krasin, I, Pont-Tuset, J, Kamali, S, Popov, S, Malloci, M, Duerig, T, Ferrari, V: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR abs/1811.00982, 1811.00982 (2018)
Lin, Y, Zhao, H, Li, Y, Wang, D: Dcd zju, textvqa challenge 2019 winner. https://visualqa.org/workshop.html (2019)
Lu, H, Li, Y, Mu, S, Wang, D, Kim, H, Serikawa, S: Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE internet of things journal 5(4), 2315–2322 (2017)
Article Google Scholar
Lu, H, Zhang, M, Xu, X, Li, Y, Shen, HT: Deep fuzzy hashing network for efficient image retrieval. IEEE Trans. Fuzzy Syst. (2020)
Lu, H, Zhang, Y, Li, Y, Jiang, C, Abbas, H: User-oriented virtual mobile network resource management for vehicle communications. IEEE Trans. Intell. Transp. Syst. (2020)
Ma, X, Zeng, J, Peng, L, Fortino, G, Zhang, Y: Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis. Futur. Gener. Comput. Syst. 93, 304–311 (2019)
Article Google Scholar
Malinowski, M, Fritz, M: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, pp 1682–1690 (2014)
Malinowski, M, Fritz, M: Towards a visual turing challenge. CoRR abs/1410.8027, 1410.8027 (2014)
Nguyen, D-K, Okatani, T: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Nguyen_Improved_Fusion_of_CVPR_2018_paper.html, pp 6087–6096. IEEE Computer Society (2018)
Paszke, A, Gross, S, Chintala, S, Chanan, G, Yang, E, DeVito, Z, Lin, Z, Desmaison, A, Antiga, L, Lerer, A: Automatic differentiation in pytorch (2017)
Pennington, J, Socher, R, Manning, CD: Glove: Global vectors for word representation. In: Moschitti, A, Pang, B, Daelemans, W (eds.) Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, october 25-29, 2014, doha, qatar, A meeting of sigdat, a special interest group of the ACL. https://www.aclweb.org/anthology/D14-1162/, pp 1532–1543. ACL (2014)
Ren, M, Kiros, R, Zemel, R: Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961 (2015)
Ren, S, He, K, Girshick, RB, Sun, J: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C, Lawrence, ND, Lee, DD, Sugiyama, M, Garnett, R (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks, pp 91–99 (2015)
Singh, A, Natarajan, V, Shah, M, Jiang, Y, Chen, X, Batra, D, Parikh, D, Rohrbach, M: Towards VQA models that can read. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, long beach, ca, usa, june 16-20, 2019. http://openaccess.thecvf.com/content_CVPR_2019/html/Singh_Towards_VQA_Models_That_Can_Read_CVPR_2019_paper.html, pp 8317–8326. Computer Vision Foundation / IEEE (2019)
submission, A: Msft_vti. https://evalai.cloudcv.org/web/challenges/challenge-page/224/
Suhr, A, Lewis, M, Yeh, J, Artzi, Y: A corpus of natural language for visual reasoning. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 217–223 (2017)
Tan, H, Bansal, M: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K, Jiang, J, Ng, V, Wan, X (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp 5099–5110. Association for Computational Linguistics (2019)
Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, Polosukhin, I: Attention is all you need. In: Guyon, I, von Luxburg, U, Bengio, S, Wallach, HM, Fergus, R, Vishwanathan, SVN, Garnett, R (eds.) Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA. http://papers.nips.cc/paper/7181-attention-is-all-you-need, pp 5998–6008 (2017)
Wang, P, Wang, D, Zhang, X, Li, X, Peng, T, Lu, H, Tian, X: Numerical and experimental study on the maneuverability of an active propeller control based wave glider. Applied Ocean Research 104, 102369 (2020)
Article Google Scholar
Wang, P, Wu, Q, Shen, C, Dick, A, Van Den Hengel, A: Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40(10), 2413–2427 (2017)
Article Google Scholar
Xu, K, Wang, Z, Shi, J, Li, H, Zhang, QC: A2-net: Molecular structure estimation from cryo-em density volumes. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp 1230–1237. AAAI Press (2019)
Yang, Z, He, X, Gao, J, Deng, L, Smola, A: Stacked attention networks for image question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Zeng, J, Ma, X, Zhou, K: Photo-realistic face age progression/regression using a single generative adversarial network. Neurocomputing 366, 295–304 (2019)
Article Google Scholar
Zhou, K, Zeng, J, Liu, Y, Zou, F: Deep sentiment hashing for text retrieval in social ciot. Futur. Gener. Comput. Syst. 86, 362–371 (2018)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 61672246, No. 61272068, No. 61672254, No. 62102159 and Program for Hust Academic Frontier Youth Team and the Natural Science Foundation of Hubei Province under grant No. 2020CFB492 and the Humanities and Social Science Fund of Ministry of Education of China under grant No. 21YJC870002. In addition, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, 430074, China
Jincai Chen, Sheng Zhang & Ping Lu
Key Laboratory of Information Storage System, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Jincai Chen & Ping Lu
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Jincai Chen, Fuhao Zou, Tao Liu & Ping Lu
School of Information Management, Central China Normal University, Wuhan, 430079, China
Jiangfeng Zeng
Department of Data Science and AI, Faculty of Information Technology, Monash University, Clayton, 3800, Australia
Yuan-Fang Li

Authors

Jincai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiangfeng Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Fuhao Zou
View author publications
You can also search for this author in PubMed Google Scholar
Yuan-Fang Li
View author publications
You can also search for this author in PubMed Google Scholar
Tao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ping Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiangfeng Zeng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Synthetic Media on the Web Guest Editors: Huimin Lu, Xing Xu, Jože Guna, and Gautam Srivastava

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Zhang, S., Zeng, J. et al. Multi-level, multi-modal interactions for visual question answering over text in images. World Wide Web 25, 1607–1623 (2022). https://doi.org/10.1007/s11280-021-00976-2

Download citation

Received: 14 January 2021
Revised: 20 September 2021
Accepted: 01 November 2021
Published: 25 November 2021
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11280-021-00976-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level, multi-modal interactions for visual question answering over text in images

Abstract

Access this article

Similar content being viewed by others

Two-Stage Reasoning Network with Modality Decomposition for Text VQA

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-level, multi-modal interactions for visual question answering over text in images

Abstract

Access this article

Similar content being viewed by others

Two-Stage Reasoning Network with Modality Decomposition for Text VQA

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation