Comprehensive Relation Modelling for Image Paragraph Generation

Zhu, Xianglu; Zhang, Zhang; Wang, Wei; Wang, Zilei

doi:10.1007/s11633-022-1408-2

Comprehensive Relation Modelling for Image Paragraph Generation

Research Article
Published: 12 January 2024

Volume 21, pages 369–382, (2024)
Cite this article

Machine Intelligence Research Aims and scope Submit manuscript

73 Accesses
1 Altmetric
Explore all metrics

Abstract

Image paragraph generation aims to generate a long description composed of multiple sentences, which is different from traditional image captioning containing only one sentence. Most of previous methods are dedicated to extracting rich features from image regions, and ignore modelling the visual relationships. In this paper, we propose a novel method to generate a paragraph by modelling visual relationships comprehensively. First, we parse an image into a scene graph, where each node represents a specific object and each edge denotes the relationship between two objects. Second, we enrich the object features by implicitly encoding visual relationships through a graph convolutional network (GCN). We further explore high-order relations between different relation features using another graph convolutional network. In addition, we obtain the linguistic features by projecting the predicted object labels and their relationships into a semantic embedding space. With these features, we present an attention-based topic generation network to select relevant features and produce a set of topic vectors, which are then utilized to generate multiple sentences. We evaluate the proposed method on the Stanford image-paragraph dataset which is currently the only available dataset for image paragraph generation, and our method achieves competitive performance in comparison with other state-of-the-art (SOTA) methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

References

J. Johnson, A. Karpathy, F. F. Li. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vgaas, USA, pp.466544744, 2016. DOI: 10.1109/CVPR.2016.494.
J. Krause, J. Johnson, R. Krishna, F. F. Li. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 3337–3344, 2017. DOI: 10.1109/CVPR.2017.346.
X. D. Liang, Z. T. Hu, H. Zhang, C. Gan, E. P. Xing. Recurrent topic-transition GAN for visual paragraph generation. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 3382–3391, 2017. DOI: 10.1109/ICCV.2017.364.
M. Chatterjee, A. G. Schwing. Diverse and coherent paragraph generation from images. In Proceedings of the 14th European Conference on Computer Vision, Springer, Munich, Germany, pp. 747–763, 2018. DOI: 10.1007/978-3-030-01216-844.
W. B. Che, X. P. Fan, R. Q. Xong, D. B. Zhao. Paragraph generation network with visual relationship detection. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, pp. 1434–1443, 2018. DOI: 10.1144/3240408.3240694.
C. W. Lu, R. Krishna, M. Bernstein, F. F. Li. Visual relationship detection with language priors. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 842–869, 2016. DOI: 10.1007/978-3-319-46448-0_41.
Y. K. Li, W L. Ouyang, X. G Wang, X. O. Tang. ViPCNN: Visual phrase guided convolutional neural network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 7244–7243, 2017. DOI: 10.1109/CVPR.2017.766.
B. Dai, Y. Q. Zhang, D. H. Lin. Detecting visual relationships with deep relational networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 3298–3308, 2017. DOI: 10.1109/CVPR.2017.342.
Y. H. Zhu, S. Q. Jiang. Deep structured learning for visual relationship detection. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence and the 33th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, Louisiana, USA, Article Number 934, 2018.
D. F. Xu, Y. K. Zhu, C. B. Choy, F. F. Li. Scene graph generation by iterative message passing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 3097–3106, 2017. DOI: 10.1109/CVPR.2017.330.
Y. K. Li, W. L. Ouyang, B. L. Zhou, K. Wang, X. G. Wang. Scene graph generation from objects, phrases and region captions. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 1270–1279, 2017. DOI: 10.1109/ICCV.2017.142.
R. Zellers, M. Yatskar, S. Thomson, Y. Choi. Neural motifs: Scene graph parsing with global context. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 4831–4840, 2018. DOI: 10.1109/CVPR.2018.00611.
Y. K. Li, W. L. Ouyang, B. L. Zhou, J. P. Shi, C. Zhang, X. G. Wang. Factorizable net: An efficient subgraph-based framework for scene graph generation. In Proceedings of the 14th European Conference on Computer Vision, Springer, Munich, Germany, pp. 346–363, 2018. DOI: 10.1007/978-3-030-01246-4_21.
S. Woo, D. Kim, D. Cho, I. S. Kweon. LinkNet: Relational embedding for scene graph. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 448–468, 2018.
T. S. Chen, W. H. Yu, R. Q. Chen, L. Lin. Knowledge-embedded routing network for scene graph generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6146–6164, 2019. DOI: 10.1109/CVPR.2019.00632.
R. J. Li, S. Y. Zhang, B. Wan, X. M. He. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 11104–11114, 2021. DOI: 10.1109/CV-PR46437.2021.01096.
M. unaail, A. Mit!al, 13. Siddiuui4, C. Boaaddas, J. Eledath, G. Medioni, L. igal. Energy-based learning for scene graph generation. In Proceedings of IEEE/CVFConference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 13931–13940, 2021. DOI: 10.1109/CVPR46437.2021.01372.
D. P. Kingma, M. Welling. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada, 2014.
J. Wang, Y. W. Pan, T. Yao, J. H. Tang, T. Mei. Convolutional auto-encoding of sentence topics for image paragraph generation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 940–946, 2019.
L. Melas-Kyriazi, A. Rush, G. Han. Training for diversity in image paragraph captioning. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Brussels, Belgium, pp. 757–761, 2018. DOI: 10.18653/v1/D18-1084.
Z. J. Zha, D. Q. Liu, H. W. Zhang, Y. D. Zhang, F. Wu. Context-aware visual policy network for fine-grained image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 710–722, 2022. DOI: 10.1109/TPAMI.2019.2909864.
C. P. Xu, Y. Li, C. M. Li, X. Ao, M. Yang, J. W. Tian. Interactive key-value memory-augmented attention for image paragraph captioning. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, pp. 3132–3142, 2020. DOI: 10.18653/v1/2020.coling-main.279.
X. Yang, C. Y. Gao, H. W. Zhang, J. F. Cai. Hierarchical scene graph encoder-decoder for image paragraph captioning. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, USA, pp. 4181–1189, 2020. DOI: 10.1145/3394171.3413859.
Y. H. Shi, Y. Liu, F. X. Feng, R. F. Li, Z. Y. Ma, X. J. Wang. S2TD: A tree-structured decoder for image paragraph captioning. In Proceedings of ACM Multimedia Asia, Gold Coast, Australia, pp. 5, 2021. DOI: 10.1145/3469877.3490585.
C. P. Xu, M. Yang, X. Ao, Y. Shen, R. F. Xu, J. W. Tian. Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning. Knowledge-based Systems, vol. 214, Article number 106730, 2021. DOI: 10.1016/j.knosys.2020.106730.
D. D. Guo, R. Y. Lu, B. Chen, Z. Q. Zeng, M. Y. Zhou. Matching visual features to hierarchical semantic topics for image paragraph captioning. International Journal of Computer Vision, vol. 130, no. 8, pp. 1920–1937, 2022. DOI: 10.1007/s11263-022-01624-6.
S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 91–99, 2015.
T. Yao, Y. W. Pan, Y. H. Li, T. Mei. Exploring visual relationship for image captioning. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 711–727, 2018. DOI: 10.1007/978-3-030-01264-942.
P. Velickovic, A. Casanova, P. Liò, G. Cucurull, A. Romero, Y. Bengio. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018. DOI: 10.17863/CAM.48429.
R. Vedantam, C. L. Zitnick, D. Parikh. CIDEr: Consensus-based image description evaluation. In Proceedings of IEEE Conference on Computer Vision and Pattern Re-cognition, Boston, USA, pp. 4566–4575, 2015. DOI: 10.1109/CVPR.2015.7299087.
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 740–755, 2014. DOI: 10.1007/978-3-319-10602-1 48.
R. Krishna, Y. K. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, M. S. Bernstein, F. F. Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017. DOI: 10.1007/s11263-016-0981-7.
K. Papineni, S. Roukos, T. Ward, W. J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL, Philadelphia, USA, pp. 311–318, 2002. DOI: 10.3115/1073083.1073135.
M. Denkowski, A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation, ACL, Baltimore, USA, pp. 376–380, 2014. DOI: 10.3115/v1/W14-3348.
J. Pennington, R. Socher, C. Manning. GloVe: Global vectors for word representation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Doha, Qatar, pp. 1532–1543, 2014. DOI: 10.3115/v1/D14-1162.
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 1026–1034, 2015. DOI: 10.1109/ICCV.2015.123.
D. Kingma, J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015.
E. Cohen, C. Beck. Empirical analysis of beam search performance degradation in neural sequence models. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 1290–1299, 2019.

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Nos. 61721004, 61976214, 62076078 and 62176246).

Author information

Authors and Affiliations

Automation Department, University of Science and Technology of China, Hefei, 230027, China
Xianglu Zhu & Zilei Wang
Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Xianglu Zhu, Zhang Zhang & Wei Wang
University of Chinese Academy of Sciences, Beijing, 100864, China
Zhang Zhang

Authors

Xianglu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zilei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhang Zhang.

Ethics declarations

The authors declared that they have no conflicts of interest to this work.

Additional information

Colored figures are available in the online version at https://link.springer.com/journal/11633

Xianglu Zhu received the B. Sc. degree in automation from University of Science and Technology of China (USTC), China in 2016. He is currently a Ph. D. degree candidate in automation at USTC, and is an intern at National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China.

His research interests include image caption, pose estimation and deep learning.

Zhang Zhang received the B. Sc. degree in computer science and technology from Hebei University of Technology, China in 2002, and the Ph. D. degree in pattern recognition and intelligent systems from National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China in 2009. Currently, he is an associate professor at National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences (CASIA), China. He has published more than 40 research papers on computer vision and pattern recognition, including some highly ranked journals and conferences, e.g., IEEE TPAMI, IEEE TIP, CVPR, and ECCV.

His research interests include action and activity recognition, human attribute recognition, person re-identification, and large-scale person retrieval.

Wei Wang received the B. Eng. degree in automation from Wuhan University, China in 2004, and the Ph. D. degree in information science and engineering from University of Chinese Academy of Sciences, China in 2011. He is currently an associate professor in National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China. He has published more than fifty papers in refereed international journals and conferences such as TPAMI, TIP, CVPR, ICCV and NeurIPS.

His research interests include computer vision and machine learning, particularly on the computational modelling of visual attention and memory, vision and language understanding.

Zilei Wang received the B. Sc. and Ph. D. degrees in control science and engineering from University of Science and Technology of China (USTC), China in 2002 and 2007, respectively. He is currently an associate professor with Department of Automation, USTC, and the founding leader of the Vision and Multimedia Research Group. Before joining USTC as a faculty, he was a postdoctoral research fellow with National University of Singapore, Singapore.

His research interests include computer vision, multimedia and deep learning.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, X., Zhang, Z., Wang, W. et al. Comprehensive Relation Modelling for Image Paragraph Generation. Mach. Intell. Res. 21, 369–382 (2024). https://doi.org/10.1007/s11633-022-1408-2

Download citation

Received: 09 August 2022
Accepted: 13 December 2022
Published: 12 January 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11633-022-1408-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comprehensive Relation Modelling for Image Paragraph Generation

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comprehensive Relation Modelling for Image Paragraph Generation

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation