research-article

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

Authors:
Santosh Kumar Mishra

Indian Institute of Technology, Patna, Bihar, India

Indian Institute of Technology, Patna, Bihar, India
View Profile

,
Gaurav Rai

National Institute of Technology, Patna, Bihar, India

National Institute of Technology, Patna, Bihar, India
View Profile

,
Sriparna Saha

Indian Institute of Technology, Patna, Bihar, India

Indian Institute of Technology, Patna, Bihar, India
View Profile

,
Pushpak Bhattacharyya

Indian Institute of Technology, Bombay, Mumbai, Maharashtra, India

Indian Institute of Technology, Bombay, Mumbai, Maharashtra, India
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21 Issue 3Article No.: 49pp 1–17https://doi.org/10.1145/3483597

Published:13 December 2021Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.

REFERENCES

[1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
[2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).Google Scholar
[3] Brants Thorsten, Popat Ashok C., Xu Peng, Och Franz J., and Dean Jeffrey. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 858–867.Google Scholar
[4] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724–1734. DOI: https://10.3115/v1/D14-1179Google ScholarCross Ref
[5] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
[6] Deshpande Aditya, Aneja Jyoti, Wang Liwei, Schwing Alexander G., and Forsyth David. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
[7] Dhir Rijul, Mishra Santosh Kumar, Saha Sriparna, and Bhattacharyya Pushpak. 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.Google Scholar
[8] Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.Google ScholarCross Ref
[9] Elliott Desmond and Keller Frank. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292–1302.Google Scholar
[10] Fang Hao, Gupta Saurabh, Iandola Forrest, Srivastava Rupesh K., Deng Li, Dollár Piotr, Gao Jianfeng, He Xiaodong, Mitchell Margaret, Platt John C., C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.Google ScholarCross Ref
[11] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. Springer, 15–29. Google ScholarDigital Library
[12] Gao Zilin, Xie Jiangtao, Wang Qilong, and Li Peihua. 2019. Global second-order pooling convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3024–3033.Google ScholarCross Ref
[13] Gary Jane and Rubino Carl. 2001. Facts About the World’s Languages: An Encyclopedia of the World’s Major Languages. H. W. Wilson.Google Scholar
[14] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
[15] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google ScholarDigital Library
[16] Hu Jie, Shen Li, Albanie Samuel, Sun Gang, and Vedaldi Andrea. 2018. Gather-excite: Exploiting feature context in convolutional neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9401–9411. Google ScholarDigital Library
[17] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, 7132–7141.Google ScholarDigital Library
[18] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.Google ScholarCross Ref
[19] Jiang Wenhao, Ma Lin, Chen Xinpeng, Zhang Hanwang, and Liu Wei. 2018. Learning to guide decoding for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
[20] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google ScholarCross Ref
[21] Kulkarni Girish, Premraj Visruth, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th Computer Vision and Pattern Recognition. Citeseer. Google ScholarDigital Library
[22] Kuznetsova Polina, Ordonez Vicente, Berg Alexander C., Berg Tamara L., and Choi Yejin. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 359–368. Google ScholarDigital Library
[23] Lebret Rémi, Pinheiro Pedro O., and Collobert Ronan. 2014. Simple image description generator via a linear phrase-based approach. In Proceedings of the ICLR.Google Scholar
[24] Li Siming, Kulkarni Girish, Berg Tamara L., Berg Alexander C., and Choi Yejin. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228. Google ScholarDigital Library
[25] Liu Junhao, Wang Kai, Xu Chunpu, Zhao Zhou, Xu Ruifeng, Shen Ying, and Yang Min. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 11588–11595.Google ScholarCross Ref
[26] MacLeod Haley, Bennett Cynthia L., Morris Meredith Ringel, and Cutrell Edward. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999. Google ScholarDigital Library
[27] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). International Conference on Learning Representations (ICLR) (Banff).Google Scholar
[28] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, and Bhattacharyya Pushpak. 2021. A Hindi image caption generation framework using deep learning. Transactions on Asian and Low-Resource Language Information Processing 20, 2 (2021), 1–19.Google ScholarDigital Library
[29] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, Bhattacharyya Pushpak, and Singh Amit Kumar. 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114. DOI: https://doi.org/10.1016/j.compeleceng.2021.107114Google ScholarCross Ref
[30] Nair Vinod and Hinton Geoffrey E.. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning. Google ScholarDigital Library
[31] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google ScholarCross Ref
[32] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318. Google ScholarDigital Library
[33] Sermanet Pierre, Eigen David, Zhang Xiang, Mathieu Michaël, Fergus Rob, and LeCun Yann. 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. International Conference on Learning Representations (ICLR) (Banff).Google Scholar
[34] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. 3104–3112. Google ScholarDigital Library
[35] Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, and Alemi Alexander A.. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
[36] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google ScholarCross Ref
[37] Wang Qilong, Wu Banggu, Zhu Pengfei, Li Peihua, Zuo Wangmeng, and Hu Qinghua. 2020. ECA-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11534–11542.Google ScholarCross Ref
[38] Welch Bernard L.. 1947. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34, 1/2 (1947), 28–35.Google ScholarCross Ref
[39] Woo Sanghyun, Park Jongchan, Lee Joon-Young, and Kweon In So. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision. 3–19.Google ScholarDigital Library
[40] Wu Qi, Shen Chunhua, Liu Lingqiao, Dick Anthony, and Hengel Anton Van Den. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.Google ScholarCross Ref
[41] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057. Google ScholarDigital Library
[42] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684–699.Google ScholarDigital Library
[43] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.Google ScholarCross Ref
[44] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894–4902.Google ScholarCross Ref
[45] Yin Wenpeng, Kann Katharina, Yu Mo, and Schütze Hinrich. 2017. Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923.Google Scholar
[46] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google ScholarCross Ref
[47] Zhou Liang and Hovy Eduard. 2004. Template-filtered headline summarization. In Proceedings of the ACL Workshop, Text Summarization Branches Out.Google Scholar
[48] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 13041–13049.Google ScholarCross Ref

Index Terms

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Image composition
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed ...
Read More
An Object Localization-based Dense Image Captioning Framework in Hindi
Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language ...
Read More
A Hindi Image Caption Generation Framework Using Deep Learning
Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 3
May 2022
413 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3505182
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 December 2021
- Accepted: 1 August 2021
- Revised: 1 July 2021
- Received: 1 December 2020
Published in tallip Volume 21, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Image captioning
channel attention
deep-learning
attention
Hindi
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 368
  Total Downloads
- Downloads (Last 12 months)71
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi

An Object Localization-based Dense Image Captioning Framework in Hindi

A Hindi Image Caption Generation Framework Using Deep Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media