Abstract
Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- [2] . 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).Google Scholar
- [3] . 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 858–867.Google Scholar
- [4] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724–1734.
DOI: https://10.3115/v1/D14-1179Google ScholarCross Ref - [5] . 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- [6] . 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- [7] . 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.Google Scholar
- [8] . 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.Google ScholarCross Ref
- [9] . 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292–1302.Google Scholar
- [10] . 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.Google ScholarCross Ref
- [11] . 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. Springer, 15–29. Google ScholarDigital Library
- [12] . 2019. Global second-order pooling convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3024–3033.Google ScholarCross Ref
- [13] . 2001. Facts About the World’s Languages: An Encyclopedia of the World’s Major Languages. H. W. Wilson.Google Scholar
- [14] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
- [15] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google ScholarDigital Library
- [16] . 2018. Gather-excite: Exploiting feature context in convolutional neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9401–9411. Google ScholarDigital Library
- [17] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, 7132–7141.Google ScholarDigital Library
- [18] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.Google ScholarCross Ref
- [19] . 2018. Learning to guide decoding for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
- [20] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google ScholarCross Ref
- [21] . 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th Computer Vision and Pattern Recognition. Citeseer. Google ScholarDigital Library
- [22] . 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 359–368. Google ScholarDigital Library
- [23] . 2014. Simple image description generator via a linear phrase-based approach. In Proceedings of the ICLR.Google Scholar
- [24] . 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228. Google ScholarDigital Library
- [25] . 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 11588–11595.Google ScholarCross Ref
- [26] . 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999. Google ScholarDigital Library
- [27] . 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). International Conference on Learning Representations (ICLR) (Banff).Google Scholar
- [28] . 2021. A Hindi image caption generation framework using deep learning. Transactions on Asian and Low-Resource Language Information Processing 20, 2 (2021), 1–19.Google ScholarDigital Library
- [29] . 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114.
DOI: https://doi.org/10.1016/j.compeleceng.2021.107114Google ScholarCross Ref - [30] . 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning. Google ScholarDigital Library
- [31] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google ScholarCross Ref
- [32] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318. Google ScholarDigital Library
- [33] . 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. International Conference on Learning Representations (ICLR) (Banff).Google Scholar
- [34] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. 3104–3112. Google ScholarDigital Library
- [35] . 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
- [36] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google ScholarCross Ref
- [37] . 2020. ECA-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11534–11542.Google ScholarCross Ref
- [38] . 1947. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34, 1/2 (1947), 28–35.Google ScholarCross Ref
- [39] . 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision. 3–19.Google ScholarDigital Library
- [40] . 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.Google ScholarCross Ref
- [41] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057. Google ScholarDigital Library
- [42] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684–699.Google ScholarDigital Library
- [43] . 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.Google ScholarCross Ref
- [44] . 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894–4902.Google ScholarCross Ref
- [45] . 2017. Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923.Google Scholar
- [46] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google ScholarCross Ref
- [47] . 2004. Template-filtered headline summarization. In Proceedings of the ACL Workshop, Text Summarization Branches Out.Google Scholar
- [48] . 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 13041–13049.Google ScholarCross Ref
Index Terms
- Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi
Recommendations
Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed ...
An Object Localization-based Dense Image Captioning Framework in Hindi
Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language ...
Comments