skip to main content
research-article

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

Published:13 December 2021Publication History
Skip Abstract Section

Abstract

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  3. [3] Brants Thorsten, Popat Ashok C., Xu Peng, Och Franz J., and Dean Jeffrey. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 858867.Google ScholarGoogle Scholar
  4. [4] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724–1734. DOI: https://10.3115/v1/D14-1179Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Deshpande Aditya, Aneja Jyoti, Wang Liwei, Schwing Alexander G., and Forsyth David. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Dhir Rijul, Mishra Santosh Kumar, Saha Sriparna, and Bhattacharyya Pushpak. 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693701.Google ScholarGoogle Scholar
  8. [8] Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 26252634.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Elliott Desmond and Keller Frank. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 12921302.Google ScholarGoogle Scholar
  10. [10] Fang Hao, Gupta Saurabh, Iandola Forrest, Srivastava Rupesh K., Deng Li, Dollár Piotr, Gao Jianfeng, He Xiaodong, Mitchell Margaret, Platt John C., C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14731482.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. Springer, 1529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Gao Zilin, Xie Jiangtao, Wang Qilong, and Li Peihua. 2019. Global second-order pooling convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30243033.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Gary Jane and Rubino Carl. 2001. Facts About the World’s Languages: An Encyclopedia of the World’s Major Languages. H. W. Wilson.Google ScholarGoogle Scholar
  14. [14] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hu Jie, Shen Li, Albanie Samuel, Sun Gang, and Vedaldi Andrea. 2018. Gather-excite: Exploiting feature context in convolutional neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 94019411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, 7132–7141.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 46344643.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Jiang Wenhao, Ma Lin, Chen Xinpeng, Zhang Hanwang, and Liu Wei. 2018. Learning to guide decoding for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kulkarni Girish, Premraj Visruth, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th Computer Vision and Pattern Recognition. Citeseer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Kuznetsova Polina, Ordonez Vicente, Berg Alexander C., Berg Tamara L., and Choi Yejin. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 359368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Lebret Rémi, Pinheiro Pedro O., and Collobert Ronan. 2014. Simple image description generator via a linear phrase-based approach. In Proceedings of the ICLR.Google ScholarGoogle Scholar
  24. [24] Li Siming, Kulkarni Girish, Berg Tamara L., Berg Alexander C., and Choi Yejin. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Liu Junhao, Wang Kai, Xu Chunpu, Zhao Zhou, Xu Ruifeng, Shen Ying, and Yang Min. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 1158811595.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] MacLeod Haley, Bennett Cynthia L., Morris Meredith Ringel, and Cutrell Edward. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 59885999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). International Conference on Learning Representations (ICLR) (Banff).Google ScholarGoogle Scholar
  28. [28] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, and Bhattacharyya Pushpak. 2021. A Hindi image caption generation framework using deep learning. Transactions on Asian and Low-Resource Language Information Processing 20, 2 (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, Bhattacharyya Pushpak, and Singh Amit Kumar. 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114. DOI: https://doi.org/10.1016/j.compeleceng.2021.107114Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Nair Vinod and Hinton Geoffrey E.. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Sermanet Pierre, Eigen David, Zhang Xiang, Mathieu Michaël, Fergus Rob, and LeCun Yann. 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. International Conference on Learning Representations (ICLR) (Banff).Google ScholarGoogle Scholar
  34. [34] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. 31043112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, and Alemi Alexander A.. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Wang Qilong, Wu Banggu, Zhu Pengfei, Li Peihua, Zuo Wangmeng, and Hu Qinghua. 2020. ECA-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1153411542.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Welch Bernard L.. 1947. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34, 1/2 (1947), 2835.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Woo Sanghyun, Park Jongchan, Lee Joon-Young, and Kweon In So. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision. 319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Wu Qi, Shen Chunhua, Liu Lingqiao, Dick Anthony, and Hengel Anton Van Den. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203212.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 20482057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 26212629.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 48944902.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Yin Wenpeng, Kann Katharina, Yu Mo, and Schütze Hinrich. 2017. Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923.Google ScholarGoogle Scholar
  46. [46] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Zhou Liang and Hovy Eduard. 2004. Template-filtered headline summarization. In Proceedings of the ACL Workshop, Text Summarization Branches Out.Google ScholarGoogle Scholar
  48. [48] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 1304113049.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
        May 2022
        413 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3505182
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 December 2021
        • Accepted: 1 August 2021
        • Revised: 1 July 2021
        • Received: 1 December 2020
        Published in tallip Volume 21, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format