Skip to main content
Log in

LCEval: Learned Composite Metric for Caption Evaluation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Automatic evaluation metrics hold a fundamental importance in the development and fine-grained analysis of captioning systems. While current evaluation metrics tend to achieve an acceptable correlation with human judgements at the system level, they fail to do so at the caption level. In this work, we propose a neural network-based learned metric to improve the caption-level caption evaluation. To get a deeper insight into the parameters which impact a learned metric’s performance, this paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics. We also compare metrics trained with different training examples to measure the variations in their evaluation. Moreover, we perform a robustness analysis, which highlights the sensitivity of learned and handcrafted metrics to various sentence perturbations. Our empirical analysis shows that our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://cocodataset.org/#captions-leaderboard.

  2. https://www.flickr.com/.

  3. https://github.com/NaehaSharif/LCEVal.

References

  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensorflow: A system for large-scale machine learning. OSDI., 16, 265–283.

    Google Scholar 

  • Aditya, S., Yang, Y., Baral, C., Aloimonos, Y., & Fermüller, C. (2017). Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding, 173, 33–45.

    Article  Google Scholar 

  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016) Spice: Semantic propositional image caption evaluation. In European conference on computer vision (pp. 382–398). Springer.

  • Banerjee, S., & Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65–72).

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Article  Google Scholar 

  • Bojar, O., Graham, Y., Kamran, A., & Stanojević, M. (2016). Results of the wmt16 metrics shared task. In Proceedings of the first conference on machine translation: volume 2, shared task papers (vol. 2, pp. 199–231)

  • Bojar, O., Helcl, J., Kocmi, T., Libovickỳ, J., Musil, T. (2017). Results of the wmt17 neural MT training task. In Proceedings of the second conference on machine translation (pp. 525–533)

  • Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., et al. (2015a). Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325.

  • Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P. et al. (2015b). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

  • Corston-Oliver, S., Gamon, M., Brockett, C. (2001). A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the 39th annual meeting on association for computational linguistics (pp. 148–155). Association for Computational Linguistics.

  • Cui, Y., Yang, G., Veit, A., Huang, X., & Belongie, S. (2018). Learning to evaluate image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5804–5812).

  • Dancey, C. P., & Reidy, J. (2004). Statistics without maths for psychology. Harlow: Prentice Hall.

    Google Scholar 

  • Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation (pp. 376–380).

  • Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers)

  • Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., & Hockenmaier, J. et al. (2010). Every picture tells a story: Generating sentences from images. In European conference on computer vision (pp. 15–29). Springer.

  • Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 249–256 (2010)

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hodosh, M., & Hockenmaier, J. (2016). Focused evaluation for image description with binary forced-choice tasks. In Proceedings of the 5th workshop on vision and language (pp. 19–28).

  • Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.

    Article  MathSciNet  MATH  Google Scholar 

  • Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128–3137).

  • Karpathy, A., Joulin, A., & Fei-Fei, L. F. (2014). Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems (pp. 1889–1897).

  • Khosrovian, K., Pfahl, D., & Garousi, V. (2008). Gensim 2.0: A customizable process simulation model for software process evaluation. In International conference on software process (pp. 294–306). Springer.

  • Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., & Erdem, E. (2016). Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:1612.07600.

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  • Kulesza, A., & Shieber, S. M. (2004). A learning approach to improving sentence-level MT evaluation. In Proceedings of the 10th international conference on theoretical and methodological issues in machine translation (pp. 75–84).

  • Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2891–2903.

    Article  Google Scholar 

  • Lin, C. Y. (2004) Rouge: A package for automatic evaluation of summaries. Text summarization branches out.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., & Ramanan, D. et al. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.

  • Liu, D., & Gildea, D. (2005). Syntactic features for evaluation of machine translation. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 25–32).

  • Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2016). Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:1612.00370.

  • Lu, J., Xiong, C., Parikh, D., & Socher, R. (2017). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (vol. 6).

  • Ma, Q., Bojar, O., Graham, Y. (2018). Results of the wmt18 metrics shared task: Both characters and embeddings achieve good performance. In Proceedings of the third conference on machine translation: shared task papers (pp. 671–688).

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

  • Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., & Berg, A. et al. (2012). Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th conference of the european chapter of the association for computational linguistics (pp. 747–756). Association for Computational Linguistics.

  • Ng, A. Y. (2004). Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on machine learning (p. 78). ICML ’04, ACM, New York, NY, USA. http://doi.acm.org/10.1145/1015330.1015435.

  • Ordonez, V., Han, X., Kuznetsova, P., Kulkarni, G., Mitchell, M., Yamaguchi, K., et al. (2016). Large scale retrieval and generation of image descriptions. International Journal of Computer Vision, 119(1), 46–59.

    Article  MathSciNet  Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311–318). Association for Computational Linguistics.

  • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

  • Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE international conference on computer vision (ICCV) (pp. 2641–2649). IEEE.

  • Ritter, S., Long, C., Paperno, D., Baroni, M., Botvinick, M., & Goldberg, A. (2015). Leveraging preposition ambiguity to assess compositional distributional models of semantics. In The fourth joint conference on lexical and computational semantics.

  • Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. In Proceedings of the IEEE international conference on computer vision (pp. 433–440).

  • Sharif, N., White, L., Bennamoun, M., & Shah, S. A. A. (2018a). Nneval: Neural network based evaluation metric. In Proceedings of the 15th European conference on computer vision. Springer Lecture Notes in Computer Science.

  • Sharif, N., White, L., Bennamoun, M., Shah, S. A. A. (2018b). Learning-based composite metrics for improved caption evaluation. In Proceedings of ACL 2018, student research workshop (pp. 14–20).

  • van Miltenburg, E., & Elliott, D. (2017). Room for improvement in automatic image description: An error analysis. arXiv preprint arXiv:1704.04198.

  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566–4575).

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In 2015 IEEE conference on Computer vision and pattern recognition (CVPR) (pp. 3156–3164) IEEE.

  • White, L., Togneri, R., Liu, W., & Bennamoun, M. (2015). How well sentence embeddings capture meaning. In Proceedings of the 20th Australasian document computing symposium (pp. 9:1–9:8). ADCS ’15, ACM. http://clic.cimec.unitn.it/marco/publications/ritt er-etal-prepositions-starsem-2015.pdf.

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., & Salakhudinov, R. et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).

  • Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2016). Boosting image captioning with attributes. OpenReview, 2(5), 8.

    Google Scholar 

  • You, Q., Jin, H., & Luo, J. (2018). Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121.

  • You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4651–4659).

  • Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67–78.

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful to NVIDIA for providing Titan-Xp GPU, which was used for the experiments. We also thank Somak Aditya for sharing COMPOSITE dataset and Ramakrishna Vedantam for sharing PASCAL50S and ABSTRACT50S datasets. Thanks to Yin Cui for providing the dataset containing the captions of 12 teams who participated in the 2015 COCO captioning challenge. This work is supported by Australian Research Council, ARC DP150100294.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Naeha Sharif.

Additional information

Communicated by Jakob Verbeek (retired).

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

A Appendix

1.1 A. 1 Performance on Machine Translations

To analyse the performance of our learned metric on sentences from a different domain, we experiment with the Machine Translation (MT) dataset. The MT dataset used in WMT shared metrics task (Ma et al. 2018) are publicly available along with the human judgements. The metrics are required to evaluate the translation quality based on the linguistic comparison against reference translations. To the best of our knowledge, such analysis has never been performed. For this experiment we use a dataset from WMT18 Metrics Shared Task (Ma et al. 2018) i.e., Direct Assessment (DA) segment-level newstest2016 which contains segment-level human scores for Machine Translations.

We focus on tasks involving translation to English from other languages. DAseg WMT16 contains translations from 6 source languages: Finnish (FI), Czech (CS), Russian (RU), German (DE), Roman (RO), Turkish (TR), containing 560 sentences each. We assess our metrics in terms of Kendall’s correlation with human judgements across different languages. We also evaluate the macro average of Kendall correlation across different languages. The human quality judgements are based on the similarity between the reference and candidate translation, the more similar it is to the reference the better it is in quality.

It can be seen in Table 14, that machine translation metric METEOR achieves the highest sentence-level correlation with human judgements across all the languages. Whereas, captioning specific metrics show a comparatively lower performance. Amongst the captioning specific metrics LCEval, or its variants show the best sentence-level correlation on most of the language pairs. One particular reason for this is that LCEval is a learned-composite measure and we believe that its performance could be improved further by training it on Machine Translations.

Table 14 Sentence-level Kendall correlation on WMT shared metrics segment-level newstest2016. All p-values are less than 0.001
Table 15 A comparison of the impact of three different feature groups on the learned metric in terms of correlation coefficients

1.2 A. 2 Impact of Using Min and Mean Operation Over the Reference Captions

We come to an understanding that LCEval is sensitive to the quality and number of reference captions used. Moreover, some of the metrics whose scores are used as features in LCEval, use ‘max’ operation over the reference captions, which limits their ability to exploit a larger number of reference captions. Therefore, as a proof of concept, we trained two different versions of LCEval: \(\text {LCEval}_{min}\) and \(\text {LCEval}_{mean}\). For \(\text {LCEval}_{min}\) we used ‘min’ operation over the reference captions for all the following features: n-gram precision, unigram recall, HWCM and MOWE. We do not change the implementations of other metrics and leave them as they are. For \(\text {LCEval}_{mean}\) we use ‘mean’ operation over the reference captions. Whereas, LCEval uses the ‘max’ operation over the reference.

We test the trained metrics on the PASCAL50S dataset by varying the number of references. For this experiment we keep the step size for the reference sentences equal to five. Figure 11 shows the results of our experiment. It is evident that using a ‘min’ or ‘mean’ operation helps the metric exploit a larger number of reference captions compared to the ‘max’ operation. However, we did not analyse the impact of using a ‘min’ or ‘mean’ operation on correlation and robustness and we leave this for future work.

Fig. 11
figure 11

Accuracy graphs with a variable number of reference captions for LCEval, \(\text {LCEval}_{min}\) and \(\text {LCEval}_{mean}\). LCEval, uses max operation over the reference captions, whereas \(\text {LCEval}_{min}\) and \(\text {LCEval}_{mean}\) use min and mean operation respectively. It is evident that using a ’min’ or ’mean’ operation helps the metric exploit a larger number of reference captions compared to the ’max’ operation

1.3 A. 3 Qualitative Results

Fig. 12
figure 12

Candidate caption: ‘a man is helping a little boy learn how to ride a bicycle’. Reference:‘The young boy learns how to ride a bike with his dad’. SPICE score: 0.59. \(\text {CIDEr}_D\) score: 1.38 METEOR score: 0.43. LCEval score: 0.99

Fig. 13
figure 13

Candidate caption: ‘a young boy tossing a soccer ball across a field’. Reference:‘A young boy deflecting a soccer ball away from the net’ . SPICE score: 0.59. \(\text {CIDEr}_D\) score: 0.93. METEOR score: 0.32. LCEval score: 0.95

Various examples of candidate captions, along with the ground truth captions and their evaluation scores generated by SPICE, \(\text {CIDEr}_D\), METEOR and our proposed metric LCEval are shown in Figs. 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 and 24 . Note that \(\text {CIDEr}_D\) scores are in the range of 0–10, whereas all other metric scores are in the range of 0–1.

Fig. 14
figure 14

Candidate caption: ‘a man on a motorcycle is riding a dirt bike’. Reference:‘Two motocross bikes are being raced around a dirt track’. SPICE score: 0.07. \(\text {CIDEr}_D\) score: 0.49. METEOR score: 0.19. LCEval score: 0.55

Fig. 15
figure 15

Candidate caption: ‘a man is standing on a bench with his arms outstretched’ . Reference:‘People travel up the elevator together’. SPICE score: 0.0. \(\text {CIDEr}_D\) score:0.0. METEOR score: 0.03. LCEval score: 0.25

Fig. 16
figure 16

Candidate caption: ‘a toddler girl plays with another younger toddler girl’. Reference:‘Two young children sitting together playing’. SPICE score: 0.0. \(\text {CIDEr}_D\) score: 0.0. METEOR score: 0.02. LCEval score: 0.19

Fig. 17
figure 17

Candidate caption: ‘a brown and white dog is sitting in some grass and a red and white fire hydrant’. Reference:‘a brown and white dog is sitting in some grass and a red and white fire hydrant’. SPICE score: 0.66. \(\text {CIDEr}_D\) score: 3.09. METEOR score: 0.49. LCEval score: 0.99

Fig. 18
figure 18

Candidate caption: ‘a woman handing another woman a birthday cake filled with candles’. Reference:‘A woman holding a blue birthday cake with stars and candles on it and another woman in front of the cake’. SPICE score: 0.27. \(\text {CIDEr}_D\) score: 1.13. METEOR score: 0.23. LCEval score: 0.80

Fig. 19
figure 19

Candidate caption: ‘a few people walking in a store near each other’. Reference:‘two people walking near a person using a cell phone’. SPICE score: 0.13. \(\text {CIDEr}_D\) score: 0.86. METEOR score: 0.21. LCEval score: 0.74

Fig. 20
figure 20

Candidate caption: ‘a man wearing a reflective vest sits on the sidewalk and holds up pamphlets with bicycles on the cover’. Reference:‘Man in bright yellow vest displays bicycle safety information on street’. SPICE score: 0.14. \(\text {CIDEr}_D\) score: 0.30. METEOR score: 0.23. LCEval score: 0.65

Fig. 21
figure 21

Candidate caption: ‘a man is surfing in a large wave’. Reference:‘A man is standing on an elephant lying in some water’. SPICE score: 0.09. \(\text {CIDEr}_D\) score: 0.19. METEOR score: 0.17. LCEval score: 0.26

Fig. 22
figure 22

Candidate caption: ‘an elderly women is sitting in a chair on one side of a room with a group of musicians including a cellist and guitar player and on the other side of the room a group of people are sitting on chairs’. Reference:‘Music being played by several individuals while a happy crowd sits and listens’. SPICE score: 0.2. \(\text {CIDEr}_D\) score: 0.22. METEOR score: 0.16. LCEval score: 0.85

Fig. 23
figure 23

Candidate caption: ‘the iguanas wrestled along the rocky water bank’. Reference:‘Two oriental lizards are fighting for dominance in a small pond’. SPICE score: 0.0. \(\text {CIDEr}_D\) score: 0.04. METEOR score: 0.18. LCEval score: 0.36

Fig. 24
figure 24

Candidate caption: ‘a man in a red shirt is holding a large sign’. Reference:‘A young lady and man dressed in warriors costume wielding sticks with a group of people in the background’. SPICE score: 0.04. \(\text {CIDEr}_D\) score: 0.14. METEOR score: 0.09. LCEval score: 0.12

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharif, N., White, L., Bennamoun, M. et al. LCEval: Learned Composite Metric for Caption Evaluation. Int J Comput Vis 127, 1586–1610 (2019). https://doi.org/10.1007/s11263-019-01206-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01206-z

Keywords

Navigation