Abstract
Automatic video captioning is one of the ultimate challenges of Natural Language Processing, boosted by the omnipresence of video and the release of large-scale annotated video benchmarks. However, the specificity and quality of the captions vary considerably, having an adverse effect on the quality of the trained captioning models. In this work, we address this issue by proposing automatic strategies for optimizing the annotations of video material, removing annotations that are not semantically relevant and generating new and more informative captions. We evaluate our approach on the MSR-VTT challenge with a state-of-the-art deep learning video-to-language model. Our code is available at https://github.com/lpmayos/mcv_thesis.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
Note that 1.5xIQR, the interquartile range, is a standard definition for suspected outliers.
References
Awad, G., et al.: Trecvid 2016: evaluating video search, video event detection, localization, and hyperlinking. In: Proceedings of TRECVID, vol. 2016 (2016)
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247 (2014)
Barzilay, R., McKeown, K.R.: Sentence fusion for multi-document news summarization. CL 31(3), 297–328 (2005)
Bengio, Y., et al.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)
Bing, L., et al.: Abstractive multi-document summarization via phrase selection and merging. arXiv preprint arXiv:1506.01597 (2015)
Boudin, F., Morin, E.: Keyphrase extraction for n-best reranking in multi-sentence compression. In: NAACL (2013)
Cheung, J.C.K., Penn, G.: Unsupervised sentence enhancement for automatic summarization. In: EMNLP, pp. 775–786 (2014)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th ICML, pp. 160–167. ACM (2008)
Elsner, M., Santhanam, D.: Learning to fuse disparate sentences. In: Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 54–63. ACL (2011)
Filippova, K.: Multi-sentence compression: finding shortest paths in word graphs. In: Proceedings of the 23rd ICCL, pp. 322–330. ACL (2010)
Filippova, K., Strube, M.: Sentence fusion via dependency graph compression. In: Proceedings of the CEMNLP, pp. 177–185. ACL (2008)
Han, L., et al.: UMBC_EBIQUITY-CORE: semantic textual similarity systems. In: * SEM@ NAACL-HLT, pp. 44–52 (2013)
Iacobacci, I., Pilehvar, M.T., Navigli, R.: SensEmbed: learning sense embeddings for word and relational similarity. In: ACL, vol. 1, pp. 95–105 (2015)
Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th ICML, pp. 641–648. ACM (2007)
Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the ACL, pp. 216–225. ACL (2010)
Ramanishka, V., et al.: Multimodal video description. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)
Ramanishka, V., et al.: Top-down visual saliency guided by captions. In: arXiv preprint arXiv:1612.07360 (2016)
Thadani, K., McKeown, K.: Supervised sentence fusion with single-stage inference. In: IJCNLP, pp. 1410–1418 (2013)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the ACL, pp. 384–394. ACL (2010)
Tzouridis, E., Nasir, J.A., Brefeld, U.: Learning to summarise related sentences. In: COLING, pp. 1636–1647 (2014)
Vadapalli, R. et al.: SSAS: semantic similarity for abstractive summarization. In: Proceedings of the IJCNLP (2017)
Xu, J., et al.: MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on CVPR, pp. 5288–5296 (2016)
Yu, M., Dredze, M.: Improving lexical embeddings with semantic knowledge. In: ACL, vol. 2, pp. 545–550 (2014)
Zou, W.Y., et al.: Bilingual word embeddings for phrase-based machine translation. In: EMNLP, pp. 1393–1398 (2013)
Acknowledgment
This work is partly supported by the Spanish Ministry of Economy and Competitiveness under the Ramon y Cajal fellowships, and the Kristina project funded by the European Union Horizon 2020 research and innovation programme under grant agreement No 645012. The Titan X GPU used for this research was donated by the NVIDIA Corporation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Pérez-Mayos, L., Sukno, F.M., Wanner, L. (2018). Improving the Quality of Video-to-Language Models by Optimizing Annotation of the Training Material. In: Schoeffmann, K., et al. MultiMedia Modeling. MMM 2018. Lecture Notes in Computer Science(), vol 10704. Springer, Cham. https://doi.org/10.1007/978-3-319-73603-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-73603-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73602-0
Online ISBN: 978-3-319-73603-7
eBook Packages: Computer ScienceComputer Science (R0)