skip to main content
10.1145/3503161.3548299acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Combining Vision and Language Representations for Patch-based Identification of Lexico-Semantic Relations

Authors Info & Claims
Published:10 October 2022Publication History

ABSTRACT

Although a wide range of applications have been proposed in the field of multimodal natural language processing, very few works have been tackling multimodal relational lexical semantics. In this paper, we propose the first attempt to identify lexico-semantic relations with visual clues, which embody linguistic phenomena such as synonymy, co-hyponymy or hypernymy. While traditional methods take advantage of the paradigmatic approach or/and the distributional hypothesis, we hypothesize that visual information can supplement the textual information, relying on the apperceptum subcomponent of the semiotic textology linguistic theory. For that purpose, we automatically extend two gold-standard datasets with visual information, and develop different fusion techniques to combine textual and visual modalities following the patch-based strategy. Experimental results over the multimodal datasets show that the visual information can supplement the missing semantics of textual encodings with reliable performance improvements.

References

  1. Houssam Akhmouch, Gaël Dias, and Jose G. Moreno. 2021. Understanding Feature Focus in Multitask Settings for Lexico-semantic Relation Identification. In Findings of the Association for Computational Linguistics (ACL/IJCNLP). ACL, Thailand, 2762--2772.Google ScholarGoogle Scholar
  2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In IEEE International Conference on Computer Vision (ICCV). 2425--2433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mohammed Attia, Suraj Maharjan, Younes Samih, Laura Kallmeyer, and Thamar Solorio. 2016. CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings. In Workshop on Cognitive Aspects of the Lexicon. 86--91.Google ScholarGoogle Scholar
  4. Georgios Balikas, Gaël Dias, Rumen Moraliyski, Houssam Akhmouch, and Massih-Reza Amini. 2019. Learning Lexical-Semantic Relations Using Intuitive Cognitive Links. In 41st European Conference on Information Retrieval (ECIR). 3--18.Google ScholarGoogle Scholar
  5. Tadas Baltru?aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423--443.Google ScholarGoogle Scholar
  6. Nesrine Bannour, Gaël Dias, Youssef Chahir, and Houssam Akhmouch. 2020. Patch-Based Identification of Lexical Semantic Relations. In 42nd European Conference on Information Retrieval (ECIR). 126--140.Google ScholarGoogle Scholar
  7. Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung chieh Shan. 2012. Entailment Above theWord Level in Distributional Semantics. In 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 23--32.Google ScholarGoogle Scholar
  8. Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. 2020. Inducing relational knowledge from BERT. In 34th AAAI Conference on Artificial Intelligence (AAAI), Vol. 34. 7456--7463.Google ScholarGoogle ScholarCross RefCross Ref
  9. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In European Conference on Computer Vision (ECCV). 104--120.Google ScholarGoogle Scholar
  10. Fabio Crestani, Martin Braschler, Jacques Savoy, Andreas Rauber, Henning Müller, David E Losada, G Heinatz Bürki, Linda Cappellato, and Nicola Ferro. 2019. Experimental IR Meets Multilinguality, Multimodality, and Interaction. In 10th International Conference of the CLEF Association (CLEF), Vol. 11696. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sebastien Delecraz, Leonor Becerra-Bonache, Benoît Favre, Alexis Nasr, and Frédéric Béchet. 2021. Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images. Neural Processing Letters 53, 5 (2021), 3095--3121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  13. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171--4186.Google ScholarGoogle Scholar
  14. Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. 2017. Learning to Paraphrase for Question Answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 886--897.Google ScholarGoogle ScholarCross RefCross Ref
  15. R. Fu, J. Guo, B. Qin, W. Che, H. Wang, and T. Liu. 2015. Learning Semantic Hierarchies: A Continuous Vector Space Approach. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3 (2015), 461--471.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mahak Gambhir and Vishal Gupta. 2017. Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47, 1 (2017), 1--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Benito García-Valero. 2020. The Legacy of János S. Petofi. Text Linguistics, Literary Theory and Semiotics. Journal of Literary Semantics 49, 1 (2020), 61--64.Google ScholarGoogle ScholarCross RefCross Ref
  18. Goran Glavas and Ivan Vulic. 2019. Generalized Tuning of Distributional Word Vectors for Monolingual and Cross-Lingual Lexical Entailment. In 57th Conference of the Association for Computational Linguistics (ACL). 4824--4830.Google ScholarGoogle Scholar
  19. Marti Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In 14th Conference on Computational Linguistics (COLING). 539--545.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Judith Holler and Stephen C Levinson. 2019. Multimodal language processing in human communication. Trends in Cognitive Sciences 23, 8 (2019), 639--652.Google ScholarGoogle ScholarCross RefCross Ref
  21. MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. Comput. Surveys 51, 6 (2019), 1--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Glyn W Humphreys and Jie Sui. 2016. Attentional control and the self: the Self-Attention Network (SAN). Cognitive neuroscience 7, 1--4 (2016), 5--17.Google ScholarGoogle Scholar
  23. Sergio Jimenez, Fabio A. Gonzalez, Alexander Gelbukh, and George Duenas. 2019. Word2set: WordNet-Based Word Representation Rivaling Neural Word Embedding for Lexical Similarity and Sentiment Analysis. IEEE Computational Intelligence Magazine 14, 2 (2019), 41--53.Google ScholarGoogle ScholarCross RefCross Ref
  24. Aishwarya Kamath, Jonas Pfeiffer, Edoardo Maria Ponti, Goran Glava, and Ivan Vulic. 2019. Specializing Distributional Vectors of All Words for Lexical Entailment. In 4th Workshop on Representation Learning for NLP (RepL4NLP). 72--83.Google ScholarGoogle Scholar
  25. Neha Kathuria, Kanika Mittal, and Anusha Chhabra. 2017. A Comprehensive Survey on Query Expansion Techniques, their Issues and Challenges. International Journal of Computer Applications 168, 12 (2017).Google ScholarGoogle ScholarCross RefCross Ref
  26. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations (ICLR), Yoshua Bengio and Yann LeCun (Eds.).Google ScholarGoogle Scholar
  27. Zornitsa Kozareva and Eduard Hovy. 2010. A Semi-supervised Method to Learn and Construct Taxonomies Using the Web. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 1110--1118.Google ScholarGoogle Scholar
  28. Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015. Combining Language and Vision with a Multimodal Skip-gram Model. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HTL). 153--163.Google ScholarGoogle ScholarCross RefCross Ref
  29. Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do supervised distributional methods really learn lexical inference relations?. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 970--976.Google ScholarGoogle ScholarCross RefCross Ref
  30. Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do Supervised Distributional Methods Really Learn Lexical Inference Relations?. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 970--976.Google ScholarGoogle ScholarCross RefCross Ref
  31. Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware multimodal dialogue systems. In 26th ACM International Conference on Multimedia (MM). 801--809.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312 (2014).Google ScholarGoogle Scholar
  33. Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu Sun, and Yuexian Zou. 2021. DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 1 (2021), 1--19.Google ScholarGoogle Scholar
  34. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems (NeurIPS) 32 (2019).Google ScholarGoogle Scholar
  35. Catherine Marechal, Dariusz Mikolajewski, Krzysztof Tyburek, Piotr Prokopowicz, Lamine Bougueroua, Corinne Ancourt, and Katarzyna Wegrzyn-Wolska. 2019. Survey on AI-Based Multimodal Methods for Emotion Detection. Highperformance modelling and simulation for big data applications 11400 (2019), 307--324.Google ScholarGoogle Scholar
  36. George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. 1990. Introduction to WordNet: An On-line Lexical Database. International Journal of Lexicography 3, 4 (1 January 1990), 235--244.Google ScholarGoogle ScholarCross RefCross Ref
  37. Kim Anh Nguyen, Maximilian Köper, Sabine Schulte im Walde, and Ngoc Thang Vu. 2017. Hierarchical Embeddings for Hypernymy Detection and Directionality. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 233--243.Google ScholarGoogle Scholar
  38. Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2017. Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network. In 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 76--85.Google ScholarGoogle ScholarCross RefCross Ref
  39. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999--66. Stanford InfoLab.Google ScholarGoogle Scholar
  40. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Conference on Empirical Methods on Natural Language Processing (EMNLP). 1532--1543.Google ScholarGoogle Scholar
  41. Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In IEEE International Conference on Data Mining (ICDM). 1033--1038.Google ScholarGoogle ScholarCross RefCross Ref
  42. James Pustejovsky, Eben Holderness, Jingxuan Tu, Parker Glenn, Kyeongmin Rim, Kelley Lynch, and Richard Brutti. 2021. Designing Multimodal Datasets for NLP Challenges. CoRR abs/2105.05999 (2021). arXiv:2105.05999Google ScholarGoogle Scholar
  43. Syed Arbaaz Qureshi, Sriparna Saha, Mohammed Hasanuzzaman, and Gaël Dias. 2019. Multitask Representation Learning for Multimodal Estimation of Depression Level. IEEE Intelligent Systems 34, 5 (2019), 45--52.Google ScholarGoogle ScholarCross RefCross Ref
  44. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.Google ScholarGoogle Scholar
  45. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. CoRR abs/2103.00020 (2021).Google ScholarGoogle Scholar
  46. Marek Rei, Daniela Gerz, and Ivan Vulic. 2018. Scoring Lexical Entailment with a Supervised Directional Similarity Network. In 56th Annual Meeting of the Association for Computational Linguistics (ACL). 638--643.Google ScholarGoogle ScholarCross RefCross Ref
  47. Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. Inclusive yet Selective: Supervised Distributional Hypernymy Detection. In 25th International Conference on Computational Linguistics (COLING). 1025--1036.Google ScholarGoogle Scholar
  48. Stephen Roller, Douwe Kiela, and Maximilian Nickel. 2018. Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora. In 56th Annual Meeting of the Association for Computational Linguistics (ACL). 358--363.Google ScholarGoogle ScholarCross RefCross Ref
  49. Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Lu, and Chu-Ren Huang. 2016. Nine Features in a Random Forest to Learn Taxonomical Semantic Relations. In 10th International Conference on Language Resources and Evaluation (LREC). 4557--4564.Google ScholarGoogle Scholar
  50. Enrico Santus, Vered Shwartz, and Dominik Schlechtweg. 2017. Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection. In 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 65--75.Google ScholarGoogle Scholar
  51. Xindi Shang, Zehuan Yuan, Anran Wang, and Changhu Wang. 2021. Multimodal Video Summarization via Time-Aware Transformers. In 29th ACM International Conference on Multimedia (MM). 1756--1765.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving Hypernymy Detection with an Integrated Path-based and Distributional Method. In 54th Annual Meeting of the Association for Computational Linguistics (ACL). 2389--2398.Google ScholarGoogle ScholarCross RefCross Ref
  53. Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  54. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  55. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2004. Learning Syntactic Patterns for Automatic Hypernym Discovery. In 17th International Conference on Neural Information Processing Systems (NeurIPS). 1297--1304.Google ScholarGoogle Scholar
  56. Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing 65 (2017), 3--14.Google ScholarGoogle ScholarCross RefCross Ref
  57. Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. 2018. CentralNet: A Multilayer Approach for Multimodal Fusion. In European Conference on Computer Vision (ECCV). 575--589.Google ScholarGoogle Scholar
  58. Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. 2018. Centralnet: a multilayer approach for multimodal fusion. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.Google ScholarGoogle Scholar
  59. Tu Vu and Vered Shwartz. 2018. Integrating Multiplicative Features into Supervised Distributional Methods for Lexical Entailment. In 7th Joint Conference on Lexical and Computational Semantics (*SEM). 160--166.Google ScholarGoogle ScholarCross RefCross Ref
  60. Ivan Vulic and Nikola Mrk?ic. 2018. Specialising Word Vectors for Lexical Entailment. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 1134--1145.Google ScholarGoogle Scholar
  61. Ivan Vulic and Nikola Mrksic. 2018. Specialising Word Vectors for Lexical Entailment. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 1134--1145.Google ScholarGoogle Scholar
  62. Ivan Vulic, Nikola Mrksic, Roi Reichart, Diarmuid Ó Séaghdha, Steve J. Young, and Anna Korhonen. 2017. Morph-fitting: Fine-Tuning Word Vector Spaces with Simple Language-Specific Rules. In 55th Annual Meeting of the Association for Computational Linguistics (ACL). 56--68.Google ScholarGoogle Scholar
  63. Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin. 2016. Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning. In 54th Annual Meeting of the Association for Computational Linguistics (ACL). 1671--1682.Google ScholarGoogle Scholar
  64. Chengyu Wang and Xiaofeng He. 2020. BiRRE: Learning Bidirectional Residual Relation Embeddings for Supervised Hypernymy Detection. In 58th Annual Meeting of the Association for Computational Linguistics (ACL). 3630--3640.Google ScholarGoogle Scholar
  65. Julie Weeds, Daoud Clarke, Jeremy Reffin, David J. Weir, and Bill Keller. 2014. Learning to Distinguish Hypernyms and Co-Hyponyms. In 5th International Conference on Computational Linguistics (COLING). 2249--2259.Google ScholarGoogle Scholar
  66. Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. 2021. Towers of babel: Combining images, language, and 3D geometry for learning multimodal vision. In IEEE/CVF International Conference on Computer Vision (ICCV). 428--437.Google ScholarGoogle ScholarCross RefCross Ref
  67. Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. 2020. Contrastive Learning of Medical Visual Representations from Paired Images and Text. CoRR abs/2010.00747 (2020). arXiv:2010.00747Google ScholarGoogle Scholar
  68. Changmeng Zheng, Junhao Feng, Ze Fu, Yi Cai, Qing Li, and Tao Wang. 2021. Multimodal Relation Extraction with Efficient Graph Alignment. In 29th ACM International Conference on Multimedia (MM). 5298--5306.Google ScholarGoogle Scholar

Index Terms

  1. Combining Vision and Language Representations for Patch-based Identification of Lexico-Semantic Relations

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '22: Proceedings of the 30th ACM International Conference on Multimedia
          October 2022
          7537 pages
          ISBN:9781450392037
          DOI:10.1145/3503161

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 October 2022

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader