ABSTRACT
Although a wide range of applications have been proposed in the field of multimodal natural language processing, very few works have been tackling multimodal relational lexical semantics. In this paper, we propose the first attempt to identify lexico-semantic relations with visual clues, which embody linguistic phenomena such as synonymy, co-hyponymy or hypernymy. While traditional methods take advantage of the paradigmatic approach or/and the distributional hypothesis, we hypothesize that visual information can supplement the textual information, relying on the apperceptum subcomponent of the semiotic textology linguistic theory. For that purpose, we automatically extend two gold-standard datasets with visual information, and develop different fusion techniques to combine textual and visual modalities following the patch-based strategy. Experimental results over the multimodal datasets show that the visual information can supplement the missing semantics of textual encodings with reliable performance improvements.
- Houssam Akhmouch, Gaël Dias, and Jose G. Moreno. 2021. Understanding Feature Focus in Multitask Settings for Lexico-semantic Relation Identification. In Findings of the Association for Computational Linguistics (ACL/IJCNLP). ACL, Thailand, 2762--2772.Google Scholar
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In IEEE International Conference on Computer Vision (ICCV). 2425--2433.Google ScholarDigital Library
- Mohammed Attia, Suraj Maharjan, Younes Samih, Laura Kallmeyer, and Thamar Solorio. 2016. CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings. In Workshop on Cognitive Aspects of the Lexicon. 86--91.Google Scholar
- Georgios Balikas, Gaël Dias, Rumen Moraliyski, Houssam Akhmouch, and Massih-Reza Amini. 2019. Learning Lexical-Semantic Relations Using Intuitive Cognitive Links. In 41st European Conference on Information Retrieval (ECIR). 3--18.Google Scholar
- Tadas Baltru?aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423--443.Google Scholar
- Nesrine Bannour, Gaël Dias, Youssef Chahir, and Houssam Akhmouch. 2020. Patch-Based Identification of Lexical Semantic Relations. In 42nd European Conference on Information Retrieval (ECIR). 126--140.Google Scholar
- Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung chieh Shan. 2012. Entailment Above theWord Level in Distributional Semantics. In 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 23--32.Google Scholar
- Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. 2020. Inducing relational knowledge from BERT. In 34th AAAI Conference on Artificial Intelligence (AAAI), Vol. 34. 7456--7463.Google ScholarCross Ref
- Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In European Conference on Computer Vision (ECCV). 104--120.Google Scholar
- Fabio Crestani, Martin Braschler, Jacques Savoy, Andreas Rauber, Henning Müller, David E Losada, G Heinatz Bürki, Linda Cappellato, and Nicola Ferro. 2019. Experimental IR Meets Multilinguality, Multimodality, and Interaction. In 10th International Conference of the CLEF Association (CLEF), Vol. 11696. Springer.Google ScholarDigital Library
- Sebastien Delecraz, Leonor Becerra-Bonache, Benoît Favre, Alexis Nasr, and Frédéric Béchet. 2021. Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images. Neural Processing Letters 53, 5 (2021), 3095--3121.Google ScholarDigital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248--255.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171--4186.Google Scholar
- Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. 2017. Learning to Paraphrase for Question Answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 886--897.Google ScholarCross Ref
- R. Fu, J. Guo, B. Qin, W. Che, H. Wang, and T. Liu. 2015. Learning Semantic Hierarchies: A Continuous Vector Space Approach. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3 (2015), 461--471.Google ScholarDigital Library
- Mahak Gambhir and Vishal Gupta. 2017. Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47, 1 (2017), 1--66.Google ScholarDigital Library
- Benito García-Valero. 2020. The Legacy of János S. Petofi. Text Linguistics, Literary Theory and Semiotics. Journal of Literary Semantics 49, 1 (2020), 61--64.Google ScholarCross Ref
- Goran Glavas and Ivan Vulic. 2019. Generalized Tuning of Distributional Word Vectors for Monolingual and Cross-Lingual Lexical Entailment. In 57th Conference of the Association for Computational Linguistics (ACL). 4824--4830.Google Scholar
- Marti Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In 14th Conference on Computational Linguistics (COLING). 539--545.Google ScholarDigital Library
- Judith Holler and Stephen C Levinson. 2019. Multimodal language processing in human communication. Trends in Cognitive Sciences 23, 8 (2019), 639--652.Google ScholarCross Ref
- MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. Comput. Surveys 51, 6 (2019), 1--36.Google ScholarDigital Library
- Glyn W Humphreys and Jie Sui. 2016. Attentional control and the self: the Self-Attention Network (SAN). Cognitive neuroscience 7, 1--4 (2016), 5--17.Google Scholar
- Sergio Jimenez, Fabio A. Gonzalez, Alexander Gelbukh, and George Duenas. 2019. Word2set: WordNet-Based Word Representation Rivaling Neural Word Embedding for Lexical Similarity and Sentiment Analysis. IEEE Computational Intelligence Magazine 14, 2 (2019), 41--53.Google ScholarCross Ref
- Aishwarya Kamath, Jonas Pfeiffer, Edoardo Maria Ponti, Goran Glava, and Ivan Vulic. 2019. Specializing Distributional Vectors of All Words for Lexical Entailment. In 4th Workshop on Representation Learning for NLP (RepL4NLP). 72--83.Google Scholar
- Neha Kathuria, Kanika Mittal, and Anusha Chhabra. 2017. A Comprehensive Survey on Query Expansion Techniques, their Issues and Challenges. International Journal of Computer Applications 168, 12 (2017).Google ScholarCross Ref
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations (ICLR), Yoshua Bengio and Yann LeCun (Eds.).Google Scholar
- Zornitsa Kozareva and Eduard Hovy. 2010. A Semi-supervised Method to Learn and Construct Taxonomies Using the Web. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 1110--1118.Google Scholar
- Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015. Combining Language and Vision with a Multimodal Skip-gram Model. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HTL). 153--163.Google ScholarCross Ref
- Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do supervised distributional methods really learn lexical inference relations?. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 970--976.Google ScholarCross Ref
- Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do Supervised Distributional Methods Really Learn Lexical Inference Relations?. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 970--976.Google ScholarCross Ref
- Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware multimodal dialogue systems. In 26th ACM International Conference on Multimedia (MM). 801--809.Google ScholarDigital Library
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312 (2014).Google Scholar
- Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu Sun, and Yuexian Zou. 2021. DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 1 (2021), 1--19.Google Scholar
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems (NeurIPS) 32 (2019).Google Scholar
- Catherine Marechal, Dariusz Mikolajewski, Krzysztof Tyburek, Piotr Prokopowicz, Lamine Bougueroua, Corinne Ancourt, and Katarzyna Wegrzyn-Wolska. 2019. Survey on AI-Based Multimodal Methods for Emotion Detection. Highperformance modelling and simulation for big data applications 11400 (2019), 307--324.Google Scholar
- George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. 1990. Introduction to WordNet: An On-line Lexical Database. International Journal of Lexicography 3, 4 (1 January 1990), 235--244.Google ScholarCross Ref
- Kim Anh Nguyen, Maximilian Köper, Sabine Schulte im Walde, and Ngoc Thang Vu. 2017. Hierarchical Embeddings for Hypernymy Detection and Directionality. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 233--243.Google Scholar
- Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2017. Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network. In 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 76--85.Google ScholarCross Ref
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999--66. Stanford InfoLab.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Conference on Empirical Methods on Natural Language Processing (EMNLP). 1532--1543.Google Scholar
- Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In IEEE International Conference on Data Mining (ICDM). 1033--1038.Google ScholarCross Ref
- James Pustejovsky, Eben Holderness, Jingxuan Tu, Parker Glenn, Kyeongmin Rim, Kelley Lynch, and Richard Brutti. 2021. Designing Multimodal Datasets for NLP Challenges. CoRR abs/2105.05999 (2021). arXiv:2105.05999Google Scholar
- Syed Arbaaz Qureshi, Sriparna Saha, Mohammed Hasanuzzaman, and Gaël Dias. 2019. Multitask Representation Learning for Multimodal Estimation of Depression Level. IEEE Intelligent Systems 34, 5 (2019), 45--52.Google ScholarCross Ref
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.Google Scholar
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. CoRR abs/2103.00020 (2021).Google Scholar
- Marek Rei, Daniela Gerz, and Ivan Vulic. 2018. Scoring Lexical Entailment with a Supervised Directional Similarity Network. In 56th Annual Meeting of the Association for Computational Linguistics (ACL). 638--643.Google ScholarCross Ref
- Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. Inclusive yet Selective: Supervised Distributional Hypernymy Detection. In 25th International Conference on Computational Linguistics (COLING). 1025--1036.Google Scholar
- Stephen Roller, Douwe Kiela, and Maximilian Nickel. 2018. Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora. In 56th Annual Meeting of the Association for Computational Linguistics (ACL). 358--363.Google ScholarCross Ref
- Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Lu, and Chu-Ren Huang. 2016. Nine Features in a Random Forest to Learn Taxonomical Semantic Relations. In 10th International Conference on Language Resources and Evaluation (LREC). 4557--4564.Google Scholar
- Enrico Santus, Vered Shwartz, and Dominik Schlechtweg. 2017. Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection. In 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 65--75.Google Scholar
- Xindi Shang, Zehuan Yuan, Anran Wang, and Changhu Wang. 2021. Multimodal Video Summarization via Time-Aware Transformers. In 29th ACM International Conference on Multimedia (MM). 1756--1765.Google ScholarDigital Library
- Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving Hypernymy Detection with an Integrated Path-based and Distributional Method. In 54th Annual Meeting of the Association for Computational Linguistics (ACL). 2389--2398.Google ScholarCross Ref
- Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations (ICLR).Google Scholar
- Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2004. Learning Syntactic Patterns for Automatic Hypernym Discovery. In 17th International Conference on Neural Information Processing Systems (NeurIPS). 1297--1304.Google Scholar
- Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing 65 (2017), 3--14.Google ScholarCross Ref
- Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. 2018. CentralNet: A Multilayer Approach for Multimodal Fusion. In European Conference on Computer Vision (ECCV). 575--589.Google Scholar
- Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. 2018. Centralnet: a multilayer approach for multimodal fusion. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.Google Scholar
- Tu Vu and Vered Shwartz. 2018. Integrating Multiplicative Features into Supervised Distributional Methods for Lexical Entailment. In 7th Joint Conference on Lexical and Computational Semantics (*SEM). 160--166.Google ScholarCross Ref
- Ivan Vulic and Nikola Mrk?ic. 2018. Specialising Word Vectors for Lexical Entailment. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 1134--1145.Google Scholar
- Ivan Vulic and Nikola Mrksic. 2018. Specialising Word Vectors for Lexical Entailment. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 1134--1145.Google Scholar
- Ivan Vulic, Nikola Mrksic, Roi Reichart, Diarmuid Ó Séaghdha, Steve J. Young, and Anna Korhonen. 2017. Morph-fitting: Fine-Tuning Word Vector Spaces with Simple Language-Specific Rules. In 55th Annual Meeting of the Association for Computational Linguistics (ACL). 56--68.Google Scholar
- Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin. 2016. Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning. In 54th Annual Meeting of the Association for Computational Linguistics (ACL). 1671--1682.Google Scholar
- Chengyu Wang and Xiaofeng He. 2020. BiRRE: Learning Bidirectional Residual Relation Embeddings for Supervised Hypernymy Detection. In 58th Annual Meeting of the Association for Computational Linguistics (ACL). 3630--3640.Google Scholar
- Julie Weeds, Daoud Clarke, Jeremy Reffin, David J. Weir, and Bill Keller. 2014. Learning to Distinguish Hypernyms and Co-Hyponyms. In 5th International Conference on Computational Linguistics (COLING). 2249--2259.Google Scholar
- Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. 2021. Towers of babel: Combining images, language, and 3D geometry for learning multimodal vision. In IEEE/CVF International Conference on Computer Vision (ICCV). 428--437.Google ScholarCross Ref
- Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. 2020. Contrastive Learning of Medical Visual Representations from Paired Images and Text. CoRR abs/2010.00747 (2020). arXiv:2010.00747Google Scholar
- Changmeng Zheng, Junhao Feng, Ze Fu, Yi Cai, Qing Li, and Tao Wang. 2021. Multimodal Relation Extraction with Efficient Graph Alignment. In 29th ACM International Conference on Multimedia (MM). 5298--5306.Google Scholar
Index Terms
- Combining Vision and Language Representations for Patch-based Identification of Lexico-Semantic Relations
Recommendations
Semantic classification of automatically acquired nouns using lexico-syntactic clues
COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics: PostersIn this paper, we present a two-stage approach to acquire Japanese unknown morphemes from text with full POS tags assigned to them. We first acquire unknown morphemes only making a morphology-level distinction, and then apply semantic classification to ...
Role of Semantic Relations in Hindi Word Sense Disambiguation
AbstractSemantic relations play an important role in resolving the ambiguity of a polysemous word. This paper investigates the role of hypernym, hyponym, holonym and meronym relations in Hindi Word Sense Disambiguation. In this work, we have considered ...
Translating lexical semantic relations: the first step towards multilingual wordnets
SEMANET '02: Proceedings of the 2002 workshop on Building and using semantic networks - Volume 11Establishing correspondences between wordnets of different languages is essential to both multilingual knowledge processing and for bootstrapping wordnets of low-density languages. We claim that such correspondences must be based on lexical semantic ...
Comments