: Language pattern learning for word sense induction and disambiguation☆
Introduction
Limited contextual information, poor grammatical rules’ conformity, and high redundancy make semantic annotation of social media texts challenging, specifically for word sense disambiguation systems [1].
A word can be interpreted in multiple ways depending on the context in which it occurs — lexical ambiguity phenomenon [2], [3]. Sometimes words have implicit semantics (e.g., to make humor, irony, or wordplay). For example, the tweet “Blond, brunette or red-headed Devassa ???”1 plays with words in English that usually refer to hair color and the word devassa from the Portuguese language. The intended meaning of this word in this context is not the one you may find in a dictionary, but beer! It can be inferred by considering the pattern appearing in the tweets presented in Table 2, as explained in the following. We have found that implicit semantics induced by the language patterns investigated in this paper is quite common in colloquial language, and particularly in social media. In cases like this, current annotation methods frequently fail to capture the correct meaning of certain words, leading to results with low levels of precision and recall [4].
Some of the most prominent tasks for modeling and resolving the lexical ambiguity problem are Word Sense Disambiguation (WSD) and Word Sense Induction (WSI) [3], [5]. Both tasks are fundamental in Natural Language Processing (NLP). WSI automatically discovers the possible senses for target words [6] in text documents, regarding the context in which each word appears. WSD, in turn, automatically disambiguates the possible meanings to assign the most probable one to each target word. Some methods for WSD rely on a fixed inventories of word senses such as WordNet [7]. Many solutions for other widely used semantic annotation tasks such as Named Entity Recognition/Normalization/Disambiguation (NER/NEN/NED) or Entity Linking (EL) [8], [9] also rely on pre-defined sense inventories.
However, building and constantly updating large inventories of word senses or semantic knowledge bases (e.g., WordNet [10], BabelNet [11], Yago [12]) is an expensive and difficult task. As a result, important current senses of words used in social media, by specific groups, in certain geographic regions or in particular domains may not appear in sense inventories. For example, consider the tweets2 (i) to (iii) transcribed below and the sense of the word polar in each one of them.
- (i)
And the journey continues: the Arctic Sunrise reaches the polar region.
- (ii)
This is the first-ever view of the polar region of Jupiter.
- (iii)
I wish to drink an ice-cold polar.
Consider the sense inventory extract for the word polar presented in Table 1. The intended sense for polar in tweets (i) and (ii) is that of line #2 in Table 1, because of clear references to regions of Earth and Jupiter, respectively. However, in tweet (iii), polar plays the role of a proper name and the verb to drink induces a sense of beverage which is out of the scope of some sense inventories. Difficulties to catch the exact word sense will happen using any sense inventory that does not know about the beer brand called Polar. Sometimes, the intended word sense exists in the inventory, but it is hard to devise algorithms that disambiguate to the correct sense. For example, by manually searching BabelNet, we found 19 senses (12 nouns and 6 adjectives) for the word polar. Two of these senses are related to beer. Then, we submitted the sentence of message (iii) to the Babelfy tool,3 which automatically annotated the word polar in that sentence with the wrong sense of arctic (extremely cold).
In this paper, we propose a new approach and automated methods to induce and disambiguate the correct sense of words in text. As many WSI/WSD approaches, it also takes advantage of contextual information. However, for the best of our knowledge, our proposal is the first one to automatically learn context information from some collection of short text as language patterns defined by sequences of Morpho-Semantic Components (MSC), which we call patterns. Each MSC has the most probable senses (e.g. candidate concepts or named entities) and the most probable morphosyntactic classes (e.g. Part-of-Speech (PoS) tags) for a word in a text document. An pattern can be seen as a sequence of pairs that repeatedly appear in the MSCs of distinct short texts (e.g. tweets). An instance of an pattern is a sequence of not necessarily adjacent but consecutive MSCs (i.e. words) in some text, in which the most probable morphosyntactic class and sense for each MSC match those in the respective position the pattern.
Firstly, we provide formal definitions for MSC and patterns, and an unsupervised algorithm to mine these patterns. Then, we show that these patterns are frequent in tweets. Finally, we exploit these linguistic patterns for doing correct WSI/WSD in cases for which current tools fail, leading to better precision and recall of semantic annotations.
Table 2 provides an example of pattern arising in short texts (#1 to #5) extracted from Twitter.4 Each one of these tweets (one per line) has an instance of the pattern that is a sequence of distinct MSCs (i.e. words) with respective candidate morphosyntactic classes and senses matching the sequence: , , . Each instance of this pattern is a sequence of 3 MSCs (columns MSC1, MSC2 and MSC3) highlighted in bold in the respective tweet (line). These sequences are similar in terms of the PoS tags and meanings of their respective MSCs. MSC1 is a verb referring to liquid ingestion (sense = Ingest). MSC2 is an adjective that can refer to cold, extremely cold or hair color, among other possibilities (sense = Any). MSC3 is a noun, but only for the tweets #1 and #2 its sense is correctly disambiguated to beer and beer brand, respectively, by using current automatic approaches.
Target words (whose sense has to be solved yet) in the other tweets are highlighted in red. Thanks to the pattern established by the MSC sequences of the first two tweets, our method can also disambiguate them. Notice the partial adherence of the MSC sequence in #3 to this pattern. Each MSC in #3 matches with the respective one in the pattern, except the word polar (), whose PoS class is also a noun, but whose sense is initially undetermined. This partial adherence to the pattern induces the sense of the from the previous tweets to the word polar in tweet #3, and allows it to be disambiguated to beer.
Analogously, one can also induce sense to and disambiguate the word Devassa to beer in tweets #4 and #5, as well as other short texts (from authors with similar language habits) in which such a word appears with the same pattern or a similar one (e.g. the tweet “Blond, brunette or red-headed Devassa ???”). We employ this rationale to automatically mine patterns and use them for doing WSI and WSD when there is partial matching with some mined pattern involving a word that current automatic approaches have difficulties to disambiguate.
Some approaches for WSI/WSD learn and employ lexical patterns [13]. Nevertheless, patterns are an alternative for some situations in which current approaches fail, as exemplified above. These patterns can be used to improve semantic annotations produced by using a variety of annotation methods [14], [15]. Then, the improved semantic annotations can help to boost a variety of computational tasks and applications, ranging from text simplification [16], summarizers [17] and knowledge base enrichment [1], [10], [11], [18], [19] to events detection [20], sentiment analysis [21], and question answering [22].
The main contributions of this paper can be stated as follow:
- 1.
the introduction and formal definition of MSC, MSC sequences, patterns and pattern instances;
- 2.
an approach for word sense induction and disambiguation based on patterns;
- 3.
an algorithm to find the most frequent patterns in a set of documents that have been previously annotated with candidate morphosyntactic classes and candidate senses of semantically relevant words.
- 4.
two methods for word sense induction and disambiguation based on patterns.
The results of experiments reveal major characteristics of patterns mined in a set of tweets, and the contribution of word sense induction and disambiguation based on the mined patterns to improve precision and recall of semantic annotations produced by state-of-the-art systems. They also show that the variation of our method for word sense induction and disambiguation relying on both PoS tagging and word sense confidence leads to the best results.
The remainder of this paper is organized as follows. Section 2 provides some foundations necessary for understanding our proposal. Section 3 discusses related works. Section 4 formally defines key concepts of our approach such as MSC and patterns. Section 5 describes our approach in a top-down fashion, i.e., first in terms of main stages and then details of key tasks. Section 6 and Section 7 report the experiments for performance evaluation and discuss their results. Finally, Section 8 presents conclusions and indications of future work.
Section snippets
Preliminaries
This section provides an overview of methods for capturing word senses, the problem tackled in this paper. It also reviews sequence pattern mining, and how it could be used to find textual patterns that are less elaborated than patterns. However, it can help to understand our proposal and it distinguishable traits.
Related work
Many WSD/WSI approaches are based on unsupervised clustering techniques. Here, we consider four approaches according to the existing literature: (i) clustering based on contextual information, (ii) approaches based on word information, (iii) graph based techniques, and (iv) probabilistic models.
The standard approach is the first one, in which the contexts of word instances are represented as vector space model (e.g., bag-of-words) of first or second-order. The context vectors are obtained by
Morpho-semantic components
Some words in text can have their meaning induced and disambiguated by exploiting morpho-semantic patterns. A morpho-semantic pattern is characterized by a sequence of Morpho-semantic Components (MSC) with length based on n-gram. An MSC refers to a word in a document text and its associated sets of candidate morphosyntactic classes (e.g. from some PoS tag set) and senses (e.g. from some sense class set), as stated in Definition 1.
Definition 1 Morpho-Semantic Component — MSC Given a text document with identifier , a set of
The proposed approach
This section describes our approach for coping with challenging instances of the WSI and WSD problems in sequences of texts such as social media posts. Fig. 1 provides an overview of the proposed process, which is composed of five stages: (i) Pre-processing, (ii) Semantic and morphosyntactic annotation, (iii) MSC extraction, (iv) Matching MSCs & Mining patterns, and (v) WSI/WSD based on patterns. In this figure, the continuous lines indicate data flow and the dotted lines linking
Experiments
The experiments aim to investigate the prevalence of patterns in tweets, their support by perfectly matching sequences in these tweets, and the gains obtained by doing WSI/WSD based on the mined patterns on words with sense unresolved by typical annotation tools. This section describes the infra-structure, including off-the-shelf tools, the parameters, the dataset, and the evaluation metrics used in these experiments. It also presents a characterization of the mined patterns and their
WSI/WSD results
Table 8 presents the WSI/WSD performance for FreeLing, FreeLing&Babelfy, and some variations of our proposal based on patterns (below the dashed line). Table 9 and Table 10, in turn, detail the gains obtained by applying variations of our approach with support 2% and 5%, respectively. Each table shows the total number of words whose sense has been handled by the respective approach (#Word), the number of matchings in terms of just surface name (Mention) or surface name and sense
Conclusions and future work
This paper presented an approach for WSI/WSD that exploits language patterns composed by sequences of morpho-semantic components (MSC) quite frequent in some short text documents such as tweets. We presented an algorithm to mine these patterns, which we call patterns, from sets of text documents. It relies on annotations of morphosyntactic classes and senses of relevant words. These annotations can be produced by a variety of alternative off-the-shelf tools for NLP. Our pattern
Acknowledgments
This work was conducted during a doctorate partially supported by grants of CAPES (Brazilian Coordination of Superior Level Staff Improvement) a research support agency from the Ministry of Education of Brazil. CAPES also supported an internship for international cooperation with the TALN (Natural Language Processing Research Group) at the Pompeu Fabra University in Barcelona, Spain. The last author acknowledges support from the Spanish Government under the María de Maeztu Units of Excellence
References (53)
- et al.
Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities
Artificial Intelligence
(2016) - et al.
Extracting information from social media with GATE
- et al.
The baquara 2 knowledge-based framework for semantic enrichment and analysis of movement data
Data Knowl. Eng.
(2015) - et al.
A text summarization method based on fuzzy rules and applicable to automated assessment
Expert Syst. Appl.
(2019) - et al.
Automatising the learning of lexical patterns: An application to the enrichment of wordnet by extracting semantic relationships from wikipedia
Data Knowl. Eng.
(2007) - et al.
Spreading semantic information by word sense disambiguation
Knowl.-Based Syst.
(2017) Hyperlex: Lexical cartography for information retrieval
Comput. Speech Lang.
(2004)- et al.
Word sense induction with multilingual features representation
- D. Alagić, J. Šnajder, S. Padó, Leveraging lexical substitutes for unsupervised word sense induction, in: Thirty-Second...
- et al.
Semeval-2013 task 11: Word sense induction and disambiguation within an end-user application
Automatic word sense discrimination
Comput. linguist.
A quick tour of word sense disambiguation, induction and related approaches
Entity linking leveraging: Automatically generated annotation
Joint inference of named entity recognition and normalization for tweets
WordNet
BabelNet: Building a very large multilingual semantic network
Yago3: A knowledge base from multilingual wikipedias
Pengyuan@ PKU: Extracting infrequent sense instance with the same n-gram pattern for the semeval-2010 task 15
Entity linking meets word sense disambiguation: A unified approach
Trans. Assoc. Comput. Linguist.
Making it simplext: Implementation and evaluation of a text simplification system for spanish
ACM Trans. Access. Comput.
An approach for building lexical-semantic resources based on heterogeneous information sources
What is new in our city? a framework for event extraction using social media posts
Computational advertising in social networks: An opinion mining-based approach
Lexical disambiguation in natural language questions (NLQs)
Speech and Language Processing
Word sense disambiguation: A survey
ACM Comput. Surv.
Cited by (3)
Enhancement of a multi-dialectal sentiment analysis system by the detection of the implied sarcastic features
2021, Knowledge-Based SystemsCitation Excerpt :They evaluated the proposed model by applying product reviews corpus and obtained an accuracy of 82.07%. Goularte et al. [19] proposed an approach to disambiguate word senses of short texts through the use of fuzzy lexico-semantic patterns. Orkphol and Yang [20] proposed a Word2vec method to construct a context sentence vector and give a cosine similarity score to each word sense to compute the similarity between sentence vectors.
Context-Based Text-Graph Embeddings in Word-Sense Induction Tasks
2022, Communications in Computer and Information ScienceThe construction of an accurate Arabic sentiment analysis system based on resources alteration and approaches comparison
2022, Applied Computing and Informatics
- ☆
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105017.