MSC+: Language pattern learning for word sense induction and disambiguation

doi:10.1016/j.knosys.2019.105017

Knowledge-Based Systems

Volume 188, 5 January 2020, 105017

https://doi.org/10.1016/j.knosys.2019.105017 Get rights and content

Highlights

•
The use of fuzzy lexico-semantic patterns in WSI/WSD systems.
•
The use of Morpho-semantic Components (MSC) in WSI/WSD systems.
•
An algorithm to mining sequences of MSC.
•
Two methods implemented as a prototype to evaluate the proposed approach.
•
Experimental results showing improvements over existing methods.

Abstract

Identifying the correct meaning of words in context or discovering new word senses is particularly useful for several tasks such as question answering, information extraction, information retrieval, and text summarization. However, specially in the context of user-generated contents and on-line communication (e.g. Twitter), new meanings are continuously crafted by speakers as the result of existing words being used in novel contexts. Consequently, lexical semantics inventories and systems have difficulties to cope with semantic drifting problems. In this work, we propose an approach to induce and disambiguate word senses of some target words in collections of short texts, such as tweets, through the use of fuzzy lexico-semantic patterns that we define as sequences of Morpho-semantic Components (MSC). We learn these patterns, that we call $M S C^{+}$ patterns, from text data automatically. Experimental results show that instances of some $M S C^{+}$ patterns arise in a number of tweets, but sometimes using different words to convey the sense of the respective MSC in some tweets where pattern instances appear. The exploitation of $M S C^{+}$ patterns when they induce semantics on target words enable effective word sense disambiguation mechanisms leading to improvements in the state of the art.

Introduction

Limited contextual information, poor grammatical rules’ conformity, and high redundancy make semantic annotation of social media texts challenging, specifically for word sense disambiguation systems [1].

A word can be interpreted in multiple ways depending on the context in which it occurs — lexical ambiguity phenomenon [2], [3]. Sometimes words have implicit semantics (e.g., to make humor, irony, or wordplay). For example, the tweet “Blond, brunette or red-headed Devassa ???”¹ plays with words in English that usually refer to hair color and the word devassa from the Portuguese language. The intended meaning of this word in this context is not the one you may find in a dictionary, but beer! It can be inferred by considering the pattern appearing in the tweets presented in Table 2, as explained in the following. We have found that implicit semantics induced by the language patterns investigated in this paper is quite common in colloquial language, and particularly in social media. In cases like this, current annotation methods frequently fail to capture the correct meaning of certain words, leading to results with low levels of precision and recall [4].

Some of the most prominent tasks for modeling and resolving the lexical ambiguity problem are Word Sense Disambiguation (WSD) and Word Sense Induction (WSI) [3], [5]. Both tasks are fundamental in Natural Language Processing (NLP). WSI automatically discovers the possible senses for target words [6] in text documents, regarding the context in which each word appears. WSD, in turn, automatically disambiguates the possible meanings to assign the most probable one to each target word. Some methods for WSD rely on a fixed inventories of word senses such as WordNet [7]. Many solutions for other widely used semantic annotation tasks such as Named Entity Recognition/Normalization/Disambiguation (NER/NEN/NED) or Entity Linking (EL) [8], [9] also rely on pre-defined sense inventories.

However, building and constantly updating large inventories of word senses or semantic knowledge bases (e.g., WordNet [10], BabelNet [11], Yago [12]) is an expensive and difficult task. As a result, important current senses of words used in social media, by specific groups, in certain geographic regions or in particular domains may not appear in sense inventories. For example, consider the tweets² (i) to (iii) transcribed below and the sense of the word polar in each one of them.

(i)
And the journey continues: the Arctic Sunrise reaches the polar region.
(ii)
This is the first-ever view of the polar region of Jupiter.
(iii)
I wish to drink an ice-cold polar.

Consider the sense inventory extract for the word polar presented in Table 1. The intended sense for polar in tweets (i) and (ii) is that of line #2 in Table 1, because of clear references to regions of Earth and Jupiter, respectively. However, in tweet (iii), polar plays the role of a proper name and the verb to drink induces a sense of beverage which is out of the scope of some sense inventories. Difficulties to catch the exact word sense will happen using any sense inventory that does not know about the beer brand called Polar. Sometimes, the intended word sense exists in the inventory, but it is hard to devise algorithms that disambiguate to the correct sense. For example, by manually searching BabelNet, we found 19 senses (12 nouns and 6 adjectives) for the word polar. Two of these senses are related to beer. Then, we submitted the sentence of message (iii) to the Babelfy tool,³ which automatically annotated the word polar in that sentence with the wrong sense of arctic (extremely cold).

In this paper, we propose a new approach and automated methods to induce and disambiguate the correct sense of words in text. As many WSI/WSD approaches, it also takes advantage of contextual information. However, for the best of our knowledge, our proposal is the first one to automatically learn context information from some collection of short text as language patterns defined by sequences of Morpho-Semantic Components (MSC), which we call $M S C^{+}$ patterns. Each MSC has the most probable senses (e.g. candidate concepts or named entities) and the most probable morphosyntactic classes (e.g. Part-of-Speech (PoS) tags) for a word in a text document. An $M S C^{+}$ pattern can be seen as a sequence of pairs $〈 m o r p h o s y n t a c t i c c l a s s, S e n s e 〉$ that repeatedly appear in the MSCs of distinct short texts (e.g. tweets). An instance of an $M S C^{+}$ pattern is a sequence of not necessarily adjacent but consecutive MSCs (i.e. words) in some text, in which the most probable morphosyntactic class and sense for each MSC match those in the respective position the pattern.

Firstly, we provide formal definitions for MSC and $M S C^{+}$ patterns, and an unsupervised algorithm to mine these patterns. Then, we show that these patterns are frequent in tweets. Finally, we exploit these linguistic patterns for doing correct WSI/WSD in cases for which current tools fail, leading to better precision and recall of semantic annotations.

Table 2 provides an example of $M S C^{+}$ pattern arising in short texts (#1 to #5) extracted from Twitter.⁴ Each one of these tweets (one per line) has an instance of the $M S C^{+}$ pattern that is a sequence of distinct MSCs (i.e. words) with respective candidate morphosyntactic classes and senses matching the sequence: $M S C_{1} 〈 V e r b, I n g e s t 〉$ , $M S C_{2} 〈 A d j, A n y 〉$ , $M S C_{3} 〈 N o u n, B e e r 〉$ . Each instance of this pattern is a sequence of 3 MSCs (columns MSC₁, MSC₂ and MSC₃) highlighted in bold in the respective tweet (line). These sequences are similar in terms of the PoS tags and meanings of their respective MSCs. MSC₁ is a verb referring to liquid ingestion (sense = Ingest). MSC₂ is an adjective that can refer to cold, extremely cold or hair color, among other possibilities (sense = Any). MSC₃ is a noun, but only for the tweets #1 and #2 its sense is correctly disambiguated to beer and beer brand, respectively, by using current automatic approaches.

Target words (whose sense has to be solved yet) in the other tweets are highlighted in red. Thanks to the $M S C^{+}$ pattern established by the MSC sequences of the first two tweets, our method can also disambiguate them. Notice the partial adherence of the MSC sequence in #3 to this pattern. Each MSC in #3 matches with the respective one in the pattern, except the word polar ( $M S C_{3}$ ), whose PoS class is also a noun, but whose sense is initially undetermined. This partial adherence to the $M S C^{+}$ pattern induces the sense of the $M S C_{3}$ from the previous tweets to the word polar in tweet #3, and allows it to be disambiguated to beer.

Analogously, one can also induce sense to and disambiguate the word Devassa to beer in tweets #4 and #5, as well as other short texts (from authors with similar language habits) in which such a word appears with the same $M S C^{+}$ pattern or a similar one (e.g. the tweet “Blond, brunette or red-headed Devassa ???”). We employ this rationale to automatically mine $M S C^{+}$ patterns and use them for doing WSI and WSD when there is partial matching with some mined pattern involving a word that current automatic approaches have difficulties to disambiguate.

Some approaches for WSI/WSD learn and employ lexical patterns [13]. Nevertheless, $M S C^{+}$ patterns are an alternative for some situations in which current approaches fail, as exemplified above. These patterns can be used to improve semantic annotations produced by using a variety of annotation methods [14], [15]. Then, the improved semantic annotations can help to boost a variety of computational tasks and applications, ranging from text simplification [16], summarizers [17] and knowledge base enrichment [1], [10], [11], [18], [19] to events detection [20], sentiment analysis [21], and question answering [22].

The main contributions of this paper can be stated as follow:

1.
the introduction and formal definition of MSC, MSC sequences, $M S C^{+}$ patterns and $M S C^{+}$ pattern instances;
2.
an approach for word sense induction and disambiguation based on $M S C^{+}$ patterns;
3.
an algorithm to find the most frequent $M S C^{+}$ patterns in a set of documents that have been previously annotated with candidate morphosyntactic classes and candidate senses of semantically relevant words.
4.
two methods for word sense induction and disambiguation based on $M S C^{+}$ patterns.

The results of experiments reveal major characteristics of $M S C^{+}$ patterns mined in a set of tweets, and the contribution of word sense induction and disambiguation based on the mined patterns to improve precision and recall of semantic annotations produced by state-of-the-art systems. They also show that the variation of our method for word sense induction and disambiguation relying on both PoS tagging and word sense confidence leads to the best results.

The remainder of this paper is organized as follows. Section 2 provides some foundations necessary for understanding our proposal. Section 3 discusses related works. Section 4 formally defines key concepts of our approach such as MSC and $M S C^{+}$ patterns. Section 5 describes our approach in a top-down fashion, i.e., first in terms of main stages and then details of key tasks. Section 6 and Section 7 report the experiments for performance evaluation and discuss their results. Finally, Section 8 presents conclusions and indications of future work.

Section snippets

Preliminaries

This section provides an overview of methods for capturing word senses, the problem tackled in this paper. It also reviews sequence pattern mining, and how it could be used to find textual patterns that are less elaborated than $M S C^{+}$ patterns. However, it can help to understand our proposal and it distinguishable traits.

Related work

Many WSD/WSI approaches are based on unsupervised clustering techniques. Here, we consider four approaches according to the existing literature: (i) clustering based on contextual information, (ii) approaches based on word information, (iii) graph based techniques, and (iv) probabilistic models.

The standard approach is the first one, in which the contexts of word instances are represented as vector space model (e.g., bag-of-words) of first or second-order. The context vectors are obtained by

Morpho-semantic components

Some words in text can have their meaning induced and disambiguated by exploiting morpho-semantic patterns. A morpho-semantic pattern is characterized by a sequence of Morpho-semantic Components (MSC) with length based on n-gram. An MSC refers to a word in a document text $d$ and its associated sets of candidate morphosyntactic classes (e.g. from some PoS tag set) and senses (e.g. from some sense class set), as stated in Definition 1.

Definition 1 Morpho-Semantic Component — MSC

Given a text document $d$ with identifier $i d d$ , a set of

The proposed approach

This section describes our approach for coping with challenging instances of the WSI and WSD problems in sequences of texts such as social media posts. Fig. 1 provides an overview of the proposed process, which is composed of five stages: (i) Pre-processing, (ii) Semantic and morphosyntactic annotation, (iii) MSC extraction, (iv) Matching MSCs & Mining $M S C^{+}$ patterns, and (v) WSI/WSD based on $M S C^{+}$ patterns. In this figure, the continuous lines indicate data flow and the dotted lines linking

Experiments

The experiments aim to investigate the prevalence of $M S C^{+}$ patterns in tweets, their support by perfectly matching $M S C^{+}$ sequences in these tweets, and the gains obtained by doing WSI/WSD based on the mined patterns on words with sense unresolved by typical annotation tools. This section describes the infra-structure, including off-the-shelf tools, the parameters, the dataset, and the evaluation metrics used in these experiments. It also presents a characterization of the mined patterns and their

WSI/WSD results

Table 8 presents the WSI/WSD performance for FreeLing, FreeLing&Babelfy, and some variations of our proposal based on $M S C^{+}$ patterns (below the dashed line). Table 9 and Table 10, in turn, detail the gains obtained by applying variations of our approach with support 2% and 5%, respectively. Each table shows the total number of words whose sense has been handled by the respective approach (#Word), the number of matchings in terms of just surface name (Mention) or surface name and sense

Conclusions and future work

This paper presented an approach for WSI/WSD that exploits language patterns composed by sequences of morpho-semantic components (MSC) quite frequent in some short text documents such as tweets. We presented an algorithm to mine these patterns, which we call $M S C^{+}$ patterns, from sets of text documents. It relies on annotations of morphosyntactic classes and senses of relevant words. These annotations can be produced by a variety of alternative off-the-shelf tools for NLP. Our $M S C^{+}$ pattern

Acknowledgments

This work was conducted during a doctorate partially supported by grants of CAPES (Brazilian Coordination of Superior Level Staff Improvement) a research support agency from the Ministry of Education of Brazil. CAPES also supported an internship for international cooperation with the TALN (Natural Language Processing Research Group) at the Pompeu Fabra University in Barcelona, Spain. The last author acknowledges support from the Spanish Government under the María de Maeztu Units of Excellence

References (53)

Camacho-ColladosJ. et al.
Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities
Artificial Intelligence
(2016)
BontchevaK. et al.
Extracting information from social media with GATE
FiletoR. et al.
The baquara 2 knowledge-based framework for semantic enrichment and analysis of movement data
Data Knowl. Eng.
(2015)
GoularteF.B. et al.
A text summarization method based on fuzzy rules and applicable to automated assessment
Expert Syst. Appl.
(2019)
Ruiz-CasadoM. et al.
Automatising the learning of lexical patterns: An application to the enrichment of wordnet by extracting semantic relationships from wikipedia
Data Knowl. Eng.
(2007)
GutiérrezY. et al.
Spreading semantic information by word sense disambiguation
Knowl.-Based Syst.
(2017)
VéronisJ.
Hyperlex: Lexical cartography for information retrieval
Comput. Speech Lang.
(2004)
AlbanoL. et al.
Word sense induction with multilingual features representation
D. Alagić, J. Šnajder, S. Padó, Leveraging lexical substitutes for unsupervised word sense induction, in: Thirty-Second...
NavigliR. et al.
Semeval-2013 task 11: Word sense induction and disambiguation within an end-user application

SchützeH.

Automatic word sense discrimination

Comput. linguist.

(1998)

NavigliR.

A quick tour of word sense disambiguation, induction and related approaches

ZhangW. et al.

Entity linking leveraging: Automatically generated annotation

LiuX. et al.

Joint inference of named entity recognition and normalization for tweets

FellbaumC.

WordNet

(1998)

NavigliR. et al.

BabelNet: Building a very large multilingual semantic network

MahdisoltaniF. et al.

Yago3: A knowledge base from multilingual wikipedias

LiuP.-Y. et al.

Pengyuan@ PKU: Extracting infrequent sense instance with the same n-gram pattern for the semeval-2010 task 15

MoroA. et al.

Entity linking meets word sense disambiguation: A unified approach

Trans. Assoc. Comput. Linguist.

(2014)

SaggionH. et al.

Making it simplext: Implementation and evaluation of a text simplification system for spanish

ACM Trans. Access. Comput.

(2015)

A. JúniorJ.G. et al.

An approach for building lexical-semantic resources based on heterogeneous information sources

XiaC. et al.

What is new in our city? a framework for event extraction using social media posts

DragoniM.

Computational advertising in social networks: An opinion mining-based approach

Al-HarbiO. et al.

Lexical disambiguation in natural language questions (NLQs)

(2017)

JurafskyD. et al.

Speech and Language Processing

(2018)