The latent words language model

doi:10.1016/j.csl.2012.04.001

Computer Speech & Language

Volume 26, Issue 5, October 2012, Pages 384-409

https://doi.org/10.1016/j.csl.2012.04.001 Get rights and content

Abstract

We present a new generative model of natural language, the latent words language model. This model uses a latent variable for every word in a text that represents synonyms or related words in the given context. We develop novel methods to train this model and to find the expected value of these latent variables for a given unseen text. The learned word similarities help to reduce the sparseness problems of traditional n-gram language models. We show that the model significantly outperforms interpolated Kneser–Ney smoothing and class-based language models on three different corpora. Furthermore the latent variables are useful features for information extraction. We show that both for semantic role labeling and word sense disambiguation, the performance of a supervised classifier increases when incorporating these variables as extra features. This improvement is especially large when using only a small annotated corpus for training.

Highlights

► We propose a novel generative model for learning synonyms and semantically related words from texts. ► The model improves words sense disambiguation. ► The model reduces the need for supervision in information extraction tasks.

Introduction

Since the very beginning of the field of natural language processing, researchers have tried to create models of natural language that offer a computer program a mean of generating, manipulating and classifying a given text. Early models were constructed by hand and were limited to a small domain (Winograd, 1972). It was soon found that these models were too rigid for natural language that contains a lot of variation that is not easily captured in fixed rules (Winston, 1976). With the introduction of large, hand-annotated corpora (e.g. the Penn treebank (Marcus et al., 1994)), most researchers turned to shallow methods for the classification of natural language. Examples of these include support vector machines (Cortes and Vapnik, 1995), hidden Markov models (Baum et al., 1970), maximum entropy classifiers (Berger et al., 1996) and others, which have been successfully used for many tasks, including named entity recognition (Sang and De Meulder, 2003), part-of-speech labeling (Brants, 2000) and phrase chunking (Kudo and Matsumoto, 2001). Limitations of this approach are however that for every new task a new corpus needs to be manually annotated, which is very costly both in man-hours, and the required crafting of features by hand that capture correlations between the input data and the output data.

Ideally we would like to develop models that have some deeper understanding of natural language. Large improvements in computational speed and storage size, together with the availability of enormous unlabeled corpora, have sparked an interest in unsupervised models that learn structures or statistics from unlabeled texts. Examples of these are topic models (Blei et al., 2003, Griffiths et al., 2005), unsupervised part-of-speech taggers (Goldwater and Griffiths, 2007) and deep neural networks (Collobert and Weston, 2008). Research has shown that the statistics or structures learned by these models can be successfully used in many supervised tasks, such as document classification (Blei et al., 2003), part-of-speech tagging, word segmentation in Chinese (Li and McCallum, 2005), dependency parsing (Koo et al., 2008) and named entity recognition (Miller et al., 2004). However, till now no unsupervised, generative model was proposed that learns accurate, fine-grained semantic similarities between words.

In this article we develop the latent words language model (LWLM). This model is a novel statistical model that learns context dependent hidden (or latent) words from unlabeled texts. A hidden word of a certain word in a specific textual context is a variable in a Bayesian network that generates the word. This variable models the probability distribution of every word in the vocabulary of the corpus on which the model is trained, indicating the probability that the concerned vocabulary word occurs in the considered context. Words that have a high probability of replacing the original word in the given context are often synonyms or have at least a strong semantic relationship with the original word. Table 1 shows the actual output of the model on an example sentence. This model has the following attractive properties: (1) it is trained on unlabeled texts, which are abundantly available in electronic format for most languages and do not require any manual annotation, (2) it does not require any knowledge of the syntactic or semantic structure of the text, and as such is completely language independent¹ and (3) it learns fine-grained semantic classes (i.e. the hidden words) that capture semantic and syntactic similarities.

The learned hidden words can be used to overcome the sparseness problem fundamental to natural language processing. We show this in three applications. In a first application we incorporate the hidden words in a language model and show that it outperforms Kneser–Ney and cluster language models on three different corpora. We then use the hidden words into a semantic role labeling system and word sense disambiguation system. For both systems this extra knowledge improves the accuracy significantly. The resulting word sense disambiguation system outperforms all other published systems on the corpus of the Senseval3 workshop. Furthermore the hidden words are included using a simple method (i.e. as extra features), which generalizes to virtually all feature-based natural language processing systems.

We fully describe the LWLM in Section 2. The LWLM is heavily influenced by the existing work on n-gram language models, and we first introduce these models in Section 2.1. Then we develop methods for training and inference the LWLM, and compare it to existing language models. We will see in Section 3 that the latent variables learned in this model can be successfully employed in supervised information extraction tasks, which is demonstrated by a semantic role label and a word sense disambiguation classifier. Finally we discuss related work in Section 4 and conclude this article in Section 5.

Section snippets

Introduction to language models

Language models are statistical models that assign a probability to every sequence of words $w = [w_{1} \dots w_{N_{t}}]$ , ideally reflecting the probability that this sequence is used by a human user of language. These models have been used in a wide range of applications, such as speech recognition (Jelinek et al., 1975), machine translation (Brown et al., 1990), spelling correction (Kemighan et al., 1990) and handwriting recognition (Srihari and Baltus, 1992). The most successful class of language models are

Using the LWLM in natural language processing

In the previous section we have shown that the hidden words in the LWLM capture semantic and syntactic similar words and that these hidden words depend on the context, essentially disambiguating the word. We have also seen how these hidden words reduce the sparseness problems of traditional n-gram models. In this section we turn to a second use of the latent words: as an extra representation of a text that can be used in natural language processing (NLP) tasks. Most supervised NLP methods face

Related work

In this article we have proposed the LWLM and shown how it (1) improves language modeling and (2) learns latent classes of words that help in information extraction tasks such as word sense disambiguation and semantic role labeling. We will relate these different features to related research in the following sections.

Conclusions and future work

In this article we have introduced the latent words language model to learn fine-grained, semantic similarities in an unsupervised fashion. This model uses a hidden variable at every position to represent the words that could occur at that position, given the context and given the observed word. We developed two methods to compute the expected value of these variables: the forward–forward beam search and a method based on Gibbs sampling, and we have developed a novel method for training that

Dissemination

The implementation of the latent words language model, together with the trained models can be found on http://www.cs.kuleuven.be/groups/liir/software.php. At this page one can also browse the list of automatically learned synonyms and related words.

Acknowledgments

The presented research was supported by the EU-FP6-IST project CLASS (Cognitive-Level Annotation using Latent Statistical Structure, IST-027978) and by the IWT-SBO project AMASS++ (Advanced Multimedia Alignment and Structured Summarization, IWT 060051).

References (80)

H. Ney et al.
On structuring probabilistic dependences in stochastic language modelling
Computer Speech and Language
(1994)
P. Winston
The psychology of computer vision
Pattern Recognition
(1976)
R. Ando et al.
A framework for learning predictive structures from multiple tasks and unlabeled data
The Journal of Machine Learning Research
(2005)
G. Andrew et al.
Scalable training of L1-regularized log-linear models
L.R. Bahl et al.
A maximum likelihood approach to continuous speech recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1983)
J. Baker
The DRAGON system – an overview
IEEE Transactions on Acoustics, Speech and Signal Processing
(1975)
L. Baum et al.
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
The Annals of Mathematical Statistics
(1970)
H. Bay et al.
Surf: speeded up robust features
Y. Bengio et al.
A neural probabilistic language model
Journal of Machine Learning Research
(2003)
A. Berger et al.
A maximum entropy approach to natural language processing
Computational Linguistics
(1996)

C.M. Bishop

Pattern Recognition and Machine Learning

(2006)

D.M. Blei et al.

Latent Dirichlet allocation

Journal of Machine Learning Research

(2003)

T. Brants et al.

Large language models in machine translation

T. Brants

TnT – a statistical part-of-speech tagger

P. Brown et al.

A statistical approach to machine translation

Computational Linguistics

(1990)

P. Brown et al.

Class-based n-gram models of natural language

Computational Linguistics

(1992)

E. Charniak

Immediate-head parsing for language models

S. Chen et al.

An empirical study of smoothing techniques for language modeling

M. Ciaramita et al.

Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger

R. Collobert et al.

A unified architecture for natural language processing: deep neural networks with multitask learning

C. Cortes et al.

Support-vector networks

Machine Learning

(1995)

K. Deschacht et al.

Semi-supervised semantic role labeling using the Latent Words Language Model

C. Fellbaum

WordNet: An Electronic Lexical Database

(1998)

C.J. Fillmore

The case for case

D. Gildea et al.

Automatic labeling of semantic roles

Computational Linguistics

(2002)

S. Goldwater et al.

A fully Bayesian approach to unsupervised part-of-speech tagging

Proceedings of the Annual Meeting of the Association for Computational Linguistics

(2007)

Goodman, J.T., 2001. A bit of progress in language modeling, extended version. Technical report, Microsoft...

G. Grefenstette

Explorations in Automatic Thesaurus Discovery

(1994)

T. Griffiths et al.

Integrating topics and syntax

Advances in Neural Information Processing Systems

(2005)

R. Grishman et al.

Generalizing automatically generated selectional patterns

J. Gruber

Studies in Lexical Relations

(1970)

Z.S. Harris

Distributional structure

Word

(1954)

M. Hearst

Automatic acquisition of hyponyms from large text corpora

R. Iyer et al.

Transforming out-of-domain estimates to improve in-domain language models

F. Jelinek et al.

Design of a linguistic statistical decoder for the recognition of continuous speech

IEEE Transactions on Information Theory

(1975)

R. Johansson et al.

Dependency-based syntactic–semantic analysis with PropBank and NomBank

M. Johnson

Why doesn’t EM find good HMM POS-taggers

D. Jurafsky et al.

Speech and Language Processing

(2008)

M. Kemighan et al.

A spelling correction program based on a noisy channel model

A. Kilgarriff

Senseval: an exercise in evaluating word sense disambiguation programs

Cited by (29)

Discrete and continuous representations and processing in deep learning: Looking forward
2021, AI Open
Discrete and continuous representations of content (e.g., of language or images) have interesting properties to be explored for the understanding of or reasoning with this content by machines. This position paper puts forward our opinion on the role of discrete and continuous representations and their processing in the deep learning field. Current neural network models compute continuous-valued data. Information is compressed into dense, distributed embeddings. By stark contrast, humans use discrete symbols in their communication with language. Such symbols represent a compressed version of the world that derives its meaning from shared contextual information. Additionally, human reasoning involves symbol manipulation at a cognitive level, which facilitates abstract reasoning, the composition of knowledge and understanding, generalization and efficient learning. Motivated by these insights, in this paper we argue that combining discrete and continuous representations and their processing will be essential to build systems that exhibit a general form of intelligence. We suggest and discuss several avenues that could improve current neural networks with the inclusion of discrete elements to combine the advantages of both types of representations.
Latent semantics in language models
2015, Computer Speech and Language
Citation Excerpt :
The authors also present a 2.2% relative reduction of WER in speech recognition of English. In Deschacht et al. (2012), the Latent words language model (LWLM) was introduced. This model uses a very similar idea to the methods in Brown et al. (1992) and Martin et al. (1998), but the solutions differ.
This paper investigates three different sources of information and their integration into language modelling. Global semantics is modelled by Latent Dirichlet allocation and brings long range dependencies into language models. Word clusters given by semantic spaces enrich these language models with short range semantics. Finally, our own stemming algorithm is used to further enhance the performance of language modelling for inflectional languages.
Our research shows that these three sources of information enrich each other and their combination dramatically improves language modelling. All investigated models are acquired in a fully unsupervised manner.
We show the efficiency of our methods for several languages such as Czech, Slovenian, Slovak, Polish, Hungarian, and English, proving their multilingualism. The perplexity tests are accompanied by machine translation tests that prove the ability of the proposed models to improve the performance of a real-world application.
A survey on the application of recurrent neural networks to statistical language modeling
2015, Computer Speech and Language
Citation Excerpt :
Statistical language modeling (SLM) amounts to estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents (Rosenfeld, 2000). Applications of statistical language modeling include, but are not limited to, speech recognition (Jelinek, 1998; Schwenk, 2010), spelling correction (Ahmed et al., 2009), text generation (de Novais et al., 2010), machine translation (Brown et al., 1993; Och and Ney, 2002; Kirchhoff and Yang, 2005), syntactic (Huang et al., 2014) and semantic processing (Deschacht et al., 2012), optical character recognition and handwriting recognition (Vinciarelli et al., 2004). A traditional task in SLM is to model the probability that a given word appears next after a given sequence of words.
In this paper, we present a survey on the application of recurrent neural networks to the task of statistical language modeling. Although it has been shown that these models obtain good performance on this task, often superior to other state-of-the-art techniques, they suffer from some important drawbacks, including a very long training time and limitations on the number of context words that can be taken into account in practice. Recent extensions to recurrent neural network models have been developed in an attempt to address these drawbacks. This paper gives an overview of the most important extensions. Each technique is described and its performance on statistical language modeling, as described in the existing literature, is discussed. Our structured overview makes it possible to detect the most promising techniques in the field of recurrent neural networks, applied to language modeling, but it also highlights the techniques for which further research is required.
Big Data Driven Natural Language Processing Research and Applications
2015, Handbook of Statistics
Citation Excerpt :
They use a probabilistic language model that captures temporal dynamics and conditions on arbitrary nonlinguistic context features. The use of Latent Words Language (LWL) model in reducing the sparseness problems of traditional n-gram language models is investigated in Deschacht et al. (2012). This study shows that the LWL model significantly outperforms interpolated Kneser–Ney smoothing and class-based language models on three different corpora.
Due to the inherent complexity of natural languages, many natural language tasks are ill-posed for mathematically precise algorithmic solutions. To circumvent this problem, statistical machine learning approaches are used for natural language processing (NLP) tasks. The emergence of Big Data enables a new paradigm for solving NLP problems—managing the complexity of the problem domain by harnessing the power of data for building high quality models.
This chapter provides an introduction to various core NLP tasks and highlights their data-driven solutions. Second, a few representative NLP applications, which are built using the core NLP tasks as the underlying infrastructure, are described. Third, various sources of Big Data for NLP research are discussed. Fourth, Big Data driven NLP research and applications are outlined. Finally, the chapter concludes by indicating trends and future research directions.
Hierarchical latent words language models for automatic speech recognition
2021, Journal of Information Processing
A survey on text simplification
2020, arXiv

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by ‘Edward J. Briscoe’.

View full text

The latent words language model☆

Abstract

Highlights

Introduction

Section snippets

Introduction to language models

Using the LWLM in natural language processing

Related work

Conclusions and future work

Dissemination

Acknowledgments

Computer Speech and Language

Pattern Recognition

A framework for learning predictive structures from multiple tasks and unlabeled data

The Journal of Machine Learning Research

Scalable training of L1-regularized log-linear models

A maximum likelihood approach to continuous speech recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence

The DRAGON system – an overview

IEEE Transactions on Acoustics, Speech and Signal Processing

A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

The Annals of Mathematical Statistics

Surf: speeded up robust features

A neural probabilistic language model

Journal of Machine Learning Research

A maximum entropy approach to natural language processing

Computational Linguistics

Pattern Recognition and Machine Learning

Latent Dirichlet allocation

Journal of Machine Learning Research

Large language models in machine translation

TnT – a statistical part-of-speech tagger

A statistical approach to machine translation

Computational Linguistics

Class-based n-gram models of natural language

Computational Linguistics

Immediate-head parsing for language models

An empirical study of smoothing techniques for language modeling

Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger

A unified architecture for natural language processing: deep neural networks with multitask learning

Support-vector networks

Machine Learning

Semi-supervised semantic role labeling using the Latent Words Language Model

WordNet: An Electronic Lexical Database

The case for case

Automatic labeling of semantic roles

Computational Linguistics

A fully Bayesian approach to unsupervised part-of-speech tagging

Proceedings of the Annual Meeting of the Association for Computational Linguistics

Explorations in Automatic Thesaurus Discovery

Integrating topics and syntax

Advances in Neural Information Processing Systems

Generalizing automatically generated selectional patterns

Studies in Lexical Relations

Distributional structure

Word

Automatic acquisition of hyponyms from large text corpora

Transforming out-of-domain estimates to improve in-domain language models

Design of a linguistic statistical decoder for the recognition of continuous speech

IEEE Transactions on Information Theory

Dependency-based syntactic–semantic analysis with PropBank and NomBank

Why doesn’t EM find good HMM POS-taggers

Speech and Language Processing

A spelling correction program based on a noisy channel model

Senseval: an exercise in evaluating word sense disambiguation programs