The latent words language model

https://doi.org/10.1016/j.csl.2012.04.001Get rights and content

Abstract

We present a new generative model of natural language, the latent words language model. This model uses a latent variable for every word in a text that represents synonyms or related words in the given context. We develop novel methods to train this model and to find the expected value of these latent variables for a given unseen text. The learned word similarities help to reduce the sparseness problems of traditional n-gram language models. We show that the model significantly outperforms interpolated Kneser–Ney smoothing and class-based language models on three different corpora. Furthermore the latent variables are useful features for information extraction. We show that both for semantic role labeling and word sense disambiguation, the performance of a supervised classifier increases when incorporating these variables as extra features. This improvement is especially large when using only a small annotated corpus for training.

Highlights

► We propose a novel generative model for learning synonyms and semantically related words from texts. ► The model improves words sense disambiguation. ► The model reduces the need for supervision in information extraction tasks.

Introduction

Since the very beginning of the field of natural language processing, researchers have tried to create models of natural language that offer a computer program a mean of generating, manipulating and classifying a given text. Early models were constructed by hand and were limited to a small domain (Winograd, 1972). It was soon found that these models were too rigid for natural language that contains a lot of variation that is not easily captured in fixed rules (Winston, 1976). With the introduction of large, hand-annotated corpora (e.g. the Penn treebank (Marcus et al., 1994)), most researchers turned to shallow methods for the classification of natural language. Examples of these include support vector machines (Cortes and Vapnik, 1995), hidden Markov models (Baum et al., 1970), maximum entropy classifiers (Berger et al., 1996) and others, which have been successfully used for many tasks, including named entity recognition (Sang and De Meulder, 2003), part-of-speech labeling (Brants, 2000) and phrase chunking (Kudo and Matsumoto, 2001). Limitations of this approach are however that for every new task a new corpus needs to be manually annotated, which is very costly both in man-hours, and the required crafting of features by hand that capture correlations between the input data and the output data.

Ideally we would like to develop models that have some deeper understanding of natural language. Large improvements in computational speed and storage size, together with the availability of enormous unlabeled corpora, have sparked an interest in unsupervised models that learn structures or statistics from unlabeled texts. Examples of these are topic models (Blei et al., 2003, Griffiths et al., 2005), unsupervised part-of-speech taggers (Goldwater and Griffiths, 2007) and deep neural networks (Collobert and Weston, 2008). Research has shown that the statistics or structures learned by these models can be successfully used in many supervised tasks, such as document classification (Blei et al., 2003), part-of-speech tagging, word segmentation in Chinese (Li and McCallum, 2005), dependency parsing (Koo et al., 2008) and named entity recognition (Miller et al., 2004). However, till now no unsupervised, generative model was proposed that learns accurate, fine-grained semantic similarities between words.

In this article we develop the latent words language model (LWLM). This model is a novel statistical model that learns context dependent hidden (or latent) words from unlabeled texts. A hidden word of a certain word in a specific textual context is a variable in a Bayesian network that generates the word. This variable models the probability distribution of every word in the vocabulary of the corpus on which the model is trained, indicating the probability that the concerned vocabulary word occurs in the considered context. Words that have a high probability of replacing the original word in the given context are often synonyms or have at least a strong semantic relationship with the original word. Table 1 shows the actual output of the model on an example sentence. This model has the following attractive properties: (1) it is trained on unlabeled texts, which are abundantly available in electronic format for most languages and do not require any manual annotation, (2) it does not require any knowledge of the syntactic or semantic structure of the text, and as such is completely language independent1 and (3) it learns fine-grained semantic classes (i.e. the hidden words) that capture semantic and syntactic similarities.

The learned hidden words can be used to overcome the sparseness problem fundamental to natural language processing. We show this in three applications. In a first application we incorporate the hidden words in a language model and show that it outperforms Kneser–Ney and cluster language models on three different corpora. We then use the hidden words into a semantic role labeling system and word sense disambiguation system. For both systems this extra knowledge improves the accuracy significantly. The resulting word sense disambiguation system outperforms all other published systems on the corpus of the Senseval3 workshop. Furthermore the hidden words are included using a simple method (i.e. as extra features), which generalizes to virtually all feature-based natural language processing systems.

We fully describe the LWLM in Section 2. The LWLM is heavily influenced by the existing work on n-gram language models, and we first introduce these models in Section 2.1. Then we develop methods for training and inference the LWLM, and compare it to existing language models. We will see in Section 3 that the latent variables learned in this model can be successfully employed in supervised information extraction tasks, which is demonstrated by a semantic role label and a word sense disambiguation classifier. Finally we discuss related work in Section 4 and conclude this article in Section 5.

Section snippets

Introduction to language models

Language models are statistical models that assign a probability to every sequence of words w=[w1wNt], ideally reflecting the probability that this sequence is used by a human user of language. These models have been used in a wide range of applications, such as speech recognition (Jelinek et al., 1975), machine translation (Brown et al., 1990), spelling correction (Kemighan et al., 1990) and handwriting recognition (Srihari and Baltus, 1992). The most successful class of language models are

Using the LWLM in natural language processing

In the previous section we have shown that the hidden words in the LWLM capture semantic and syntactic similar words and that these hidden words depend on the context, essentially disambiguating the word. We have also seen how these hidden words reduce the sparseness problems of traditional n-gram models. In this section we turn to a second use of the latent words: as an extra representation of a text that can be used in natural language processing (NLP) tasks. Most supervised NLP methods face

Related work

In this article we have proposed the LWLM and shown how it (1) improves language modeling and (2) learns latent classes of words that help in information extraction tasks such as word sense disambiguation and semantic role labeling. We will relate these different features to related research in the following sections.

Conclusions and future work

In this article we have introduced the latent words language model to learn fine-grained, semantic similarities in an unsupervised fashion. This model uses a hidden variable at every position to represent the words that could occur at that position, given the context and given the observed word. We developed two methods to compute the expected value of these variables: the forward–forward beam search and a method based on Gibbs sampling, and we have developed a novel method for training that

Dissemination

The implementation of the latent words language model, together with the trained models can be found on http://www.cs.kuleuven.be/groups/liir/software.php. At this page one can also browse the list of automatically learned synonyms and related words.

Acknowledgments

The presented research was supported by the EU-FP6-IST project CLASS (Cognitive-Level Annotation using Latent Statistical Structure, IST-027978) and by the IWT-SBO project AMASS++ (Advanced Multimedia Alignment and Structured Summarization, IWT 060051).

References (80)

  • H. Ney et al.

    On structuring probabilistic dependences in stochastic language modelling

    Computer Speech and Language

    (1994)
  • P. Winston

    The psychology of computer vision

    Pattern Recognition

    (1976)
  • R. Ando et al.

    A framework for learning predictive structures from multiple tasks and unlabeled data

    The Journal of Machine Learning Research

    (2005)
  • G. Andrew et al.

    Scalable training of L1-regularized log-linear models

  • L.R. Bahl et al.

    A maximum likelihood approach to continuous speech recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1983)
  • J. Baker

    The DRAGON system – an overview

    IEEE Transactions on Acoustics, Speech and Signal Processing

    (1975)
  • L. Baum et al.

    A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

    The Annals of Mathematical Statistics

    (1970)
  • H. Bay et al.

    Surf: speeded up robust features

  • Y. Bengio et al.

    A neural probabilistic language model

    Journal of Machine Learning Research

    (2003)
  • A. Berger et al.

    A maximum entropy approach to natural language processing

    Computational Linguistics

    (1996)
  • C.M. Bishop

    Pattern Recognition and Machine Learning

    (2006)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    Journal of Machine Learning Research

    (2003)
  • T. Brants et al.

    Large language models in machine translation

  • T. Brants

    TnT – a statistical part-of-speech tagger

  • P. Brown et al.

    A statistical approach to machine translation

    Computational Linguistics

    (1990)
  • P. Brown et al.

    Class-based n-gram models of natural language

    Computational Linguistics

    (1992)
  • E. Charniak

    Immediate-head parsing for language models

  • S. Chen et al.

    An empirical study of smoothing techniques for language modeling

  • M. Ciaramita et al.

    Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger

  • R. Collobert et al.

    A unified architecture for natural language processing: deep neural networks with multitask learning

  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • K. Deschacht et al.

    Semi-supervised semantic role labeling using the Latent Words Language Model

  • C. Fellbaum

    WordNet: An Electronic Lexical Database

    (1998)
  • C.J. Fillmore

    The case for case

  • D. Gildea et al.

    Automatic labeling of semantic roles

    Computational Linguistics

    (2002)
  • S. Goldwater et al.

    A fully Bayesian approach to unsupervised part-of-speech tagging

    Proceedings of the Annual Meeting of the Association for Computational Linguistics

    (2007)
  • Goodman, J.T., 2001. A bit of progress in language modeling, extended version. Technical report, Microsoft...
  • G. Grefenstette

    Explorations in Automatic Thesaurus Discovery

    (1994)
  • T. Griffiths et al.

    Integrating topics and syntax

    Advances in Neural Information Processing Systems

    (2005)
  • R. Grishman et al.

    Generalizing automatically generated selectional patterns

  • J. Gruber

    Studies in Lexical Relations

    (1970)
  • Z.S. Harris

    Distributional structure

    Word

    (1954)
  • M. Hearst

    Automatic acquisition of hyponyms from large text corpora

  • R. Iyer et al.

    Transforming out-of-domain estimates to improve in-domain language models

  • F. Jelinek et al.

    Design of a linguistic statistical decoder for the recognition of continuous speech

    IEEE Transactions on Information Theory

    (1975)
  • R. Johansson et al.

    Dependency-based syntactic–semantic analysis with PropBank and NomBank

  • M. Johnson

    Why doesn’t EM find good HMM POS-taggers

  • D. Jurafsky et al.

    Speech and Language Processing

    (2008)
  • M. Kemighan et al.

    A spelling correction program based on a noisy channel model

  • A. Kilgarriff

    Senseval: an exercise in evaluating word sense disambiguation programs

  • Cited by (29)

    • Latent semantics in language models

      2015, Computer Speech and Language
      Citation Excerpt :

      The authors also present a 2.2% relative reduction of WER in speech recognition of English. In Deschacht et al. (2012), the Latent words language model (LWLM) was introduced. This model uses a very similar idea to the methods in Brown et al. (1992) and Martin et al. (1998), but the solutions differ.

    • A survey on the application of recurrent neural networks to statistical language modeling

      2015, Computer Speech and Language
      Citation Excerpt :

      Statistical language modeling (SLM) amounts to estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents (Rosenfeld, 2000). Applications of statistical language modeling include, but are not limited to, speech recognition (Jelinek, 1998; Schwenk, 2010), spelling correction (Ahmed et al., 2009), text generation (de Novais et al., 2010), machine translation (Brown et al., 1993; Och and Ney, 2002; Kirchhoff and Yang, 2005), syntactic (Huang et al., 2014) and semantic processing (Deschacht et al., 2012), optical character recognition and handwriting recognition (Vinciarelli et al., 2004). A traditional task in SLM is to model the probability that a given word appears next after a given sequence of words.

    • Big Data Driven Natural Language Processing Research and Applications

      2015, Handbook of Statistics
      Citation Excerpt :

      They use a probabilistic language model that captures temporal dynamics and conditions on arbitrary nonlinguistic context features. The use of Latent Words Language (LWL) model in reducing the sparseness problems of traditional n-gram language models is investigated in Deschacht et al. (2012). This study shows that the LWL model significantly outperforms interpolated Kneser–Ney smoothing and class-based language models on three different corpora.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by ‘Edward J. Briscoe’.

    View full text