ABSTRACT
In (Chen, 2009), we show that for a variety of language models belonging to the exponential family, the test set cross-entropy of a model can be accurately predicted from its training set cross-entropy and its parameter values. In this work, we show how this relationship can be used to motivate two heuristics for "shrinking" the size of a language model to improve its performance. We use the first heuristic to develop a novel class-based language model that outperforms a baseline word trigram model by 28% in perplexity and 1.9% absolute in speech recognition word-error rate on Wall Street Journal data. We use the second heuristic to motivate a regularized version of minimum discrimination information models and show that this method outperforms other techniques for domain adaptation.
- Michiel Bacchiani, Michael Riley, Brian Roark, and Richard Sproat. 2006. MAP adaptation of stochastic grammars. Computer Speech and Language, 20(1):41--68. Google ScholarDigital Library
- Jerome R. Bellegarda. 2004. Statistical language model adaptation: review and perspectives. Speech Communication, 42(1):93--108.Google ScholarCross Ref
- Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, December. Google ScholarDigital Library
- Stanley F. Chen and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University.Google Scholar
- Stanley F. Chen. 2008. Performance prediction for exponential language models. Technical Report RC 24671, IBM Research Division, October.Google Scholar
- Stanley F. Chen. 2009. Performance prediction for exponential language models. In Proc. of HLT-NAACL. Google ScholarDigital Library
- Chuang-Hua Chueh and Jen-Tzung Chien. 2008. Reliable feature selection for language model adaptation. In Proc. of ICASSP, pp. 5089--5092.Google Scholar
- Stephen Della Pietra, Vincent Della Pietra, Robert L. Mercer, and Salim Roukos. 1992. Adaptive language modeling using minimum discriminant estimation. In Proc. of the Speech and Natural Language DARPA Workshop, February. Google ScholarDigital Library
- Marcello Federico. 1996. Bayesian estimation methods for n-gram language model adaptation. In Proc. of ICSLP, pp. 240--243.Google ScholarCross Ref
- Marcello Federico. 1999. Efficient language model adaptation through MDI estimation. In Proc. of Eurospeech, pp. 1583--1586.Google Scholar
- Joshua T. Goodman. 2001. A bit of progress in language modeling. Technical Report MSR-TR-2001-72, Microsoft Research.Google Scholar
- Rukmini Iyer, Mari Ostendorf, and Herbert Gish. 1997. Using out-of-domain data to improve in-domain language models. IEEE Signal Processing Letters, 4(8):221--223, August.Google ScholarCross Ref
- Frederick Jelinek, Bernard Merialdo, Salim Roukos, and Martin Strauss. 1991. A dynamic language model for speech recognition. In Proc. of the DARPA Workshop on Speech and Natural Language, pp. 293--295, Morristown, NJ, USA. Google ScholarDigital Library
- Slava M. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400--401, March.Google ScholarCross Ref
- Jun'ichi Kazama and Jun'ichi Tsujii. 2003. Evaluation and extension of maximum entropy models with inequality constraints. In Proc. of EMNLP, pp. 137--144. Google ScholarDigital Library
- Dietrich Klakow. 1998. Log-linear interpolation of language models. In Proc. of ICSLP.Google Scholar
- Reinhard Kneser, Jochen Peters, and Dietrich Klakow. 1997. Language model adaptation using dynamic marginals. In Proc. of Eurospeech.Google Scholar
- Hirokazu Masataki, Yoshinori Sagisaka, Kazuya Hisaki, and Tatsuya Kawahara. 1997. Task adaptation using MAP estimation in n-gram language modeling. In Proc. of ICASSP, volume 2, pp. 783--786, Washington, DC, USA. IEEE Computer Society. Google ScholarDigital Library
- Douglas B. Paul and Janet M. Baker. 1992. The design for the Wall Street Journal-based CSR corpus. In Proc. of the DARPA Speech and Natural Language Workshop, pp. 357--362, February. Google ScholarDigital Library
- P. Srinivasa Rao, Michael D. Monkowski, and Salim Roukos. 1995. Language model adaptation via minimum discrimination information. In Proc. of ICASSP, volume 1, pp. 161--164.Google ScholarCross Ref
- P. Srinivasa Rao, Satya Dharanipragada, and Salim Roukos. 1997. MDI adaptation of language models across corpora. In Proc. of Eurospeech, pp. 1979--1982.Google Scholar
- George Saon, Daniel Povey, and Geoffrey Zweig. 2005. Anatomy of an extremely fast LVCSR decoder. In Proc. of Interspeech, pp. 549--552.Google Scholar
- Hagen Soltau, Brian Kingsbury, Lidia Mangu, Daniel Povey, George Saon, and Geoffrey Zweig. 2005. The IBM 2004 conversational telephony system for rich transcription. In Proc. of ICASSP, pp. 205--208.Google Scholar
- Andreas Stolcke. 1998. Entropy-based pruning of backoff language models. In Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 270--274, Lansdowne, VA, February.Google Scholar
- Wen Wang and Mary P. Harper. 2002. The Super-ARV language model: Investigating the effectiveness of tightly integrating multiple knowledge sources. In Proc. of EMNLP, pp. 238--247. Google ScholarDigital Library
- Wen Wang, Yang Liu, and Mary P. Harper. 2002. Rescoring effectiveness of language models using different levels of knowledge and their integration. In Proc. of ICASSP, pp. 785--788.Google Scholar
- Wen Wang, Andreas Stolcke, and Mary P. Harper. 2004. The use of a linguistically motivated language model in conversational speech recognition. In Proc. of ICASSP, pp. 261--264.Google Scholar
- Hirofumi Yamamoto and Yoshinori Sagisaka. 1999. Multi-class composite n-gram based on connection direction. In Proc. of ICASSP, pp. 533--536. Google ScholarDigital Library
- Hirofumi Yamamoto, Shuntaro Isogai, and Yoshinori Sagisaka. 2003. Multi-class composite n-gram language model. Speech Communication, 41(2--3):369--379.Google Scholar
Index Terms
- Shrinking exponential language models
Recommendations
Paraphrastic language models
Natural languages are known for their expressive richness. Many sentences can be used to represent the same underlying meaning. Only modelling the observed surface word sequence can result in poor context coverage and generalization, for example, when ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Syntactic and semantic features for code-switching factored language models
This paper presents our latest investigations on different features for factored language models for Code-Switching speech and their effect on automatic speech recognition (ASR) performance. We focus on syntactic and semantic features which can be ...
Comments