skip to main content
10.5555/1620754.1620822dlproceedingsArticle/Chapter ViewAbstractPublication PagesnaaclConference Proceedingsconference-collections
research-article
Free Access

Shrinking exponential language models

Published:31 May 2009Publication History

ABSTRACT

In (Chen, 2009), we show that for a variety of language models belonging to the exponential family, the test set cross-entropy of a model can be accurately predicted from its training set cross-entropy and its parameter values. In this work, we show how this relationship can be used to motivate two heuristics for "shrinking" the size of a language model to improve its performance. We use the first heuristic to develop a novel class-based language model that outperforms a baseline word trigram model by 28% in perplexity and 1.9% absolute in speech recognition word-error rate on Wall Street Journal data. We use the second heuristic to motivate a regularized version of minimum discrimination information models and show that this method outperforms other techniques for domain adaptation.

References

  1. Michiel Bacchiani, Michael Riley, Brian Roark, and Richard Sproat. 2006. MAP adaptation of stochastic grammars. Computer Speech and Language, 20(1):41--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jerome R. Bellegarda. 2004. Statistical language model adaptation: review and perspectives. Speech Communication, 42(1):93--108.Google ScholarGoogle ScholarCross RefCross Ref
  3. Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, December. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stanley F. Chen and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University.Google ScholarGoogle Scholar
  5. Stanley F. Chen. 2008. Performance prediction for exponential language models. Technical Report RC 24671, IBM Research Division, October.Google ScholarGoogle Scholar
  6. Stanley F. Chen. 2009. Performance prediction for exponential language models. In Proc. of HLT-NAACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chuang-Hua Chueh and Jen-Tzung Chien. 2008. Reliable feature selection for language model adaptation. In Proc. of ICASSP, pp. 5089--5092.Google ScholarGoogle Scholar
  8. Stephen Della Pietra, Vincent Della Pietra, Robert L. Mercer, and Salim Roukos. 1992. Adaptive language modeling using minimum discriminant estimation. In Proc. of the Speech and Natural Language DARPA Workshop, February. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Marcello Federico. 1996. Bayesian estimation methods for n-gram language model adaptation. In Proc. of ICSLP, pp. 240--243.Google ScholarGoogle ScholarCross RefCross Ref
  10. Marcello Federico. 1999. Efficient language model adaptation through MDI estimation. In Proc. of Eurospeech, pp. 1583--1586.Google ScholarGoogle Scholar
  11. Joshua T. Goodman. 2001. A bit of progress in language modeling. Technical Report MSR-TR-2001-72, Microsoft Research.Google ScholarGoogle Scholar
  12. Rukmini Iyer, Mari Ostendorf, and Herbert Gish. 1997. Using out-of-domain data to improve in-domain language models. IEEE Signal Processing Letters, 4(8):221--223, August.Google ScholarGoogle ScholarCross RefCross Ref
  13. Frederick Jelinek, Bernard Merialdo, Salim Roukos, and Martin Strauss. 1991. A dynamic language model for speech recognition. In Proc. of the DARPA Workshop on Speech and Natural Language, pp. 293--295, Morristown, NJ, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Slava M. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400--401, March.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jun'ichi Kazama and Jun'ichi Tsujii. 2003. Evaluation and extension of maximum entropy models with inequality constraints. In Proc. of EMNLP, pp. 137--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dietrich Klakow. 1998. Log-linear interpolation of language models. In Proc. of ICSLP.Google ScholarGoogle Scholar
  17. Reinhard Kneser, Jochen Peters, and Dietrich Klakow. 1997. Language model adaptation using dynamic marginals. In Proc. of Eurospeech.Google ScholarGoogle Scholar
  18. Hirokazu Masataki, Yoshinori Sagisaka, Kazuya Hisaki, and Tatsuya Kawahara. 1997. Task adaptation using MAP estimation in n-gram language modeling. In Proc. of ICASSP, volume 2, pp. 783--786, Washington, DC, USA. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Douglas B. Paul and Janet M. Baker. 1992. The design for the Wall Street Journal-based CSR corpus. In Proc. of the DARPA Speech and Natural Language Workshop, pp. 357--362, February. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Srinivasa Rao, Michael D. Monkowski, and Salim Roukos. 1995. Language model adaptation via minimum discrimination information. In Proc. of ICASSP, volume 1, pp. 161--164.Google ScholarGoogle ScholarCross RefCross Ref
  21. P. Srinivasa Rao, Satya Dharanipragada, and Salim Roukos. 1997. MDI adaptation of language models across corpora. In Proc. of Eurospeech, pp. 1979--1982.Google ScholarGoogle Scholar
  22. George Saon, Daniel Povey, and Geoffrey Zweig. 2005. Anatomy of an extremely fast LVCSR decoder. In Proc. of Interspeech, pp. 549--552.Google ScholarGoogle Scholar
  23. Hagen Soltau, Brian Kingsbury, Lidia Mangu, Daniel Povey, George Saon, and Geoffrey Zweig. 2005. The IBM 2004 conversational telephony system for rich transcription. In Proc. of ICASSP, pp. 205--208.Google ScholarGoogle Scholar
  24. Andreas Stolcke. 1998. Entropy-based pruning of backoff language models. In Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 270--274, Lansdowne, VA, February.Google ScholarGoogle Scholar
  25. Wen Wang and Mary P. Harper. 2002. The Super-ARV language model: Investigating the effectiveness of tightly integrating multiple knowledge sources. In Proc. of EMNLP, pp. 238--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wen Wang, Yang Liu, and Mary P. Harper. 2002. Rescoring effectiveness of language models using different levels of knowledge and their integration. In Proc. of ICASSP, pp. 785--788.Google ScholarGoogle Scholar
  27. Wen Wang, Andreas Stolcke, and Mary P. Harper. 2004. The use of a linguistically motivated language model in conversational speech recognition. In Proc. of ICASSP, pp. 261--264.Google ScholarGoogle Scholar
  28. Hirofumi Yamamoto and Yoshinori Sagisaka. 1999. Multi-class composite n-gram based on connection direction. In Proc. of ICASSP, pp. 533--536. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hirofumi Yamamoto, Shuntaro Isogai, and Yoshinori Sagisaka. 2003. Multi-class composite n-gram language model. Speech Communication, 41(2--3):369--379.Google ScholarGoogle Scholar

Index Terms

  1. Shrinking exponential language models

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        NAACL '09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
        May 2009
        716 pages
        ISBN:9781932432411

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 31 May 2009

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate21of29submissions,72%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader