skip to main content
10.3115/1075096.1075102dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free Access

Generalized algorithms for constructing statistical language models

Published:07 July 2003Publication History

ABSTRACT

Recent text and speech processing applications such as speech mining raise new and more general problems related to the construction of language models. We present and describe in detail several new and efficient algorithms to address these more general problems and report experimental results demonstrating their usefulness. We give an algorithm for computing efficiently the expected counts of any sequence in a word lattice output by a speech recognizer or any arbitrary weighted automaton; describe a new technique for creating exact representations of n-gram language models by weighted automata whose size is practical for offline use even for a vocabulary size of about 500,000 words and an n-gram order n = 6; and present a simple and more general technique for constructing class-based language models that allows each class to represent an arbitrary weighted automaton. An efficient implementation of our algorithms and techniques has been incorporated in a general software library for language modeling, the GRM Library, that includes many other text and grammar processing functionalities.

References

  1. Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. GRM Library-Grammar Library. http://www.research.att.com/sw/tools/grm, AT&T Labs - Research.Google ScholarGoogle Scholar
  2. Jean Berstel and Christophe Reutenauer. 1988. Rational Series and Their Languages. Springer-Verlag: Berlin-New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stanley Chen and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report, TR-10-98, Harvard University.Google ScholarGoogle Scholar
  5. Frederick Jelinek and Robert L. Mercer. 1980. Interpolated estimation of markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381--397.Google ScholarGoogle Scholar
  6. Ronald M. Kaplan and Martin Kay. 1994. Regular models of phonological rule systems. Computational Linguistics, 20(3). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Slava M. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Transactions on Acoustic, Speech, and Signal Processing, 35(3):400--401.Google ScholarGoogle ScholarCross RefCross Ref
  8. Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. Springer-Verlag, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mehryar Mohri and Richard Sproat. 1996. An Efficient Compiler for Weighted Rewrite Rules. In 34th Meeting of the Association for Computational Linguistics (ACL '96), Proceedings of the Conference, Santa Cruz, California. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 1996. Weighted Automata in Text and Speech Processing. In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language, Budapest, Hungary. ECAI.Google ScholarGoogle Scholar
  11. Mehryar Mohri, Michael Riley, Don Hindle, Andrej Ljolje, and Fernando C. N. Pereira. 1998. Full expansion of context-dependent networks in large vocabulary speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  12. Mehryar Mohri. 1997. Finite-State Transducers in Language and Speech Processing. Computational Linguistics, 23:2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mehryar Mohri. 2002. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics, 7(3):321--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in stochastic language modeling. Computer Speech and Language, 8:1--38.Google ScholarGoogle Scholar
  15. Arto Salomaa and Matti Soittola. 1978. Automata-Theoretic Aspects of Formal Power Series. Springer-Verlag: New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Marcel Paul Schützenberger. 1961. On the definition of a family of automata. Information and Control, 4.Google ScholarGoogle Scholar
  17. Kristie Seymore and Ronald Rosenfeld. 1996. Scalable backoff language models. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google ScholarGoogle ScholarCross RefCross Ref
  18. Andreas Stolcke. 1998. Entropy-based pruning of backoff language models. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, pages 270--274.Google ScholarGoogle Scholar
  1. Generalized algorithms for constructing statistical language models

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
          July 2003
          571 pages

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 7 July 2003

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate85of443submissions,19%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader