Article

Free Access

Generalized algorithms for constructing statistical language models

Authors:
Cyril Allauzen

AT&T Labs - Research, Florham Park, NJ

AT&T Labs - Research, Florham Park, NJ
View Profile

,
Mehryar Mohri

AT&T Labs - Research, Florham Park, NJ

AT&T Labs - Research, Florham Park, NJ
View Profile

,
Brian Roark

AT&T Labs - Research, Florham Park, NJ

AT&T Labs - Research, Florham Park, NJ
View Profile

ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1July 2003Pages 40–47https://doi.org/10.3115/1075096.1075102

Published:07 July 2003Publication History

ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1

Pages 40–47

ABSTRACT

Recent text and speech processing applications such as speech mining raise new and more general problems related to the construction of language models. We present and describe in detail several new and efficient algorithms to address these more general problems and report experimental results demonstrating their usefulness. We give an algorithm for computing efficiently the expected counts of any sequence in a word lattice output by a speech recognizer or any arbitrary weighted automaton; describe a new technique for creating exact representations of n-gram language models by weighted automata whose size is practical for offline use even for a vocabulary size of about 500,000 words and an n-gram order n = 6; and present a simple and more general technique for constructing class-based language models that allows each class to represent an arbitrary weighted automaton. An efficient implementation of our algorithms and techniques has been incorporated in a general software library for language modeling, the GRM Library, that includes many other text and grammar processing functionalities.

References

Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. GRM Library-Grammar Library. http://www.research.att.com/sw/tools/grm, AT&T Labs - Research.Google Scholar
Jean Berstel and Christophe Reutenauer. 1988. Rational Series and Their Languages. Springer-Verlag: Berlin-New York. Google ScholarDigital Library
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479. Google ScholarDigital Library
Stanley Chen and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report, TR-10-98, Harvard University.Google Scholar
Frederick Jelinek and Robert L. Mercer. 1980. Interpolated estimation of markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381--397.Google Scholar
Ronald M. Kaplan and Martin Kay. 1994. Regular models of phonological rule systems. Computational Linguistics, 20(3). Google ScholarDigital Library
Slava M. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Transactions on Acoustic, Speech, and Signal Processing, 35(3):400--401.Google ScholarCross Ref
Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
Mehryar Mohri and Richard Sproat. 1996. An Efficient Compiler for Weighted Rewrite Rules. In 34th Meeting of the Association for Computational Linguistics (ACL '96), Proceedings of the Conference, Santa Cruz, California. ACL. Google ScholarDigital Library
Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 1996. Weighted Automata in Text and Speech Processing. In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language, Budapest, Hungary. ECAI.Google Scholar
Mehryar Mohri, Michael Riley, Don Hindle, Andrej Ljolje, and Fernando C. N. Pereira. 1998. Full expansion of context-dependent networks in large vocabulary speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
Mehryar Mohri. 1997. Finite-State Transducers in Language and Speech Processing. Computational Linguistics, 23:2. Google ScholarDigital Library
Mehryar Mohri. 2002. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics, 7(3):321--350. Google ScholarDigital Library
Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in stochastic language modeling. Computer Speech and Language, 8:1--38.Google Scholar
Arto Salomaa and Matti Soittola. 1978. Automata-Theoretic Aspects of Formal Power Series. Springer-Verlag: New York. Google ScholarDigital Library
Marcel Paul Schützenberger. 1961. On the definition of a family of automata. Information and Control, 4.Google Scholar
Kristie Seymore and Ronald Rosenfeld. 1996. Scalable backoff language models. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google ScholarCross Ref
Andreas Stolcke. 1998. Entropy-based pruning of backoff language models. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, pages 270--274.Google Scholar

Generalized algorithms for constructing statistical language models

Recommendations

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Read More
Paraphrastic language models

Natural languages are known for their expressive richness. Many sentences can be used to represent the same underlying meaning. Only modelling the observed surface word sequence can result in poor context coverage and generalization, for example, when ...
Read More
Cache-based Statistical Language Models of English and Highly Inflected Lithuanian

This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
July 2003
571 pages
Program Chairs:
Erhard W. Hinrichs,
Dan Roth
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 7 July 2003
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate85of443submissions,19%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 644
  Total Downloads
- Downloads (Last 12 months)142
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Generalized algorithms for constructing statistical language models

ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Paraphrastic language models

Cache-based Statistical Language Models of English and Highly Inflected Lithuanian

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Generalized algorithms for constructing statistical language models

ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Paraphrastic language models

Cache-based Statistical Language Models of English and Highly Inflected Lithuanian

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media