doi:10.1016/j.csl.2005.10.001
Copyright © 2005 Elsevier Ltd All rights reserved.
Morphology-based language modeling for conversational Arabic speech recognition
aDepartment of Electrical Engineering, University of Washington, Box 352500, Seattle, WA 98195-2500, USA
bSRI International, Menlo Park, CA 94720, USA
Received 26 May 2004;
revised 19 March 2005;
accepted 5 October 2005.
Available online 2 November 2005.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Language modeling for large-vocabulary conversational Arabic speech recognition is faced with the problem of the complex morphology of Arabic, which increases the perplexity and out-of-vocabulary rate. This problem is compounded by the enormous dialectal variability and differences between spoken and written language. In this paper, we investigate improvements in Arabic language modeling by developing various morphology-based language models. We present four different approaches to morphology-based language modeling, including a novel technique called factored language models. Experimental results are presented for both rescoring and first-pass recognition experiments.
Fig. 1. Morphological structure for taskuniyna (you (f.sg.) live).
Fig. 2. Vocabulary growth rates for callHome data in English and ECA.
Fig. 3. Vocabulary growth rates for stemmed CallHome data in English and ECA.
Fig. 4. Backoff path in a standard word-based language model.
Fig. 5. Backoff graph for a factored language model (4-gram).
Fig. 6. FLM bigram for CH ECA. The model attempts to predict the current word based on the previous two words, the previous morph class and the previous stem. When this combination is not found, the most distant word is dropped, followed by the word at the previous time step. When backing off from the combination of previous morph and stems, both estimates obtained by just using either the previous morph or the previous stem are computed, and the larger of them is utilized. The final backoff node is always the word unigram probability.
Fig. 7. Gene activates graph grammar production rules (a); generation of Backoff graph by activated rules 1, 3, 4 (b).
Fig. A.1. Backoff graph of the Arabic bigram FLM used for rescoring and first-pass recognition. The characters W, P, S, R, M represent the previous word, pattern, stem, root, and morphological class factors, respectively. The top node thus stands for the probability distribution P(Wi|Wi − 1, Pi − 1, Si − 1, Ri − 1, Mi − 1). At each of the lower level nodes, one of the conditioning factors is dropped. Multiple paths entering one node indicate the weighted mean combination of the corresponding probability estimates. Note that although we have more than one conditioning variable, we retain the term “bigram” for a model of this type, to incidate that only factors pertaining to the preceding word are required. This allows us in principle to use the model in a bigram decoding framwork.
Table 1.
Linguistic differences between MSA and ECA

Table 2.
Words derived from the roots KTB (‘write’) and DRS (‘study’)

Root consonants are marked in boldface.
Table 3.
LDC ECA CH data sets

Table 4.
Perplexities (ppl) obtained by word-based models and n-gram token coverage rates on the CH development and evaluation sets

Trigram-I refers to a trigram trained on data processed with method I; trigram-II is a trigram trained on data processed by method II.
Table 5.
Affixes used for word decomposition

sg, singular; pl, plural; poss, possessive; dir, direct object; and ind, indirect object.
Table 6.
Number of unique lexical items and perplexities on the CH dev set obtained by particle-based language models

Table 7.
Perplexities of class-based models on the CH dev set. Classes are defined by stems, morph tags, roots, and patterns, respectively

Table 8.
Number of unique words and factors in the CH lexicon and perplexities of the corresponding trigram models on the CH dev set

Table 9.
Word error rates (%) on the CH dev set obtain by various combinations of the baseline language models with morphology-based language model scores

Table 10.
Word error rates (%) obtained by N-best list rescoring with morphology-based language models

The leftmost column refers to the different recognition passes as described in Section 5. Word error rates for the first pass are the same for both systems since the morphology-based language models are only used during rescoring.
Table 11.
Perplexity for word-based LMs and FLMs with parameters optimized using manual, random, and GA search

Table 12.
Bigram and trigram perplexities obtained by: the word-based baseline model (I), the FLM (II), the baseline model rescored with the FLM without adding additional n-grams (III), and with added n-grams (IV), on the different CH sets

Table 13.
Word error rates (%) obtained by using approximations fo FLMs during the first recognition pass, in addition to morphological stream and class models during rescoring
