Abstract
This article describes a novel method that models the correlation among acoustic observations in contiguous speech segments. The basic idea behind the method is that acoustic observations are conditioned not only on the phonetic context but also on the preceding acoustic segment observation. The correlation between consecutive acoustic observations is modeled by mean trajectory polynomial segment models (PSM). This method is an extension of conventional segment modeling approaches in that it describes the correlation of acoustic observations not only inside segments but also between contiguous segments. It is also a generalization of phonetic context (e.g., triphone) modeling approaches because it can model acoustic context and phonetic context at the same time. Using the proposed method in a speaker-independent phoneme classification test resulted in a 7 to 9% relative reduction of error rate as compared with the traditional triphone segmental model system and a 31% reduction as compared with a similar triphone hidden Markov model (HMM) system.
Similar content being viewed by others
References
Fukada, T., Sagisaka, Y., and Paliwal, K.K. (1997). Model parameter estimation for mixture density polynomial segment models. In ICASSP, pp. 1403–1406.
Furui, S. (1986). On the role of spectral transition for speech perception. J. Acoust. Soc. Am. 80(4):1016–1025.
Gish, H. and Ng, K. (1993). A Segmental speech model with applications to word spotting. In ICASSP-93, pp. II/447–450.
Kimball, O. (1994). Segment Modeling Alternatives for Continuous Speech Recognition. Ph.D. thesis. Elect. Comput. Syst. Eng. Dept., Boston University.
Lee, K.-F. (1989). Automatic speech recognition: The developement of the SPHINX system. Norwell, Massachusetts 02061: Kluwer Academic Publishers.
Ostendorf, M., Digalakis, V.V., and Kimball, O.A. (1996). From HMMs to segment models: A unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing SAP; 4(5):360–378.
Ostendorf, M., Kannan, A., Austin, S., Kimball, O., Schwartz, R., and Rohlicek, J.R. (1991). Integration of diverse recognition methodologies through reevaluation of N-Best sentence hypotheses. In Proc. of the DARPA Workshop on Speech and Natural Language, pp. 83–87.
Sagisaka, Y., Abe, M., Umeda, T., Katagiri, S., Takeda, K., and Kuwabara, H. (1990). A large-scale japanese speech database. In ICSLP, pp. 1089–1092.
Schwartz, R. and Chow, Y.-L. (1990). The N-Best Algorithm: An efficient and exact procedure for finding theNmost likely sentence hypotheses. In ICASSP, pp. 1857–1860.
Schwartz, R., Chow, Y.-L., Kimball, O., Roucos, S., Knasser, M., and Makhoul, J. (1985).Context-dependent modeling for acoustic phonetic recognition of continous-speech. In ICASSP, pp. 1205–1208.
Szarvas, M. and Matsunaga, S. (1998). Acoustic observation context modeling in segment based speech recognition. In ICSLP-98, pp. VII/2967–2970.
Szarvas, M. and Matsunaga, S. (1999). Segment-based speech recognition using acoustic observation context. Technical Report of IEICE SP98-119(1): 9–16.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Szarvas, M., Matsunaga, S. Improving Phoneme Classification Performance Using Observation Context–Dependent Segment Models. International Journal of Speech Technology 3, 253–262 (2000). https://doi.org/10.1023/A:1026502830036
Issue Date:
DOI: https://doi.org/10.1023/A:1026502830036