|
ABSTRACT
For the generation of highly natural synthetic speech, the control of prosody is of primary importance. The fundamental frequency (F0) is one of the most important components of speech prosody. This research investigates the variation of F0 in continuous Cantonese speech, with the goal of establishing an effective mechanism of prosody control in Cantonese text-to-speech (TTS) applications. Cantonese is a commonly used Chinese dialect that is well known for being rich in tones. This article describes a simple yet effective approach to the analysis and modeling of F0. The surface F0 contour of a continuous Cantonese utterance is considered to be the combination of a global component--phrase-level intonation curve, and local components--syllable-level tone contoursA novel method of F0 normalization is proposed to separate the local components from the global one. As a result, the variation in tone contours is greatly reduced. Statistical analysis is performed for the phrase curves and context-dependent tone contours that are extracted from a large corpus of 1,200 utterances. Specifically, the analysis is focused on co-articulated tone contours for disyllabic words, cross-word contours, and phrase-initial tone contours. Based on the results of the analysis, a template-based model for F0 generation is established and integrated with a Cantonese TTS system. Subjective listening tests show that the proposed model significantly improves the naturalness of the output speech.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Clark, J. and Yallop, C. 1990. An Introduction to Phonetic and Phonology. Blackwell, London.
|
| |
2
|
Cox, R.V., Rabiner, L. R., and Wilpon, J. G. 2000. Speech and language processing for next-millennium communications services. Proc.IEEE 88, 8 (2000), 1314--1337.
|
| |
3
|
Dong, M. and Lua, K. T. 2002. Pitch contour model for Chinese text-to-speech using CART and statistical method. In Proceedings of the 2002 International Conference on Spoken Language Processing (Denver, CO, Sept. 2002) 2405--2408.
|
| |
4
|
Dong, M. and Lua, K. T. 2000. An example-based approach for prosody generation in Chinese speech synthesis. In Proceedings of the 2nd International Symposium on Chinese Spoken Language Processing (Beijing, Oct. 2000). 303--307.
|
| |
5
|
|
| |
6
|
Grimes, B. F. Eds. 2003. Ethnologue: Languages of the World. 14th ed. http://www.sil.org/ethnologue (Internet version), SIL International.
|
| |
7
|
Hill, D. R. and Kolman, B. 2001. Modern Matrix Algebra. Prentice Hall, Englewood Cliffs, NJ, 2001.
|
| |
8
|
Holm, B. and Bailly, G. 2000. Generating prosody by superposing multi-parametric overlapping contours. In Proceedings of the 2000 International Conference on Spoken Language Processing (Beijing. Oct. 2000). 203--206.
|
| |
9
|
Juamg, B. H. 2001. Why speech synthesis? (In memory of Prof. Jonathan Allen 1934-2000). IEEE Trans. on Speech and Audio Processing 9 (2001), 1, 1--2.
|
| |
10
|
Kochanski, G. P. and Shih, C. 2001. Automatic modeling of Chinese intonation in continuous speech. In Proceedings of the 2001 European Conference on Speech Communication and Technology (Aalborg, Denmark, Sept. 2001). 2:911--914.
|
| |
11
|
|
| |
12
|
Lau, W. 2000. Attributes and extraction of tone information for continuous Cantonese speech recognition. Mphil. thesis, Dept. of Electronic Engineering, Chinese University of Hong Kong.
|
| |
13
|
Law, K. M. 2001. Cantonese text-to-speech synthesis using sub-syllable units. MPhil. Thesis, Dept. of Electronic Engineering, Chinese University of Hong Kong.
|
| |
14
|
Lee, T., Ching, P. C., Chan, L. W., Mak, B., and Cheng, Y. H. 1995. Tone recognition of isolated Cantonese syllables. IEEE Trans. on Speech and Audio Processing 3, 3 (1995), 204--209.
|
| |
15
|
Lee, T., Kochanski, G. P., Shih, C., and LI, Y. J. 2002. Modeling tones in continuous Cantonese speech. In Proceedings of the 2002 International Conference on Spoken Language Processing (Denver, CO, Sept. 2002). 4:2401--2404.
|
| |
16
|
|
| |
17
|
Lee, T., Meng, H., Lau, W., Lo, W. K., and Ching, P. C. 1999. Micro-prosodic control in Cantonese text-tospeech synthesis. In Proceedings of the 1999 European Conference on Speech Communication and Technology (Budapest, Sept. 1999). 4:1855--1858.
|
| |
18
|
Li, Y. J. 2003. Prosody Analysis and Modeling for Cantonese Text-to-Speech. Mphil. thesis, Dept. of Electronic Engineering, Chinese University of Hong Kong.
|
| |
19
|
Li, Y. J., Lee, T., and Qian, Y. 2002. Acoustical F0 analysis of continuous Cantonese speech. In Proceedings of the 2002 International Symposium on Chinese Spoken Language Processing (Taipei, Aug. 2002), 127--130.
|
| |
20
|
Lieberman, P. 1967. Intonation, Perception and Language. MIT Press, Cambridge, MA.
|
| |
21
|
Linguistic Society of Hong Kong (LSHK). 1997. Hong Kong Jyut Ping characters table Linguistic Society of Hong Kong Press.
|
| |
22
|
Lo, W. K. 2000. Cantonese phonology and phonetics: An engineering introduction. Internal documentation. Digital Signal Processing Laboratory, Chinese University of Hong Kong.
|
| |
23
|
Qian, Y., Lee, T., and Li, Y J. 2003. Overlapped di-tone modeling for tone recognition in continuous Cantonese speech. In Proceedings of the 2003 European Conference on Speech Communication and Technology (Geneva, Sept. 2003). 1845--1848.
|
| |
24
|
Swerts, M. 1997. Prosodic features at discourse boundaries of different strength. J. Acoustical Society of America 101, 1 (1997), 514--521.
|
| |
25
|
Sonntag, G. P. and Portele, T. 1998. Comparative evaluation of synthetic prosody with the PURR method. In Proceedings of the 1998 International Conference on Spoken Language Processing (Sydney, Australia, Nov. 1998). 18--21.
|
| |
26
|
Talkin, D. and Lin, D. ESPS/waves online documentation. Entropic Research Laboratory.
|
| |
27
|
Tseng, C.-Y. 1999. Investigating Mandarin Chinese prosody through speech database. In Proceedings of Oriental COCOSDA Workshop.
|
| |
28
|
Van Heuven, V. J. and Van Bezoojien, R. 1995. Quality evaluation of synthesized speech. In Speech Coding and Synthesis. Kleign and Paliwal, eds. Elsevier Health Sciences, New York, 707--734.
|
| |
29
|
Wang, C., Fujisaki, H., Tomana, R., and Ohno, S. 2000. Analysis of fundamental frequency contours of standard Chinese in terms of the command-response model and its application to synthesis by rule of intonation. In Proceedings of the 2000 International Conference on Spoken Language Processing (Beijing, Oct. 2000). 3:326--329.
|
| |
30
|
Yuen, I. 2002. Tonal invariance and downtrend in Cantonese. In Speech Prosody 2002 (Aix-en-Provence, France, April 2002).
|
| |
31
|
Zhang, J., Dong, S., and Yu, G. 1998. Total quality evaluation of speech synthesis systems. In Proceedings of the 1998 International Conference on Spoken Language Processing (Sydney, Australia, Nov. 1998). 60--63.
|
REVIEW
"Peter C. Patton : Reviewer"
The fundamental frequency (F0) of human speech is the critical factor in creating synthetic speech with natural prosody, the temporal and rhythmic properties of human utterance that make speech sound natural rather than robotic. Mechanical techniq
more...
|