Quantitative measurement of prosodic strength in Mandarin
Introduction
Intonation production is generally considered a two-step process: an accent or tone class is predicted from available information, and then the tone class is used to generate f0 as a function of time. Historically, most attention has been paid to the first, high level, step of the process. We here show that by focusing on f0 generation, one can build a model that starts with acoustic data and reaches far enough up to connect directly to linguistic factors such as part-of-speech, word length and position in the text.
Specifically, we present a model of Mandarin Chinese intonation that makes quantitative f0 predictions in terms of the lexical tones and the prosodic strength of each word. The model is able to generate tonal variations from a few tone templates that correspond to lexical tones, and accurately reproduce f0 in continuous Mandarin speech with a 13 Hz RMS error. The result is comparable to machine learning systems that may use more than one hundred tone templates to account for Mandarin tonal variations.
We find that some parameters of the model can be interpreted as the prosodic strength of a tone. We determine the prosodic strengths (and the values of the other global parameters) by executing a least-squares fit of the model to the time-series of f0 from a corpus of speech data. The resulting best-fit strengths, tone shapes, and metrical patterns of words can be associated with linguistic properties. We show that strengths computed from the model exhibit strong and weak alternation as in metrical phonology (Liberman and Prince, 1977), and the values are correlated with the part-of-speech of words, with mutual information, and with the hierarchy of the prosodic structure (Ladd, 1996; Pierrehumbert and Beckman, 1988; Selkirk, 1984) such as the beginning and ending of sentences, clauses, phrases, and words.
We will also show that values of parameters from a fit to one half of the corpus match equivalent parameters fit to the other half of the corpus. Further, we can change the details of the model, and show that the values of many parameters are essentially unaffected by the change. This consistency is important because if we hope to interpret these parameters (and thus the models that contain them) as statements about the language as a whole, they must at least be consistent across the corpus and between similar models.
The model we use is described in Section 3. It is written in Soft Template Mark-up Language (Stem-ML) (Kochanski and Shih, 2003; Kochanski and Shih, 2000), and depends upon its underlying mathematical model of prosody control. We write a Stem-ML model in terms of a set of tags (parameters) then find the parameter values that best reproduce f0 in a training corpus. Fitting the model to the data can be done automatically.
Stem-ML calculates an intonational contour from a set of tags. Some of the tags set global parameters that correspond to speaker characteristics, such as pitch range, while others represent intonational events such as lexical tone categories and accent types. The tags can contain adjustable parameters that can explain surface variations.
Stem-ML does not impose restriction on how one define tags. In our view, a meaningful way is to use the tags to represent linguistic hypotheses such as Mandarin lexical tones, or English accent types. We call tags that define tones or accents templates because they define the ideal shapes of f0 in their vicinity. In this paper, our usage of tone tags (tone templates) corresponds directly to Mandarin lexical tone categories, and we interpret the Stem-ML strength parameters as the prosodic strengths of these tone templates. The actual realization of f0 depends on the templates, their neighbors, and the prosodic strengths. We show in the paper that this treatment successfully generates continuous tonal variations from lexical tones.
Described another way, a Stem-ML model is a function that produces a curve of f0 vs. time. The resulting curve depends on a set of adjustable (free) parameters which describe things like the shape of tones, how tones interact, and the prosodic strength of each syllable. When Stem-ML is generating a f0 curve, one can set these parameters to any values, and each setting will get you a different curve. In reverse, one can find the best values for the parameters via data fitting procedures.
We use a least-squares fitting algorithm to find the values for the parameters that best describe the data. The algorithm operates iteratively by adjusting the parameter values, and accepting steps that reduce the sum of the squared differences between the model and the data. The values of the parameters that make the summed squared difference as small as possible, for a given model, are called the best-fit (or fitted) parameters.
Section snippets
Chinese tones
Tonal languages, such as Chinese, use variations in pitch to distinguish otherwise identical syllables. Mandarin Chinese has four lexical tones with distinctive shapes: high level (tone 1), rising (tone 2), low (tone 3), and high falling (tone 4). The syllable ma with a high level tone means mother, but it means horse with a low tone. Thus, in a text-to-speech (TTS) system, good pitch prediction is important not just for natural sounding speech but also for good intelligibility. There is a
Modeling intonation
We build our model for Mandarin on top of Stem-ML (Kochanski and Shih, 2003) because it captures several desirable properties. A positive feature of Stem-ML is that the representation is understandable, adjustable, and can be transported from one situation to another.
Unlike most engineering approaches, this model cleanly separates into local (word-dependent) and global (speaker-dependent) parameters. For instance, one can generate acceptable speech by using the templates of one speaker with
Data collection
The corpus was obtained from a male native Mandarin speaker reading paragraphs from newspaper articles, selected for broad coverage of factors in the text that are associated with prosodic effects, including tonal patterns in the beginning, medial, and final positions of utterances, phrases, and words. To select sentences from a corpus, we used the greedy algorithm described in (van Santen and Buchsbaum, 1997). Pause and emphasis were transcribed manually after text selection and recording. A
Results of fit
Overall, our word-based models fit the data with a 13 Hz RMS error, approximately 1.5 semitones. In Fig. 3, we show the beginning of an utterance from the best-fit model (subset1-J-wA). In Fig. 4, we show the phrase with median error from that model, and in Fig. 5, the phrase containing the worst-fit pair of syllables in the worst of the converged models (subset2-S-wAT). Generally, the worst-fitting syllables tend to be the ones with the largest and fastest pitch excursions. These are
Conclusion
We have used Stem-ML to build a model of continuous Mandarin speech that connects the acoustic level up to the results of text analysis (part-of-speech information, and word, phrase, clause, and sentence boundaries). When fit to a corpus, the model shows that prosody is used in a consistent way to mark divisions in the text: sentences, clauses, phrases, and words start strong and end weak. Our prosodic measurements also show a useful correlation with word length, and the part-of-speech of
References (73)
- et al.
A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams
Comput. Speech Lang.
(1991) - et al.
Physiologically based criterion of muscle force prediction in locomotion
J. Biomech.
(1981) Pitch accent in context: Predicting international prominence from text
Artif. Intell.
(1993)- et al.
Prosody modeling with soft templates
Speech Comm.
(2003) Pitch targets and their realization: evidence from Mandarin Chinese
Speech Comm.
(2001)- et al.
Statistical prosodic modeling: from corpus design to parameter estimation
IEEE Trans. Speech Audio Process.
(2001) - et al.
Tiers in articulatory phonology, with some implications for casual speech
- Chen, Y., Gao, W., Zhu, T., Ma, J., 2000. Multi-strategy data mining on Mandarin, prosodic patterns. In: Proceedings of...
- Chen, S.-H., Hwang, S.H., Tsai, C.-Y., 1992. A first study of neural net: based generation of prosodic and spectral...
- Computational Linguistic Society of the Republic of China, 1993. ROCLING Chinese Corpus. Institute of Information...
Segmental durations in connected speech signals: syllabic stress
J. Acoust. Soc. Am.
The articulatory kinematics of final lengthening
J. Acoust. Soc. Am.
Transmission of Information
The coordination of arm movements: an experimentally confirmed mathematical model
J. Neurosci.
Scalar and categorical phenomena in a unified model of phonetics and phonology
Phonology
Dynamic characteristics of voice fundamental frequency in speech and singing
In search of vocal frequency control mechanisms
Interaction between two factors that influence vowel duration
J. Acoust. Soc. Amer.
Hierarchical structure and word strength prediction of Mandarin prosody
Internat. J. Speech Technol.
The control of multi-muscle systems: human jaw and hyoid movements
Biol. Cybernet.
Intonational Phonology
Segmental and suprasegmental influences on fundamental frequency contours
Improved tone concatenation rules in a formant-based Chinese text-to-speech system
IEEE Trans. Speech Audio Process.
A method for the solution of certain problems in least squares
Quart. Appl. Math.
Intonational invariance under changes in pitch range and length
On stress and linguistic rhythm
Linguist. Inq.
Spectrographic study of vowel reduction
J. Acoust. Soc. Amer.
Mechanical properties of single motor units in speech musculature
J. Acoust. Soc. Amer.
An algorithm for least-squares estimation of nonlinear parameters
SIAM J. Appl. Math.
Cited by (49)
Boundary-conditioned anticipatory tonal coarticulation in Standard Mandarin
2021, Journal of PhoneticsCitation Excerpt :Based on the results, we re-evaluate the PENTA model and the Stem-ML model regarding their assumptions about anticipatory tonal coarticulation. The PENTA model (Xu & Wang, 2001; Xu, 2005) and the Stem-ML model (Kochanski & Shih, 2003; Kochanski et al., 2003) were both originally developed to address tonal coarticulation issues in continuous speech. The two models share the common assumption that the underlying articulatory mechanisms for F0 production need to be incorporated into the modeling of surface F0 contours.
Intonation modelling using a muscle model and perceptually weighted matching pursuit
2018, Speech CommunicationCitation Excerpt :The qTA (quantitative Target Approximation) model (Prom-on et al., 2009), expands on the CR model, and uses pitch targets as input to the physiological model of pitch production. The StemML (Kochanski et al., 2003), on the other hand, imposes physiological constraints of smoothness and communication constraints specified by target accent templates to the modelling process. In mimicking the abilities of humans in a machine, it is natural to try to mimic human physiological processes.
Effects of talker-to-listener distance on tone
2015, Journal of PhoneticsCitation Excerpt :The f0 contours of tones produced with contrastive focus and in emphatic speech may look distorted when compared to tones produced in normal speech modes, but they tend to be more separated in tone space, which actually helps with tone recognition (Surendran, 2007; Surendran, Levow, & Xu, 2005; Xu, 1999). Tones produced in prosodically weak positions in connected speech can be reduced to the extent that they lose their distinctive tone shapes (Kochanski & Shih, 2003; Kochanski, Shih, & Jing, 2003; Surendran et al., 2005). The acoustic correlates of tones are further influenced by factors including but not limited to speaker, context, sentence prosody, declination, speaking rate and speaker effort (Kochanski et al., 2003; Shih, 1988, 2000; Xu, 1997, 1999).
Brain responses to spoken F0 changes: Is H special?
2015, Journal of Phonetics
- 1
Present address: University of Illinois, Urbana-Champaign, IL, USA.
- 2
Present address: IBM, T.J. Watson Research Center, Yorktown Heights, NY, USA.