Elsevier

Speech Communication

Volume 41, Issue 4, November 2003, Pages 625-645
Speech Communication

Quantitative measurement of prosodic strength in Mandarin

https://doi.org/10.1016/S0167-6393(03)00100-6Get rights and content

Abstract

We describe models of Mandarin prosody that allow us to make quantitative measurements of prosodic strengths. These models use Stem-ML, which is a phenomenological model of the muscle dynamics and planning process that controls the tension of the vocal folds, and therefore the pitch of speech. Because Stem-ML describes the interactions between nearby tones, we were able to capture surface tonal variations using a highly constrained model with only one template for each lexical tone category, and a single prosodic strength per word. The model accurately reproduces the intonation of the speaker, capturing 87% of the variance of f0 with these strength parameters. The result reveals alternating metrical patterns in words, and shows that the speaker marks a hierarchy of boundaries by controlling the prosodic strength of words. The strengths we obtain are also correlated with syllable duration, mutual information and part-of-speech.

Introduction

Intonation production is generally considered a two-step process: an accent or tone class is predicted from available information, and then the tone class is used to generate f0 as a function of time. Historically, most attention has been paid to the first, high level, step of the process. We here show that by focusing on f0 generation, one can build a model that starts with acoustic data and reaches far enough up to connect directly to linguistic factors such as part-of-speech, word length and position in the text.

Specifically, we present a model of Mandarin Chinese intonation that makes quantitative f0 predictions in terms of the lexical tones and the prosodic strength of each word. The model is able to generate tonal variations from a few tone templates that correspond to lexical tones, and accurately reproduce f0 in continuous Mandarin speech with a 13 Hz RMS error. The result is comparable to machine learning systems that may use more than one hundred tone templates to account for Mandarin tonal variations.

We find that some parameters of the model can be interpreted as the prosodic strength of a tone. We determine the prosodic strengths (and the values of the other global parameters) by executing a least-squares fit of the model to the time-series of f0 from a corpus of speech data. The resulting best-fit strengths, tone shapes, and metrical patterns of words can be associated with linguistic properties. We show that strengths computed from the model exhibit strong and weak alternation as in metrical phonology (Liberman and Prince, 1977), and the values are correlated with the part-of-speech of words, with mutual information, and with the hierarchy of the prosodic structure (Ladd, 1996; Pierrehumbert and Beckman, 1988; Selkirk, 1984) such as the beginning and ending of sentences, clauses, phrases, and words.

We will also show that values of parameters from a fit to one half of the corpus match equivalent parameters fit to the other half of the corpus. Further, we can change the details of the model, and show that the values of many parameters are essentially unaffected by the change. This consistency is important because if we hope to interpret these parameters (and thus the models that contain them) as statements about the language as a whole, they must at least be consistent across the corpus and between similar models.

The model we use is described in Section 3. It is written in Soft Template Mark-up Language (Stem-ML) (Kochanski and Shih, 2003; Kochanski and Shih, 2000), and depends upon its underlying mathematical model of prosody control. We write a Stem-ML model in terms of a set of tags (parameters) then find the parameter values that best reproduce f0 in a training corpus. Fitting the model to the data can be done automatically.

Stem-ML calculates an intonational contour from a set of tags. Some of the tags set global parameters that correspond to speaker characteristics, such as pitch range, while others represent intonational events such as lexical tone categories and accent types. The tags can contain adjustable parameters that can explain surface variations.

Stem-ML does not impose restriction on how one define tags. In our view, a meaningful way is to use the tags to represent linguistic hypotheses such as Mandarin lexical tones, or English accent types. We call tags that define tones or accents templates because they define the ideal shapes of f0 in their vicinity. In this paper, our usage of tone tags (tone templates) corresponds directly to Mandarin lexical tone categories, and we interpret the Stem-ML strength parameters as the prosodic strengths of these tone templates. The actual realization of f0 depends on the templates, their neighbors, and the prosodic strengths. We show in the paper that this treatment successfully generates continuous tonal variations from lexical tones.

Described another way, a Stem-ML model is a function that produces a curve of f0 vs. time. The resulting curve depends on a set of adjustable (free) parameters which describe things like the shape of tones, how tones interact, and the prosodic strength of each syllable. When Stem-ML is generating a f0 curve, one can set these parameters to any values, and each setting will get you a different curve. In reverse, one can find the best values for the parameters via data fitting procedures.

We use a least-squares fitting algorithm to find the values for the parameters that best describe the data. The algorithm operates iteratively by adjusting the parameter values, and accepting steps that reduce the sum of the squared differences between the model and the data. The values of the parameters that make the summed squared difference as small as possible, for a given model, are called the best-fit (or fitted) parameters.

Section snippets

Chinese tones

Tonal languages, such as Chinese, use variations in pitch to distinguish otherwise identical syllables. Mandarin Chinese has four lexical tones with distinctive shapes: high level (tone 1), rising (tone 2), low (tone 3), and high falling (tone 4). The syllable ma with a high level tone means mother, but it means horse with a low tone. Thus, in a text-to-speech (TTS) system, good pitch prediction is important not just for natural sounding speech but also for good intelligibility. There is a

Modeling intonation

We build our model for Mandarin on top of Stem-ML (Kochanski and Shih, 2003) because it captures several desirable properties. A positive feature of Stem-ML is that the representation is understandable, adjustable, and can be transported from one situation to another.

Unlike most engineering approaches, this model cleanly separates into local (word-dependent) and global (speaker-dependent) parameters. For instance, one can generate acceptable speech by using the templates of one speaker with

Data collection

The corpus was obtained from a male native Mandarin speaker reading paragraphs from newspaper articles, selected for broad coverage of factors in the text that are associated with prosodic effects, including tonal patterns in the beginning, medial, and final positions of utterances, phrases, and words. To select sentences from a corpus, we used the greedy algorithm described in (van Santen and Buchsbaum, 1997). Pause and emphasis were transcribed manually after text selection and recording. A

Results of fit

Overall, our word-based models fit the data with a 13 Hz RMS error, approximately 1.5 semitones. In Fig. 3, we show the beginning of an utterance from the best-fit model (subset1-J-wA). In Fig. 4, we show the phrase with median error from that model, and in Fig. 5, the phrase containing the worst-fit pair of syllables in the worst of the converged models (subset2-S-wAT). Generally, the worst-fitting syllables tend to be the ones with the largest and fastest pitch excursions. These are

Conclusion

We have used Stem-ML to build a model of continuous Mandarin speech that connects the acoustic level up to the results of text analysis (part-of-speech information, and word, phrase, clause, and sentence boundaries). When fit to a corpus, the model shows that prosody is used in a consistent way to mark divisions in the text: sentences, clauses, phrases, and words start strong and end weak. Our prosodic measurements also show a useful correlation with word length, and the part-of-speech of

References (73)

  • T.H. Crystal et al.

    Segmental durations in connected speech signals: syllabic stress

    J. Acoust. Soc. Am.

    (1988)
  • J. Edwards et al.

    The articulatory kinematics of final lengthening

    J. Acoust. Soc. Am.

    (1991)
  • R. Fano

    Transmission of Information

    (1961)
  • Feldman, A.G., Adamovich, S.V., Ostry, D.J., Flanagan, J.R., 1990. The origin of electromyograms––explanations based on...
  • T. Flash et al.

    The coordination of arm movements: an experimentally confirmed mathematical model

    J. Neurosci.

    (1985)
  • Flemming, E., 1997. Phonetic optimization: compromise in speech production. University of Mainland Working Papers in...
  • E. Flemming

    Scalar and categorical phenomena in a unified model of phonetics and phonology

    Phonology

    (2001)
  • H. Fujisaki

    Dynamic characteristics of voice fundamental frequency in speech and singing

  • Hirschberg, J., Pierrehumbert, J., 1986. The international structuring of discourse. In: Proceedings of the 24th Annual...
  • Hogan, N., Winters, J.M., 1990. Principles underlying movement organization: upper limb. In: Winters and Woo (1990),...
  • H. Hollien

    In search of vocal frequency control mechanisms

  • D.H. Klatt

    Interaction between two factors that influence vowel duration

    J. Acoust. Soc. Amer.

    (1973)
  • Kochanski, G.P., Shih, C., 2000. Stem-ML: language independent prosody description. In: Proceedings of the...
  • Kochanski, G., Shih, C., 2001. Automated modelling of Chinese intonation in continuous speech. In: Proceedings of...
  • G. Kochanski et al.

    Hierarchical structure and word strength prediction of Mandarin prosody

    Internat. J. Speech Technol.

    (2003)
  • R. Laboissière et al.

    The control of multi-muscle systems: human jaw and hyoid movements

    Biol. Cybernet.

    (1996)
  • D.R. Ladd

    Intonational Phonology

    (1996)
  • W. Lea

    Segmental and suprasegmental influences on fundamental frequency contours

  • L.-S. Lee et al.

    Improved tone concatenation rules in a formant-based Chinese text-to-speech system

    IEEE Trans. Speech Audio Process.

    (1993)
  • K. Levenberg

    A method for the solution of certain problems in least squares

    Quart. Appl. Math.

    (1944)
  • M.Y. Liberman et al.

    Intonational invariance under changes in pitch range and length

  • M.Y. Liberman et al.

    On stress and linguistic rhythm

    Linguist. Inq.

    (1977)
  • Lin, M.-C., Yan, J., 1983. The stress pattern and its acoustic correlates in Beijing Mandarin. In: Proc. 10th Internat....
  • B. Lindblom

    Spectrographic study of vowel reduction

    J. Acoust. Soc. Amer.

    (1963)
  • P.F. MacNeilage et al.

    Mechanical properties of single motor units in speech musculature

    J. Acoust. Soc. Amer.

    (1979)
  • D. Marquardt

    An algorithm for least-squares estimation of nonlinear parameters

    SIAM J. Appl. Math.

    (1963)
  • Cited by (49)

    • Boundary-conditioned anticipatory tonal coarticulation in Standard Mandarin

      2021, Journal of Phonetics
      Citation Excerpt :

      Based on the results, we re-evaluate the PENTA model and the Stem-ML model regarding their assumptions about anticipatory tonal coarticulation. The PENTA model (Xu & Wang, 2001; Xu, 2005) and the Stem-ML model (Kochanski & Shih, 2003; Kochanski et al., 2003) were both originally developed to address tonal coarticulation issues in continuous speech. The two models share the common assumption that the underlying articulatory mechanisms for F0 production need to be incorporated into the modeling of surface F0 contours.

    • Intonation modelling using a muscle model and perceptually weighted matching pursuit

      2018, Speech Communication
      Citation Excerpt :

      The qTA (quantitative Target Approximation) model (Prom-on et al., 2009), expands on the CR model, and uses pitch targets as input to the physiological model of pitch production. The StemML (Kochanski et al., 2003), on the other hand, imposes physiological constraints of smoothness and communication constraints specified by target accent templates to the modelling process. In mimicking the abilities of humans in a machine, it is natural to try to mimic human physiological processes.

    • Effects of talker-to-listener distance on tone

      2015, Journal of Phonetics
      Citation Excerpt :

      The f0 contours of tones produced with contrastive focus and in emphatic speech may look distorted when compared to tones produced in normal speech modes, but they tend to be more separated in tone space, which actually helps with tone recognition (Surendran, 2007; Surendran, Levow, & Xu, 2005; Xu, 1999). Tones produced in prosodically weak positions in connected speech can be reduced to the extent that they lose their distinctive tone shapes (Kochanski & Shih, 2003; Kochanski, Shih, & Jing, 2003; Surendran et al., 2005). The acoustic correlates of tones are further influenced by factors including but not limited to speaker, context, sentence prosody, declination, speaking rate and speaker effort (Kochanski et al., 2003; Shih, 1988, 2000; Xu, 1997, 1999).

    View all citing articles on Scopus
    1

    Present address: University of Illinois, Urbana-Champaign, IL, USA.

    2

    Present address: IBM, T.J. Watson Research Center, Yorktown Heights, NY, USA.

    View full text