Elsevier

Speech Communication

Volume 30, Issue 1, January 2000, Pages 55-74
Speech Communication

Estimation of stability and accuracy of inverse problem solution for the vocal tract

https://doi.org/10.1016/S0167-6393(99)00031-XGet rights and content

Abstract

The inverse problem for the vocal tract is under consideration from the viewpoint of the ill-posed problem theory. The proposed approach, which permits overcoming the difficulties related to ambiguity and instability, is based on the variational regularization with constraints. The work of articulators is used as a functional of regularization and a criterion of optimality for finding an approximate solution. The measured acoustical parameters of the speech signal serve as external constraints while the geometry of the vocal tract, the mechanics of the articulation, and the phonetic properties of the language play the role of internal constraints. An effective numerical implementation of the proposed approach is based on a local piecewise linear approximation of the articulatory-to-acoustics mapping and a polynomial approximation of the discrepancy measure. A heuristic method named the “calibrating curves method” is applied for estimating the accuracy of the obtained approximate solution. It was shown that in some cases the error of the inverse problem solution is weakly dependent on the errors of formant frequency measurements. The vocal tract shapes obtained by virtue of the proposed approach are very close to those measured in X-ray experiments.

Résumé

Un problème inverse du conduit vocal est analysé du point de vue de la théorie des problèmes incorrects. On propose la méthode, utilisant la régularisation variée avec des restrictions pour surmonter des difficultés liés à la dissemblance et à l'instabilité. Le fonctionnement des organes d'articulation est utilisé comme un régulateur et comme un critère d'optimum pour une résolution approximative recherchée. Des paramètres acoustiques mesurés servent des restrictions extérieures tandis que la géométrie du conduit vocal, les mécanismes d'articulation et les particularités phonétiques de la langue servent des restrictions intérieures. Une méthode effective numérique de la réalisation d'un approche proposé est basée sur une approximation linéaire d'une image “articulation – acoustique”. Une méthode euristique nommée “une méthode des courbes calibrées” a été utilisée pour évaluer la précision des résolutions approximatives obtenues. On fait voire que dans les certains cas une faute dans la résolution d'une tâche inverse ne dépend que très faiblement des fautes des mesures des fréquences des formantes. Formes du conduit vocal évaluées à l'aide d'une méthode proposée ressemblent beaucoup aux formes mesurées au cours des expériences radiographiques.

Introduction

Historically, the inverse problem for the vocal tract was studied for the technical applications. It was supposed that the determination of the vocal tract parameters from a speech signal will provide a considerable compression of the speech signal, improve the rate of automatic speech recognition, solve the problem of training articulatory synthesizers, and create new methods in logopedics and foreign language teaching.

The compensating ability of the articulatory control system and some hypotheses on the mechanisms of speech perception in terms of the articulatory parameters indicate the possibility of solving the inverse problem in the systems of articulatory control and speech perception (Fowler, 1986; Liberman et al., 1967; Liberman and Mattingly, 1985; Sorokin, 1996). A study of the mathematical properties inherent in the inverse problem for the vocal tract gives way for understanding the processes of articulatory control and speech perception.

Initially, the inverse problem for the vocal tract was considered as a purely mathematical problem of the determination of its area function by means of solving a one-dimensional wave equation (Webster's equation). Borg (1946) and Levinson (1949) have shown that for a lossless system, the unique solution of the problem can be obtained only when complete information on the spectrum of the solution is available, and if the area function is “smooth”. Following this analysis of the inverse problem, Gårding (1977) found the solution for the cases when all resonance frequencies and their dampings are known. He assumed that the measured fields of the acoustical pressure and velocity are analytic functions, and the solution is infinitely differentiable. The latter condition may be avoided if all resonance frequencies are known at least for two different boundary conditions. Obviously, these assumptions are not valid in practice for the inverse problem for the vocal tract.

The limitations of the analytical approach stimulated the development of approximate and numerical methods. Using the technique of small perturbations, Schroeder (1967) and Mermelstein (1967) derived analytic expressions describing the correspondence between the resonance frequencies of the wave equation and the coefficients of the logarithmic area function expansion in the Fourier series. Yehia and Itakura (1996) used this approach with the addition of some “morphological” constraints for the area function. This method requires a knowledge of the vocal tract length and of the boundary conditions, i.e., values not measurable in the speech signal. Gopinath and Sondhi (1970) have shown that the area function can be calculated using the measured resonance frequencies if the function is twice continuously differentiable, the vocal slit is closed, no radiation impedance takes place, and some additional functionals are known. But the functionals are unmeasurable immediately in the speech signal.

Also, a problem of unmeasurable parameters arose when the area of a sequence of cylindrical sections approximating the vocal tract was computed from the speech signal parameters (Atal, 1970; Paige and Zue, 1970; Wakita, 1973; Wakita and Gray, 1975). It was found that the computed area function depended on the radiation impedance, the vocal source characteristics and the distribution of losses in the vocal tract.

The problem of unmeasurable parameters can be avoided if the error of acoustical parameters prediction is minimized by means of the variative parameters of a speech production model (the so-called “analysis by synthesis”) (Flanagan et al., 1980; Nakajima, 1977; Shirai and Honda, 1976). This approach was used by Charpentier (1984) when the number of the measured acoustical parameters was less than that of the acoustical parameters. Båvegard and Fant (1995), Kobayashi et al. (1991) and Saltzman and Munhall (1989) applied the analysis by synthesis method when the number of the area function parameters was equal to the number of the measured acoustical parameters.

It is well known that there is a non-denumerable infinity of area functions consistent with a given set of formant frequencies (Schroeter and Sondhi, 1994). In particular, the anti-symmetric transformation of the area function retains the same resonance frequencies. It is difficult to overcome the ambiguity of the inverse problem with respect to the area function parameters since information on the physically justified constraints on the admissible vocal tract shapes is largely lacking. On the other hand, certain constraints may be found when solving the inverse problem for the articulatory parameters. Ladefoged et al. (1978) and Shirai and Honda (1976) solved the task when the number of articulatory parameters was equal to the number of measured acoustical parameters. However, a solution to the inverse problem with respect to few articulatory parameters is not sufficient for applications like automatic speech recognition, speech compression, or speech synthesis. For the applications, we need a model which is able to generate all articulatory states inherent in a given language. Articulatory models with 7–11 articulatory parameters developed by Coker (1973), Maeda (1979), Mermelstein (1973) and Shirai and Honda (1976) were used for solving inverse problems. A model with 15 articulatory parameters was used in inverse problems for vowels and fricatives (Sorokin, 1992a, Sorokin, 1992b, Sorokin, 1994; Sorokin and Trushkin, 1996).

An efficient initial search for articulatory parameters can be provided with the use of associative memory. The associative memory can be implemented as a code-book or as an artificial neural net. In both cases, during training, a vector of acoustic parameters is computed for each vector of the articulatory parameters. In one method, the articulatory parameters and their acoustical counterparts are stored in a code-book. In the other method, the parameters of the articulatory-to-acoustic mapping are learnt in a neural net. The code-book technique for the static inverse problem was first proposed by Atal et al. (1978). It was further elaborated by Larar et al. (1988), Schroeter and Sondhi (1992) and Sorokin and Trushkin (1996). The neural net technique was studied in application to solving both the static and dynamic inverse problems by Jordan and Rumelhurt (1992), Kawato et al. (1987), Kobayashi et al. (1991), Rahim and Goodyear (1990), Rahim et al. (1993). An advantage of the associative memory is that the found articulatory parameters can be used both as a final solution and as an initial approximation for further optimization. The complicated problem of the training set representativity is inherent both in the code-book technique and the neural net technique. Schroeter and Sondhi (1994) reviewed some approaches to the inverse problem by neural nets, concluding that “no clear advantage has so far been shown for them compared to other approaches”.

The ambiguity of the inverse problem solution with respect to the articulatory parameters cannot be resolved by the code-book technique alone because of the necessity to make a choice among multiple articulatory vectors corresponding to the same acoustical vector (Atal et al., 1978). Neural nets do not explicitly deal with the non-uniqueness, usually outputting a weighted average of the different inverse solutions. The constraints derived from the dynamics of the articulatory parameters can decrease the uncertainty of the solution (Chenoukh et al., 1997; McGowan, 1994; Saltzman and Munhall, 1989; Schoentgen and Ciocea, 1997; Shirai and Honda, 1976; Shirai, 1977; Shirai and Kobayashi, 1986). Sorokin and Trushkin (1996) formed a code-book only with articulatory vectors sampled on the trajectories of the articulatory parameters providing synthetic diphones. This way, the dynamic constraints were implicitly used.

In spite of reports on successful solutions for some particular cases, it is not clear whether the problem can be solved for all kinds of speech sounds. The mathematical models in the inverse articulatory problem are heuristic in general. Moreover, important mathematical properties of the problem such as the existence of the solution and the accuracy estimation for the obtained solution are not under theoretical investigation in the literature concerning the speech inverse problem.

Let us consider the inverse articulatory problem for the vocal tract as a mathematical operator equationAz=uwith a nonlinear continuous operator A:ZU. Here Z and U are respectively the sets of admissible articulatory and acoustical parameters. The element zZ is an unknown vector of articulatory parameters to be found and uU is the corresponding vector of acoustical parameters. The exact value of u is unknown because only its approximation can be measured in the speech signal. In fact, the operator A specifying the relationship of articulatory and acoustical parameters is also unknown. We are concerned only with a mathematical model of this relationship. Thus, the operator A is given approximately. Problems like Eq. (1), as a rule, are ill-posed. The operator equation (1) is said to be well-posed (in the sense of Hadamard) if (1) its solution z exists on the given set Z for any admissible exact data (A,u); (2) the solution is unique in this set; (3) the solution continuously depends on (A,u) (or, in other words, the solution is stable with respect to admissible perturbations of (A,u)). The latter means that for every sequence of data {(An,un)}n=1 such that AnA and unu (in a certain sense) as n→∞, the corresponding sequence {zn} of solutions of Eq. (1) converges to z. The problem becomes ill-posed if at least one of these requirements is violated.

As a rule, the inverse problems for the vocal tract are ill-posed. Indeed, the fulfillment of the first requirement depends on the adequacy of the mathematical model of speech production. Practically, it is impossible to prove whether any particular model is able to reproduce all vectors of acoustical parameters which can appear in the speech signal, or not. It especially concerns models with few articulatory parameters. Below in Section 3.2, we will present an example of measurements in which even a 16-parametric model was not able to provide the reproduction. The scatterplots used to find the range of acoustical parameters for a given range of the articulatory parameters (Atal et al., 1978; Boë et al., 1992; Schroeter et al., 1990) cannot confirm the adequacy of the used articulatory model. Therefore, it is likely that the first condition is not fulfilled.

We have already discussed a non-unique relationship between the acoustical and articulatory parameters. Thus, the second requirement of the well-posedness is not fulfilled too.

The stability of the solution to Eq. (1) depends on the properties of the inverse operator A−1. More precisely, the solution appears to be stable if the operator A−1 is continuous, i.e., A−1unA−1u as n→∞, and if the inversion is stable with respect to admissible perturbations of the operator: An−1A−1. Since we have a perturbed operator An only, these requirements are hard to verify theoretically. Moreover, the operator An−1 is represented usually as a computing procedure rather than an analytical expression. Thus, the examination of whether our inverse problem is stable or not, should be carried out by computing only. Some examples of the speech inverse problem demonstrate the instability of the solution. For example, when the area function is calculated with the use of linear prediction coefficients (Markel and Grey, 1976), the estimated shape of the vocal tract changes considerably from one frame of the speech signal to another. Therefore, a solution for some particular task may be unstable.

The above considerations confirm that the inverse articulatory problem (1) is ill-posed in general. Therefore, for its solution we can benefit from the specific notions and methods developed in the theory of ill-posed problems. The theory and the corresponding numerical methods have evolved considerably since the time of formulating its main principles by Tikhonov (1963) (see e.g., Bakushinsky and Goncharsky, 1994; Tikhonov and Arsenin, 1977; Tikhonov et al., 1998). Sorokin, 1992b, Sorokin, 1994 and Sorokin and Trushkin (1996) used the recommendations and the results of the theory, though without an explicit description of the mathematical background. The majority of recent approaches to solving speech inverse problems do not directly use the results of the ill-posed problem theory, and some of them do not even consider the problems as ill-posed. For example, it is not a usual practice to investigate whether approximate solutions converge to the exact one. The other point which is not clearly understood is what to do when we are aware that the inverse problem can have no solution.

The next subject to be analyzed from the ill-posed problem viewpoint is the accuracy estimation of an approximate solution to the inverse problem. Only few known studies compare the real vocal tract parameters with those obtained as a result of solving an inverse problem. For example, measured and calculated tongue shapes were presented by Sorokin (1992b) and Sorokin and Trushkin (1996). Badin et al. (1995) compared measured and computed midsagittal functions. Even fewer studies give quantitative estimates of the difference between the measured and computed parameters.

Möller et al. (1976) reported that the mean square errors were 21–37% of the measured variance of the velum height. Hogden et al. (1996) estimated the errors for the upper lip, lower lip, jaw and tongue position as 0.5–2%. McGowan and Lee (1996) estimated the rms errors for the lip protrusion and the coordinates of some points on the tongue as 0.2–2.5 mm. Schoentgen and Ciocea (1997) investigated the stability of solutions with respect to the area function for triplets of vowels. They found that the area perturbation quotient is in the range of 0.01–0.07, while the formant frequencies perturbation quotient is in the range of 0.02–0.07.

The question arises whether it is possible to estimate the accuracy of approximate solutions theoretically, or can this be done experimentally only. Note that if the stability of an inverse problem solution is not guaranteed then the numerical accuracy estimation becomes meaningless.

The paper is structured as follows. Section 2 discusses the notion of the regularization of ill-posed problems in application to the speech inverse problem, considers conceivable constraints and optimality criteria, and describes an approximation of the mapping operator and the discrepancy measure. Section 3 discusses the imprecision of a speech production model. A theoretical and experimental estimation of the accuracy of inverse problem solutions will be described in Section 4. Section 5 discusses the obtained results, and Section 6 concludes this paper.

Section snippets

Generalized solution and regularization

In practical problems, input data u are usually measured with some error δ. It means that a value uδ such that u−uδ⩽δ is available instead of u. Here · is a norm introduced on the set U. Moreover, since we have only approximate mathematical models for the articulatory-to-acoustics mapping, the exact operator A is unknown and we are forced to use another operator Ah instead. Here, the value h is an accuracy estimation for the approximation of operator A by Ah. Thus, even though the problem with

Vocal tract model mismatch and measurement errors

A preliminary assessment of algorithms for solving the inverse problem may be carried out with the use of an articulatory synthesizer because in this case both the articulatory parameters and the acoustical parameters are known. When a real speech signal is taken as an input for the inverse problem, a mismatch between the mathematical models and the real processes of speech production as well as measurement errors may affect the obtained solution. A model mismatch increases solution errors (

Accuracy of solution

Unfortunately, the theory of ill-posed problems presents mostly negative results for the attainable accuracy of approximate solutions to inverse problems. It was found that for solving an ill-posed problem by a concrete method, it is impossible to obtain a uniform quantitative estimation of accuracy without detailed knowledge about the exact solution (Bakushinsky and Goncharsky, 1994; Leonov and Yagola, 1995). Therefore, it is impossible, e.g., to guarantee that an algorithm using data with 1%

Discussion

A high solution accuracy was obtained for non-perturbed formant frequencies. Apparently, for these particular vowels, it was achieved due to the use of proper constraints and an adequate criterion of optimality. Another reason for the good accuracy achieved may be the minimal mismatch between the vocal tract model and the vocal tract of the speaker because the shape of the fixed parts of the vocal tract model and some dimensions were measured for the same speaker.

Fant (1960) has shown that the

Conclusion

The inverse problem of calculating articulatory parameters from measured acoustical parameters has principally, in its initial mathematical statement, a non-unique and unstable solution. The problem can be regularized in order to achieve a stable solution if

  • 1.

    optimality criteria matched with the criteria of articulatory control system activity are found and

  • 2.

    physically and phonetically based constraints are exploited as much as possible.

A theoretical evaluation of attainable accuracy of the

Acknowledgements

We are grateful to J. Hogden and the anonymous reviewer for helpful comments on an earlier version of this paper.

References (75)

  • K. Shirai et al.

    Estimating articulatory motion from speech wave

    Speech Communication

    (1986)
  • V.N. Sorokin

    Determination of vocal tract shape for vowels

    Speech Communication

    (1992)
  • V.N. Sorokin

    Inverse problem for fricatives

    Speech Communication

    (1994)
  • V.N. Sorokin et al.

    Articulatory-to-acoustic mapping for inverse problem

    Speech Communication

    (1996)
  • S.A.J. Wood

    A radiographic analysis of constriction locations for vowels

    J. Phonetics

    (1979)
  • H. Yehia et al.

    A method to combine acoustic and morphological constraints in the speech production inverse problem

    Speech Communication

    (1996)
  • B.S. Atal

    Determination of vocal-tract shape directly from the speech wave

    J. Acoust. Soc. Amer.

    (1970)
  • B.S. Atal et al.

    Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting technique

    J. Acoust. Soc. Amer.

    (1978)
  • Bakushinsky, A., Goncharsky, A., 1994. Ill-Posed Problems: Theory and Applications. Kluwer Academic Publishers,...
  • M. Båvegard et al.

    From formant frequencies to VT area function parameters

    STL-QPSR

    (1995)
  • Bernstein, M.A., 1967. The Coordination and Regulation of Movements. Pergamon Press,...
  • Brillouin, L., 1956. Science and Information Theory. Academic Press, New...
  • G. Borg

    Eine Umkehrung der Sturm – Liouvilleschen Eigenwertaufgabe

    Acta Math.

    (1946)
  • Chenoukh, S., Sinder, D., Richard, G., Flanagan, J.L., 1997. Voice mimic system using an articulatory codebook for...
  • Coker, C.H., 1973. A model of articulatory dynamics and control. In: Proc. IEEE 64, pp....
  • Fant, G., 1960. Acoustic Theory of Speech Production. Mouton, The...
  • J.L. Flanagan et al.

    Signal models for low bit-rate coding of speech

    J. Acoust. Soc. Amer.

    (1980)
  • L. Gårding

    The inverse of vowel articulation

    Arkiv für Matematik

    (1977)
  • B. Gopinath et al.

    Determination of the shape of the human vocal tract from acoustical measurements

    Bell Syst. Technol. J.

    (1970)
  • Hatze, H., 1980. Neuromusculoskeletal control systems modeling – a critical review of recent developments. IEEE Trans....
  • Heinz, J.M., Stevens, K.N., 1965. On the relations between lateral cineradiograph, area functions, and acoustic spectra...
  • J. Hogden et al.

    Accurate recovery of articulator positions from acoustics: New conclusions based on human data

    J. Acoust. Soc. Amer.

    (1996)
  • M. Kawato et al.

    A hierarchical neural network model for control and learning of voluntary movement

    Biol. Cybernetics

    (1987)
  • Kiritani, S., Tatenaka, E., Sawashima, M., 1978. Computer tomography of the vocal tract. Ann. Bull. Res. Inst....
  • Kobayashi, T., Yagyu, M., Shirai, K., 1991. Application of neural networks to articulatory motion estimation. In:...
  • P. Ladefoged et al.

    Generating vocal tract shapes from formant frequencies

    J. Acoust. Soc. Amer.

    (1978)
  • Larar, J.N., Schroeter, J., Sondhi, M.M., 1988. Vector-quantization of the articulatory space. IEEE Trans. Acoust....
  • Cited by (26)

    • Feedback and imitation by a caregiver guides a virtual infant to learn native phonemes and the skill of speech inversion

      2013, Speech Communication
      Citation Excerpt :

      In fact, the exact optimality criteria used by humans in speech production are not known. When proper constraints are chosen, a stable inversion solution can be searched using iterative methods, such as regularization techniques (e.g., Tikhonov and Arsenin, 1977 cited in Sorokin et al. (2000)). Because the used vocal tract model is rarely equal to the actual vocal tract of the speaker whose speech is to be inverted, a certain amount of discrepancy between the measured acoustic signal and synthesized signal must be allowed.

    • Estimation of relevant time-frequency features using Kendall coefficient for articulator position inference

      2013, Speech Communication
      Citation Excerpt :

      The question of how the articulatory information, which come from Electro-Magnetic Articulograph (EMA) systems in present work, is coded in the speech signal remains of practical and theoretical relevance. In particular, the knowledge of the distribution of the articulatory influence on the acoustic speech signal is useful in those applications involving articulatory inversion tasks, whose main goal is to infer the articulators position based on the information immersed in the acoustic speech signal (Schroeter and Sondhi, 1994; Sorokin et al., 2000). Several studies have pointed out on the measured performance differences of acoustic-to-articulatory mapping systems when using different context-window sizes or positions.

    • Some coding properties of speech

      2003, Speech Communication
    View all citing articles on Scopus
    View full text