Elsevier

Speech Communication

Volume 48, Issue 8, August 2006, Pages 1009-1023
Speech Communication

Effect of voice quality on frequency-warped modeling of vowel spectra

https://doi.org/10.1016/j.specom.2006.01.003Get rights and content

Abstract

The perceptual accuracy of an all-pole representation of the spectral envelope of voiced sounds may be enhanced by the use of frequency-scale warping prior to LP modeling. For the representation of harmonic amplitudes in the sinusoidal coding of voiced sounds, the effectiveness of frequency warping was shown to depend on the underlying signal spectral shape as determined by phoneme quality. In this paper, the previous work is extended to the other important dimension of spectral shape variation, namely voice quality. The influence of voice quality attributes on the perceived modeling error in frequency-warped LP modeling of the spectral envelope is investigated through subjective and objective measures applied to synthetic and natural steady sounds. Experimental results are presented that demonstrate the feasibility and advantage of adapting the warping function to the signal spectral envelope in the context of a sinusoidal speech coding scheme.

Introduction

A successful low bit rate speech coding algorithm requires a good model for the speech signal together with an effective parameter quantization algorithm. A popular model for low bit rate speech coding has been the sinusoidal model, an important example of which is the multiband excitation (MBE) model (Griffin and Lim, 1988). In the case of voiced speech, the parameters of the model are the fundamental frequency (or pitch), and the amplitudes and phases of the harmonics. The harmonic amplitudes represent the product of the source excitation and vocal tract spectra. At low bit rates, estimated phases are usually dispensed with, and the accurate representation of the pitch and harmonic amplitudes becomes critical to the perceptual quality of decoded speech. Steady vowel sounds are particularly sensitive to harmonic amplitude representation errors.

The quantization of harmonic amplitudes is most demanding on the bit allotment in sinusoidal coding, and methods for efficient quantization have been an important topic of research. A widely used method for the quantization of the amplitudes of the harmonics is based on the modeling of a spectral envelope fitted to the harmonic peaks (MacAulay and Quatieri, 1995). The spectral amplitudes are then reconstructed from samples of the modeled spectral envelope at harmonic frequencies. Representing the spectral envelope by the coefficients of an all-pole filter enables the use of one of the many efficient quantization methods available in the speech coding literature. The order of the all-pole model has a significant effect on the accuracy of the modeled spectral amplitudes. While the all-pole representation of the spectral envelope is expected to capture local resonances accurately, capturing additional features such as overall spectral tilt and spectral zeros due to nasality typically lead to an increase in the number of poles required for an adequate approximation. Further, for similar spectral envelope, low pitched sounds require higher LP model order for similar perceived quality levels (Champion et al., 1994, Rao and Patwardhan, 2005).

In the interest of achieving low bit rates, however, it is necessary to keep model order as low as possible. Frequency-scale warping before all-pole modeling of the spectral envelope is a widely used method to improve the perceptual accuracy of modeling for a given model order. Frequency-scale warping leads to a more accurate representation of the low frequency spectrum at the cost of increased errors in the high frequency region. Although perceptual scales such as the Bark-scale and its variants have been widely used in LP modeling of speech spectra, our recent work (Rao and Patwardhan, 2005) on synthetically generated steady vowel sounds (using fixed excitation source parameters) indicated that the performance of frequency warping depended to a great extent on the nature of the underlying sound spectrum. It was observed that front vowels such as [ei] and [i] in fact were better modeled without Bark-scale warping. This was explained by the low frequency first formant structure of these vowels which fails to mask high frequency distortion adequately. In a subjective experiment using natural speech, it was found that the comparative behaviour of modeling error under different warping conditions in the case of the non-front vowels was inconsistent across speakers indicating a further dependence on speaker voice quality as reflected in the overall spectral envelope.

In the present study we attempt to extend our previous work by exploring aspects of voice quality that could influence the spectrum modeling error. The influence of voice quality on perceived modeling error in frequency-warped all-pole modeling of the spectral envelope is studied via the framework of MBE model analysis–synthesis.

While the overall spectral envelope for voiced speech is determined by the glottal source waveform, vocal tract transfer function and lip radiation, it is the glottal source waveform that most directly affects the relative strengths of the low and high frequency harmonics. The scope of the present study is restricted to variations in laryngeal voice qualities for they are closely linked to gross differences in spectral envelope. An investigation based on subjective and objective experimental evaluation is presented. Finally, the applicability of the results to improving speech quality in a low bit rate sinusoidal coder is discussed.

Section snippets

Frequency-warped LP modeling of discrete spectra

A discrete spectrum, characterized by a fundamental frequency and the amplitudes of the components at harmonic frequencies, can be represented by the coefficients of an all-pole model. A smooth spectral envelope is first derived to fit the harmonic amplitudes using a suitable interpolation method such as linear interpolation of log amplitudes (Hermanksy et al., 1985, Rao and Patwardhan, 2005). The power spectrum obtained from the interpolated envelope is used to compute the autocorrelation

Voice quality and its spectral correlates

Voice quality refers to the auditory impression a listener gets upon hearing the speech of another talker. Voice quality is determined by the articulators of the vocal tract as well as the characteristics of the vocal folds. For speech, the vocal folds have a predominant importance. We refer to this aspect of voice quality that results from differences in vocal cord vibratory patterns, or laryngeal voice quality (Childers and Lee, 1991). Periodic glottal excitation is a characteristic of

Experimental evaluation

Experiments were designed to investigate the influence of voice quality on the modeling error from frequency-warped LP modeling of the spectral envelope as presented in Section 2. Vowels uttered in low pitched modal voices of different brightness were obtained by synthesis as well as extraction from natural speech.

Results and discussion

Table 3, Table 4 summarise the results of subjective and objective ranking of the experiment involving the natural vowels. Table 5, Table 6 are the corresponding results for the synthetic vowels.

The best rank (rank 1) implies the least perceived degradation among the three test sounds, while the worst (rank 3) implies perceived degradation was maximum among the three test sounds. Higher score, and lower PL value, implies lower perceived degradation. The last column contains Spearman’s rank

Conclusion

Frequency-warping according to a perceptual scale is often applied to improve the perceptual accuracy of low order LP modeling of the speech spectral envelope in sinusoidal speech coding. Understanding the factors that influence the subjective perception of spectral envelope modeling errors can be useful in improving the attained speech quality. Steady vowel sounds are particularly sensitive to spectrum envelope modeling errors. Experimental investigations of the relative improvement in

References (17)

  • P. Rao et al.

    Frequency warped modeling of vowel spectra: dependence on vowel quality

    Speech Commun.

    (2005)
  • J. Burred et al.

    Hierarchical automatic audio signal classification

    J. Audio Eng. Soc.

    (2004)
  • Champion, T., MacAulay, R., Quatieri, J., 1994. High-order all-pole modeling of the spectral envelope. In: Proceedings...
  • D. Childers

    Speech Processing and Synthesis Toolboxes

    (2000)
  • D. Childers et al.

    Vocal quality factors: analysis, synthesis and perception

    J. Acoust. Soc. Am.

    (1991)
  • Doval, B., d’Alessandro, C., 1997. Spectral correlates of glottal waveform models: an analytic study. In: Proceedings...
  • G. Feng et al.

    Some acoustic features of nasal and nasalized vowels: a target for vowel nasalization

    J. Acoust. Soc. Am.

    (1996)
  • D. Griffin et al.

    Multiband excitation vocoder

    IEEE Trans. Acoust. Speech Signal Process.

    (1988)
There are more references available in the full text version of this article.

Cited by (0)

A portion of this work was presented at the International Conference on Spoken Language Technology-04, New Delhi.

View full text