Abstract
Choosing a pitch estimation algorithm is not a simple task. One must balance between the accuracy and the reliability of the estimates. Two classes of methods are available. The first one, known as the “block methods” class, gives noise robust solutions and has an intrinsic averaging property, but is not very accurate, especially for the transition regions. The second one, known as the “instantaneous (or event-based) methods” class, gives very accurate estimates, but is considered to be inadequate in the presence of noise.
In this paper, we present potential enhancements of the performance in pitch estimation, based on both block and instantaneous methods. In this respect we discuss mainly two algorithms: a nonlinear cepstral algorithm and a wavelet-based one. The first algorithm, due to the proposed nonlinear model, enhances the classical linear model performance related to the accuracy of the estimated pitch for the transition regions and to the robustness in the presence of noise. Concerning the second algorithm, to the inherent accuracy of the estimated pitch, we add robust estimates even in the presence of noise, based on the multiresolution properties of an improved wavelet transform. The obtained enhancements were evaluated on a hand-labeled speech database, and the improved algorithms are now being applied in our research concerning speech compression and prosody.
Similar content being viewed by others
References
Ahmadi, S. and Spanias, A.S. (1999). Cepstrum-based pitch detection using a new statistical V/UV classification algorithm. IEEE Transactions on Speech and Audio Processing, 7: 333–338.
Ananthapadmanbha, T.V. and Yegnanarayana, B. (1979). Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Transactions on Audio, Speech and Signal Processing, 27: 309–319.
Cheng, Y.M. and O'shaughnessy, D. (1989). Automatic and reliable estimation of glottal source instant and period. IEEE Transactions on Audio, Speech and Signal Processing, 37: 1805–1815.
De Mori, R., Laface, P., Makhonine, V.A., and Mezzalama, M. (1977). A syntactic procedure for the recognition of glottal pulses in continuous speech. Pattern Recognition, 9: 181–189.
Di Francesco, P. and Moulines, E. (1989). Detection of the glottal closure by jumps in the statistical properties of the signal. EUROSPEECH'89 Proceedings. Paris, France: European Speech Communication Association (ESCA), pp. 39–42.
Flanagan, J.L. and Saslow, M.G. (1958). Pitch discrimination for synthetic vowels. Journal of American Society of Acoustics, 30: 435–442.
Gavat, I., Zirra, M., and Enescu, V. (1995). Compresia semnalului vocal de calitate telefonicã utilizând prelucrarea homomorfic? (Compression of telephone quality speech signal using homomorfic processing), Military Technical Academy Conference Proceedings, Bucharest, Romania, pp. 109–116.
Gavat, I. (1995). Naturalness improvement in Romanian language speech synthesis. ICSPAT'95 Proceedings, Boston,MA,pp. 1951–1954.
Gavat, I., Zirra, M., and Enescu, V. (1996). Pitch detection of speech by dyadic wavelet transform. ICSPAT'96 Proceedings, Boston, MA, pp. 1630–1634.
Harris, M.S. and Umeda, N. (1987). Difference limens for fundamental frequency contours in sentences, Journal of American Society of Acoustics, 81: 1139–1145.
Hess, W.J. (1976). A pitch-synchronous digital feature extraction system for phonemic recognition of speech. IEEE Transactions on Audio, Speech and Signal Processing, 24:14–25.
Hess, W.J. (1983). Pitch Determination of Speech Signals: Algorithms and Devices. Berlin: Springer Verlag.
Hess, W.J. (1992). Pitch and voicing determination. In M. Sondhi and S. Furui (Eds.), Advances in Speech Signal Processing. New York: Marcel Decker.
Hodgson, L., Jerrigan, M.E., and Wills, B.L. (1990). Nonlinear multiplicative cepstral analysis for pitch extraction in speech. ICASSP'90 Proceedings, Adelaide, Australia, pp. 257–260.
Jo, C.W., Bang, H.G., and Ainsworth, W.A. (1996). Improved glottal closure instant detector based on linear prediction and standard pitch concept. ICSLP'96 Proceedings, Philadelphia, PA, pp. 1–5.
Kadambe, S. and Boudreaux-Bartels, G. (1992). Application of the wavelet transform for pitch detection of speech signals. IEEE Transactions on Information Theory, 38: 917–924.
Kamp, C.Y. and Willems, L.F. (1994). A Frobenius norm approach to glottal closure detection from the speech signal. IEEE Transactions on Speech and Audio Processing, 2: 258–264.
Klatt, D. (1973). Discrimination of fundamental frequency contours in synthetic speech: Implications for models of speech perception. Journal of American Society of Acoustics, 53: 8–16.
Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11: 674–694.
Mallat, S. and Hwang, W.L. (1992). Singularity detection and processing with wavelets. IEEE Transactions on Information Theory, 38: 617–643.
Markel, J.D. (1972). The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20: 367–377.
Markel, J.D and Gray, A.H. (1976). Linear Prediction of Speech. New York: Springer Verlag.
Medan, Y., Yair, E., and Chazan, D. (1991). Super resolution pitch determination of speech signals. IEEE Transactions on Signal Processing, 39: 40–48.
Murthy, P.S. and Yegnanarayana, B. (1999). Robustness of groupdelay-based method for extraction of significant instants of excitation from speech signals. IEEE Transactions on Speech and Audio Processing, 7: 609–620.
Noll, A.M. (1967). Cepstrum pitch determination. Journal of American Society of Acoustics, 41: 293–309.
Oppenheim, A.V. (1986). Homomorphic analysis of speech. IEEE Transactions on Audio and Electroacoustics, 16: 221–226.
Qi, Y. and Hunt, B.R. (1993). Voiced-univoiced-silence classifications of speech using hybrid features and a network classifier. IEEE Transactions on Speech and Audio Processing, 1: 250–254.
Ross, M.J., Shaffer, H.L., Cohen, A., Freudberg, R., and Manle, H.J. (1974). Average magnitude difference function pitch extractor. IEEE Transactions on Audio, Speech and Signal Processing, 22: 353–361.
Smits, R. and Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech. IEEE Transactions on Speech and Audio Processing, 3: 325–333.
Wong, D.J., Markel, J.D., and Gray, A.H. (1979). Least squares glottal inverse filtering from the acoustic speech wave. IEEE Transactions on Audio, Speech and Signal Processing, 27: 350–355.
Yegnanarayana, B. and Smits, R. (1995). A robust method for determining instants of major excitations in voiced speech. ICASSP'95 Proceedings, Detroit, MI, pp. 776–779.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Gavat, I., Zirra, M. & Sabac, B. Pitch Estimation by Block and Instantaneous Methods. International Journal of Speech Technology 5, 269–279 (2002). https://doi.org/10.1023/A:1020201125377
Issue Date:
DOI: https://doi.org/10.1023/A:1020201125377