Elsevier

Applied Acoustics

Volume 64, Issue 1, January 2003, Pages 25-41
Applied Acoustics

Second generation wavelet transform-based pitch period estimation and voiced/unvoiced decision for speech signals

https://doi.org/10.1016/S0003-682X(02)00055-5Get rights and content

Abstract

Pitch detection is an important part of speech recognition and speech processing. In this paper, a pitch detection algorithm based on second generation wavelet transform was developed. The proposed algorithm reduces the computational load of those algorithms that were based on classical wavelet transform. The proposed pitch detection algorithm was tested for both real speech and synthetic speech signal. Some experiments were carried out under noisy environment condition to evaluate the accuracy and robustness of the proposed algorithm. Results showed that the proposed algorithm was robust to noise and provided accurate estimates of the pitch period for both low-pitched and high-pitched speakers. Moreover, different wavelet filters that were obtained using second generation wavelet transform were considered to see the effects of them on the proposed algorithm. It was noticed that Haar filter showed good performance as compared to the other wavelet filters.

Introduction

Pitch period is the fundamental frequency of an audio waveform and is an essential component in various speech processing applications such as speech segregation, speaker identification and verification, hearing impaired diagnostic, and speech coding. Since pitch detection has played an important role in speech processing areas, a wide variety of pitch detection algorithm (PDA) have been proposed in speech processing literature. The pitch detection algorithms can be generally classified into either event detection algorithms or non-event detection algorithms. The event detection algorithm that is based on autocorrelation function computes the pitch frequency by finding instance where the glottis closes and then measuring the time interval between two such glottis closures. So far only a few event detection algorithms have been proposed. They are based on the calculation of autocorrelation function that displays a fairly prominent peak at the pitch period. The disadvantage of event detection based on autocorrelation function is that it can estimate the pitch period exactly in certain vowels. Its performance also degrades non-stationary pitch periods. The non-event based pitch detectors estimate pitch period by a direct approach. Therefore, they are computationally simple as compared to event pitch period over a segment of a speech signal that is obtained by using a window. Some non-event methods are: (1) modified autocorrelation method using clipping (AUTOC) [1], (2) cepstrum method [2], (3) simplified inverse filtering technique [3], (4) average magnitude difference function (AMDF) [4]. However, non-event methods are insensitive to non-stationary variations in the pitch period variations during the measurement interval. They are not suitable for wide range of speaker as well.

Some reasons prevent pitch period of speech signal from being estimated accurately. The first reason is that the formants of vocal tract have a significant effect on the structure of the glottal waveform so that the effect causes difficulty in computing pitch period. The second problem in reliably measuring pitch period is the difference between low-level voiced speech and unvoiced speech. A third difficulty in pitch detection is determination of beginning and end points of pitch period during voiced speech segments.

The PDA based on classical wavelet transform (CWT) in literature [5], [6], [7] estimates the pitch period by determining the glottal closure instant (GCI) and measuring the time period between such two events because when a GCI occurs in a speech waveform, maximum occurrence in the adjacent scales of wavelet transform. However, construction of the CWT relies on the Fourier transform (FT) and needs clumsy mathematical operations.

In this work, a novel approach is proposed for pitch period estimation. The proposed method is based on second-generation wavelets that are constructed by lifting scheme. A lifting scheme is a new method to construct wavelets. The basic idea behind the lifting scheme is very simple. It starts with a trivial wavelet, which doesn't do anything but it holds the formal properties of a wavelet. The lifting scheme, then, gradually constructs a new wavelet with improved properties. Classical wavelets, which are so-called as first generation wavelets, are known as translation and dilation of one fixed function. The Fourier transform is then very important tool for first generation wavelets. On the other hand, a construction with lifting scheme is entirely spatial and is, therefore, ideally suited for building second-generation wavelets when no Fourier transform is available. In addition, construction of the second-generation wavelet not only needs less computational time compared to that of CWT but also it is an easier process.

Performance analysis is done on real speech signals, which reveals relatively accurate and reliable pitch detection. It shows that the algorithm improves performance considerably when compared to the method based on CWT.

The rest of the paper is organized as follows: in Section 2, we review the classical wavelet transform. Section 3 is devoted to the description of construction of the second-generation wavelet transform (SGWT). In Section 4, the pitch period determination is discussed. In Section 5 results and the relevant discussion are given. In Section 6, conclusions are given.

Section snippets

Classical wavelet transform

Classical wavelets are functions generated from one single function ψ by dilations and translations [8], [9], [10].ψa,bt=a12ψt−ba

Where b is real valued and called the shift parameter. The function set ψa,bt is called a wavelet family. Since the parameters (a, b) are continuous valued, the transform is called continuous wavelet transform. The definition of classical wavelets as dilates of one function means that high frequency wavelets correspond to a<1 or narrow width, while low frequency

Construction of the second generation wavelet transform

Wavelets make versatile tool for representing general functions. They can be considered as data building blocks. Moreover, they capture the core of a data set with only a small set of coefficients. This is based on the fact that correlation structure exists in neighboring points to obtain a sparse representation of the signal. The correlation structure is typically local in time and frequency. The SGWT, which is constructed by lifting scheme, is where wavelets are not necessarily translates and

Pitch period determination and voiced/unvoiced decision

The proposed pitch determination algorithm is based on the SGWT. Here, we use the assumption that when a GCI occurs in a speech waveform, maximums also occur in the adjacent scales of the wavelet transform. Although we were inspired by the studies of Kadambe and Obadiat [5], [6], [7] there are many differences in the construction of the wavelets. Unlike the pitch determination algorithms based on the CWT, the proposed algorithm utilizes a lifting scheme to construct the SGWT. The SGWT is

Experimental results

The proposed algorithm was tested on a wide variety of speech data for both synthetic speech and real speech signals. The real speech data consisted of two utterances “ should we chase those cowboys” spoken by an American female and male. First, a synthetic voiced signal, which is shown in Fig. 5a, was produced with pitch frequency 120 Hz. In Fig. 5b–d, the SGWT coefficients of the synthetic speech were plotted at different scales. As shown in Fig. 5b–d, the SGWT gives local maxima across all

Conclusions

The PDA based on SGWT was presented for a wide range of the pitch periods. It was tested for both real speech and synthetic speech signals. As can be seen in all illustrations, pitch periods of synthetic and real speech signal were estimated accurately. Different wavelet filters such as Haar, Dauebchies 4, Dauebchies 6, (9–7), and cubic B-splines were considered to see effects of them on the proposed PDA. The algorithm showed better performance with Haar filter. The proposed PDA is superior to

References (18)

There are more references available in the full text version of this article.

Cited by (36)

  • iPEEH: Improving pitch estimation by enhancing harmonics

    2016, Expert Systems with Applications
    Citation Excerpt :

    As is seen, a majority of algorithms in this domain use spectrum in a direct way without any modification or enhancement. Other useful domains for pitch extraction include the wavelet domain (Chen & Wang, 2002; Ercelebi, 2003) and the Hilbert-Huang domain (Huang & Pan, 2006). Recently, data-driven statistical approaches are increasingly popular.

  • Adaptive multiresolution finite element method based on second generation wavelets

    2007, Finite Elements in Analysis and Design
    Citation Excerpt :

    The scaling function and wavelet can be custom constructed by designing the prediction operator and update operator. Second generation wavelets have gradually being applied in many areas [11–19], especially in numerical analysis. Yserentant [12,13] has proposed an important multiresolution approach adopting hierarchical bases, which can be interpreted as a very elementary class of second generation wavelets [14].

View all citing articles on Scopus
View full text