1 Introduction

One of the most important issues in the field of speech processing concerns the estimation of the pitch in the presence of a range of interferences that are added to the main signal. We can cite some cases such as an environmental noise, reverberation or just another person speaking at the same time with the target speaker. During the last decades, various methods of separating sounds have been proposed, such as the blind source separation (BSS) which allows the extraction of the unknown speech signals from the mixture signals with no a priori information about the mixed signals and the sources [1,2,3,4,5,6]. We mention also spectral subtraction [7], subspace analysis [8], hidden Markov modeling [9] and sinusoidal modeling [10]. These approaches expect certain interference properties, and then, they segregate composite speech based on these hypotheses. These methods assume certain signal properties, but they have some limitations in comparison with the separation of human sounds.

Other approaches have been proposed as the CASA (computational auditory scene analysis) model [11,12,13,14,15] which are based on the principles of ASA “auditory scene analysis” [16] described by Bregman for separation of composite signals. Bregman supposed that the brain organizes the data according to the general properties of the signal (harmonicity, synchronization, etc.) by using automatic and native mechanisms. The low-level processing, data driven, allows the grouping of the signal components, thus providing auditory stream corresponding to external events.

A CASA system generally follows four steps: peripheral analysis, the extraction of sound properties, segmentation and grouping. The peripheral processing decomposes the auditory scene representation in a two-dimensional time–frequency (TF) through a filter bandwidth and a windowing in function of time. The second step extracts auditory properties corresponding to the principles of ASA, which will be used later in the segmentation and grouping. In the segmentation and grouping, the system produces segments for concurrent voice and target voice (interference) and then groups the segments from the target in a stream. This flow corresponds to a sound source. Finally, the separate target signal will be synthesized from this flow [13].

In the different methods of speech separation, the pitch is a major parameter. Indeed, several multi-pitch search algorithms have been proposed, like in the case of the automatic transcription of polyphonic music [17] or in the case of multi-pitch estimation in composite speech. The first systems typically have had significant limitations in the content and were only capable of detecting two harmonic overtones [18, 19]. Among the multi-pitch estimation methods, we find mainly those based on statistical approaches such as Bayesian modeling [20] or the hidden Markov model (HMM) [21]. These techniques have been validated only in the case of two speakers or a single speaker with noise.

In this work, we use the comb filters in the frequency domain for the separation of composite speech signals. The comb filters have been used in [22, 23] on the synthesized signals for the separation of polyphonic sounds. But their use on real signals will be more complicated because of certain intrinsic phenomena to the production of voiced sounds such as the effects of noise due to the vocal tract, the variability of the frequency of the glottal excitation called jitter or the pitch interaction with the first forming.

Then, our principal goal is the target signal estimation in voiced zones.

This paper is organized as follows. Section 2 describes related works. Section 3 presents the principle of the proposed method for voiced sounds separation. The results and the comparison are given in Sect. 4. Finally, Sect. 5 concludes this work.

2 Related works

Multiple comb filters and autocorrelation of the multi-scale product for multi-pitch estimation approach presents a new method “that estimates the fundamental frequency in the case of a real noisy environment when many persons speak at the same time and considers the case of two speakers” [24].

On the other hand, the method of the evaluation of multi-pitch estimation by multi-scale product analysis describes “a multi-scale product comparative study for dominant pitch estimation in a mixture” [25].

Besides, multi-pitch estimation based on multi-scale product analysis, improved comb filter and dynamic programming approach is about “a method consists on the autocorrelation function of the multi-scale product calculation of the mixture signal, its filtered version by a rectangular improved comb filter and the dynamic programming of the residual signal spectral density for multi-pitch detection” [26].

Finally, the comparison of several computational auditory scene analysis (CASA) techniques for monaural speech segregation work presents and compares “several studies that have used computational auditory scene analysis (CASA) for speech separation and recognition” [27].

Nowadays, researchers are increasingly interested in making systems capable of separating signals. In our case, we are focused on developing a method of voiced sounds separation from a composite signal which is mainly based on the use of a modified comb filter that is keyed to the average values of the estimated pitch. In addition, this estimation is performed through an autocorrelation of multi-scale product analysis to separate the effects of the source and the vocal tract.

So, our actual work is about the target signal estimation in voiced zones using the fundamental frequency estimation of the dominant signal. Absolutely, we use some techniques presented in previous works such as the autocorrelation, the multi-scale product analysis and comb filtering. But, we have modified the structure of the filter and used these tools for the separation after the pitch detection.

3 Description of voiced sounds separation system

For the separation of composite speech, we propose an approach based on the “autocorrelation of multi-scale product (AMP)” and the modified comb filtering (MCF). This filtering is applied only in the voiced regions. For the conception of our comb filter, we need to determinate the true value of the fundamental frequency f0 of the voiced regions of composite signal.

Then, the use of comb filters is dependent on pitch estimation of dominant speaker. To do this, we use a method based on the “AMP analysis.” The proposed algorithm operates as shown in the block diagram given in Fig. 1.

Fig. 1
figure 1

Flow diagram of separation approach

3.1 Pitch estimation by the “AMP”

To decompose the composite signal into frames, we use a hamming window with 64-ms duration.

In order to obtain the “multi-scale product (MP)” signal, we use a continuous wavelet transform (WT) at three successive scales (2−1, 20 and 21) and we calculate their product of the coefficients.

In fact, the wavelet transform allows the characterization and the detection of signal singularities [24]. The wavelet used in this MP analysis is the quadratic spline function with a support of 0.8 ms. It is the first-order derivative of the scaling function, which closely approximates a Gaussian function.

The MP detects singularities present in the signal and allows filtering the signals and generating simplest periodic structures. Then, we calculate the autocorrelation function, which allows the estimation of the pitch of the target speech signal [28].

The “AMP” for the time interval i is described as follows:

$${\text{AMP}}_{i} \left( k \right) = \mathop \sum \limits_{l = 0}^{N - 1} P_{i} \left( l \right)P_{i} \left( { + k} \right)$$
(1)

where N is the length of the analyzed frame and Pi is the multi-scale product.

On the other hand, for a signal x, the MP at scales sj is given by:

$$P\left( n \right) = \mathop \prod \limits_{j = 1}^{l} Wx\left( {n,s_{j} } \right)$$
(2)

where Wx (n, sj) is the WT of x(n) at the scale \(s_{j} = 2^{j}\).

3.2 Improved comb filter and speakers separation

Comb filters are based on the summation of two identical signals together with a small delay between them. These filters allow selected frequencies to pass while blocking all other frequencies.

For an input signal x[n], this process is described by the following equation:

$$Y\left[ n \right] = x\left[ n \right] + g*x\left[ {n - D} \right]$$
(3)

where g is a factor, which scales the gain of the filter between 1 + g and 1 − g, and D is the delay in samples.

There are two types of comb filters:

  • a comb filter with finite impulse response (FIR)

  • a comb filter with infinite impulse response (IIR)

These filters are described, respectively, by the following equations:

$$Y\left[ n \right] = X\left[ n \right] + \alpha X\left[ {n - k} \right]$$
(4)
$$Y\left[ n \right] = X\left[ n \right] + \alpha Y\left[ {n - k} \right]$$
(5)

where k represents the delay associated with the period T and α is a fixed parameter applied to the delayed signal.

The transfer functions associated with these filters are, respectively:

$$H\left( z \right) = 1 + \alpha z^{ - k}$$
(6)
$$H\left( z \right) = \frac{1}{{1 - \alpha z^{ - k} }}$$
(7)

Comb filters are used in numerous applications such as music analysis, speech enhancement and separation and multi-pitch estimation. In this context, a least mean square (LMS) adaptive filtering approach has been proposed for removing the deleterious effects of additive noise on the speech signal by Sambur [29]. Moreover, Darlington and Campbell [30] developed a multi-channel and sub-band adaptive system for enhancement of speech signals corrupted by background noise. We mention also Iliev and Kasabov technique [31] which is based on adaptive filtering with averaging (AFA) used for echo cancelation. Besides, we cite Christensen and Jakobsson method [32] which introduced a number of new filter designs that are constructed for periodic signals and that give full parameterizations of these signals.

In this work, to obtain the dominant signals, a rectangular comb filter is used. Indeed, we have used the comb filter in the frequency domain with a rectangular frequency response because the bands forming this type of filter are needed to filter properly the fundamental frequency and its harmonics.

This filter is tuned by the estimated f0, and it has adaptable amplitude which varies according to the maximum amplitude of the mixture signal at each frame.

The resulting signal from this filtering presents the first speaker or the dominant signal. To obtain the second speaker or the intrusion, we apply a subtraction between the mixture signal and the signal estimated by the designed comb filter.

Figure 2 shows the frequency response of the designed comb filter for a fundamental frequency f0 = 160 Hz and a sampling frequency Fs = 16 kHz.

Fig. 2
figure 2

Frequency response of the rectangular comb filter

4 Experiments and comparison

4.1 Experiments

The separation of composite sounds is to keep only the harmonics of the signal and reject the intermediate frequency components. For separation of our composite signal, the proposed approach uses a comb filter with a rectangular bandwidth. The filter amplifies all multiple frequencies f0 = 1/T0 according to the width of the band. f0 is given by the “AMP analysis.”

This approach has been evaluated using Cooke database [33], which contains 100 mixture sounds obtained by mixing ten male voiced speech signals with ten other signals representing a variety of sounds. Interferences can be classified into three categories:

  • Interferences without pitch formed by white and impulse noises.

  • Interferences having a pitch quality formed by pure frequency of 1 kHz, cocktail party noise, rock music, siren and ring tone.

  • Speech interferences, which are speech signal uttered by a woman 1, speech signal uttered by a man 2 and speech signal uttered by a woman 2.

The text pronounced by male target signals is “Why are you all weary.” On the other hand, the text pronounced by the speech interference is “Don’t ask me to carry an oily rag like that.”

Signals constituting the Cooke database are monophonic sound reproduction and are sampled at the frequency 16 kHz.

The mixture signal is decomposed into frames using a hamming window with 64-ms duration.

Figures 3, 4 and 5 present the mixture, the reference target and the estimated target by our approach, respectively, with a noise from the first and the second categories of interferences (white noise, pure frequency of 1 kHz and cocktail party noise).

Fig. 3
figure 3

Separated signals from composite speech from Cooke database (male voice and white noise)

Fig. 4
figure 4

Separated signals from composite speech from Cooke database (male voice and pure frequency of kHz)

Fig. 5
figure 5

Separated signals from composite speech from Cooke database (male voice and cocktail party noise)

Then, for this type of mixture signals, we obtain good results for the estimated signals comparing with references signals.

Figures 6, 7 and 8 show the composite signal followed by the first speaker and the second speaker signals references and estimated from the separation operation using a rectangular comb filter.

Fig. 6
figure 6

Mixture and both reference and estimated target and intrusion frames of male voices (from the Cooke database)

Fig. 7
figure 7

Mixture and both reference and estimated target and intrusion frames of male and female voices (from the Cooke database)

Fig. 8
figure 8

Mixture and both reference and estimated target and intrusion frames of male and female voices (from the Cooke database)

Figure 6 describes the case when the mixture contains two male voices. This is the most difficult case in the separation task because the two speakers are from the same gender and have very similar characteristics like the fundamental frequency. This explicates why the estimated intrusion is not very similar to the reference one.

Then, we note that the estimated signal is approximately similar to the reference signal for both speakers composed the mixture signal in the three presented cases, especially for the mixture formed by male and female voices.

On the other hand, the proposed approach has been evaluated also in the VCTK database.

This database includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph [34].

We have only used ten speakers from this database: two mixtures formed by two males, two mixtures formed by two females and six mixtures formed by one male and one female [35].

Figures 9, 10 and 11 present the mixture, the reference target, the estimated target by our approach and the estimated target by Quan et al.’s system, respectively, in the case of a mixture formed by two males, in the case of a mixture formed by two females and in the case of a mixture formed by one male and one female.

Fig. 9
figure 9

Mixture, reference and both estimated target frames of two males voices (from the VCTK database)

Fig. 10
figure 10

Mixture, reference and both estimated target frames of two females voices (from the VCTK database)

Fig. 11
figure 11

Mixture, reference and both estimated target frames of one male and one female voices (from the VCTK database)

From these figures, we can conclude that the estimated target signal by our proposed approach is approximately similar to the reference target signal in the three presented cases.

These results show the robustness of our proposed method.

4.2 Comparison and discussion

Our approach performed well for both speakers estimation. It is compared to Wang and Brown [36], Hu and Wang [37] and Zhang and Liu [38] systems for speech separation.

Wang–Brown model [36] is based on a framework oscillatory correlation which is supported by recent neurobiological findings. F0 information is available in the pooled correlogram. Indeed, after the formation of the correlogram by computing a running autocorrelation of the auditory nerve activity in each filter channel, these correlograms are computed at 10-ms intervals, forming a three-dimensional volume in which time, channel center frequency and autocorrelation lag are represented on orthogonal axes. Additionally, a “pooled” correlogram is formed at each time frame by summing the periodicity information in the correlogram over frequency. The largest peak in the pooled function occurs at the period of the dominant fundamental frequency (f0).

On the other hand, Hu and Wang [37] proposed a tandem algorithm that performs the pitch estimation of the target signal and the segregation of voiced portions of target speech jointly and iteratively. They used the ideal binary mask (IBM) at the voiced frames of the target utterance to estimate a pitch period at each frame.

In addition, Zhang–Liu approach [38] concerns monaural voiced speech separation. It is based on pitch estimation and comb filter. Indeed, the authors decomposed the composite speech signal into time–frequency units. Then, these TF units are used to determine the pitch contour according to theirs zero crossing rate. After that, a comb filter is used to label each unit as target speech or as intrusion.

Finally, we mention Quan et al.’s approach [39] that separates the voice of a target speaker from composite signal by making use of a reference signal from the target signal. They realize this by training two separate neural networks:

  • A speaker recognition network that produces speaker-discriminative embeddings;

  • A spectrogram-masking network that takes both noisy spectrogram and speaker embedding as input and produces a mask.

We have compared our SNR results from the VCTK database with this method.

Comparing signals (Fig. 12a–c), we conclude that the proposed method is as efficient as those using CASA system.

Fig. 12
figure 12

Reference target signal followed by the estimated target, respectively, by a our proposed approach, b Wang–Brown system and c Hu–Wang system

Figure 12 exposes a composite signal (male and female voices) followed by the reference target signal and its estimated signal using our method and those of Wang–Brown and Hu–Wang. The separated signal is approximately similar to the reference one showing the performance of these methods.

For the entire studied mixture signal formed by male and female voices, the calculation of the signal-to-noise ratio (SNR) gives the following results for the four mentioned approaches.

Table 1 shows that the Zhang–Liu and Hu–Wang systems are the best for the target separation. However, our contribution for the target separation has produced considerable results by using only simple techniques.

Table 1 SNR results for speech separation

In Table 2, we present the computing time of our method, of Hu–Wang system and of Zhang–Liu system.

Table 2 Computing time

From Table 2, we note that our proposed approach has the best computing time. In fact, the computing time of the Hu and Wang model and Zhang–Liu model is much higher compared to our computing time.

For the studied mixture signal formed by male and female voices from the VCTK database, the calculation of the signal-to-noise ratio (SNR) gives the following results (Table 3).

Table 3 SNR results

Quan et al.’s system [39] is the best for the target separation since they use a complex and more efficient method based on the training of two separate neural networks. However, our contribution for the target separation has produced significant results by using only simple techniques (MP and filtering).

5 Conclusion

In this work, we presented a composite sound separation algorithm by a designed comb filter tuned by the fundamental frequency f0. The f0 is determined by analyzing “the autocorrelation of multi-scale product (AMP).” The composite signal is formed by a mixing of two speech signals, from two male speakers or a male speaker and another female. The “AMP analysis” method estimated peaks for measuring the mean value of the dominant speaker pitch. This approach allows the separation of all voiced sounds in the mixture.

As prospects, we want to ameliorate this approach and to separate a mixture signal composed from more than two speakers. In addition, we would like to separate unvoiced composite signals.