1 Introduction

Speaker discrimination (by voice) represents an important field in biometry, since the voice remains the unique method used at distance (via telephone). This particularity, has given to speaker discrimination a great importance, especially in secure applications which require very high accuracy. Speaker discrimination consists in checking whether two different pronunciations (speech segments) are uttered by the same speaker or by two different speakers. One means used to compare the utterances is to extract the vocal characteristics from each segment, in order to detect the degree of similarity between them.

Speaker discrimination has applications in several domains, like speaker verification, biometry, multimedia segmentation and speaker based clustering.

Different approaches were developed for this purpose, among those two approaches are investigated in this paper: a neural network and a 2nd order statistical measure, but we also propose two other approaches based on the association between the two previous classifiers.

These different approaches are evaluated on a sub-set of Broadcast News (1996) [1] and our results show that this fusion is really interesting.

2 Some Techniques Related to Speaker Discrimination and Parameterization

Several techniques were developed for the task of speaker discrimination, like GMM (Gaussian Mixture Models) [2], NN (Neural Networks) [3], statistical measures [4], HMM (Hidden Markov Models) [5] …etc. In our research work, we have approached the discrimination problem with four methods; MLP (Multi-Layer Perceptron), statistical measures, Hybrid method and the fusion based on the sum of weighted scores. These different methods are described below.

For the parameterization, we used 37 MFSC coefficients (Mel Frequency Spectral Coefficients) obtained from the calculation of the energies in the mel spectral scale [6, 7]. This dimension has been chosen after a thorough investigation done on the optimal spectral resolution [8, 9].

2.1 Statistical Method

One of the referential methods used for the task of speaker discrimination is the statistical measure of similarity (µGc) which is based on the covariance matrix. The statistical measure is used in order to determine the similarity degree (with regards to speaker’s features) between the different speech segments.

We recall bellow the most important properties of the approach [10, 11].

Let \( \left\{ {x_{t} } \right\}_{1 \le t \le M} \) be a sequence of M vectors resulting from the P-dimensional acoustic analysis of a speech signal uttered by speaker x. These vectors are summarized by the mean vector \( \bar{x} \) and the covariance matrix X:

$$ \overline{x} = \frac{1}{M}\sum\limits_{t = 1}^{M} {x_{t} } $$
(1)

and

$$ X = \frac{1}{M}\sum\limits_{t = 1}^{M} {\left( {x_{t} - \overline{x} } \right)} \left( {x_{t} - \overline{x} } \right)^{T} $$
(2)

Similarly, for a speech signal uttered by speaker y, a sequence of N vectors can be \( \left\{ {y_{t} } \right\}_{1 \le t \le M} \) extracted.

By assuming that all acoustic vectors extracted from the speech signal uttered by speaker x are distributed like a Gaussian function, the likelihood of a single vector yt uttered by speaker y is:

$$ G(y_{t} /{\mathbf{x}}) = \,\frac{1}{{(2\pi )^{{{p \mathord{\left/ {\vphantom {p 2}} \right. \kern-0pt} 2}}} (\det \,X)^{{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} }}e^{{{{(1} \mathord{\left/ {\vphantom {{(1} 2}} \right. \kern-0pt} 2})(y_{t} - \overline{x} )^{T} X^{ - 1} (y_{t} - \overline{x} )}} $$
(3)

If we assume that all vectors yt are independent observations, the average log-likelihood of \( \left\{ {y_{t} } \right\}_{1 \le t \le M} \) can be written as

$$ \overline{L} x(y_{1}^{N} ) = \frac{1}{N}\log \,G(y_{1} \cdots y_{N} |\varvec{X}) = \,\frac{1}{N}\sum\limits_{t = 1}^{N} {\log \,G(y_{t} |{\mathbf{x}})} $$
(4)

We also define the minus-log-likelihood \( \mu ({\mathbf{x, }}\text{y}_{t} ) \) which is equivalent to similarity measure between vector yt (uttered by y) and the model of speaker x, so that

$$ \mathop {Arg\,\hbox{max} }\limits_{x} \,\,G(y_{t} /{\mathbf{x}}) = \mathop {Arg\,\hbox{min} }\limits_{x} \,\,\mu ({\mathbf{x, }}y_{t} ) $$
(5)

We have then:

$$ \mu ({\mathbf{x, }}\text{y}_{t} ) = - \log \,G(y_{t} /{\mathbf{x}}) $$
(6)

The similarity measure between test utterance \( \left\{ {y_{t} } \right\}_{1 \le t \le M} \) of speaker y and the model of speaker x is then

$$ \begin{aligned} \mu ({\mathbf{x,y}}) = \mu ({\mathbf{x}},\,y_{1}^{N} ) & = \frac{1}{N}\sum\limits_{t = 1}^{N} {\mu ({\mathbf{x,}}\,y_{t} )} \\ & = - \bar{L}x(y_{1}^{N} ) \\ \end{aligned} $$
(7)

After simplifications, we obtain

$$ \begin{aligned} & \mu ({\mathbf{x,}}\,{\mathbf{y}}) = \\ & \frac{1}{P}\left[ { - \log (\frac{\det (Y)}{\det (X)}) + tr(YX^{ - 1} ) + (\bar{y} - \bar{x})^{T} X^{ - 1} (\bar{y} - \bar{x})} \right] - 1 \\ \end{aligned} $$
(8)

This measure is equivalent to the standard Gaussian likelihood measure (asymmetric µG) defined in [8].

A variant of this measure called µGc is deduced from the previous one by assuming that \( \bar{y} = \bar{x} \) (inter-speaker variability of the mean is negligible).

Thus, the new formula becomes:

$$ \mu_{GC} \left( {{\mathbf{x,y}}} \right) = \frac{1}{P}\left[ { - \log (\frac{\det (Y)}{\det (X)}) + tr(YX^{ - 1} )} \right] - 1 $$
(9)

2.2 Neural Approach for Speaker Discrimination

Knowing the high discriminative capacities of the NNs (neural networks) [12], we opted for the use of a MLP (Multi-Layer Perceptron) with one or two hidden layers and with only one output. Experiments are done on audio signals, of three or four seconds each and extracted from Hub-4 Broadcast News.

The goal of this neural network [13] is to discriminate the different speakers by their speech signals. For this purpose, an input vector extracted from the MFSC coefficients is used.

The NN must have at its input a number of receptive cells equal to the dimension of the example vector [7]. Thus, in case of using a vector with N MFSC coefficients [6, 7], the number of input receptive cells is equal to 2.N (corresponding to two different utterances).

The training is performed by the back-propagation algorithm and the NN output will give then an indication on the correlation between the two utterances:

  • If NN OUTPUT  = 0 then it is the same speaker,

  • If NN OUTPUT  = 1 then the speakers are different,

If the two segments (utterances) have different characteristics (characterization of the speaker), then we can affirm that these segments belong to the same speaker, otherwise, these segments belong to two different speakers.

Concerning the acoustical-spectral analysis of the signal, a segmentation by windows of 35 ms (ensuring the stationarity) is used in each segment where a spectral analysis is made, in giving one series of MFSC vectors for each segment [6, 7].

This vectors set goes through a statistical process which allows extracting the covariance diagonal elements in each segment. Thereafter, a feature reduction is applied by using a RSC or Relative Speaker Characterization (see section C). These elements are directly injected to the input of the NN which will decide whether the two segments belong to the same speaker or not: see Fig. 1.

Fig. 1.
figure 1

Comparison between 2 utterances and discrimination decision

2.3 Hybrid Method

Since it has been proved that NNs have an excellent discriminative property, we thought to mix the statistical measure with the neural inputs in order to improve the NN performance: this is the hybrid method.

Thus, a new input is added to the NN, into which we inject the discrimination result given by the statistical measure for each couple of segments, with the corresponding segments and then the training of the NN with this new input is performed as shown in Fig. 2 below.

Fig. 2.
figure 2

The hybrid method.

The hybrid method is summarized as follows: First, the features are extracted from the two segments, then; the statistical measure µGc is computed and injected to the NN together with the reduced features, called RSC (Relative Speaker Characteristic) [14]. The training is then enhanced by the information brought by the statistical approach.

2.4 Fusion

In order to enhance the discrimination performance, we usually use several classifiers which are combined in order to get a better precision: this combination is called Fusion. The fusion in the broad sense can be performed at different hierarchical levels or processing stages. A very commonly encountered taxonomy of data fusion is given by the following three-stage hierarchy [15, 16]:

  1. (a)

    Feature level where the feature sets of different modalities are combined. Fusion at this level provides the highest flexibility but classification problems may arise due to the large dimension of the combined (concatenated) feature vectors.

  2. (b)

    Score (matching) level is the most common level where the fusion takes place. The scores of the classifiers are usually normalized and then they are combined in a consistent manner.

  3. (c)

    Decision level where the outputs of the classifiers establish the decision via techniques such as majority voting. Fusion at the decision level is considered to be rigid for information integration.

In our case, we chose the fusion at the score level.

If the simple scores are denoted by S j , then the fusion score Sf is given by:

$$ Sf\; = \;\sum\limits_{j = 1}^{N} {C_{j} S_{j} } $$
(10)

where C j represents the weighing coefficient (confidence) for the classifier “j” and N denotes the classifiers number.

With

$$ \sum\limits_{j} {C_{j} } = 1 $$
(11)

and \( C_{j} \in [0.1,\,0.9] \)

The coefficient C j represents the relevance of the classifier j.

3 Results and Discussion

The audio database, used in our experiments, is an extract of Broadcast News “CNN early edition”, for which the SNR is rather low (presence of music, telephonic calls, noises…etc.) and where the training sub-set is different from the testing one.

In order to evaluate the different techniques described above, several experiments of speaker discrimination are done on the previous database, each experiment concerns one particular method and the corresponding results are represented on Figs. 3 and 4.

Fig. 3.
figure 3

Speaker discrimination –Hybrid Method-

Fig. 4.
figure 4

Speaker discrimination –Fusion NN/µGc-

The figures represent the ROC curve for the two classifiers: NN and statistical measure. We notice that the NN gives an EER of 9.25 % while the EER given by the statistical measure is 11.75 %. The NN looks better than the statistical method in the middle area, whereas at the borders of the ROC curve, the statistical measure looks better.

In the other hand, this EER is about 9.95 % when we use the hybrid method (Fig. 3), which means an improvement of 1.8 % with respect to the statistical measure score and a degradation of 0.7 % with respect to the NN.

Results of fusion between the two classifiers: NN and µGc are shown in Fig. 4, where we can notice that the fusion gives an EER better than the EER given by each method alone. The fusion EER is only 7.88 % (Table 1) which shows that the fusion is useful. The overall results are summarized in Table 1.

Table 1. Equal Error Rates for different methods.

4 Conclusion

Speaker discrimination consists in checking if two different speech segments are uttered by the same speaker or by two different speakers. In order to deal with this problem, several techniques are developed. In this paper, we are interested in four methods, namely: MLP based method, statistical measure based method (µGc), hybrid method (MLP-µGc) and even a fusion (at score level) based method for the task of discrimination. All those methods are evaluated on a sub-set extracted from Hub4 Broadcast News database and the different scores obtained by each method are represented in a way of ROC curves.

Results allow us to do some comparisons between those four methods according to their corresponding EER.

In one hand, we notice that The NN EER is better than the µGc one, which confirms once again the high discriminative capacity of neural networks [12]. In the other hand, the hybrid method resulting from the mixture of the NN and the statistical method has a medium EER of 9.95. The fourth method tested here is the fusion technique carried out with the two basic classifiers. This technique combines the different scores obtained by each method, with a specific weighting coefficients of confidence. This fusion has highly improved the precision of speaker discrimination with an EER of 7.88 % (best score obtained).

In the overall, this research work has shown the difficulties encountred in speaker discrimination, the high discriminative properties of NNs and the relevance of the fusion technique. For future works, we hope to expand our experiments to other fusion techniques used for the same task.