Elsevier

Digital Signal Processing

Volume 88, May 2019, Pages 66-79
Digital Signal Processing

Quality measures for speaker verification with short utterances

https://doi.org/10.1016/j.dsp.2019.01.023Get rights and content

Highlights

  • This work proposes new speech quality measures for quality-based score fusion in automatic speaker verification (ASV).

  • The proposed quality measures are formulated from the zero-order Baum–Welch statistics.

  • We fuse two popular ASV systems: Gaussian mixture model with universal background model (UBM) and i-vector.

  • The experiments are conducted on NIST SRE 2008 and 2010 for various duration conditions.

  • The proposed quality metric based ASV methods yield improved recognition performance.

Abstract

The performances of the automatic speaker verification (ASV) systems degrade due to the reduction in the amount of speech used for enrollment and verification. Combining multiple systems based on different features and classifiers considerably reduces speaker verification error rate with short utterances. This work attempts to incorporate supplementary information during the system combination process. We use quality of the estimated model parameters as supplementary information. We introduce a class of novel quality measures formulated using the zero-order sufficient statistics used during the i-vector extraction process. We have used the proposed quality measures as side information for combining ASV systems based on Gaussian mixture model–universal background model (GMM–UBM) and i-vector. The proposed methods demonstrate considerable improvement in speaker recognition performance on NIST SRE corpora, especially in short duration conditions. We have also observed improvement over existing systems based on different duration-based quality measures.

Introduction

The automatic speaker verification (ASV) technology uses the characteristics of human voice for the detection of individuals [1], [2]. The technology provides a low cost biometric solution suitable for real-world applications such as in banking [3], finance [4], and forensics [5]. Similar to other traditional pattern recognition applications, an ASV system includes three fundamental modules [1], [6]: an acoustic feature extraction unit that extracts relevant information from the speech signal in a compact manner, a modeling block to represent those features and a scoring and decision scheme to distinguish between genuine speakers and impostors. The state-of-the-art ASV system uses i-vector technology that represents a speech utterance with a single vector of fixed length either using Gaussian mixture model–universal background model (GMM–UBM) [7] or deep neural network (DNN) technology [8]. More recently, deep neural network (DNN) based embeddings are used for speaker recognition [9]. First, a DNN trained in a supervised manner to classify different speakers with known labels. Then, the trained DNN is employed to find a fixed-dimensional representation, known as x-vectors [9], corresponding to a variable length speech utterance.

Despite of these recent technological advancements, the mismatch issues are still a major concern for its real-world applications [10]. The performance of ASV system considerably degrades in presence of mismatch due to intra-speaker variability caused by the variations in speech duration [10], [11], background noise [12], vocal effort [13], spoken languages [14], emotion [15], channels [16], room reverberation [17], etc. In this paper, we focus on one of the most important mismatch factor, speech duration, the amount of speech data used in enrollment and verification.

State-of-the-art ASV systems exhibit satisfactory performance with adequately long (2 minutes) speech data. However, reduction in amount of speech drastically degrades the ASV performance [10], [12], [18], [19], [20]. The requirement of sufficiently long speech for training or testing, especially in presence of large intersession variability has limited the potential of widespread real-world implementations. An ASV system, in real world, is naturally constrained on the amount of speech data. Though this requirement can be fulfilled in training in some special cases, it is not always possible to maintain the same in verification for end-user convenience. In forensics applications, it is less likely to get sufficient data even for enrollment also [10], [19]. Therefore, getting reliable performance for short duration speech is one of the most important requirement in ASV application.

The performance of ASV systems are notably degraded with the reduction of amount of speech due to the lack of information provided in short utterance condition [19], [18], [21], [22]. In [7], it is reported that the i-vector based ASV systems are less sensitive to limited duration utterances than support vector machine (SVM) and JFA. The performance still deteriorates considerably with limited duration utterance as reported in [20], [18]. The duration variability problem is handled by extracting the duration pattern from the automatic speech recognition prior to modeling and scoring process in [23]. In [24], the short duration problem is approached, demonstrating the potential of fusion between GMM–UBM and SVM based systems using logistic regression. The work in [25] attempted to model the duration variability as noise and also by a synthetic process. The work in [26] has attempted to model variability caused by short duration segments in i-vector domain. In [27], [28], i-vector based ASV system is calibrated for short duration using duration based quality measures. The work in [29] attempted to improve short utterance speaker recognition by modeling speech unit classes.

The latest DNN-based speaker embedding approaches have shown promising results for speaker recognition with short utterances [9], [30]. Another recent work demonstrates that DNN-based i-vector mapping is useful for speaker recognition with short utterances [31]. Even though the DNN-based methods give good recognition accuracy, they require massive amount of training data, careful selection of network architecture and related tuning parameters. In this current work, we aim at improving the speaker recognition performance by efficiently combining two popular ASV systems based on GMM–UBM and i-vector representation which require lesser number of tuning parameters and amount of training data compared to the DNN-based methods. Moreover, the GMM–UBM and i-vector method are suitable with limited computational resources.

The research dealing with the effect of duration in speaker recognition have concentrated mostly on the consequences of classification performance, expressed in terms of equal error rate (EER) and minimum detection cost function (DCF) assuming the speaker model parameters are estimated satisfactorily. However, the speaker models are affected due to duration variability in short duration. The idea of quality metric was successfully applied in biometric authentication systems [32], [33]. The quality metrics were employed to improve the efficiency of the multi-modal biometric systems [34], [35], [36]. The work in [37] was motivated by a need to test claims that quality measures are predictive of matching performance. They also evaluated it by quantifying the association between estimated quality metric values and observed matching results.

The quality metrics are also successfully used in speech based bio-metric systems [38], [39]. The work in [39] studied a frame-level quality measure, obtaining encouraging results. However, the work in [38] showed a conventional user-independent multilevel SVM-based score fusion, adapted for the inclusion of quality information in the fusion process. The work in [40] focused on quality measure based system fusion, giving the emphasis on noisy and short duration test conditions using NIST 2012 database. The commonly used ASV systems, such as i-vector and GMM–UBM, do not include the information about the quality of estimated speaker models and information of duration variability. The work documented in [41], analyzed several quality measures for speaker verification from the point of view of their utility in an authentication task by selecting several quality measures derived from classic indicator like ITU P.563 estimator of subjective quality, signal to noise ratio and kurtosis of linear predictive coefficients. Moreover, the work [41] proposed a novel quality measure derived from what we have called universal background model likelihood (UBML). The work in [42], analyzed the factors that negatively impact the biometric quality and also depict a review of overall framework for the challenges of biometric quality.

The work in [28] used duration of speech segments to formulate the quality metrics and subsequently utilized the same for the calibration of recognition scores. However, the duration based quality metrics may not improve performance where the duration is fixed for either enrollment or verification or both. These durations based quality measures ignored the information of quality of speaker-model estimation. The quality of speaker-model parameters are not only dependent on duration, noise but also on phonetic distribution, intelligibility of speech etc. However, to develop a solution by targeting the basic building blocks of an ASV system, we attempted to incorporate the information of duration variability which degrades the quality of speaker-models. The concept of quality may be defined as degree of goodness of an element [39], [38], which, in our case, is the speaker-models. We treat BW statistics not only as the source of speaker information but also as a source of quality of estimated speaker models.

The Baum–Welch (BW) statistics, which represent the speech features in the intermediate step of i-vector extraction process, is affected by the duration variability. Consequently, the variability gets propagated in the subsequent representation, i.e., i-vector. We hypothesize that BW statistics can help to extract the quality of speaker-model parameters. We demonstrate through graphical analysis that the utterance duration is associated with the dissimilarity measures between intermediate statistics and background model parameters. We propose to use this measure as a quality measure. In this work, we propose to formulate this quality measure from the BW statistics and universal background model (UBM) parameters.

The proposed quality measures can be infused as additional information in the ASV technique to improve the system performance. The quality measures can be incorporated in potentially four possible stages of ASV system: feature extraction, speaker-model training, score computation and fusion of scores [38]. The use of quality measures in score fusion stage is most straightforward and has been successfully applied in speech, finger-print, face based multimodal person authentication systems [43], [44]. In this paper, we incorporate the proposed quality measures in score fusion stage to improve the performance of speaker recognition system in various duration conditions. In short duration, the linear score fusion strategy showed efficient performance with GMM and SVM based classification framework [24]. However, the i-vector based system (with GPLDA based channel compensation) was reported to perform more efficiently over JFA and GMM–SVM based framework in short utterance conditions [7]. Here, we show a comparative performance study of i-vector and GMM–UBM on NIST corpora (Fig. 1). We observe that though i-vector system performs better than GMM–UBM for long duration speech, the GMM–UBM system still shows comparable or even better performance for short duration conditions [45], [46]. This observation inspire to fuse i-vector and GMM–UBM to develop a more accurate and reliable solution for practical application of ASV systems. We have incorporated the estimated quality measures while blending the GMM–UBM and i-vector based ASV system. Incorporation of quality measures not only showed considerable improvement in performance but also consistency in various duration conditions. The proposed systems showed more relative improvement in short duration conditions which is more relevant for practical requirement. A preliminary version of this work was presented in [45]. In this work, we conduct extensive analysis and experiments.

The rest of the paper is organized as follows. The theoretical aspects of classical GMM–UBM and i-vector GPLDA system are discussed in Section 2. Analysis on intermediate subsystems under different duration variability condition is presented in Section 3. Section 4 describes the proposed quality measures and quality aided fusion based system. Details of experimental setup are provided in Section 5. The comparison of performance metrics of baseline GMM–UBM and i-vector GPLDA based ASV system and results on proposed quality aided fusion system are reported in Section 6. Finally, conclusion is drawn in Section 7.

Section snippets

Automatic speaker recognition system

Speaker recognition system, based on Gaussian mixture model, has emerged as the most widely used fundamental approach with the introduction of universal background model [47]. Subsequently, GMM supervector based SVM [48], and JFA [49] were introduced in ASV technology. Recent state-of-the-art speaker recognition concentrates on compact representation of GMM supervectors, named as i-vectors [7]. This work considers ASV system based on subspace modeling of i-vectors using PLDA [50]. This section

Analysis on BW statistics extraction procedure

Previous studies dealing with duration variability concentrated on the final performance metrics measured in terms of EER, DCF, etc. [18]. Some studies focused the variability in i-vector space [26]. However in this work, we present a study on how duration variability affects the intermediate steps of ASV system. BW statistics represent the total information from the speech and are transformed into i-vectors for decision making. Since, in most of modern ASV systems, BW statistics is an

Quality measures for speech segments

The observations in previous section illustrate that the variability in BW statistics is in someway associated with the duration of speech. The work in [51] also attempted to model the sparsity and variability to compensate the performance of ASV in short duration. The work in [52] introduced an uncertainty measure computed from the i-vector posterior parameter to compensate the duration variability effect. In our current work, we propose to apply dissimilarity metrics between NBS (N˜i) and

Experimental setup

Both GMM–UBM and i-vector based systems use same mel frequency cepstral coefficients (MFCCs) [61] as front-end acoustic features. We extract MFCCs using frame size of 20 ms and frame shift of 10 ms as in [62]. The Hamming window is used in MFCC extraction process [63]. The non-speech frames are dropped using energy-based speech activity detector (SAD) [64]. Finally, we perform a cepstral mean and variance normalization (CMVN) to remove the convolutive channel effect [62]. 19 dimensional MFCC

Baseline performance

Initially, we have investigated the performance of state-of-the-art i-vector and classical GMM–UBM based ASV system under various duration condition. The experiments are executed on male subset of both NIST 2008 short2–short3 corpus. Comparison of performance is accomplished in eleven different duration conditions separately.

The results of the experiments reported in Table 4. Fig. 6 exhibits a systematic comparative study. Table 4 depicts the relative performance improvement of i-vector based

Conclusion and future scopes

In this work, we have introduced new quality measures for improving the speaker recognition performance under short duration conditions. We derive the quality measures using Baum–Welch sufficient statistics which are used for computing i-vector representation. We demonstrate that the dissimilarity between the normalized zero-order Baum–Welch statistics and the weights of universal background model (UBM) is associated with the speech duration. We formulate the quality measures based on the

Conflict of interest statement

The authors declare that they have no conflict of interest.

Acknowledgements

This work is partially supported by Indian Space Research Organisation (ISRO), Government Of India. The work of Md Sahidullah is supported by Region Grand Est, France. The authors would like to express their sincere thanks to the anonymous reviewers and the editors for their comments and suggestions, which greatly improved the work in quality and content. We further thank Dr. Tomi Kinnunen (University of Eastern Finland) for his valuable comments on the earlier version of this work. Finally,

Arnab Poddar received his MS (by research) degree in the area of speech processing and machine learning from the Department of Electronics & Electrical Communication Engineering, Indian Institute Technology Kharagpur in 2018. He has worked in the research project entitled as Reduction of False acceptance an rejection in non-cooperative automatic speaker recognition system, funded by Indian Space Research Organization (ISRO). Prior to that he has worked as a research project person in the

References (65)

  • M. Sahidullah et al.

    Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition

    Speech Commun.

    (2012)
  • J.P. Campbell et al.

    Forensic speaker recognition

    IEEE Signal Process. Mag.

    (2009)
  • ICICI bank introduces voice recognition for biometric authentication

  • Death of the password?

  • Interpol's new software will recognize criminals by their voices

  • J.P. Campbell

    Speaker recognition: a tutorial

    Proc. IEEE

    (1997)
  • N. Dehak et al.

    Front-end factor analysis for speaker verification

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • P. Matějka et al.

    Analysis of DNN approaches to speaker identification

  • D. Snyder et al.

    X-vectors: robust DNN embeddings for speaker recognition

  • A. Poddar et al.

    Speaker verification with short utterances: a review of challenges, trends and opportunities

    IET Biometrics

    (2018)
  • Y.A. Solewicz et al.

    Estimated intra-speaker variability boundaries in forensic speaker recognition casework

  • J. Ming et al.

    Robust speaker recognition in noisy conditions

    IEEE Trans. Audio Speech Lang. Process.

    (2007)
  • R. Saeidi et al.

    Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2016)
  • M. McLaren et al.

    Exploring the role of phonetic bottleneck features for speaker and language recognition

  • S. Parthasarathy et al.

    A study of speaker verification performance with expressive speech

  • D. Wang et al.

    A robust DBN-vector based speaker verification system under channel mismatch conditions

  • V. Vestman et al.

    Time-varying autoregressions for speaker verification in reverberant conditions

  • A. Kanagasundaram et al.

    I-vector based speaker recognition on short utterances

  • M.I. Mandasari et al.

    Evaluation of i-vector speaker recognition systems for forensic application

  • A. Kanagasundaram et al.

    PLDA based speaker recognition on short utterances

  • A.K. Sarkar et al.

    Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification

  • B.G. Fauve et al.

    Influence of task duration in text-independent speaker verification

  • Cited by (18)

    • Speaker verification using attentive multi-scale convolutional recurrent network

      2022, Applied Soft Computing
      Citation Excerpt :

      The work in this paper focuses on discussing speaker verification only. Many works are done on speaker verification [10–25]. The efforts in these works mainly concentrate on two questions.

    • A network model of speaker identification with new feature extraction methods and asymmetric BLSTM

      2020, Neurocomputing
      Citation Excerpt :

      Remarkable speaker identification performance can provide a guarantee for the application of the designed network model in public security, justice, military and national defense, such as criminal investigation, criminal tracking, national defense monitoring etc. [8]. However, the complex environment of practical application has great challenges to the accuracy and robustness in text-independent speaker identification [9], such as environmental noise [10,11], channel mismatch [12,13], multi-speaker [14] and short utterance [15,16]. To improve the accuracy and robustness of the network model under environmental noise and short utterance, this paper proposes an end-to-end text-independent speaker identification system.

    • Active voice authentication

      2020, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      But the performance degrades rapidly as the test window duration decreases. In general, many i-vector based systems exhibit sharp performance degradation [50–53], when they are tested with short duration (below 5 s) utterances. This is understandable as the covariance matrix of the i-vector is inversely proportional to the number of speech frames per test utterance and the variance of the i-vector estimate grows directly as the number of frames in the test utterance decreases [31].

    View all citing articles on Scopus

    Arnab Poddar received his MS (by research) degree in the area of speech processing and machine learning from the Department of Electronics & Electrical Communication Engineering, Indian Institute Technology Kharagpur in 2018. He has worked in the research project entitled as Reduction of False acceptance an rejection in non-cooperative automatic speaker recognition system, funded by Indian Space Research Organization (ISRO). Prior to that he has worked as a research project person in the project Development of Optical Character Recognition system on printed Indian Languages in Computer Vision and Pattern Recognition (CVPR) Unit, Indian Statistical Institute (ISI). He is currently pursuing Ph.D. at Indian Institute of Technology Kharagpur in area of machine learning and computer vision. His research interests include speech & audio signal processing, image processing, and machine learning.

    Md Sahidullah received his Ph.D. degree in the area of speech processing from the Department of Electronics & Electrical Communication Engineering, Indian Institute Technology Kharagpur in 2015. Prior to that he obtained the Bachelors of Engineering degree in Electronics and Communication Engineering from Vidyasagar University in 2004 and the Masters of Engineering degree in Computer Science and Engineering (with specialization in Embedded System) from West Bengal University of Technology in 2006. In 2007–2008, he was with Cognizant Technology Solutions India PVT Limited. In 2014–2017, he was a postdoctoral researcher with the School of Computing, University of Eastern Finland. In January 2018, he joined MULTISPEECH team, Inria, France as a post-doctoral researcher where he currently holds a starting research position. His research interest includes robust speaker recognition, voice activity detection and spoofing countermeasures. He is also a co-organizer of two Automatic Speaker Verification Spoofing and Countermeasures Challenges: ASVspoof 2017 and ASVspoof 2019.

    Goutam Saha received his B.Tech. and Ph.D. degrees from the Department of Electronics & Electrical Communication Engineering, Indian Institute of Technology (IIT) Kharagpur, India in 1990 and 2000, respectively. In between, he served industry for about four years and obtained a five year fellowship from Council of Scientific & Industrial Research, India. In 2002, he joined IIT Kharagpur as a faculty member where he is currently serving as a Professor. His research interests include analysis of audio and bio signals.

    View full text