Quality measures for speaker verification with short utterances

doi:10.1016/j.dsp.2019.01.023

Digital Signal Processing

Volume 88, May 2019, Pages 66-79

https://doi.org/10.1016/j.dsp.2019.01.023 Get rights and content

Highlights

•
This work proposes new speech quality measures for quality-based score fusion in automatic speaker verification (ASV).
•
The proposed quality measures are formulated from the zero-order Baum–Welch statistics.
•
We fuse two popular ASV systems: Gaussian mixture model with universal background model (UBM) and i-vector.
•
The experiments are conducted on NIST SRE 2008 and 2010 for various duration conditions.
•
The proposed quality metric based ASV methods yield improved recognition performance.

Abstract

The performances of the automatic speaker verification (ASV) systems degrade due to the reduction in the amount of speech used for enrollment and verification. Combining multiple systems based on different features and classifiers considerably reduces speaker verification error rate with short utterances. This work attempts to incorporate supplementary information during the system combination process. We use quality of the estimated model parameters as supplementary information. We introduce a class of novel quality measures formulated using the zero-order sufficient statistics used during the i-vector extraction process. We have used the proposed quality measures as side information for combining ASV systems based on Gaussian mixture model–universal background model (GMM–UBM) and i-vector. The proposed methods demonstrate considerable improvement in speaker recognition performance on NIST SRE corpora, especially in short duration conditions. We have also observed improvement over existing systems based on different duration-based quality measures.

Introduction

The automatic speaker verification (ASV) technology uses the characteristics of human voice for the detection of individuals [1], [2]. The technology provides a low cost biometric solution suitable for real-world applications such as in banking [3], finance [4], and forensics [5]. Similar to other traditional pattern recognition applications, an ASV system includes three fundamental modules [1], [6]: an acoustic feature extraction unit that extracts relevant information from the speech signal in a compact manner, a modeling block to represent those features and a scoring and decision scheme to distinguish between genuine speakers and impostors. The state-of-the-art ASV system uses i-vector technology that represents a speech utterance with a single vector of fixed length either using Gaussian mixture model–universal background model (GMM–UBM) [7] or deep neural network (DNN) technology [8]. More recently, deep neural network (DNN) based embeddings are used for speaker recognition [9]. First, a DNN trained in a supervised manner to classify different speakers with known labels. Then, the trained DNN is employed to find a fixed-dimensional representation, known as x-vectors [9], corresponding to a variable length speech utterance.

Despite of these recent technological advancements, the mismatch issues are still a major concern for its real-world applications [10]. The performance of ASV system considerably degrades in presence of mismatch due to intra-speaker variability caused by the variations in speech duration [10], [11], background noise [12], vocal effort [13], spoken languages [14], emotion [15], channels [16], room reverberation [17], etc. In this paper, we focus on one of the most important mismatch factor, speech duration, the amount of speech data used in enrollment and verification.

State-of-the-art ASV systems exhibit satisfactory performance with adequately long (2 minutes) speech data. However, reduction in amount of speech drastically degrades the ASV performance [10], [12], [18], [19], [20]. The requirement of sufficiently long speech for training or testing, especially in presence of large intersession variability has limited the potential of widespread real-world implementations. An ASV system, in real world, is naturally constrained on the amount of speech data. Though this requirement can be fulfilled in training in some special cases, it is not always possible to maintain the same in verification for end-user convenience. In forensics applications, it is less likely to get sufficient data even for enrollment also [10], [19]. Therefore, getting reliable performance for short duration speech is one of the most important requirement in ASV application.

The performance of ASV systems are notably degraded with the reduction of amount of speech due to the lack of information provided in short utterance condition [19], [18], [21], [22]. In [7], it is reported that the i-vector based ASV systems are less sensitive to limited duration utterances than support vector machine (SVM) and JFA. The performance still deteriorates considerably with limited duration utterance as reported in [20], [18]. The duration variability problem is handled by extracting the duration pattern from the automatic speech recognition prior to modeling and scoring process in [23]. In [24], the short duration problem is approached, demonstrating the potential of fusion between GMM–UBM and SVM based systems using logistic regression. The work in [25] attempted to model the duration variability as noise and also by a synthetic process. The work in [26] has attempted to model variability caused by short duration segments in i-vector domain. In [27], [28], i-vector based ASV system is calibrated for short duration using duration based quality measures. The work in [29] attempted to improve short utterance speaker recognition by modeling speech unit classes.

The latest DNN-based speaker embedding approaches have shown promising results for speaker recognition with short utterances [9], [30]. Another recent work demonstrates that DNN-based i-vector mapping is useful for speaker recognition with short utterances [31]. Even though the DNN-based methods give good recognition accuracy, they require massive amount of training data, careful selection of network architecture and related tuning parameters. In this current work, we aim at improving the speaker recognition performance by efficiently combining two popular ASV systems based on GMM–UBM and i-vector representation which require lesser number of tuning parameters and amount of training data compared to the DNN-based methods. Moreover, the GMM–UBM and i-vector method are suitable with limited computational resources.

The research dealing with the effect of duration in speaker recognition have concentrated mostly on the consequences of classification performance, expressed in terms of equal error rate (EER) and minimum detection cost function (DCF) assuming the speaker model parameters are estimated satisfactorily. However, the speaker models are affected due to duration variability in short duration. The idea of quality metric was successfully applied in biometric authentication systems [32], [33]. The quality metrics were employed to improve the efficiency of the multi-modal biometric systems [34], [35], [36]. The work in [37] was motivated by a need to test claims that quality measures are predictive of matching performance. They also evaluated it by quantifying the association between estimated quality metric values and observed matching results.

The quality metrics are also successfully used in speech based bio-metric systems [38], [39]. The work in [39] studied a frame-level quality measure, obtaining encouraging results. However, the work in [38] showed a conventional user-independent multilevel SVM-based score fusion, adapted for the inclusion of quality information in the fusion process. The work in [40] focused on quality measure based system fusion, giving the emphasis on noisy and short duration test conditions using NIST 2012 database. The commonly used ASV systems, such as i-vector and GMM–UBM, do not include the information about the quality of estimated speaker models and information of duration variability. The work documented in [41], analyzed several quality measures for speaker verification from the point of view of their utility in an authentication task by selecting several quality measures derived from classic indicator like ITU P.563 estimator of subjective quality, signal to noise ratio and kurtosis of linear predictive coefficients. Moreover, the work [41] proposed a novel quality measure derived from what we have called universal background model likelihood (UBML). The work in [42], analyzed the factors that negatively impact the biometric quality and also depict a review of overall framework for the challenges of biometric quality.

The work in [28] used duration of speech segments to formulate the quality metrics and subsequently utilized the same for the calibration of recognition scores. However, the duration based quality metrics may not improve performance where the duration is fixed for either enrollment or verification or both. These durations based quality measures ignored the information of quality of speaker-model estimation. The quality of speaker-model parameters are not only dependent on duration, noise but also on phonetic distribution, intelligibility of speech etc. However, to develop a solution by targeting the basic building blocks of an ASV system, we attempted to incorporate the information of duration variability which degrades the quality of speaker-models. The concept of quality may be defined as degree of goodness of an element [39], [38], which, in our case, is the speaker-models. We treat BW statistics not only as the source of speaker information but also as a source of quality of estimated speaker models.

The Baum–Welch (BW) statistics, which represent the speech features in the intermediate step of i-vector extraction process, is affected by the duration variability. Consequently, the variability gets propagated in the subsequent representation, i.e., i-vector. We hypothesize that BW statistics can help to extract the quality of speaker-model parameters. We demonstrate through graphical analysis that the utterance duration is associated with the dissimilarity measures between intermediate statistics and background model parameters. We propose to use this measure as a quality measure. In this work, we propose to formulate this quality measure from the BW statistics and universal background model (UBM) parameters.

The proposed quality measures can be infused as additional information in the ASV technique to improve the system performance. The quality measures can be incorporated in potentially four possible stages of ASV system: feature extraction, speaker-model training, score computation and fusion of scores [38]. The use of quality measures in score fusion stage is most straightforward and has been successfully applied in speech, finger-print, face based multimodal person authentication systems [43], [44]. In this paper, we incorporate the proposed quality measures in score fusion stage to improve the performance of speaker recognition system in various duration conditions. In short duration, the linear score fusion strategy showed efficient performance with GMM and SVM based classification framework [24]. However, the i-vector based system (with GPLDA based channel compensation) was reported to perform more efficiently over JFA and GMM–SVM based framework in short utterance conditions [7]. Here, we show a comparative performance study of i-vector and GMM–UBM on NIST corpora (Fig. 1). We observe that though i-vector system performs better than GMM–UBM for long duration speech, the GMM–UBM system still shows comparable or even better performance for short duration conditions [45], [46]. This observation inspire to fuse i-vector and GMM–UBM to develop a more accurate and reliable solution for practical application of ASV systems. We have incorporated the estimated quality measures while blending the GMM–UBM and i-vector based ASV system. Incorporation of quality measures not only showed considerable improvement in performance but also consistency in various duration conditions. The proposed systems showed more relative improvement in short duration conditions which is more relevant for practical requirement. A preliminary version of this work was presented in [45]. In this work, we conduct extensive analysis and experiments.

The rest of the paper is organized as follows. The theoretical aspects of classical GMM–UBM and i-vector GPLDA system are discussed in Section 2. Analysis on intermediate subsystems under different duration variability condition is presented in Section 3. Section 4 describes the proposed quality measures and quality aided fusion based system. Details of experimental setup are provided in Section 5. The comparison of performance metrics of baseline GMM–UBM and i-vector GPLDA based ASV system and results on proposed quality aided fusion system are reported in Section 6. Finally, conclusion is drawn in Section 7.

Section snippets

Automatic speaker recognition system

Speaker recognition system, based on Gaussian mixture model, has emerged as the most widely used fundamental approach with the introduction of universal background model [47]. Subsequently, GMM supervector based SVM [48], and JFA [49] were introduced in ASV technology. Recent state-of-the-art speaker recognition concentrates on compact representation of GMM supervectors, named as i-vectors [7]. This work considers ASV system based on subspace modeling of i-vectors using PLDA [50]. This section

Analysis on BW statistics extraction procedure

Previous studies dealing with duration variability concentrated on the final performance metrics measured in terms of EER, DCF, etc. [18]. Some studies focused the variability in i-vector space [26]. However in this work, we present a study on how duration variability affects the intermediate steps of ASV system. BW statistics represent the total information from the speech and are transformed into i-vectors for decision making. Since, in most of modern ASV systems, BW statistics is an

Quality measures for speech segments

The observations in previous section illustrate that the variability in BW statistics is in someway associated with the duration of speech. The work in [51] also attempted to model the sparsity and variability to compensate the performance of ASV in short duration. The work in [52] introduced an uncertainty measure computed from the i-vector posterior parameter to compensate the duration variability effect. In our current work, we propose to apply dissimilarity metrics between NBS ( ${\tilde{N}}_{i}$ ) and

Experimental setup

Both GMM–UBM and i-vector based systems use same mel frequency cepstral coefficients (MFCCs) [61] as front-end acoustic features. We extract MFCCs using frame size of 20 ms and frame shift of 10 ms as in [62]. The Hamming window is used in MFCC extraction process [63]. The non-speech frames are dropped using energy-based speech activity detector (SAD) [64]. Finally, we perform a cepstral mean and variance normalization (CMVN) to remove the convolutive channel effect [62]. 19 dimensional MFCC

Baseline performance

Initially, we have investigated the performance of state-of-the-art i-vector and classical GMM–UBM based ASV system under various duration condition. The experiments are executed on male subset of both NIST 2008 short2–short3 corpus. Comparison of performance is accomplished in eleven different duration conditions separately.

The results of the experiments reported in Table 4. Fig. 6 exhibits a systematic comparative study. Table 4 depicts the relative performance improvement of i-vector based

Conclusion and future scopes

In this work, we have introduced new quality measures for improving the speaker recognition performance under short duration conditions. We derive the quality measures using Baum–Welch sufficient statistics which are used for computing i-vector representation. We demonstrate that the dissimilarity between the normalized zero-order Baum–Welch statistics and the weights of universal background model (UBM) is associated with the speech duration. We formulate the quality measures based on the

Conflict of interest statement

The authors declare that they have no conflict of interest.

Acknowledgements

This work is partially supported by Indian Space Research Organisation (ISRO), Government Of India. The work of Md Sahidullah is supported by Region Grand Est, France. The authors would like to express their sincere thanks to the anonymous reviewers and the editors for their comments and suggestions, which greatly improved the work in quality and content. We further thank Dr. Tomi Kinnunen (University of Eastern Finland) for his valuable comments on the earlier version of this work. Finally,

References (65)

T. Kinnunen et al.
An overview of text-independent speaker recognition: from features to supervectors
Speech Commun.
(2010)
A. Kanagasundaram et al.
Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques
Speech Commun.
(2014)
M.I. Mandasari et al.
Quality measures based calibration with duration and noise dependency for speaker recognition
Speech Commun.
(2015)
J. Guo et al.
Deep neural network based i-vector mapping for speaker verification using short utterances
Speech Commun.
(2018)
N. Poh et al.
A multimodal biometric test bed for quality-dependent, cost-sensitive and client-specific score-level fusion algorithms
Pattern Recognit.
(2010)
J. Fierrez-Aguilar et al.
Discriminative multimodal biometric authentication based on quality measures
Pattern Recognit.
(2005)
D. Garcia-Romero et al.
Using quality measures for multilevel speaker recognition
Comput. Speech Lang.
(2006)
D.A. Reynolds et al.
Speaker verification using adapted Gaussian mixture models
Digit. Signal Process.
(2000)
W. Li et al.
Feature sparsity analysis for i-vector based speaker verification
Speech Commun.
(2016)
G.R. Doddington et al.
The NIST speaker recognition evaluation—overview, methodology, systems, results, perspective
Speech Commun.
(2000)

M. Sahidullah et al.

Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition

Speech Commun.

(2012)

J.P. Campbell et al.

Forensic speaker recognition

IEEE Signal Process. Mag.

(2009)

ICICI bank introduces voice recognition for biometric authentication

Death of the password?

Interpol's new software will recognize criminals by their voices

J.P. Campbell

Speaker recognition: a tutorial

Proc. IEEE

(1997)

N. Dehak et al.

Front-end factor analysis for speaker verification

IEEE Trans. Audio Speech Lang. Process.

(2011)

P. Matějka et al.

Analysis of DNN approaches to speaker identification

D. Snyder et al.

X-vectors: robust DNN embeddings for speaker recognition

A. Poddar et al.

Speaker verification with short utterances: a review of challenges, trends and opportunities

IET Biometrics

(2018)

Y.A. Solewicz et al.

Estimated intra-speaker variability boundaries in forensic speaker recognition casework

J. Ming et al.

Robust speaker recognition in noisy conditions

IEEE Trans. Audio Speech Lang. Process.

(2007)

R. Saeidi et al.

Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch

IEEE/ACM Trans. Audio Speech Lang. Process.

(2016)

M. McLaren et al.

Exploring the role of phonetic bottleneck features for speaker and language recognition

S. Parthasarathy et al.

A study of speaker verification performance with expressive speech

D. Wang et al.

A robust DBN-vector based speaker verification system under channel mismatch conditions

V. Vestman et al.

Time-varying autoregressions for speaker verification in reverberant conditions

A. Kanagasundaram et al.

I-vector based speaker recognition on short utterances

M.I. Mandasari et al.

Evaluation of i-vector speaker recognition systems for forensic application

A. Kanagasundaram et al.

PLDA based speaker recognition on short utterances

A.K. Sarkar et al.

Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification

B.G. Fauve et al.

Influence of task duration in text-independent speaker verification

Cited by (18)

Effective preservation of higher-frequency contents in the context of short utterance based children's speaker verification system
2023, Applied Acoustics
Developing an automatic speaker verification (ASV) system for children is extremely challenging due to the unavailability of children’s speech corpora. The challenges are further exacerbated in the case of short utterances. Voice-based biometric systems require adequate amount of speech data for enrollment and verification; otherwise the performance considerably degrades. In this paper, we have focussed on data paucity and preserving the higher-frequency contents in order to enhance the performance of a short-utterance based children’s speaker verification system. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used is from adult speakers which are acoustically very different from children’s speech, we have resorted to techniques like prosody modification, formant modification and voice-conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes. A relative improvement of $33.6 %$ in equal error rate (EER) with respect to the baseline system trained solely on child data-set is achieved when the proposed data augmentation technique is applied. Further to that, for the preservation of the higher-frequency contents, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the linear-frequency cepstral coefficient (LFCC) or with the inverse-Mel-frequency cepstral coefficient (IMFCC) features. The use of Mel-filter-bank leads to poor resolution of higher-frequency components. On the other hand, linear- or inverse-Mel-filter-banks yield better resolution of higher-frequency components. Moreover, MFCC and IMFCC features exhibit low canonical correlation. Consequently, the frame-level concatenation of MFCC and LFCC or IMFCC features leads to better resolution of both lower- as well as higher-frequency components. Therefore, the EER considerably reduces when either LFCC features or IMFCC features are concatenated with MFCC features. The EER for the full test set shows a relative reduction of $10.56 %$ (with respect to the EER for the MFCC features) when IMFCC features are concatenated with the MFCC features. This novel approach of incorporating data augmentation followed by frame-level feature concatenation helps in achieving an overall reduction of $40.6 %$ in EER.
Speaker verification using attentive multi-scale convolutional recurrent network
2022, Applied Soft Computing
Citation Excerpt :
The work in this paper focuses on discussing speaker verification only. Many works are done on speaker verification [10–25]. The efforts in these works mainly concentrate on two questions.
In this paper, we propose a speaker verification method by an Attentive Multi-scale Convolutional Recurrent Network (AMCRN). The proposed AMCRN can acquire both local spatial information and global sequential information from the input speech recordings. In the proposed method, logarithm Mel spectrum is extracted from each speech recording and then fed to the proposed AMCRN for learning speaker embedding. Afterwards, the learned speaker embedding is fed to the back-end classifier (such as cosine similarity metric) for scoring in the testing stage. The proposed method is compared with state-of-the-art methods for speaker verification. Experimental data are three public datasets that are selected from two large-scale speech corpora (VoxCeleb1 and VoxCeleb2). Experimental results show that our method exceeds baseline methods in terms of equal error rate and minimal detection cost function, and has advantages over most of baseline methods in terms of computational complexity and memory requirement. In addition, our method generalizes well across truncated speech segments with different durations, and the speaker embedding learned by the proposed AMCRN has stronger generalization ability across two back-end classifiers.
A network model of speaker identification with new feature extraction methods and asymmetric BLSTM
2020, Neurocomputing
Citation Excerpt :
Remarkable speaker identification performance can provide a guarantee for the application of the designed network model in public security, justice, military and national defense, such as criminal investigation, criminal tracking, national defense monitoring etc. [8]. However, the complex environment of practical application has great challenges to the accuracy and robustness in text-independent speaker identification [9], such as environmental noise [10,11], channel mismatch [12,13], multi-speaker [14] and short utterance [15,16]. To improve the accuracy and robustness of the network model under environmental noise and short utterance, this paper proposes an end-to-end text-independent speaker identification system.
Speaker identification has recently attracted considerable attention in speaker recognition. Environmental noise and short utterance pose two challenges for accurate speaker identification. In this paper, a network model with new feature extraction methods and a new bi-directional long short-term memory network is proposed to identify the speaker. Specifically, this paper proposes to combine the mel-spectrogram and cochleagram to generate two new features, named MC-spectrogram and MC-cube. They have stronger robustness and can obtain more abundant voiceprint feature in the short utterance. Then, multi-dimensional CNNs are applied to process MC-spectrogram and MC-cube features correspondingly. They contain multi-dimensional convolution kernels, which can learn the voiceprint features more efficiently. In addition, the context information is ignored by CNN. And the forward voiceprint features are more crucial because the voiceprint features concentrate on the back part in the short utterance. Asymmetric bi-directional long short-time memory network (ABLSTM) is proposed to further learn the voiceprint features in global feature learning. It can improve the accuracy of speaker identification. According to the diverse dimension of input, the proposed network model can manifest diverse patterns, which are named Audio-1DCNN-ABLSTM, MCS(MC-spectrogram)-2DCNN-ABLSTM and MCC(MC-cube)-3DCNN-ABLSTM. From the experimental results, it is shown that the diverse patterns can achieve superior accuracy and robustness in the short utterance with extra environmental noise. Furthermore, the proposed network model provides a reliable solution in text-independent speaker identification.
Active voice authentication
2020, Digital Signal Processing: A Review Journal
Citation Excerpt :
But the performance degrades rapidly as the test window duration decreases. In general, many i-vector based systems exhibit sharp performance degradation [50–53], when they are tested with short duration (below 5 s) utterances. This is understandable as the covariance matrix of the i-vector is inversely proportional to the number of speech frames per test utterance and the variance of the i-vector estimate grows directly as the number of frames in the test utterance decreases [31].
Active authentication refers to a new mode of identity verification in which biometric indicators are continuously tested to provide real-time or near real-time monitoring of an authorized access to a service or use of a device. This is in contrast to the conventional authentication systems where a single test in form of a verification token such as a password is performed. In active voice authentication, voice is the biometric modality. This paper describes an ensemble of techniques that make reliable speaker verification possible using unconventionally short voice test signals. These techniques include model adaptation and minimum verification error training that are tailored for the extremely short training and testing requirements. A database of 25 speakers is recorded for developing this system. In our off-line evaluation on this dataset, the system achieves an average windowed-based equal error rates of 3-4% depending on the model configuration, which is remarkable considering that only 1 second of voice data is used to make every single authentication decision. On the NIST SRE 2001 Dataset, the system provides a 3.88% absolute gain over i-vector when the duration of test segment is 1 second. A real-time demonstration system has been implemented on Microsoft Surface Pro.
Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients
2024, Circuits, Systems, and Signal Processing
Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children’s Speaker Verification System
2024, Circuits, Systems, and Signal Processing

View all citing articles on Scopus

Arnab Poddar received his MS (by research) degree in the area of speech processing and machine learning from the Department of Electronics & Electrical Communication Engineering, Indian Institute Technology Kharagpur in 2018. He has worked in the research project entitled as Reduction of False acceptance an rejection in non-cooperative automatic speaker recognition system, funded by Indian Space Research Organization (ISRO). Prior to that he has worked as a research project person in the project Development of Optical Character Recognition system on printed Indian Languages in Computer Vision and Pattern Recognition (CVPR) Unit, Indian Statistical Institute (ISI). He is currently pursuing Ph.D. at Indian Institute of Technology Kharagpur in area of machine learning and computer vision. His research interests include speech & audio signal processing, image processing, and machine learning.

Md Sahidullah received his Ph.D. degree in the area of speech processing from the Department of Electronics & Electrical Communication Engineering, Indian Institute Technology Kharagpur in 2015. Prior to that he obtained the Bachelors of Engineering degree in Electronics and Communication Engineering from Vidyasagar University in 2004 and the Masters of Engineering degree in Computer Science and Engineering (with specialization in Embedded System) from West Bengal University of Technology in 2006. In 2007–2008, he was with Cognizant Technology Solutions India PVT Limited. In 2014–2017, he was a postdoctoral researcher with the School of Computing, University of Eastern Finland. In January 2018, he joined MULTISPEECH team, Inria, France as a post-doctoral researcher where he currently holds a starting research position. His research interest includes robust speaker recognition, voice activity detection and spoofing countermeasures. He is also a co-organizer of two Automatic Speaker Verification Spoofing and Countermeasures Challenges: ASVspoof 2017 and ASVspoof 2019.

Goutam Saha received his B.Tech. and Ph.D. degrees from the Department of Electronics & Electrical Communication Engineering, Indian Institute of Technology (IIT) Kharagpur, India in 1990 and 2000, respectively. In between, he served industry for about four years and obtained a five year fellowship from Council of Scientific & Industrial Research, India. In 2002, he joined IIT Kharagpur as a faculty member where he is currently serving as a Professor. His research interests include analysis of audio and bio signals.

View full text

Quality measures for speaker verification with short utterances

Highlights

Abstract

Introduction

Section snippets

Automatic speaker recognition system

Analysis on BW statistics extraction procedure

Quality measures for speech segments

Experimental setup

Baseline performance

Conclusion and future scopes

Conflict of interest statement

Acknowledgements

Speech Commun.

Speech Commun.

Speech Commun.

Speech Commun.

Pattern Recognit.

Pattern Recognit.

Comput. Speech Lang.

Digit. Signal Process.

Speech Commun.

Speech Commun.

Speech Commun.

Forensic speaker recognition

IEEE Signal Process. Mag.

ICICI bank introduces voice recognition for biometric authentication

Death of the password?

Interpol's new software will recognize criminals by their voices

Speaker recognition: a tutorial

Proc. IEEE

Front-end factor analysis for speaker verification

IEEE Trans. Audio Speech Lang. Process.

Analysis of DNN approaches to speaker identification

X-vectors: robust DNN embeddings for speaker recognition

Speaker verification with short utterances: a review of challenges, trends and opportunities

IET Biometrics

Estimated intra-speaker variability boundaries in forensic speaker recognition casework

Robust speaker recognition in noisy conditions

IEEE Trans. Audio Speech Lang. Process.

Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch

IEEE/ACM Trans. Audio Speech Lang. Process.

Exploring the role of phonetic bottleneck features for speaker and language recognition

A study of speaker verification performance with expressive speech

A robust DBN-vector based speaker verification system under channel mismatch conditions

Time-varying autoregressions for speaker verification in reverberant conditions

I-vector based speaker recognition on short utterances

Evaluation of i-vector speaker recognition systems for forensic application

PLDA based speaker recognition on short utterances

Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification

Influence of task duration in text-independent speaker verification