A real-time trained system for robust speaker verification using relative space of anchor models

doi:10.1016/j.csl.2009.07.002

Computer Speech & Language

Volume 24, Issue 4, October 2010, Pages 545-561

https://doi.org/10.1016/j.csl.2009.07.002 Get rights and content

Abstract

A real-time trained system for robust speaker verification is proposed. This system was developed using a relative space of reference speakers, also referred to as anchor models. The real-time training aspect of the system is based on this relative space’s intriguing features and properties. The relative space concept uses relative speaker representation rather than an absolute representation, by comparing the speaker to a set of well-trained reference speakers. The advantage of this approach is that instead of estimating numerous parameters of an absolute model for a speaker, only a few parameters of a model relative to a number of anchor models are estimated. In order to optimize the performance of the proposed system, several techniques were assessed for possible implementation in various blocks of the system. As a result, the best performance was achieved where normalized vector’s mutual angle with the Minimum normalization method was applied to speaker verification in conjunction with an orthogonal relative space of virtual reference speakers. In this case, an Equal Error Rate (EER) of 0.12% on 400 test samples of 100 speakers was obtained. In addition to assessment under normal conditions, the developed speaker verification system was also evaluated under abnormal conditions where noisy or telephonic speech sequence contamination was present. Experiments conducted in this case demonstrated that, in most cases, this system outperforms absolute space based systems even with shortened training speech sequences. Another major contribution of this research is the development of a more complex speaker verification system capable of tackling abnormal conditions more effectively. In this case, other interesting features of the relative space approach were employed. For this purpose, a novel enrichment method was developed to construct a relative space of anchor models trained to tackle noise. The results of the experiments conducted in this part of the research demonstrated an excellent ability of this approach to tackle abnormal conditions. Compared to absolute space based system, applying this method in relative space led to lower error rates of speaker verification in all cases even with low SNR values.

Introduction

Speaker verification systems have various applications that involve identity authentication through World Wide Web, telephone lines, and direct microphone speech data acquisition systems. For example, secure web- or telephone-banking can be facilitated efficiently using such systems. Furthermore, over the past few years, reliable security systems, where the security of public at a large scale is concerned, has become a matter of paramount importance. Another important application of speaker verification systems is crime investigation conducted by law enforcement organizations. In such applications, usually there is access to only short speech segments, processing of which using current systems often fails to yield reliable results. Another hurdle associated with this application is the fact that speech segments are usually contaminated with noise, including channel noise or environmental noise. This leads to further degradation of the ability of current systems to produce correct output. These shortcomings of the current verification systems have led many research groups to investigate the possibility of developing more effective systems. The main purpose of this research is to provide solutions to address these issues. These solutions employ novel techniques based on features extracted in the relative space.

Feature extraction is a primary phase in speaker verification systems. Speech extracted features used in a speaker verification system fall into two categories based on their related space. One category includes features defined in an absolute and irrelative space, while the other includes features defined in a relative space. For the first category, representation of a speaker in the feature space is not related to any reference speaker. While there is a significant body of literature on features in the absolute space, very little research has been conducted for investigating the properties of features extracted in the relative space. MFCC (Young, 1996), LPCC (Rabiner and Juang, 1993), wavelet coefficients (Avci and Akpolat, 2006), etc. are among the most prevalent speech features in absolute space. Recently, Campbell et al. used Maximum A Posteriori (MAP) adapted GMM mean supervectors as an absolute feature with Support Vector Machine (SVM) as a discriminative model for speaker verification (Campbell et al., 2006). As an attempt to investigate the relative space approach and its application in verification systems, a number of groups have recently utilized such approach in developing novel speaker verification systems. For features defined in a relative space, each speaker in the feature space is represented relative to some reference speakers. Given that the feature extraction phase in any verification system can be performed independently from its verification phase, the latter will not be affected by the choice of feature space. As such, extracted features in the relative space can be applied in conjunction with any other set of techniques from the verification phase menu that are deemed more suitable.

Relative feature space was first developed for speaker adaptation in speech recognition. Merlin et al. proposed a new approach to speaker recognition and indexation systems, based on nondirectly-acoustic processing in the relative space (Merlin et al., 1999). In 2000 Kuhn et al. introduced the eigenvoices concept and represented each new speaker relative to eigenvoices (Kuhn et al., 2000). Eigenvoices were, then, used by Thyes et al. for speaker identification and verification (Thyes et al., 2000). Later, other researchers used a different approach where they introduced the idea of space of anchor models to represent enrolled speakers in verification systems, and to verify a test speaker in a relative feature space (Mami and Charlet, 2002, Mami and Charlet, 2003, Mami and Charlet, 2006). The main concept of anchor models space, which will be described in the methods, has been utilized in this research.

In this study, the effectiveness of applying various techniques involved in the development of a speaker verification system in a relative feature space of reference speakers has been assessed. As a result, the most effective techniques have been identified and optimal values of the parameters involved in the system have been determined and used to develop a novel speaker verification system. The proposed system was developed using a novel relative space of reference speakers, also referred to as anchor models. The real-time training aspect of this speaker verification system is based on the relative space’s intriguing features and properties as described in the methods. Real-time training in speaker verification systems is often required in applications where acquisition of long segments of speech data for clients’ enrollment in the system is either not desirable for clients or not feasible. Examples of such applications include temporary security systems such as exhibition authentication systems, large-scale security systems involving biometrics passports or national ID, and criminal identity determination associated with investigative procedures where long segments of speech data is not available in the crime scene. Real-time training in speaker verification systems is also desirable in applications where clients may not feel comfortable spending a relatively long time for enrollment in the security system, e.g., in online security systems. Another important contribution of this work is improving the accuracy of existing verification systems in the relative space for which distance normalization was applied in the relative space. Among the applied normalization techniques, the Minimum normalization method provided the highest accuracy. Furthermore, a simple clustering method was developed and implemented in the proposed systems. This method clusters the anchor models, thus leading to reduction of the relative space complexity, hence reducing the system’s response time and increasing its accuracy. To further enhance the performance of the system, an orthogonalization method was employed on the coordinate vectors in the relative space. This has led to significant improvement in the relative space discrimination ability. Under various circumstances, speech signals can be distorted or contaminated with noise. Under such abnormal conditions, existing speaker verification systems such as the ones proposed recently (Mami and Charlet, 2002, Mami and Charlet, 2003, Mami and Charlet, 2006) are not adequate. In this research, the developed speaker verification system was employed with data obtained under abnormal conditions where noisy or telephonic speech sequence contamination was present. Results obtained in this case confirmed that, in most cases, this system outperforms absolute space based systems even with shortened training speech sequences. Another major contribution of this research is the development of a more complex speaker verification system capable of tackling abnormal conditions more effectively. In this case, other interesting features of the relative space approach were employed. For this purpose, a novel enrichment method was employed to construct a relative space of anchor models trained to tackle noise. Applying this enrichment method has proven to be more effective for tackling noisy conditions.

Another major difficulty a speaker verification system needs to overcome is session variability, which frequently happens in real world applications. It refers to all of the phenomena, which cause two recordings of a given speaker to sound different from each other. Recently, new techniques of joint factor analysis and eigenchannels were introduced by Kenny et al. to deal with the session variability in speaker recognition systems (Kenny et al., 2007). These techniques had a profound impact on the speaker verification field of research (Fauve et al., 2007, Matrouf et al., 2007). As discussed in Section 6, it is speculated that the relative space approach may also be able to tackle session variability efficiently.

This paper is organized such that the main concept of relative space of anchor models will be first introduced in the following section. In Section 3, primary and optional steps for a speaker verification system in such space will be described. Section 4 focuses on the details of the systems implemented in this study and the conditions under which the experiments were conducted. In Section 5, the conducted experiments and their observed results will be presented and discussed, and finally in Section 6, the proposed techniques and results will be discussed and conclusions made.

Section snippets

Relative space of anchor models

In order to extract relative features from a speech segment, the space containing these features should be initially defined. The main idea is to build an orthogonal space where each axis represents some general characteristics of the speakers’ voices. In fact, a speech representation space is defined, which contains useful and common information about voices of a diverse group of speakers. Generally, this supplementary information can complement the system’s available information for speaker

Enrolling system speakers in relative space

After constructing a relative space, relative feature vectors of speakers can be extracted. In this paper, this is referred to as enrollment or locating of speakers in the relative feature space. To enroll or locate a speaker in the relative space, the speaker’s speech data is preprocessed and its absolute features are extracted. Scores of the speaker’s feature vectors under all reference speakers’ models as well as under GMM/UBM model are evaluated. GMM/UBM model is a GMM model trained using

Speech databases

To test the proposed system in this study, a number of experiments were conducted. In these experiments, three speech databases were involved, which are (A) FARSDAT microphony (clean) (Bijankhan et al., 1994), (B) telephonized FARSDAT (Momtazi et al., 2007), and (C) NoiseX (Varga and Steeneken, 1993). FARSDAT microphony is a Persian speech database, which contains speech data with a sampling frequency of 22,050 Hz from 300 speakers. Each speaker uttered a number of sentences in two sessions.

Normal conditions

Speaker verification system in the relative space was first examined under normal conditions where noise and transmission channel effects were not present. The data set of 10 sec for training and 5 sec for testing from microphony (clean) FARSDAT was applied in these experiments. In the first set of experiments, two different distance criteria, namely Euclidian distance and vectors’ mutual angles were applied as distance metrics in the relative space. The results shown in Table 1 indicate that the

Summary and conclusion

Robust and real-time trained speaker verification systems in a relative space of anchor models were proposed in this paper. This system was developed using a relative space of reference speakers also referred to as anchor models. The primary steps involved in this system include constructing the relative space, enrolling and locating system speakers and test samples in the space, and finally distance measuring and decision making for each test sample based on a predefined threshold. In this

References (22)

R. Auckenthaler et al.
Score normalization for text-independent speaker verification system
Digital Signal Processing
(2000)
E. Avci et al.
Speech recognition using a wavelet packet adaptive network based fuzzy inference system
Expert Systems with Applications
(2006)
Y. Mami et al.
Speaker recognition by location in the space of reference speakers
Speech Communication
(2006)
Bijankhan, M., Seikhzadeghan, J., Roohani, M.R., Samareh, Y., Lucas, K., Tebyani, M., 1994. FARSDAT – the speech...
F. Bimbot et al.
A tutorial on text-independent speaker verification
EURASIP Journal on Applied Signal Processing
(2004)
W.M. Campbell et al.
Support vector machines using GMM supervectors for speaker verification
Signal Processing Letters
(2006)
B. Fauve et al.
State-of-the-art performance in text-independent speaker verification through open-source software
IEEE Transactions of Audio, Speech, Lang. Processing
(2007)
P. Kenny et al.
Joint factor analysis versus eigenchannels in speaker recognition
IEEE Trans. Audio, Speech, Lang. Process
(2007)
R. Kuhn et al.
Rapid speaker adaptation in eigenvoice space
IEEE Transactions on Speech Audio Process
(2000)
Li, K.P., Porter, J.E., 1988. Normalizations and selection of speech segments for speaker recognition scoring. In:...

Mami, Y., Charlet, D., 2002. Speaker identification by location in an optimal space of anchor models. In: International...

Cited by (5)

Usability evaluation of voiceprint authentication in automated telephone banking: Sentences versus digits
2011, Interacting with Computers
Citation Excerpt :
Another possibility is the use of biometrics, a range of technologies that use a distinguishing physical or behavioural feature to identify or verify an individual within automated systems. Examples of physical features include voice (Naini et al., 2010; Yoma et al., 2008), fingerprints (Tan and Schuckers, 2010; Yamagishi et al., 2008) and iris patterns (Arivazhagan et al., 2009). Behavioural features include gait (Nandini and Ravi Kumar, 2008) and handwriting (Nanni and Lumini, 2006; Scheidat et al., 2009).
This paper describes an experiment to investigate the usability of voiceprints for customer authentication in automated telephone banking. The usability of voiceprint authentication using digits (random strings and telephone numbers) and sentences (branded and unbranded) are compared in a controlled experiment with 204 telephone banking customers. Results indicate high levels of usability and customer acceptance for voiceprint authentication in telephone banking. Customers find voiceprint authentication based on digits more usable than that based on sentences, and a majority of participants would prefer to use digits.
Usability evaluation of dialogue designs for voiceprint authentication in automated telephone banking
2017, Smart Technologies: Breakthroughs in Research and Practice
Usability evaluation of dialogue designs for voiceprint authentication in automated telephone banking
2014, International Journal of Technology and Human Interaction
Self-organizing map weights and wavelet packet entropy for speaker verification
2012, International Journal of Circuits, Systems and Signal Processing
Modular arithmetic and wavelets for speaker verification
2011, Journal of Applied Sciences

View full text

A real-time trained system for robust speaker verification using relative space of anchor models

Abstract

Introduction

Section snippets

Relative space of anchor models

Enrolling system speakers in relative space

Speech databases

Normal conditions

Summary and conclusion

Digital Signal Processing

Expert Systems with Applications

Speech Communication

A tutorial on text-independent speaker verification

EURASIP Journal on Applied Signal Processing

Support vector machines using GMM supervectors for speaker verification

Signal Processing Letters

State-of-the-art performance in text-independent speaker verification through open-source software

IEEE Transactions of Audio, Speech, Lang. Processing

Joint factor analysis versus eigenchannels in speaker recognition

IEEE Trans. Audio, Speech, Lang. Process

Rapid speaker adaptation in eigenvoice space

IEEE Transactions on Speech Audio Process