Playback attack detection for text-dependent speaker verification over telephone channels

doi:10.1016/j.specom.2014.12.003

Speech Communication

Volume 67, March 2015, Pages 143-153

https://doi.org/10.1016/j.specom.2014.12.003 Get rights and content

Highlights

•
Spectral features and score normalization employed to detect playback attack.
•
4187 samples of 175 persons used for evaluation of algorithm.
•
Algorithm is resistant to channel variations.
•
EER of 0.6% obtained for playback detection.

Abstract

Playback attacks constitute one of the biggest threats in biometric speaker verification systems, in which a previously recorded passphrase is played back by an unprivileged person in order to gain access. This paper features a description of the playback attack detection (PAD) algorithm, designed to protect text-dependent speaker verification systems from the aforementioned spoofing attacks. The paper also describes the usage of spectral landmarks and score normalization methods in the playback detection algorithm. Different factors are discussed in terms of the performance of the algorithm. The authors investigate two issues: (1) extracting the PAD features which are robust against channel variations and (2) the robustness of the algorithm in adverse acoustical environments (e.g. office, street, cocktail party noise). The experiments are performed on a prepared speech corpus containing 4187 occurrences of a passphrase spoken by 175 speakers. The results of the experiment show the equal error rate (EER) to be as low as 1.0%. These findings demonstrate that such spoofing-oriented playback attacks can be effectively detected and should not be considered a significant argument against applications of text-dependent speaker verification.

Introduction

The task of biometric speaker verification is to accept or reject the identity claim of the speaker based on a sample of the speaker’s voice. Telephone-based automatic speaker verification (by use of telephone channel) has already been a subject of research (Murthy et al., 1999, Kinnunen et al., 2012). Despite the fact that such systems perform very well, reaching relatively low EERs in demanding testing scenarios (NIST, 2012), consumers and organizations still have their doubts in the context of high-security applications (e.g. e-banking). One of the prevailing arguments against voice biometry concerns common passphrase text-dependent systems, in which the passphrase uttered by the speaker does not change from one login attempt to another. This enables the possibility of breaking into such systems by playing back a recording obtained earlier, using a microphone or any other eavesdropping method (e.g. malicious mobile software). This type of attack is called a playback attack and is available to anyone with minimal signal processing knowledge.

One of the solutions to this problem is to use a text-prompted system in which the user is asked to speak a randomly selected phrase for each access attempt. It is worth noting that such systems are more sensitive to other types of attacks (such as the concatenation of previously recorded digits) (Lindberg and Blomberg, 1999), and, due to the fact that the system cannot use lexical knowledge in its assessment, they achieve higher error rates, as compared to text-dependent solutions (Boves and den Os, 1998). The presented work focused on safeguarding text-dependent systems against playback attacks.

Several methods of playback attack detection are described, the bibliography on this subject not being very extensive. Employing direct spectral features such as the low frequency ratio was investigated in Villalba and Lleida (2011). The comparison of maps containing the highest peaks of the magnitude spectrum was described in Stevenson (2008). Another method, making use of a specific channel pattern, was presented in Wang et al. (2011). Despite the possibility of the normalization of similarity scores dramatically increasing the effectiveness of any of the aforementioned methods, there is little existing research on the subject. Shang and Stevenson (2010) successfully used the relative similarity score, which resulted in a reduction of the EER from 11.94% to 6.81%. None of the authors presented any kind of analysis of the impact of noise present in the attacking recording on detection performance. This is one of the reasons for the existence of the aforementioned algorithm. One of the objectives of this work was to achieve high noise robustness in a wide range of signal-to-noise ratio (SNR) of root mean square values. Another goal was taking advantage of the features available in devices which require high-speed data processing and have limited memory resources, such as embedded systems, physical biometric locks or other small-scale consumer electronics.

The method described in this paper uses both spectral features and score normalization to obtain a robust algorithm that addresses the issues of operating in an adverse acoustic environment, such as the one mentioned above. The paper is divided as follows: In Section 2, the core PAD algorithm is described, the corpus recorded for use in the experiments is described in Section 3, Section 4 provides the results of the conducted evaluations and covers the method of score normalization, in Section 5 conclusions are made and future work is discussed.

To improve the clarity of the paper, verification scenarios are presented in Fig. 1 and the following terms are defined:

Target: a privileged user, owner of data protected by biometric security.
Impostor: an unauthorized person claiming to be the owner of protected data, who attacks the system by modifying a previously acquired recording of the privileged (Target) user.
Legitimate: a non-playback-based verification attempt of the target.
Fraud: a playback attack by an impostor.
Authentic recording: a recording of a successful target verification attempt acquired server side.
Eavesdropped recording: a recording intercepted by an impostor on the client-side of the telecommunication channel during a legitimate verification attempt.
Compromised recording: a recording intercepted by an impostor on the server side of the telecommunication channel during a legitimate target’s verification attempt, or a recording stolen from the server’s user database.
Playback recording: the eavesdropped recording played back by the impostor and received at the server side of the telecommunication channel.

Section snippets

PAD algorithm

In this section, the PAD algorithm is presented. The concept of this solution is based on the music recognition system presented in Wang, 2003, Ellis, 2009. Wang’s idea of the algorithm is based on comparing recordings on the basis of the similarity of the local configuration of maxima pairs extracted from spectrograms of verified and reference recordings. According to the author of Wang (2003), the algorithm is computationally efficient, as well as resistant to noise and channel distortions.

Corpus

No proper speech corpus, containing both authentic and playback recordings, was available for comparative testing and benchmarking of the algorithm during the extensive research of the literature. Moreover, the collections which met the required criteria were not immediately available due to proprietary rights. Therefore, a new corpus was created to perform the necessary performance tests. For this purpose, recordings of 175 participants were collected, using a telephone channel. The voice

Results of the experiment

Two kinds of tests were conducted to assess the performance of PAD: (1) legitimate – when the recording is new, and has not been previously used by the user to gain access to the biometric system, and (2) playback attack – when the recording is a modified copy of one of the recordings that were previously used by the speaker for biometric verification. PAD may result in two different types of errors – false detection (FD), and missed detection (MD). A false detection error occurs when PAD

Summary

In this paper, the PAD system was presented. The idea behind this solution is based on the music recognition system proposed in Wang, 2003, Ellis, 2009. PAD compares recordings on the basis of the similarity of robust spectral landmarks, which are reproducible and resistant to noise and channel variations.

The experiments evaluating the performance of PAD involved three attack types, which were described in detail in Section 3. An average EER value of 1.0% shows that the efficiency of the

References (20)

A. Larcher et al.
Text-dependent speaker verification: classifiers, databases and rsr2015
Speech Commun.
(2014)
T. Matsui et al.
Likelihood normalization for speaker verification using a phoneme- and speaker-independent model
Speech Commun.
(1995)
D. Ramos-Castro et al.
Speaker verification using speaker- and test-dependent fast score normalization
Pattern Recogn. Lett.
(2007)
Ariyaeeinia, A.M., Sivakumaran, P., 1997. Analysis and comparison of score normalisation methods for text-dependent...
Barras, C., Gauvain, J., 2003. Feature and score normalization for speaker verification of cellular data. In: 2003 IEEE...
Boves, L., den Os, E., 1998. Speaker recognition in telecom applications. In: 1998 IEEE 4th Workshop on Interactive...
Ellis, D., 2009. Robust landmark-based audio fingerprinting....
Kinnunen, T., Wu, Z., Lee, K.A., Sedlak, F., Chng, E.S., Li, H. 2012. Vulnerability of speaker verification systems...
Lindberg, J., Blomberg, M., 1999. Vulnerability in speaker verification – a study of technical impostor techniques. In:...
Martin, A.F., Doddington, G.R., Kamm, T., Ordowski, M., Przybocki, M.A., 1997. The det curve in assessment of detection...

There are more references available in the full text version of this article.

Cited by (76)

An adaptive transmission line cochlear model based front-end for replay attack detection
2021, Speech Communication
Citation Excerpt :
Thus, the prevention of malicious spoofing attacks is currently acknowledged as a priority area of investigation for the deployment of ASV systems and is an emerging field of research (Wu et al., 2015). Spoofing attacks are categorised into four major types: replay (Gałka et al., 2015), speech synthesis (SS) (Hanilci et al., 2016), voice conversion (VC) (Kinnunen et al., 2012), and impersonation (Lau et al., 2004). A replay attack is performed by playing the pre-recorded speech of a legitimate speaker back to the ASV system.
The cochlea is a remarkable spectrum analyser with desirable properties including sharp frequency tuning and level-dependent compression and the potential advantages of incorporating these characteristics in a speech processing front-end are investigated. This paper develops a framework for an active transmission line cochlear model employing adaptive notch and resonant filters. The proposed model reproduces the observed asymmetric auditory filter shape with a sharp high-frequency roll-off and level-dependent nonlinear dynamic range compression characteristics. Experimental analysis demonstrates that sharp frequency tuning and dynamic range compression of the proposed model lead to an enhanced spectral representation compared with other spectral analysis methods. The proposed model was employed in the front-end of replay spoofing attack detection systems, and experiments on the ASVspoof 2017 version 2.0 and ASVspoof 2019 databases demonstrate that the proposed model outperforms linear and nonlinear level-dependent parallel filter bank auditory models and classical spectro-temporal front-ends. The use of the proposed model leads to relative improvements of 45.6%, 51.9% and 60.8% over the baseline feature CQCCs of ASVspoof version 2.0 and CQCCs and LFCCs of ASVspoof2019 on evaluation datasets, respectively.
Strengthening speech content authentication against tampering
2021, Speech Communication
It is vital to authenticate the content of speech signals to prevent the framing of innocent individuals. Furthermore, lack of speech content authentication may lead to repudiation. The research problem of authenticity vs falsehood is timely considering the recent worldwide interest in fake news and viral manipulated media. Liu & Wang proposed an algorithm to detect tamperings in speech content, based on semi-fragile watermarking. These techniques embed a watermark imperceptibly in the speech signal. Semantic changes to the speech will destroy the watermark, while signal processing operations will not affect the watermark. One main criterion for these speech content authentication schemes is the ability to withstand tampering attacks. We present a cryptanalysis attack framework for watermark based media authentication schemes. To exemplify this, two media authentication schemes and corresponding attacks are modelled within this framework. We discuss the main reasons leading to the attacks and propose a strategy to strengthen the Liu & Wang algorithm against such attacks.
On the use of blind channel response estimation and a residual neural network to detect physical access attacks to speaker verification systems
2021, Computer Speech and Language
Spoofing attacks have been acknowledged as a serious threat to automatic speaker verification (ASV) systems. In this paper, we are specifically concerned with replay attack scenarios. As a countermeasure to the problem, we propose a front-end based on the blind estimation of the channel response magnitude and as a back-end a residual neural network. Our hypothesis is that the magnitude response of the channel, obtained by subtracting the log-magnitude spectrum of the observed signal from the prediction of the log-magnitude spectrum average of the observed signal’s clean counterpart, will capture the nuances of room ambiences, recordings and playback devices. The performance of these features is investigated on a benchmark back-end, based on a Gaussian mixture model and on a deep neural network classifier. Our experiments are performed on the 2017 and 2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof) datasets. The benchmark systems are the same as used in the challenges and are based on constant-Q cepstral coefficients (CQCC) and linear-frequency cepstral coefficients (LFCC) features. Experimental results on the 2017 dataset show that the proposed method outperforms the two benchmarks, providing equal-error rates (EER) as low as 7.57% and 11.64%, respectively, for the development and evaluation sets. On the ASVspoof 2019 dataset, in turn, the proposed method outperformed the benchmark using a residual neural network as back-end by yielding tandem detection cost function (t-DCF) and EER as low as 0.1086 and 4.26% on the evaluation set. Lastly, an instrumental (objective) quality assessment is performed on the two datasets and the impact of quality variability on spoofing detection accuracy is discussed.
Detection of speech playback attacks using robust harmonic trajectories
2021, Computer Speech and Language
Citation Excerpt :
Not only will adherence to this best practice provide an information source for the investigation of successfully perpetrated playback attacks, but it will also provide a means by which playback attacks may be detected. The development of methods to detect playback attacks has attracted the interest of many researchers in recent years (Bredin et al., 2006; Shang and Stevenson, 2008a; 2008b; 2010; Greenhall and Atlas, 2010; Malik, 2012; Villalba and Lleida, 2011; Wang et al., 2011; Wu and Li, 2014; Galka et al., 2015; Yamagishi et al., 2017; Wu et al., 2017; Gonzalez-Rodriguez et al., 2018). One approach focuses on detecting the distortion associated with the recording and playback devices that are used to execute a playback attack (Greenhall and Atlas, 2010; Villalba and Lleida, 2011; Wang et al., 2011; Kinnunen et al., 2017).
In this paper, a new feature set is proposed for use in a playback attack detector (PAD) aimed at safeguarding a passphrase and speaker-verified protected system that can be remotely accessed from an arbitrary location using an arbitrary telecommunication channel. The new feature set, termed VoicedTracks, is a time-frequency map of the most robust harmonic trajectories in an utterance and serves as an audio fingerprint that can uniquely identify an utterance despite a moderate amount of noise and channel distortion. Experimental results are obtained using a specially designed in-house database; the impact of various noise types and SNR levels is further investigated using a publicly available database. An analysis of playback scores across several combinations of telecommunication channel types, playback devices and additive noise demonstrates robustness of the feature set to channel distortion and additive noise, thus making it suitable for use in a copy-detection based PAD (cd-PAD) designed for applications such as telephone banking. Relative to other cd-PADs the proposed approach was better able to defend against playback attacks when telephone channels were involved. An analysis of its performance across the replay configurations used in the ASVspoof 2017 V2 evaluation set suggests that the proposed cd-PAD is highly effective in detecting those playback attacks that are most likely to spoof the speaker verification system.
Usefulness of linear prediction residual for replay attack detection
2019, AEU - International Journal of Electronics and Communications
This work demonstrates the usefulness of processing linear prediction (LP) residual signal for detecting replay attacks. The playback device having non-flat frequency response modulates the input signal passing through it, resulting spectral distortion in replay signals especially in the low frequency regions (0–300 Hz). In effect, the excitation source information present below 300 Hz gets distorted. The linear prediction (LP) residual signal implicitly contains excitation source information. The excitation source feature, residual mel frequency cepstral coefficients (RMFCC) obtained from the LP residual signal has been proposed for replay detection task. The significance of RMFCC feature has been investigated through Gaussian mixtures model-universal background model (GMM-UBM) ASV experiments and spoof detection experiments using self-developed IITG-MV replay database and standard ASVspoof 2017 database, respectively. For IITG-MV, relative tandem-detection cost function (t-DCF) improvements of 29.78% (male), 7.96% (female) and 24.52% (whole-set) are observed for RMFCC+MFCC combination over MFCC feature. For ASVspoof 2017, relative EER improvements of 37.34% are reported for RMFCC+CQCC combination over CQCC feature. Minimum reported EER is 9.50% for RMFCC+CQCC system. Experimental results shows that the RMFCC feature contains information complementary to MFCC and CQCC features.
Distant Speech Detection
2023, Acoustical Physics

View all citing articles on Scopus

^☆: This work was supported by the Polish National Centre for Research and Development – Applied Research Program under Grant PBS1/B3/1/2012 titled “Biometric voice verification and identification”.

View full text

Playback attack detection for text-dependent speaker verification over telephone channels☆