Playback attack detection for text-dependent speaker verification over telephone channels☆
Introduction
The task of biometric speaker verification is to accept or reject the identity claim of the speaker based on a sample of the speaker’s voice. Telephone-based automatic speaker verification (by use of telephone channel) has already been a subject of research (Murthy et al., 1999, Kinnunen et al., 2012). Despite the fact that such systems perform very well, reaching relatively low EERs in demanding testing scenarios (NIST, 2012), consumers and organizations still have their doubts in the context of high-security applications (e.g. e-banking). One of the prevailing arguments against voice biometry concerns common passphrase text-dependent systems, in which the passphrase uttered by the speaker does not change from one login attempt to another. This enables the possibility of breaking into such systems by playing back a recording obtained earlier, using a microphone or any other eavesdropping method (e.g. malicious mobile software). This type of attack is called a playback attack and is available to anyone with minimal signal processing knowledge.
One of the solutions to this problem is to use a text-prompted system in which the user is asked to speak a randomly selected phrase for each access attempt. It is worth noting that such systems are more sensitive to other types of attacks (such as the concatenation of previously recorded digits) (Lindberg and Blomberg, 1999), and, due to the fact that the system cannot use lexical knowledge in its assessment, they achieve higher error rates, as compared to text-dependent solutions (Boves and den Os, 1998). The presented work focused on safeguarding text-dependent systems against playback attacks.
Several methods of playback attack detection are described, the bibliography on this subject not being very extensive. Employing direct spectral features such as the low frequency ratio was investigated in Villalba and Lleida (2011). The comparison of maps containing the highest peaks of the magnitude spectrum was described in Stevenson (2008). Another method, making use of a specific channel pattern, was presented in Wang et al. (2011). Despite the possibility of the normalization of similarity scores dramatically increasing the effectiveness of any of the aforementioned methods, there is little existing research on the subject. Shang and Stevenson (2010) successfully used the relative similarity score, which resulted in a reduction of the EER from 11.94% to 6.81%. None of the authors presented any kind of analysis of the impact of noise present in the attacking recording on detection performance. This is one of the reasons for the existence of the aforementioned algorithm. One of the objectives of this work was to achieve high noise robustness in a wide range of signal-to-noise ratio (SNR) of root mean square values. Another goal was taking advantage of the features available in devices which require high-speed data processing and have limited memory resources, such as embedded systems, physical biometric locks or other small-scale consumer electronics.
The method described in this paper uses both spectral features and score normalization to obtain a robust algorithm that addresses the issues of operating in an adverse acoustic environment, such as the one mentioned above. The paper is divided as follows: In Section 2, the core PAD algorithm is described, the corpus recorded for use in the experiments is described in Section 3, Section 4 provides the results of the conducted evaluations and covers the method of score normalization, in Section 5 conclusions are made and future work is discussed.
To improve the clarity of the paper, verification scenarios are presented in Fig. 1 and the following terms are defined:
Target: a privileged user, owner of data protected by biometric security.
Impostor: an unauthorized person claiming to be the owner of protected data, who attacks the system by modifying a previously acquired recording of the privileged (Target) user.
Legitimate: a non-playback-based verification attempt of the target.
Fraud: a playback attack by an impostor.
Authentic recording: a recording of a successful target verification attempt acquired server side.
Eavesdropped recording: a recording intercepted by an impostor on the client-side of the telecommunication channel during a legitimate verification attempt.
Compromised recording: a recording intercepted by an impostor on the server side of the telecommunication channel during a legitimate target’s verification attempt, or a recording stolen from the server’s user database.
Playback recording: the eavesdropped recording played back by the impostor and received at the server side of the telecommunication channel.
Section snippets
PAD algorithm
In this section, the PAD algorithm is presented. The concept of this solution is based on the music recognition system presented in Wang, 2003, Ellis, 2009. Wang’s idea of the algorithm is based on comparing recordings on the basis of the similarity of the local configuration of maxima pairs extracted from spectrograms of verified and reference recordings. According to the author of Wang (2003), the algorithm is computationally efficient, as well as resistant to noise and channel distortions.
Corpus
No proper speech corpus, containing both authentic and playback recordings, was available for comparative testing and benchmarking of the algorithm during the extensive research of the literature. Moreover, the collections which met the required criteria were not immediately available due to proprietary rights. Therefore, a new corpus was created to perform the necessary performance tests. For this purpose, recordings of 175 participants were collected, using a telephone channel. The voice
Results of the experiment
Two kinds of tests were conducted to assess the performance of PAD: (1) legitimate – when the recording is new, and has not been previously used by the user to gain access to the biometric system, and (2) playback attack – when the recording is a modified copy of one of the recordings that were previously used by the speaker for biometric verification. PAD may result in two different types of errors – false detection (FD), and missed detection (MD). A false detection error occurs when PAD
Summary
In this paper, the PAD system was presented. The idea behind this solution is based on the music recognition system proposed in Wang, 2003, Ellis, 2009. PAD compares recordings on the basis of the similarity of robust spectral landmarks, which are reproducible and resistant to noise and channel variations.
The experiments evaluating the performance of PAD involved three attack types, which were described in detail in Section 3. An average EER value of 1.0% shows that the efficiency of the
References (20)
- et al.
Text-dependent speaker verification: classifiers, databases and rsr2015
Speech Commun.
(2014) - et al.
Likelihood normalization for speaker verification using a phoneme- and speaker-independent model
Speech Commun.
(1995) - et al.
Speaker verification using speaker- and test-dependent fast score normalization
Pattern Recogn. Lett.
(2007) - Ariyaeeinia, A.M., Sivakumaran, P., 1997. Analysis and comparison of score normalisation methods for text-dependent...
- Barras, C., Gauvain, J., 2003. Feature and score normalization for speaker verification of cellular data. In: 2003 IEEE...
- Boves, L., den Os, E., 1998. Speaker recognition in telecom applications. In: 1998 IEEE 4th Workshop on Interactive...
- Ellis, D., 2009. Robust landmark-based audio fingerprinting....
- Kinnunen, T., Wu, Z., Lee, K.A., Sedlak, F., Chng, E.S., Li, H. 2012. Vulnerability of speaker verification systems...
- Lindberg, J., Blomberg, M., 1999. Vulnerability in speaker verification – a study of technical impostor techniques. In:...
- Martin, A.F., Doddington, G.R., Kamm, T., Ordowski, M., Przybocki, M.A., 1997. The det curve in assessment of detection...
Cited by (76)
An adaptive transmission line cochlear model based front-end for replay attack detection
2021, Speech CommunicationCitation Excerpt :Thus, the prevention of malicious spoofing attacks is currently acknowledged as a priority area of investigation for the deployment of ASV systems and is an emerging field of research (Wu et al., 2015). Spoofing attacks are categorised into four major types: replay (Gałka et al., 2015), speech synthesis (SS) (Hanilci et al., 2016), voice conversion (VC) (Kinnunen et al., 2012), and impersonation (Lau et al., 2004). A replay attack is performed by playing the pre-recorded speech of a legitimate speaker back to the ASV system.
Strengthening speech content authentication against tampering
2021, Speech CommunicationDetection of speech playback attacks using robust harmonic trajectories
2021, Computer Speech and LanguageCitation Excerpt :Not only will adherence to this best practice provide an information source for the investigation of successfully perpetrated playback attacks, but it will also provide a means by which playback attacks may be detected. The development of methods to detect playback attacks has attracted the interest of many researchers in recent years (Bredin et al., 2006; Shang and Stevenson, 2008a; 2008b; 2010; Greenhall and Atlas, 2010; Malik, 2012; Villalba and Lleida, 2011; Wang et al., 2011; Wu and Li, 2014; Galka et al., 2015; Yamagishi et al., 2017; Wu et al., 2017; Gonzalez-Rodriguez et al., 2018). One approach focuses on detecting the distortion associated with the recording and playback devices that are used to execute a playback attack (Greenhall and Atlas, 2010; Villalba and Lleida, 2011; Wang et al., 2011; Kinnunen et al., 2017).
Usefulness of linear prediction residual for replay attack detection
2019, AEU - International Journal of Electronics and CommunicationsDistant Speech Detection
2023, Acoustical Physics
- ☆
This work was supported by the Polish National Centre for Research and Development – Applied Research Program under Grant PBS1/B3/1/2012 titled “Biometric voice verification and identification”.