Elsevier

Speech Communication

Volume 95, December 2017, Pages 1-15
Speech Communication

Acoustical and perceptual study of voice disguise by age modification in speaker verification

https://doi.org/10.1016/j.specom.2017.10.002Get rights and content

highlights

  1. We study the effects of voice disguise on speaker verification on a corpus of 60 native Finnish speakers from acoustic and perceptual perspectives based on automatic speaker verification system performance.

  2. Acoustic analyses with statistical tests reveal the difference in fundamental frequency and formant frequencies between natural and disguised voices.

  3. The listening test with 70 subjects indicates the correspondence between perceptual and automatic speaker recognition evaluation.

Abstract

The task of speaker recognition is feasible when the speakers are co-operative or wish to be recognized. While modern automatic speaker verification (ASV) systems and some listeners are good at recognizing speakers from modal, unmodified speech, the task becomes notoriously difficult in situations of deliberate voice disguise when the speaker aims at masking his or her identity. We approach voice disguise from the perspective of acoustical and perceptual analysis using a self-collected corpus of 60 native Finnish speakers (31 female, 29 male) producing utterances in normal, intended young and intended old voice modes. The normal voices form a starting point and we are interested in studying how the two disguise modes impact the acoustical parameters and perceptual speaker similarity judgments.

First, we study the effect of disguise as a relative change in fundamental frequency (F0) and formant frequencies (F1 to F4) from modal to disguised utterances. Next, we investigate whether or not speaker comparisons that are deemed easy or difficult by a modern ASV system have a similar difficulty level for the human listeners. Further, we study affecting factors from listener-related self-reported information that may explain a particular listener’s success or failure in speaker similarity assessment.

Our acoustic analysis reveals a systematic increase in relative change in mean F0 for the intended young voices while for the intended old voices, the relative change is less prominent in most cases. Concerning the formants F1 through F4, 29% (for male) and 30% (for female) of the utterances did not exhibit a significant change in any formant value, while the remaining  ∼ 70% of utterances had significant changes in at least one formant.

Our listening panel consists of 70 listeners, 32 native and 38 non-native, who listened to 24 utterance pairs selected using rankings produced by an ASV system. The results indicate that speaker pairs categorized as easy by our ASV system were also easy for the average listener. Similarly, the listeners made more errors in the difficult trials. The listening results indicate that target (same speaker) trials were more difficult for the non-native group, while the performance for the non-target pairs was similar for both native and non-native groups.

Introduction

The human voice carries individual characteristics that can be used to identify the speaker. In speaker recognition, the main focus of analysis is on who is speaking rather than what is being said. The human ability to recognize people by their voices is well known, especially in relation to familiar speakers (Schmidt-Nielsen and Stern, 1985). Moreover, the use of technology in the speaker recognition task has increased with the widespread use of personal hand-held devices to access information and for daily communications. Nevertheless, whether performed by humans or automatic systems, the speaker recognition task can be challenging as speech is subject to many variations induced by the speaker, the communication scenario and the transmission channel (Campbell, 1997, Hansen, Hasan, 2015, Kinnunen, Li, 2010). State-of-the-art automatic speaker verification (ASV) technology (Campbell, 1997, Kinnunen, Li, 2010) has advanced to deal with additive and channel variability, but the intrinsic, or speaker-based, variations of the speech remain very challenging. According to Hansen and Hasan (2015), the variations in the speaker’s voice characteristics can be affected by the scenario or by the task performed by the speaker, which may include vocal effort, emotion, physical condition and voluntary alterations of the voice.

Voluntary variations of speech can be induced either by electronic means, in which speech can be purposefully modified by the use of voice transformation technology (Mohammadi, Kain, 2017, Stylianou, 2009, Clark, Foulkes, 2007); or by non-electronic means. Two cases of the latter can be identified. Firstly, the speaker may attempt to be identified as another person by means of mimicry or impersonation (González Hautamäki, Kinnunen, Hautamäki, Laukkanen, 2015, López, Riera, Assaneo, Eguía, Sigman, Trevisan, 2013, Panjwani, Prakash, 2014), such as voice acting or stand-up comedy. Secondly, in a more generic case that does not necessarily involve any specific target voice, the speaker adapts or transforms his or her voice with the aim of concealing his or her audio identity. It is this broad form of variation, known as voice disguise, that forms the focus of our study. It may involve several variations in speaking style (Perrot, Aversano, Chollet, 2007, Rodman, Powell, 2000, San Segundo, Alves, Trinidad, 2013) and is a particularly relevant concern in forensics or audio surveillance. This might include, for example, analysis of an armed robbery or a black-mailing call in which the perpetrator does not wish to be identified later.

Voice disguise may include one or several of the following modifications: a) forced modifications of the physical vocal cavities, such as pinched nose, pulled cheeks, the use of physical obstruction objects (e.g. helmet, face mask (Saeidi et al., 2016), handkerchief over the mouth, pencil or chewing gum (Zhang and Tan, 2008)); b) changes in the type of phonation, or modification of the sound source, e.g. imitating a speech defect, or a specific type of phonation such as a creaky, hoarse or falsetto voice (San Segundo et al., 2013); c) phonemic modification related to the change in pronunciation, e.g. adopting foreign accent sounds (Leemann and Kolly, 2015) or nasal speech; and d) prosody-related modifications in pitch or speech rate (Künzel, Gonzalez-Rodriguez, Ortega-García, 2004, Zhang, 2012). A visual example of a speaker’s voluntary modification of the voice is shown in Fig. 1, which presents spectrograms and F0 contours of the speaker’s own voice and two disguised voices.

Voice disguise is a complex problem that has attracted interest from different research communities. Previous studies on the topic enable one to identify three general perspectives: vulnerability analysis of ASV systems, effects on acoustic parameters and perceptual experiments. Vulnerability analysis mainly addresses voice disguise in terms of target speaker false rejections, and compares ASV system results with and without intentional voice modification. Acoustic analysis focuses on changes in the articulatory and voice source settings, which are most commonly measured through fundamental frequency (F0) and formant frequencies. Finally, perceptual evaluations study the performance of human listeners, usually in a controlled environment, in a speaker comparison task that includes disguised voices.

Our preliminary analyses of the effects of voice disguise on modern ASV systems was reported in (González Hautamäki et al., 2016). The experiments indicated the vulnerability of our ASV systems in the presence of disguised voices when the speakers intended old and young voices. In terms of equal error rate (EER), the standard accuracy measure of biometric recognizers, we observed a 7-fold increase for intended old voices for male speakers and 5-fold increase for female speakers. The increase in EER was even higher for the intended young voices: 11-fold for male and 6-fold for female speakers. An analysis of F0 histogram distributions for natural, intended old and intended young voices indicated a shift towards higher frequencies for some of the speakers. F0 values are expected to be higher for younger speakers and for most of the speech segments the F0 increased for intended young voices, while in the case of male speakers it also increased for the intended old voice.

The present study seeks to proceed beyond the population level and the ‘average’ performance related to the EER metric. Its main objective is to gain a better understanding of the considerable performance loss of our ASV systems against voice disguise by a deeper investigation into the acoustics of disguised speech and an evaluation of the performance of human listeners. It does so by studying the relative change in F0 and the difference between formants F1 through F4, for each speaker caused by disguise. These acoustic features are affected, among many other factors, by biological ageing. Our study addresses a “simulated aging” process using young and old voice stereotypes, rather than biological ageing. In order to quantify the change in formant frequencies, we introduce a novel method to address the joint change in all averaged formant values with respect to their direction of change — none, increase or decrease — instead of the raw formant measurements. This sort of discrete descriptive presentation enables us to enumerate all the possible formant change patterns and to study their frequency of occurrence in order to reveal whether any speaker-independent voice disguise strategies can be identified.

In addition to the acoustic analysis, we designed a perceptual experiment to benchmark the performance of human speaker verification accuracy under voice disguise. Our perceptual task includes two novel elements, first, a selection of speech sample pairs, or trials, using the results from the ASV systems implemented in our previous study (González Hautamäki et al., 2016). More specifically, we use the ASV system output to select easy, intermediate and difficult speaker pairs. The test includes trials with and without the presence of voice disguise as well as cases with the same and different speakers. The second element is to compare the performance of native and non-native listeners for its relevance in a forensic setting such as voice-lineups, in which the listeners may be unfamiliar with the speaker’s language. Previous studies confirm that the reliability of non-native listeners decreases in speaker recognition tasks (Eriksson, Llamas, Watt, 2010, Köster, Schiller, et al., 1997) which is why the results of non-native listeners in speaker comparison should be considered with caution. Although the accuracy of native vs. non-native listeners under normal voices has been addressed several times (e.g. by Kahn, Audibert, Rossato, Bonastre, 2011, Hautamäki, Kinnunen, Nosratighods, Lee, Ma, Li, 2010, Schwartz, Campbell, Shen, Sturim, Campbell, Richardson, Dunn, Granvill, 2011, Ramos, Franco-Pedroso, Gonzalez-Rodriguez, 2011), the authors are unaware of a previous study that compares the performance of native and non-native listeners with disguised voices for speaker recognition.

The dataset used for this study was collected by the authors and is the same that was used in our preliminary study (González Hautamäki et al., 2016). Our data consists of speech from 60 native Finnish speakers with 31 female and 29 male speakers. We instructed the speakers to not sound like themselves by producing intended old and intended young voices in addition to their normal modal voices without disguise. The intended vocal age was set to define a disguise strategy that assumes that the speakers have a common knowledge of how stereotypical old and young voices may sound like. In this setting, our experiments dealt with analyzing the effects of disguise in speaker verification accuracy. For our perceptual speaker comparison experiment, we recruited 70 listeners (32 native, 38 non-native), and each listened to the same set of 24 utterance pairs, in which the trial order was randomized for each listener.

The specific research questions that the present study seeks to answer are phrased as follows:

  • Q1.

    Is there a significant change in the F0 of female and male speakers when attempting voice disguise to sound older or younger? Does it increase or decrease?

  • Q2.

    Are there significant differences between the average of the first four formant frequencies of the natural and disguised voices of the female and male speakers?

  • Q3.

    Is there any speaker-independent disguise pattern that can be associated with formant frequency variation between natural speech and the studied strategy for disguised speech ?

  • Q4.

    Is listener performance affected by the presence of voice disguise in a similar way to the performance of our ASV systems?

  • Q5.

    Does knowledge of the speakers’ native language play a role in making more reliable perceptual speaker comparisons under modal voices and under disguise?

  • Q6.

    Is there a particular trial category or listener attribute that affects listener performance in the perceptual speaker recognition task?

Section snippets

Previous work on intentional voice modification and vocal ageing

Our study focuses on disguising one’s voice identity by means of a specific type of voice modification related to one’s perceptual age. Our primary interest is in identity disguise and its detrimental effects on the accuracy of speaker recognition, while age disguise merely serves as a shared and not too constrained task across our speakers. Given that our speakers are naïve, we do not necessarily expect them to produce particularly convincing old or young voice imitations. Nevertheless, in

Experimental data

The data collected for our study was first introduced in González Hautamäki et al. (2016). It consists of voice disguise as the only intentional modification of the speakers’ voices, as opposed to modifications that would involve measures such as physically obstructing one’s mouth or nostrils or the use of electronic (software or hardware) voice modifications as discussed by Rodman and Powell (2000). The main instruction given to the participants was to modify their voices to sound old

Acoustic analysis of the test material

To analyze the impact of voice disguise, we carried out an acoustical analysis using our test material. We studied the changes implied by voice disguise in F0 and formant frequencies F1 to F4. As mentioned above, these speech characteristics are also affected by biological ageing, which means that the speakers may attempt to produce a certain perceived age by modifying these primary voice parameters.

Perceptual speaker verification experiment

We have conducted a perceptual experiment in order to evaluate the performance of the listeners. This section details the experimental design and test results.

Discussion

This work presents a broad study into voice disguise effects with the use of acoustic and perceptual methodologies. Before concluding the study, we present an overview of the results obtained, together with our interpretation according to the research questions formulated at the end of Section 1.

Conclusions

Verifying the identity of speakers by means of short utterances that include voluntary variations of the voice is a very challenging task for both humans and state-of-the-art automatic speaker verification systems. Therefore, it is important to investigate how speakers manipulate their voices in order to avoid identification. Our case study addressed the impact of voice disguise when the speakers attempt to sound much older or younger than their actual age. To this end, we conducted an

Acknowledgments

This study was supported by the Academy of Finland (projects no. 253120, 283256 and 309629), the Finnish Scientific Advisory Board for Defense (MATINE) project no. 2500M-003, and Nokia Foundation. The authors would like to thank Maria Bentz and Prof. Stefan Werner of UEF for their help in contributing to planning and execution of the data collection. Finally, the authors would like express their sincere thanks to the volunteer speakers and listeners who made this study possible.

References (49)

  • Boersma, P., Weenink, D., 2015. Praat: doing phonetics by computer [Computer program]. Version 5.4.09, retrieved 15...
  • Brookes, M., 2006. Voicebox: Speech processing toolbox for MATLAB. Software, available [January 2014] from...
  • J.P. Campbell

    Speaker recognition: a tutorial

    Proc. IEEE

    (1997)
  • D. Childers

    Modern Spectrum Analysis

    (1978)
  • J. Clark et al.

    Identification of voices in disguised speech

    Int. J. Speech, Lang. Law

    (2007)
  • V. Dellwo et al.

    How is individuality expressed in voice? An introduction to speech production and description for speaker classification

  • A. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. R. Stat. Soc. Ser. B

    (1977)
  • W. Endres et al.

    Voice spectrograms as a function of age, voice disguise, and voice imitation

    J. Acoust. Soc. Am.

    (1971)
  • A. Eriksson et al.

    The disguised voice: imitating accents or speech styles and impersonating individuals

    Lang. Identities

    (2010)
  • Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N., Zue, V., 1993. TIMIT Acoustic-phonetic...
  • R. González Hautamäki et al.

    Age–related voice disguise and its impact in speaker verification accuracy

    Proc. Odyssey: the Speaker and Language Recognition Workshop

    (2016)
  • J.H. Hansen et al.

    Speaker recognition by machines and humans: a tutorial review

    IEEE Signal Process Mag.

    (2015)
  • J. Harrington et al.

    Age-related changes in fundamental frequency and formants: a longitudinal study of four speakers.

    Proc. Interspeech

    (2007)
  • V. Hautamäki et al.

    Approaching human listener accuracy with modern speaker verification

    Proc. Interspeech. Makuhari, Japan

    (2010)
  • Cited by (33)

    • Improving speaker de-identification with functional data analysis of f0 trajectories

      2022, Speech Communication
      Citation Excerpt :

      Human listeners can recognize speakers’ identities reliably only within a small speaker population (Farrús, 2018), but automatic speaker recognition systems, which are not restricted to small speaker populations, typically outperform naive listeners in unfamiliar speaker recognition tasks (Hautamäki et al., 2010). The recognition accuracy of both human and even state-of-the-art automatic systems is known to decline when speakers deliberately change their speech characteristics to conceal their identities, i.e. disguise their voices (González Hautamäki et al., 2017). It can be assumed, however, that few people have disguised their voices in speech databases, which are collected by different operators, such as officials, scientists, reporters, commercial service providers — and potentially even criminals.

    • Voice biometrics security: Extrapolating false alarm rate via hierarchical Bayesian modeling of speaker verification scores

      2020, Computer Speech and Language
      Citation Excerpt :

      Extrinsic variation refers to the inability to accurately measure ‘pure’ speaker characteristics from imperfect acoustic observations (for instance, due to imperfect transducer, lossy communication channel, background noise, or reverberant environment). Intrinsic variation, in turn, refers to linguistic and non-linguistic variation induced by the speaker him/herself, some of which can be substantial (Hansen et al., 2017; González Hautamäki et al., 2017). The main focus of the ASV research community for the past several decades (Reynolds, 1995; Kenny, 2010) has been on improving ASV technology to handle extrinsic variations of increased complexity, though specific intrinsic factors, such as vocal effort, have also been addressed in the context of NIST SREs (Greenberg et al., 2011).

    • Voice Mimicry Attacks Assisted by Automatic Speaker Verification

      2020, Computer Speech and Language
      Citation Excerpt :

      The tasks consisted of spontaneous speech and read text (13 sentences) in both Finnish and English. The read texts in Finnish are the same used in (González Hautamäki et al., 2017). Their corresponding English versions were added for this study.

    View all citing articles on Scopus
    View full text