Elsevier

Speech Communication

Volume 56, January 2014, Pages 70-81
Speech Communication

Native vs. non-native accent identification using Japanese spoken telephone numbers

https://doi.org/10.1016/j.specom.2013.07.010Get rights and content

Highlights

  • We recorded and analysed spoken Japanese telephone numbers.

  • Prosodic realisation was compared between native and non-native Japanese speakers.

  • Only native speakers realised the specific prosodic pattern for telephone numbers.

  • Prosodic differences were used in a foreign accent identification experiment.

  • Perceptual foreign accent identification experiment was also carried out.

Abstract

In forensic investigations, it would be helpful to be able to identify a speaker’s native language based on the sound of their speech. Previous research on foreign accent identification suggested that the identification accuracy can be improved by using linguistic forms in which non-native characteristics are reflected. This study investigates how native and non-native speakers of Japanese differ in reading Japanese telephone numbers, which have a specific prosodic structure called a bipodic template. Spoken Japanese telephone numbers were recorded from native speakers, and Chinese and Korean learners of Japanese. Twelve utterances were obtained from each speaker, and their F0 contours were compared between native and non-native speakers. All native speakers realised the prosodic pattern of the bipodic template while reading the telephone numbers, whereas non-native speakers did not. The metric rhythm and segmental properties of the speech samples were also analysed, and a foreign accent identification experiment was carried out using six acoustic features. By applying a logistic regression analysis, this method yielded an 81.8% correct identification rate, which is slightly better than that achieved in other studies. Discrimination accuracy between native and non-native accents was better than 90%, although discrimination between the two non-native accents was not that successful. A perceptual accent identification experiment was also conducted in order to compare automatic and human identifications. The results revealed that human listeners could discriminate between native and non-native speakers better, while they were inferior at identifying foreign accents.

Introduction

Globalisation has provided us with more opportunities to communicate with people from all around the world. This also leads to more chances to hear foreign accents. Investigation of foreign accents is important not only for second language (L2) acquisition research and language teaching, but also for technologies such as speech recognition, speaker recognition, and accent identification. The term “accent” can be defined as speech properties that indicate which country, or which part of a country, the speaker originates from. Accent identification is commonly used to identify a speaker’s mother dialect (D1) by using speech samples spoken in D1 or other dialects (D2). For foreign accent identification, a speaker’s first language (L1) is identified using speech in L2 or a later language. Applications of accent identification include preprocessing for automatic speech recognition and language support for L2 speakers. The performance of a speech recognition system can be improved by applying accent identification in advance and then using a dialect or language model in which the accent colour is taken into consideration (e.g., Brousseau and Fox, 1992, Blackburn et al., 1993, Arslan and Hansen, 1996, Fung and Kat, 1999). This is also useful for assisting L2 speakers when call routing is needed for emergency operators or in multi-lingual voice-controlled information retrieval systems (Muthusamy et al., 1994, Zissman, 1996). Furthermore, in forensic situations, when there is a possibility that obtained speech samples were spoken by a D2/L2 speaker, identifying the speaker’s accent, and consequently his/her nationality and/or hometown, can often lead to important clues with regard to the suspect.

A speech technology similar to accent identification is language identification. However, compared to language identification, in which the language spoken by a native speaker is identified, accent identification is considered to be a more challenging task. One reason is that the traits of a speaker’s D1/L1 are carried into D2/L2 speech in various ways. These traits, often called language transfers, may appear on the segmental level, for instance, the substitution of unfamiliar phonemes with similar sounds from the D1/L1, or on the supra-segmental (prosodic) level, e.g., erroneous word accents, clumsy rhythm, and inappropriate intonation. What makes accent identification more difficult is the fact that language transfer is not unique to one target dialect/language or speaker but depends on the speaker’s D1/L1, the language-typological distance between D1/L1 and D2/L2, and various individual factors. For example, different phonemic inventories and phonotactics will bring about different articulatory errors, and different accentuation systems will cause different prosodic problems. Also, the degree of language transfer is reported to depend on each speaker’s age of learning (or age of arrival), amount of exposure and interactive contact with native speakers (e.g., Flege, 1988, Flege and Fletcher, 1992), experience of learning other foreign dialects or languages (Mehlhorn, 2007, Wrembel, 2009), and the individual’s language talent (Markham, 1999); there are also several reports that disclaim the effects of the former two factors (Mackay and Fullana, 2009, Fullana and Mora, 2009).

Previous research on accent identification can be classified into three groups: that based on segmental and articulatory features (Arslan and Hansen, 1996, Kumpf and King, 1996, Teixeira et al., 1996, Berkling et al., 1998, Yanguas et al., 1998), that based on prosodic features (Itahashi and Yamashita, 1992, Itahashi and Tanaka, 1993, Hansen and Arslan, 1995, Mixdorff, 1996, Piat et al., 2008), and that based on both (Piat et al., 2008, Arslan and Hansen, 1997, Vieru-Dimulescu et al., 2007). Kumpf and King (1996) identified three accents of Australian English: Lebanese, Vietnamese, and native. They used a system based on a hidden Markov model (HMM) trained using 2000 sentences recorded from 16 speakers, and identified more than 50 utterances produced by 63 speakers using 12th-order mel-frequency cepstral coefficients (MFCC), log energy and the delta of both as the acoustic features. Their system performed 85.3% correctly in pair-wise identification on average and 76.6% in the identification of the three accents. Similarly, Teixeira et al. (1996) identified six accents of English (Portuguese, Danish, German, British, Spanish, and Italian) using a HMM-based system. They used a speech corpus that contained 200 English isolated words, and calculated linear predictive coding (LPC) cepstra and their delta as the acoustic features. Their system obtained a 65.5% correct identification rate. An example of using prosodic cues was described by Itahashi and Tanaka (1993). They analysed a Japanese passage read by speakers of 14 regional dialects, and extracted 19 acoustic parameters related to the F0. A principal component analysis was performed on these 19 components and the results showed that the 14 dialects could be classified into six groups that approximately corresponded to the regions that the dialects belonged to. Finally, Piat et al. (2008) carried out a study that involved identification of four accents (French, Italian, Greek, and Spanish) of English. They compared the identification performance of their HMM-based system using 1-dimensional duration, 3-dimensional energy, 36-dimensional MFCC, and other prosodic features. The results showed that the MFCC yielded the highest identification rate of 82.9%, whereas duration and energy yielded rates of 67.1% and 68.6%, respectively. They thus concluded that MFCC provided a superior identification rate, although the computational cost was higher.

It is not easy to compare the identification results of the above studies, as they used different speech corpora and different comparison methods; however, these previous studies indicate that accent identification performance improves by using linguistic knowledge of the target languages effectively. This can be, for example, knowledge of linguistic forms for which non-native speakers saliently differ from native speakers, or knowledge of how to detect these linguistic forms in running speech. Blackburn et al. (1993) suggested a method for classifying non-native English accents using features related to phonological differences between the accents. They exploited knowledge on segmental differences among Arabic-accented, Mandarin-accented and Australian (native) English, and extracted features such as the phoneme duration of the sibilants, the voice onset time of the plosives, and the formant frequencies of the vowels. With their system, which was based on a neural network, 96% of Australian English, 35% of Arabic, and 62% of Mandarin male speech were correctly identified using voiced segments. Cleirigh and Vonwiller (1994) developed a phonological model of Australian English that included information on English syllable structure and the distribution of phonemes within a syllable. Berkling et al. (1998) applied this model for the identification of Vietnamese-accented and Lebanese-accented Australian English. They conducted two accent identification experiments, one using the linguistic model and the other not using the model. When they incorporated the linguistic model into their system, the performance improved by 6–7% (84% for the English–Lebanese pair and 93% for the English–Vietnamese pair) compared to the system not using the model (78% for the English–Lebanese pair and 86% for the English–Vietnamese pair). Zissman (1996) built a speech corpus for testing accent identification (using the term “dialect identification”) systems for conversational Latin-American Spanish. He also built an accent identification system using HMM-based phoneme recognition. By applying N-gram language modeling, his system achieved an 84% correct discrimination rate between Cuban and Peruvian Spanish. Subsequently, Yanguas et al. (1998) used the same speech corpus and explored accent identification systems that exploit linguistic knowledge of variations in phoneme realisations. They concentrated on reducing the length of the speech samples needed for identification. By using the duration and energy of the fricative /s/ taken from read digits, their system yielded 72% accuracy in discriminating between Cuban and Peruvian Spanish.

In the present article, another example of using linguistic knowledge is introduced for identifying foreign accents of Japanese. Our method successfully discriminated between Japanese native and non-native accents, although it still needs improvement in discriminating among non-native accents. The study itself is motivated by the fact that not enough research has been made on foreign accent identification in forensic context, although it has been alerted that foreign accent has a detrimental effect on speaker identification tasks (Tate, 1979) reviewed in Hollien (2002). It is important for forensic speech investigators to determine whether the target speech samples contain any accents, and if they do, what type of accents they are. What is important here is that the existence of two different accents in the speech samples strongly suggests that the speakers of the speech samples are two different people (Hollien, 2002, Rogers, 1998); and forensic practicians must be able to show objective and scientific evidence that the accent spoken by the perpetrator is the same as that spoken by the suspect, before they tell that the two speech samples, perpetrator’s and suspect’s, were spoken by the same speaker. Also, accents, whether regional or foreign, can provide us with speaker profile information. Forensic practicians often use accents heard in speech samples as part of their assessment of the speaker’s identity (Kulshreshtha et al., 2012). Giving consideration to these circumstances, research on accent identification is critical for forensic speech investigations, and construction of a reliable accent identification system is of valuable for those who work in this field.

For the selection of the acoustic features used for accent identification in forensic case, we often come across the situation where we are constrained to use prosodic features due to rather poor quality of the speech data. Generally speaking, prosodic features are more robust against background noise and transmission characteristics. Sometimes we are able to extract segment-related frequency features. In the present article, we mainly analysed the prosodic characteristics, although we also had some benefit of segmental characteristics. In the experiment, spoken telephone numbers (hereafter STN) are focused on. Some studies have pointed out that language-dependent structures are highlighted in STN (Baumann and Trouvain, 2001, Katagiri, 2008). In Japanese, too, there is a specific prosodic pattern for STN. In the Japanese language education field, teachers do not spend enough time on pronunciation. According to Toki’s (Toki, 2010) survey of Japanese textbooks, all 14 textbooks in his research covered pronunciation of the vowels and consonants, vowel duration contrast, and syllabic consonants. Most of them also covered word accent and vowel devoicing. Even so, only a few of them contained information on intonation or rhythm. This implies that non-native Japanese speakers are not familiar with the prosody of Japanese STN; thus, STN may give rise to large differences between native and non-native speech.

Most Japanese regional dialects employ two-level (high and low) pitch accents with mora-timed rhythm (Vance, 2008). Each mora in a word is inherently associated with a specific pitch. In addition, Japanese has bimoraic metrical feet. In Japanese, a bimoraic foot plays an important role in accounting for various prosodic phenomena. Linguistic forms such as reduplicated mimetics (/kiɾakiɾa kiɾakiɾa/ “glitter”), clipped words (/ɾimokon/“remote control”), and STN all have a prosodic pattern based on bimoraic feet. In these linguistic forms, two bimoraic feet, i.e., four morae, are grouped together to produce a prosodic structure called a bipodic template (BT) (Poser, 1990, Tsujimura, 1996, Nasu, 2001). There are certain rules for reading Japanese STN. Numerical digits in Japanese are either one- or two-moraic as shown in Table 1; however, once subsumed into STN, one-moraic digits (/ni/ “2” and /ɡo/ “5”) are read with an elongated vowel (/niː/ and /ɡoː/), which means all Japanese digits are read as two-moraic in STN. Additionally, digits in Japanese STN are read in isolation, unlike in English and other European languages (e.g., “11” is read as “one one” and not as “double one;” and “62” is read as “six two” and not as “sixty-two”). Every two digits from the beginning of STN are phonologically grouped together to compose one BT. Accentuation occurs for every BT; accordingly, one accentual peak appears every two digits, i.e., every four morae. Three-digit numbers are read as two-one digit combinations. The first two digits are read in one BT, and the remaining one digit is read in another BT (Nasu, 2001).

The purpose of the present study was, first of all, to show the differences in STN prosody realisation between native and non-native Japanese speakers, both quantitatively and qualitatively. Pitch contours were analysed for Japanese STN produced by speakers with three accents: Chinese, Korean, and native Japanese. The results revealed that all of the native Japanese speakers realised the prosodic pattern of the BT structure, whereas non-native speakers showed different prosodic patterns. The second objective of this study was to examine the extent to which these differences can be used for identifying foreign accents. An accent identification experiment was conducted in order to assess the usefulness of extracted acoustic features related to the prosodic pattern of Japanese STN and the frequency properties of certain segments. A small experiment on the performance of human listeners in identifying non-native accents of Japanese was also conducted.

Section snippets

Speech materials

Native Japanese, Chinese and Korean speakers participated in the recording sessions which lasted between 60 and 90 min. The six telephone numbers shown in Table 2 were recorded twice for each participant in an anechoic room at the National Research Institute of Police Science, Chiba, Japan. Twenty-six (14 female and 12 male) native Japanese speakers with a mean age of 24.3 years participated in the recording sessions. These speakers came from various regions of Japan, including the Tohoku, Kanto,

Speech materials

The same speech materials were used as in the above analysis. Each speaker uttered six telephone numbers two times. The total number of speech samples was, again, 312 for the native Japanese speakers, 216 for the Chinese speakers of Japanese, and 108 for the Korean speakers of Japanese.

Acoustic features

Six acoustic features were extracted and used in the experiments: one feature related to F0, two features related to speech rhythm, and three features related to the frequency properties of certain segments. As

Speech materials

In order to investigate the accuracy of human accent identification, a small perceptual experiment was carried out in which listeners attempted to identify foreign accents for Japanese STN. A subset of the speech materials used in Experiment 1 was selected and used in this experiment. Taking into account the experimental burden on the listeners, three telephone numbers out of the previous six were selected (N1, N4, and N5), one for each type of numbering, to be tested in the experiment. It was

General discussion

In this study, the prosody of Japanese STN was analysed in order to investigate the differences between native and non-native Japanese speakers. As predicted, the native speakers realised the prosodic structure of BT, whereas non-native speakers did not follow the prosodic rule for Japanese STN. The Chinese speakers’ patterns had more peaks whereas Korean speakers’ patterns were much flatter than the native speakers’ patterns. Both Chinese and Korean speakers’ utterances had a smaller F0 range

Acknowledgments

Portions of this work were presented at ICPhS 2011 (K. Amino and T. Osanai, Realisation of the prosodic structure of spoken telephone numbers by native and non-native speakers of Japanese, in: Proc. International Congress of Phonetic Sciences, pp. 236–239, Hong Kong, August 2011) and the ASJ meeting in 2011 (K. Amino and T. Osanai, Identification of native and nonnative speech by using Japanese spoken telephone numbers, in: Proc. Autumn Meeting Acoust. Soc. Jpn., pp. 407–410, Matsue, September

References (62)

  • L.M. Arslan et al.

    Language accent classification in American English

    Speech Commun.

    (1996)
  • F. Ramus et al.

    Correlates of linguistic rhythm in the speech signal

    Cognition

    (1999)
  • Arslan, L.M., Hansen, J., 1997. Frequency characteristics of foreign accented speech. In: Proc. International...
  • Baumann, S., Trouvain, J., 2001. On the prosody of German telephone numbers. In: Proc. Eurospeech. pp....
  • M.G. Beaupre et al.

    An ingroup advantage for confidence in emotion recognition judgments: the moderating effect of familiarity with the expressions of outgroup members

    Personal. Social Psychol. Bull.

    (2006)
  • Berkling, K., Zissman, M., Vonwiller, J., Cleirigh, C., 1998. Improving accent identification through knowledge of...
  • Blackburn, C.S., Vonwiller, J.P., King, R.W., 1993. Automatic accent classification using artificial neural networks....
  • B. Bloch

    Studies in colloquial Japanese 4 – phonemics

    Language

    (1950)
  • P. Boersma

    Praat, a system for doing phonetics by computer

    Glot Int.

    (2001)
  • M.B. Brewer

    In-group bias in the minimal intergroup situation: a cognitive-motivational analysis

    Psychol. Bull.

    (1979)
  • Brousseau, J., Fox, S.A., 1992. Dialect-dependent speech recognisers for Canadian and European French. In: Proc....
  • Cleirigh, C., Vonwiller, J., 1994. Accent identification with a view to assisting recognition. In: Proc. International...
  • H.A. Elfenbein et al.

    Is there an in-group advantage in emotion recognition?

    Psychol. Bull.

    (2002)
  • J.M. Flege

    Factors affecting degree of perceived foreign accent in English sentences

    J. Acoust. Soc. Am.

    (1988)
  • J.M. Flege et al.

    Talker and listener effects on degree of perceived foreign accent

    J. Acoust. Soc. Am.

    (1992)
  • N. Fullana et al.

    Production and perception of voicing contrasts in English word-final obstruents: assessing the effects of experience and starting age

  • Fung, P., Kat, L.W., 1999. Fast accent identification and accented speech recognition. In: Proc. International...
  • M. Hall et al.

    The WEKA data mining software: an update

    SIGKDD Explor.

    (2009)
  • Hansen, J., Arslan, L.M., 1995. Foreign accent classification using source generator based prosodic features. In: Proc....
  • H. Hirano et al.

    Prosodic features and evaluation of Japanese sentences spoken by Chinese learners

    IEICE Technical Report SP

    (2006)
  • H. Hirano et al.

    Analysis of prosodic features in native and non-native Japanese using generation process model of fundamental frequency contours

    IEICE Technical Report SP

    (2006)
  • H. Hollien

    Forensic Voice Identification

    (2002)
  • Itahashi, S., Tanaka, K., 1993. A method of classification among Japanese dialects. In: Proc. Eurospeech. pp....
  • Itahashi, S., Yamashita, T., 1992. A discrimination method between Japanese dialects. In: Proc. International...
  • C. Ito et al.

    The adaptation of Japanese loanwords into Korean

    MIT Working Pap. Ling.

    (2006)
  • Japanese-Language Proficiency Test:...
  • H. Joh et al.

    Dictionary of Basic Phonetic Terms

    (2011)
  • Katagiri, K.L., 2008. Pitch accent realisation of the Japanese digits by Filipino learners of Japanese. In: Proc....
  • C.W. Kim

    The vowel system of Korean

    Language

    (1968)
  • H. Kindaichi et al.

    Dictionary of Japanese Accents

    (2001)
  • M. Koshimizu

    Chinese

  • Cited by (9)

    View all citing articles on Scopus
    View full text