Elsevier

Journal of Phonetics

Volume 32, Issue 1, January 2004, Pages 111-140
Journal of Phonetics

Some acoustic cues for the perceptual categorization of American English regional dialects

https://doi.org/10.1016/S0095-4470(03)00009-3Get rights and content

Abstract

The perception of phonological differences between regional dialects of American English by naı̈ve listeners has received little attention in the speech perception literature and is still a poorly understood problem. Two experiments were carried out using the TIMIT corpus of spoken sentences produced by talkers from a number of distinct dialect regions in the United States. In Experiment 1, acoustic analysis techniques identified several phonetic features that can be used to distinguish different dialects. In Experiment 2, recordings of the sentences were played back to naı̈ve listeners who were asked to categorize talkers into one of six geographical dialect regions. Results showed that listeners are able to reliably categorize talkers using three broad dialect clusters (New England, South, North/West), but that they have more difficulty categorizing talkers into six smaller regions. Multiple regression analyses on the acoustic measures, the actual dialect affiliation of the talkers, and the categorization responses revealed that the listeners in this study made use of several reliable acoustic–phonetic properties of the dialects in categorizing the talkers. Taken together, the results of these two experiments confirm that naı̈ve listeners have knowledge of phonological differences between dialects and can use this knowledge to categorize talkers by dialect.

Introduction

Studies of phonological differences in regional dialects of American English have focused primarily on the collection of phonological atlases, phonological descriptions of specific dialects, or the social aspects of attitudes towards certain dialects, such as perceived “correctness” and social stereotypes related to speakers of a given dialect (e.g., Giles, 1970; Labov, 1972; Labov, Ash, & Boberg, in press; Preston, 1993). Linguistic atlas projects have been in progress in the United States since the early 1900s with the goal of documenting regional phonological and lexical variation (e.g., Kurath, 1939). Labov conducted some of the first smaller-scale variation research in the 1960s in his well-known study on the use of [ɹ] in New York City department stores. This new research approach proved instrumental in shaping the current field of variationist research (Labov, 1972). At around the same time, Giles and his colleagues were also conducting attitude judgment research in Britain on native speaker attitudes towards varieties of English spoken in the British Isles. They asked participants to listen to recordings of speech and rate the talker on dimensions such as aesthetics and status (e.g., Giles, 1970).

While dialect geographers have typically focused more on lexical variation than on phonological variation, early linguistic atlas projects collected information about both lexical and phonological variation (e.g., Kurath, 1939). More recently, Labov et al. (in press) have been working on a complete phonological description of North American English, using data collected from telephone surveys of over 600 talkers across the United States and Canada. The recordings from these talkers were impressionistically transcribed and acoustic measurements of F1 and F2 were obtained for each of the vowels they selected to study. Based on the differences in vowel production, their Atlas of North American English identifies various levels of dialect boundaries that range from a basic North–South–West split to the division of New England into Eastern New England, Western New England, and New York City.

In another study, Thomas (2001) explored the vowel systems of close to 200 speakers of North American English using F1 and F2 acoustic measurements taken from recordings of read passages and spontaneous conversational speech materials. The bulk of the speakers in his study came from Ohio, North Carolina, or Texas, which allowed him to carry out a thorough analysis of the differences between these three regions, as well as provide a sense of the degree of variation within a given locale, such as a state or even a city. In his presentation and discussion of the vowel spaces of individual talkers, Thomas made two important contributions to the acoustic literature on variation. First, he provided a quantitative means of comparing individuals within and across dialect areas. Second, he used a single methodology in the analysis of every talker in his corpus, thus allowing for the direct comparison of materials that had previously been collected for disparate projects but had never been presented together in a standard format.

The collection of a large number of samples of spoken language has enabled researchers to study specific dialects. Similarly, the study of specific dialects has led to the collection of large spoken language corpora. In studying and describing individual varieties of American English, the focus has typically been on the vowel system of that variety (Docherty & Foulkes, 1999). The current shifts in the vowel systems of two regions in particular have received a great deal of attention in the past few decades: the Northern Cities vowel shift and the Southern vowel shift. The Northern Cities vowel shift is characterized by a “clockwise” rotation of the low vowels in the F1×F2 vowel space as shown on the left in Fig. 1. It has been found in such urban areas as Buffalo, Cleveland, Detroit, and Chicago (Labov, Yaeger, & Steiner (1972), Labov (1972)). The Southern vowel shift, on the other hand, is characterized by the centralization of the high tense vowels and the lengthening of the high front lax vowels as shown on the right in Fig. 1. This shift is found more prominently in rural areas of the South, as opposed to the more urban populations that exhibit the Northern Cities vowel shift (Labov & Ash, 1997).

A third phenomenon involving vowels in American English that has received attention in the literature is the low-back merger in which /ɔ/ and /ɑ/ have merged to make homophones of such pairs as ‘caught’ and ‘cot’ or ‘Dawn’ and ‘Don.’ The quality of the merged vowel varies from talker to talker, and ranges from [ɒ] to [a]. This low-back merger is found in the Midland areas of the United States and much of the West (Wolfram & Schilling-Estes, 1998; Labov et al., in press). While vowels have been the primary focus of phonological dialect descriptions, consonantal phenomena like the postvocalic r-lessness found in New England, New York City, and some parts of the South, and the ‘greasy’ ∼ ‘greazy’ alternation found in the South have also been noted dialect characteristics in discussions of phonological differences (Labov, 1972; Wolfram & Schilling-Estes, 1998).

In studying the perception of these phonological differences by naı̈ve listeners, the methods that are typically used have been based on the representations of dialect variation that listeners have stored in memory, and not on direct behavioral responses to any speech stimuli. For example, Preston (1986), ]Preston (1989) conducted a series of studies in which he asked undergraduates from various parts of the United States to complete a number of tasks, including drawing and labeling dialect regions on a map of the United States and ranking all 50 states on the “correctness” or the “pleasantness” of the English spoken there. Studies were conducted in Hawaii, southern Indiana, eastern Michigan, New York City, and western New York. The results indicated that naı̈ve undergraduates cannot accurately replicate the dialect boundaries drawn by such variationist researchers as Labov.

A comparison of the composite maps of each participant group indicated that concepts of dialect variation are related, in part, to where a listener lives. In general, listeners defined more dialect regions in areas that were in closer geographic proximity to themselves than in areas that were farther away (Preston, 1986). Similarly, results of the ranking task for informants in southern Indiana indicated that “pleasantness” seemed to correspond to geographic proximity to Indiana, whereas “correctness” seemed to correspond more to highly familiar stereotypes of where “standard” English is spoken, with California and the North and Northeast regions receiving the highest rankings (Preston, 1989). The results of these perceptual dialectology studies may have been affected by the participants’ poor knowledge of United States geography. Preston admitted that using a map without state boundaries resulted in great confusion and that “for the time being, folk dialectology is confounded with folk geography” (1993, p. 335). Similar problems in perceptual dialectology research have been noted in Great Britain (Inoue, 1999; Wales, 1999).

In addition to the concern about participants’ geographic knowledge, one major criticism of this research strategy is that the participants are rarely, if ever, asked to listen to actual speech samples when completing these perceptual tasks. Instead, they are asked to make judgments based on their mental representations of dialect variation stored in long-term memory. It is therefore unclear whether or not the participants could reliably identify any given talker as being from a place that they rate as having “pleasant” or “correct” English.

In addition to Preston (1986), ]Preston (1989), Preston (1993) perceptual dialectology work, a substantial body of literature exists that has examined attitude judgments based on listening to samples of spoken language. Lambert, Hodgson, Gardner, and Fillenbaum (1960) introduced a methodology known as the “matched-guise technique” in which the same talker produces utterances in multiple languages or dialects in order to control inter-talker variability while obtaining the desired inter-language or inter-dialect variability. Listeners are then typically asked to rate the talker on a series of scales related to intelligence, likeability, competence, trustworthiness, etc. Giles and his colleagues (Giles, 1970; Giles & Bourhis, 1973; Bourhis, Giles, & Lambert, 1975; Ryan & Carranza, 1975) have used this technique to explore linguistic attitudes of naı̈ve listeners to a range of regional, ethnic, and social language varieties and have consistently found that talkers exhibiting features of less prestigious dialects receive lower ratings than talkers exhibiting features of more prestigious dialects.

More recently, Purnell, Idsardi, and Baugh (1999) assessed the ability of naı̈ve listeners to identify the dialect of a talker, using a variation of the matched-guise technique. The male talker in the study by Purnell et al. left answering machine messages for landlords in various neighborhoods in the San Francisco area inquiring about apartments for rent using white, African-American, and Chicano guises. The results suggested that the landlords were able to identify the talker's dialect (or guise) based on these short samples of speech because the number of returned calls for African-American and Chicano guises increased as the minority population of the neighborhood increased.

One major criticism of matched-guise research is that participants are rarely asked to identify where they think the talkers are actually from. Interpretations of the results therefore are often based on the assumption that the listeners first correctly identified where the talker was from and then completed the attitude judgment portion of the task. The validity of this assumption, however, is questionable, particularly given other research in the domain of dialect perception. In his famous study of the linguistic attitudes of teachers in Chicago, Williams (1976) found that both white and African-American children who were identified as white received higher ratings on a number of language-related scales such as fluency, standard pronunciation, and sentence complexity than children who were identified as African-American. He concluded that the teachers may have made their attitude judgments based on their perception of the child's race, instead of the actual linguistic characteristics displayed by the child.

More recently, Niedzielski (1999) examined the ability of naı̈ve listeners in Detroit to match synthetic vowel tokens to a target vowel spoken by a single female talker. One group of listeners was told that the talker was from Detroit. A second group of listeners was told that the talker was from Canada. Niedzielski found that those listeners who thought the talker was from Canada selected the actual matching vowels from the set of six alternatives. The listeners who were told that the talker was from Detroit, however, selected canonical vowels as the best match to the talker's utterances. That is, the perception of vowel quality was affected by the listeners’ beliefs about the talker. As in Williams (1976), where the teachers’ judgments were related to their perception of the race of the talkers, the listeners in Niedzielski's study were influenced by their stereotypes of Canadian vs. Detroit English. Taken together, these results suggest that an important area of research is the study of the identification of regional, ethnic, and social dialects by naı̈ve listeners.

A small number of studies have been conducted that explicitly examine the ability of naı̈ve listeners to identify where different talkers are from based on actual speech samples. In one of the earliest studies of its kind, Bush (1967) asked naı̈ve listeners to identify the national origin of talkers from the United States, Great Britain, and India. She found that listeners could identify where the talkers were from with over 90% accuracy in a three-alternative forced-choice categorization task using read speech including nonsense words, real words, and sentences.

More recently, Preston (1993) asked naı̈ve adult listeners in Michigan and Indiana to listen to short speech samples taken from interviews with middle-aged males and to assign each of the different talkers to one of nine cities, running north to south between Saginaw, Michigan and Dothan, Alabama. Results of his study revealed that the listeners were only able to make a broad distinction between North and South. Preston noted that this perceptual boundary did not correspond to the boundaries drawn by these same listeners in the map-drawing task discussed above, suggesting that listeners’ perceptions of dialect variation when listening to actual speech samples differ significantly from their stored mental representations of dialect variation. It is also interesting to note that the boundary perceived by the Indiana residents was different from the boundary perceived by the Michigan residents. Preston (2002) suggested that the difference in categorization between the Michigan and Indiana listeners might be due to a different attentional focus. Specifically, he proposed that the Michigan listeners were making their identifications based on attitude judgments of relative “correctness,” whereas the Indiana listeners were making their identifications based on judgments of relative “pleasantness.” In any case, it was clear that the overall results of this identification task were similar across both groups, with slight inter-group variation due to where the listeners themselves were from.

Van Bezooijen and her colleagues (Van Bezooijen & Gooskens, 1999; Van Bezooijen & Ytsma, 1999) have examined perceptual dialect categorization of regional varieties of Dutch in the Netherlands and Belgium and English in the United Kingdom. Using speech samples taken from interviews with three male talkers from each of four regional varieties of Dutch, Van Bezooijen and Gooskens asked naı̈ve Dutch listeners to identify the country, region, and province that each talker was from in a multi-level forced-choice perceptual categorization task. They found that Dutch listeners could correctly identify 60% of Dutch talkers by region of origin and 40% by province. Van Bezooijen and Ytsma observed similar results using read passages spoken by four female talkers representing each of six Dutch varieties. Their listeners identified 60% of the talkers by region and 35% by province. In the United Kingdom, Van Bezooijen and Gooskens replicated their Dutch study with three male talkers from each of five English dialects and found that listeners identified 88% of the English talkers by region and 52% by area, again using speech samples taken from interviews.

Williams, Garrett, and Coupland (1999) recorded two adolescent males from each of six regions of Wales and two adolescent male speakers of Received Pronunciation (RP) recounting personal narratives. Short samples of these utterances were played back to different groups of adolescents from each of the six regions who were asked to categorize each talker using an eight-alternative forced-choice task. The eight alternatives were the six regions of Wales, RP, and “don’t know.” Compared to the results reported by Van Bezooijen and Gooskens (1999) and Van Bezooijen and Ytsma (1999), overall performance measured in terms of accuracy was quite low. The average proportion correct categorization was only 30%. In considering the performance of the listener groups, Williams et al. found that listeners only correctly identified talkers from their own region about 45% of the time.

The recent studies reviewed above on dialect identification and categorization based on actual speech samples have all found that listeners can make judgments about where talkers are from with some degree of accuracy. Given the varied nature of the studies themselves and how performance was scored, it is difficult to make direct comparisons among them. However, it is clear that Purnell et al. (1999) were tapping into the ability of landlords to identify a talker's racial dialect based on an answering machine message, that the listeners in Preston's (1993) study could discriminate northern from southern American English talkers based on short narratives, that the listeners in the Van Bezooijen and Gooskens (1999), Van Bezooijen and Ytsma (1999), and Williams et al. (1999) studies were able to categorize talkers by regional dialect with somewhat variable, although consistently above-chance performance. Taken together, these findings suggest that naı̈ve listeners are aware of phonological differences between dialects and can make reliable judgments based on this information in the speech signal.

The present set of experiments extended this line of dialect categorization research in two ways. First, through the use of read speech materials that were identical across all of the talkers, we measured the acoustic–phonetic properties associated with the talkers from each of the dialects of American English included in our study. Second, using playback techniques, we carried out a perceptual categorization study of regional varieties of American English with American listeners that was similar to the research done by Williams et al. (1999). In addition to replicating their categorization results, we also conducted several detailed analyses of our perceptual data to investigate the nature of the errors made by our listeners and to measure the acoustic–phonetic properties that the listeners were relying on in making their categorization judgments.

The present experiments were designed to determine how naı̈ve American listeners categorize talkers by regional dialect of American English and to identify the acoustic cues that are used by the listeners in making categorization judgments about where a talker is from. Wolfram and Schilling-Estes claim that “phonological patterns can be diagnostic of regional and social differences, and a person who has a good ear for dialects can often pinpoint a talker's general regional and social affiliation with considerable accuracy based solely on phonology” (1998, p. 67). However, there is little, if any, experimental evidence available in the published literature to show that listeners have this detailed knowledge of variation in phonological patterns at all, or that they can use this knowledge reliably as a diagnostic for regional identification. Thus, the primary goal of the present research was to investigate dialect variation in both speech production and speech perception. Experiment 1 was carried out to measure which acoustic–phonetic cues were available in the speech signal to identify where a talker was from. Experiment 2 was designed to investigate how listeners categorize talkers by dialect and describe which acoustic–phonetic cues are used by the listeners in making these perceptual categorization judgments.

Section snippets

Talkers

Sixty-six talkers were selected from the TIMIT Acoustic–Phonetic Continuous Speech Corpus (Fisher, Doddington, & Goudie-Marshall, 1986; Zue, Seneff, & Glass, 1990). The TIMIT corpus consists of audio recordings of 630 talkers reading ten sentences each. The corpus includes 438 males and 192 females. The talkers were each assigned one of eight regional labels to indicate their dialect: New England, North, North Midland, South Midland, South, West, New York City, or Army Brat. While this corpus

Stimulus materials

The two sentences spoken by the 66 male talkers in Experiment 1 were also used in this study. In addition, a third novel sentence was selected from the eight other sentences available on the TIMIT corpus for each talker. A different novel sentence was selected for each of the 66 talkers, so that no sentence would ever be repeated during the course of the experiment. As in Experiment 1, all three sentences for each talker were reproduced in separate sound files that were segmented to include

General discussion

The acoustic analyses performed in the first experiment confirmed that the talkers selected from each dialect region reliably produced phonological differences that can be measured acoustically. As expected, r-lessness was a good predictor of New England talkers, centralized /oʊ/ offglides were good predictors of Northern talkers, /u/ fronting was a good predictor of South Midland talkers, and the presence of a voiced fricative in ‘greasy’ was a good predictor of Southern talkers. Unexpectedly,

Conclusions

The results of the first experiment using acoustic measurement techniques provide further evidence that phonological differences do exist between regional dialects of American English and that the dialect affiliation of the talkers can be predicted to some extent by well-defined acoustic–phonetic differences in speech, even short samples of read speech, such as sentences. The results of the second perceptual categorization experiment replicated previous work by Williams et al. (1999) which

Acknowledgements

This work was supported by the NIH-NIDCD R01 research grant DC00111 and the NIH-NIDCD T32 training grant DC00012 to Indiana University. We would like to thank Caitlin Dillon for her assistance in selecting the talkers for this project, Luis Hernandez for his technical advice and support, Kenneth de Jong for suggesting some of the measures for the first experiment, Robert Nosofsky for his help in conducting the clustering analyses in the second experiment, and James Harnsberger, Allyson Carter,

References (61)

  • D. Byrd

    Sex, dialects, and reduction

    Proceedings of the International Conference on Spoken Language Processing—ICSLP’92 Proceedings

    (1992)
  • C.M. Carver

    American regional dialectsA word geography

    (1987)
  • Clopper, C. G., & Pisoni, D. B. (2002). Effects of talker variability on perceptual learning using a dialect...
  • Clopper, C. G., & Pisoni, D. B. (submitted). Homebodies and army brats: some effects of early linguistic experience and...
  • Corter, J. E. (1995). ADDTREE/P Program for Fitting Additive...
  • L.B. Crane

    The social stratification of /ai/ in Tuscaloosa, Alabama

  • L.M. Davis et al.

    Is there a Midland dialect?—again

    American Speech

    (1992)
  • G.J. Docherty et al.

    Derby and Newcastleinstrumental phonetics and variationist studies

  • M.F. Dorman et al.

    Stop-consonant recognitionRelease bursts and formant transitions as functionally equivalent, context-dependent cues

    Perception & Psychophysics

    (1977)
  • Fisher, W. M., Doddington, G. R., & Goudie-Marshall, K. M. (1986). The DARPA speech recognition research database:...
  • H. Giles

    Evaluative reactions to accents

    Educational Review

    (1970)
  • H. Giles et al.

    Dialect perception revisited

    Quarterly Journal of Speech

    (1973)
  • R. Hagiwara

    Dialect variation and formant frequencyThe American English vowels revisited

    Journal of the Acoustical Society of America

    (1997)
  • J. Hillenbrand et al.

    Acoustic characteristics of American English vowels

    Journal of the Acoustical Society of America

    (1995)
  • F. Inoue

    Subjective dialect division in Great Britain

  • E. Johnson

    Yet againThe midland dialect

    American Speech

    (1991)
  • P. Keating et al.

    Phonetic analyses of the TIMIT corpus of American English

    Proceedings of the International Conference on Spoken Language Processing—ICSLP’92 Proceedings

    (1992)
  • G.P. Krapp

    The English language in America

    (1925)
  • Kurath, H. (Ed.) (1939). The linguistic atlas of New England. Providence: Brown University...
  • H. Kurath et al.

    The pronunciation of English in the Atlantic States

    (1961)
  • Cited by (0)

    View full text