Elsevier

Cognition

Volume 149, April 2016, Pages 31-39
Cognition

Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production

https://doi.org/10.1016/j.cognition.2016.01.002Get rights and content

Highlights

  • We develop a high-precision computational approach for measuring elicited speech.

  • This allows analysis of a set far exceeding the size of previous speech error studies.

  • Illuminates properties of errors previously impossible to reliably observe.

Abstract

Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When “pig” is mispronounced as “big,” the resulting /b/ sound differs from correct productions of “big,” moving towards intended “pig”—revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing.

Introduction

The acoustic and articulatory properties of speech vary from moment to moment; if you repeat a word several times, no two instances will be precisely the same. Hidden within this variation are traces of the cognitive processes underlying language production. For example, when repeatedly producing a word, you will tend to slightly reduce its duration—reflecting (in part) the ease of retrieving the word from long term memory (Kahn and Arnold, 2012, Lam and Watson, 2010). Such effects can also be found at the level of individual speech sounds within a word. One such effect can be observed in bilingual speakers’ pronunciations of second language speech sounds. Such sounds are more accented when speakers have recently produced a word in their native language, relative to cases where the same speaker has just produced sounds in the second language (Balukas and Koops, 2015, Goldrick et al., 2014, Olson, 2013). This suggests that the difficulty of retrieving words and sounds when switching languages can modulate how sounds are articulated.

Here, we focus on one source of evidence that has played a key role in theories of language production: speech errors (Fromkin, 1971, et seq.). Errors involving the mis-production of sounds (“pig” mispronounced as “big”) reveal the graded influence of intended productions on articulation. Errors simultaneously reflect acoustic/articulatory properties of both the target and error outcome (Frisch and Wright, 2002, Goldrick et al., 2011, Goldrick and Blumstein, 2006, Goldstein et al., 2007, McMillan and Corley, 2010, McMillan et al., 2009, Pouplier, 2007, Pouplier, 2008). Such effects are consistent with theories of language production incorporating continuous, distributed mental representations in the cognitive process underlying the planning (Dell, 1986, Goldrick and Blumstein, 2006, Plaut and Shallice, 1993, Smolensky et al., 2014) and articulation of speech sounds (Goldstein et al., 2007, Saltzman and Munhall, 1989). According to these theoretical perspectives, articulation reflects subtle, gradient variation in the representational structures and cognitive processes underlying speech (e.g., variation in the degree to which the native language is activated can yield graded changes in the degree of accent in non-native speech; partial activation of target sounds can influence how errors are articulated).

While studies of phonetic variation have provided a rich source of information about language processing, most researchers have relied on manual annotation to obtain accurate data. This approach suffers from two critical flaws. It is highly resource intensive; a single experiment in our lab (Goldrick et al., 2011) required over 3000 person-hours for analysis. With respect to speech error studies (as discussed below), this has prevented researchers from obtaining the data required to reliably evaluate different hypotheses. Second, this approach is fundamentally subjective: manual labels reflect the judgments of annotators. This presents a barrier to replication.

Recent studies have aimed to address these issues through computational methods that automatically measure acoustic properties of speech (e.g., Gahl et al., 2012, Labov et al., 2013, Yuan and Liberman, 2014). These methods eliminate subjective judgments while enormously reducing the resources required for analysis. Although this has provided great advances in studies of phonetic variation, existing methods do not provide a comprehensive solution. They have not provided the fine granularity of measurement necessary to reliably measure differences at the level of individual speech sounds (specifically, consonant sounds). Furthermore, these existing methods require a complete transcription of the observed speech prior to phonetic analysis. This is a major burden, particularly for paradigms that are designed to produce tremendous variation in production (e.g., speech errors).

In this work, we propose a novel computational framework for automatic analysis of speech appropriate for evaluating hypotheses relating to the phonetics of speech errors. This is based on a set of algorithms in machine learning (Keshet et al., 2007, McAllester et al., 2010, Sonderegger and Keshet, 2012). Our automatic approach matches the performance of expert manual coders and outperforms algorithms used in the existing psycholinguistic literature. The analyses reveal novel properties of the phonetics of speech errors. Furthermore, we show (via a power analysis) that reliable investigation of the properties of individual speech sounds requires datasets larger than those used in previous work. These findings show how automatic analysis creates an opportunity to gain a much richer, objective, and replicable picture of acoustic variation in speech.

One key source of evidence for the structure of the cognitive mechanisms underlying language production is speech errors (Fromkin, 1971). Sound substitution errors (e.g., intending to say bet, but producing pet; written as betpet) have been studied in the laboratory by asking participants to rapidly produce artificial tongue twisters composed of syllables with alternating contrasting sounds (pet bet bet pet; Wilshire, 1999). Based on transcriptions of speech, it was long assumed that such errors reflect the categorical substitution of one sound for another (Dell, 1986, Fromkin, 1971, Shattuck-Hufnagel and Klatt, 1979). However, more recent quantitative analyses of the phonetic (acoustic/articulatory) properties of errors have revealed that errors systematically differ from corresponding correct productions—a deviation that reflects properties of the intended sound (Frisch and Wright, 2002, Goldrick and Blumstein, 2006, Goldrick et al., 2011, Goldstein et al., 2007, McMillan and Corley, 2010, McMillan et al., 2009, Pouplier, 2007, Pouplier, 2008). For example, an important acoustic cue to the distinction between words like pet and bet is voice onset time (VOT), the time between the release of airflow (e.g., opening the lips) and the onset of periodic vibration of the vocal folds (Lisker & Abramson, 1964). In English, voiceless sounds like /p/ have relatively long VOTs whereas voiced sounds like /b/ have short VOTs (Lisker & Abramson, 1964). In a betpet error, the resulting /p/ sound is distinct from correct productions of the same sound (petpet). The error /p/ tends to have a shorter VOT—which makes it more similar to the intended sound /b/. The complementary pattern is found for errors like petbet; the error /b/ tends to have a longer VOT than the corresponding sound in betbet. Note that similar effects are found in non-errorful speech when a competitor word is explicitly primed (e.g., priming top while reading the word cop aloud yields a blend of /t/ and /k/ articulations; Yuen, Davis, Brysbaert, & Rastle, 2010).

These deviations have been attributed to one of two distinct types of cognitive processes that underlie the production of speech: (i) planning processes that construct a relatively abstract specification of the targets of articulation; or (ii) articulatory processes that specify the specific motor movements that execute this plan. To illustrate this division, when producing pet, planning processes might specify that the initial sound is /p/ but not the precise timing of the associated lip movements; these would be specified during articulatory processing. Below, we outline how different theories have proposed that deviations of errors from correct productions arise at each level of processing.

Within planning processes, many theories of speech production assume that representations are patterns of activation over simple processing units (Dell, 1986). For example, the contrast between big and pig is represented by graded patterns of activation over units representing speech segments /p/ and /b/. While this type of representation can express arbitrarily varying combinations of /p/ and /b/, theories typically incorporate mechanisms that constrain the patterns of activation. These mechanisms force planning processes to select relatively discrete representations for production (e.g., primarily activating /p/, with little activation of /b/). A variety of mechanisms have been proposed to account for this, including: boosting activation of one representation relative to alternatives (e.g., Dell, 1986); lateral inhibition that reduces the activation of alternative representations (see Dell & O’Seaghdha, 1994, for a review); and attractors over distributed representations (e.g., Goldrick and Chu, 2014, Plaut and Shallice, 1993, Smolensky et al., 2014). However, these constraints on activation are typically not categorical; while one unit may be highly active, others may remain partially active. This has been proposed as one possible mechanism for producing deviations in speech errors. If the specification of the intended target sound remains partially active, the phonetic properties of the error could be distorted towards the intended target (Goldrick and Blumstein, 2006, Goldrick and Chu, 2014, Smolensky et al., 2014). For example, in betpet, the speech plan could specify the target is 0.9 /p/ and 0.1 /b/—resulting in articulations that combine properties of both sounds.

Articulatory processes could provide an additional source of distortions in speech errors. Such processes specify the continuous, coordinated dynamics of articulator movements that execute the speech plan (Saltzman & Munhall, 1989). Tongue twisters require speakers to rhythmically alternate different configurations of speech gestures (e.g., altering the relative timing of lip opening and glottal movement for /p/ vs. /b/). Research across a variety of domains of action has suggested that alternating different movements is inherently less dynamically stable than repeating synchronous actions. When participants are asked to perform alternating movements under varying response speeds, they spontaneously shift from successful alternation to synchronized movements at fast rates (Haken, Peper, Beek, & DaVertshofer, 1996). If speech errors in tongue twisters reflected, in part, a similar process—a destabilization of articulation of alternating movements under fast rates—we might expect a similar pattern to emerge. The synchronous production of previously alternating sounds would manifest as a blend of properties of the error and the intended target, providing a second possible mechanism for producing deviations in speech errors (Goldstein et al., 2007, Pouplier, 2007).

Evaluating these two approaches to deviations in errors has been hampered by the relative paucity of phonetic data. For example, studies arguing for an articulatory locus of deviations have often induced errors using repeating sequences (pet bet pet bet; e.g., Goldstein et al., 2007). In contrast, studies arguing for a planning locus have often used twisters where the order of pairs of syllables switches within a twister (pet bet bet pet; e.g., Goldrick & Blumstein, 2006). Transcription studies suggest that the difference between these two twister types exerts a significant influence on processing (Croot et al., 2010, Wilshire, 1999). Across studies, relative to twisters using repeated syllables sequences, twisters that switch the order of syllables result in higher errors at points where the order of syllable switches (i.e., the first and third positions in a sequence; pet bet bet pet). While multiple transcription studies have examined this issue, phonetic studies have not. This likely reflects the high cost of analyzing phonetic data; comparison of syllable orders within items and participants requires collecting twice the amount of data as any single paradigm. The consequence of this methodological divergence has, as of yet, been unexamined. Developing a more efficient means of gathering phonetic data could allow us to bridge results across these two types of studies.

The paucity of phonetic data has also constrained the types of measures that can be examined. While many studies have examined shifts in the mean properties of errors vs. correct productions (e.g., the typical size of deviations away from canonical /b/, towards the intended /p/), the processing conditions that give rise to speech errors might also influence other distributional properties of productions—in particular, errors might exhibit a different degree of variability than correct productions. The difficulty of production processing may influence phonetic variability. For example, Heisler, Goffman, and Younger (2010) found that children produced novel sound sequences with higher articulatory variability when the strings were not paired with a lexical referent. If participants learned that the sequence was the label for an object, articulation became less variable. However, previous work has not examined whether the processing difficulties that give rise to speech errors in adults might also influence the variability of articulation. This likely reflects the high cost of analyzing phonetic data; the proper assessment of the variability of errors relative to correct productions1 requires a large number of observations from a substantial number of participants. Each participant must produce a significant number of observations within each condition in order for us to reliably assess the distributional properties of their errors and correct productions. Then, to assess whether such distributional properties are reliably different, we must compare measures across a number of participants. Analyses of this type therefore require a substantial decrease in the cost of gathering phonetic data.

To address the problems associated with limited amounts of speech data, we propose a new approach to the automatic analysis of speech error data. We formulate the general problem as speech sound measurement in studies where speech is elicited by a prompt specifying a target utterance. The objective is to take recordings of such utterances and output an accurate measurement of specific, linguistically relevant dimensions of the acoustic signal (phonetic parameters). As outlined in Fig. 1, we approach this problem by first identifying the relevant regions of the acoustic signal for phonetic analysis, and then automatically measuring some linguistically relevant acoustic properties (Section 2 provides full implementation details).

Each of the boxes in Fig. 1 corresponds to a learning algorithm that was specially designed to solve the task of phonetic parameter measurement and to minimize the error in the algorithm’s prediction for these measures. Unlike problems of classification or regression where the input is a fixed length feature vector and the output is a single bit (such as “grammatical” vs. “not grammatical”) or a real number (such as a pitch value), respectively, the input to each of the tasks represented by the boxes in Fig. 1 is a structured object (e.g., a variable length acoustic signal), as is the output (e.g., an alignment between phoneme sequences and regions of the acoustic signal; phonetic parameter measurement for particular regions of the acoustic signal).

Structured prediction refers to machine learning models that predict relational information that has structure such as being composed of multiple interrelated parts. A structured prediction algorithm maps the input object along with the target output object into a feature vector space. The algorithm distills to a classifier in this vector space, which is trained to predict the target output object. The classifiers used in this work are based on the large margin concept, meaning that they are trained to separate the target output object from all other output objects with some confidence called margin. This allows the trained model to account for perturbations in the vectors in feature vector space due to noise in the speech signal (Crammer, Dekel, Keshet, Shalev-Shwartz, & Singer, 2006). The classifier’s confidence can be used to identify noisy input to the classifier or poor classification results, as detailed in Section 2.4.

In contrast to the standard approaches that have been developed for binary classification, each structured prediction task is distinctive: it has a unique evaluation metric, its own set of feature functions, and in many cases requires a non-standard procedure for predicting the target object from an input object, given a set of trained parameters. An overview of our approach is provided in Fig. 1. The first box in Fig. 1 is a structured prediction algorithm (Keshet et al., 2007, McAllester et al., 2010) that automatically aligns the transcription of the utterance at the level of individual speech sounds (phonemes) with the corresponding portions of the recorded acoustic signal. In contrast to standard existing approaches (Gahl et al., 2012, Labov et al., 2013, Yuan and Liberman, 2014), this transcription is dynamically generated: the target utterance is used to generate several possible transcriptions, allowing for deletion or addition of syllables and mispronunciation of key segments. This eliminates a substantial manual step required by previous approaches. The transcription that best aligns with the recorded acoustics is then used to determine analysis windows for measurement of phonetic parameters. Our state-of-the-art phoneme alignment algorithm has two advantages relative to existing approaches (Brugnara et al., 1993, Hosom, 2009): it was designed to minimize the predicted error in the alignment (McAllester et al., 2010); and it extends the representation of the speech acoustics so as to capture temporal regularities in the signal which correlate highly with phoneme boundaries (Keshet et al., 2007). This allows the algorithm to achieve significantly higher accuracy than competing approaches on a standard benchmark (see Section 5 of McAllester et al., 2010).

As noted above, an important acoustic cue that we focus on in our analysis of speech errors is VOT. The second box in Fig. 1 is a structured prediction algorithm for measurement of VOT (Sonderegger & Keshet, 2012; see Ryant, Yuan, & Liberman, 2013, for an alternative approach). Many standard approaches measure parameters based on pre-programmed rules developed in consultation with expert annotators (Boyce et al., 2010, Hansen et al., 2010, Prathosh et al., 2014, Stouten and van Hamme, 2009). In contrast, the algorithm utilized here was designed to minimize the error in the predicted measurement and had a unique feature set. This novel and unique feature set was designed to represent the acoustic signal with a time resolution of 1 ms (based on processing window of 5 ms) as opposed to the 10 ms resolution (reflecting a window of 20–25 ms) of standard set feature sets used in automatic speech recognition. By allowing us to measure rapidly changing, short-duration acoustic features, this feature set reflects properties relating to the critical phonetic parameters of our analysis (e.g., those associated with consonant contrasts). Previous research has shown this algorithm can achieve high accuracy, near that of human inter-annotator reliability. For example, using a VOT dataset collected in our laboratory, the algorithm’s measurements had correlation r = 0.992 with human annotations, compared with r = 0.987 for two human annotators (Sonderegger & Keshet, 2012). Comparison of the VOT algorithm used in our experiments to most of the available automatic methods for four different benchmarks is detailed in Section VII of Sonderegger and Keshet (2012).

This yields acoustic data from speech recordings without requiring human intervention at any intermediate analysis steps. Software implementing each stage of processing is publicly available (https://github.com/jkeshet/tongue_twisters), allowing any laboratory to replicate the analysis procedures on novel data. We used this approach to examine—in a single experiment—the VOT of over 68,000 syllables. In comparison, the amount of data examined across all existing studies (through 2011) is less than 43,000 syllables in total (Frisch and Wright, 2002, Goldrick and Blumstein, 2006, Goldrick et al., 2011, Goldstein et al., 2007, McMillan and Corley, 2010, McMillan et al., 2009, Pouplier, 2003: Experiment 2; 2007, 2008). This amount of data allowed us to examine two issues unaddressed in previous work: whether the distinct types of tongue twisters utilized in previous work yield distinct phonetic effects in speech errors; and whether speech errors exhibit differences in variability as well as mean phonetic properties relative to correct productions.

Section snippets

Participants

Thirty-four native English speakers (21 women) from the Northwestern University community participated. These individuals reported no history of speech or language impairment. They received financial compensation or course credit.

Materials

Tongue twisters were composed of syllables with initial consonants contrasting in voicing (e.g., post-boast). Forty-eight pairs of syllables were selected, evenly distributed across labial (/p/, /b/), alveolar (/t/, /d/) and velar (/k/, /g/) place of articulation. For

Results

Across participants, the mean overall accuracy was 89.9% (estimated5 95% confidence interval [87.7%, 91.8%]). As has been observed in previous studies, there was considerable variation across individuals (range: 74.3–97.2%). Table 1 provides a breakdown of accuracy by place of articulation and voicing.

We first establish that the automatic approach replicates standard

Discussion

Previous research on phonetic variation in speech production has been hamstrung by the reliance on subjective, labor-intensive manual annotation. Our approach provides a fully replicable method for rapidly analyzing acoustic data. Applying this to speech errors, our results provide further evidence against the traditional claim that speech errors are categorical substitutions of one sound for another (Dell, 1986, Fromkin, 1971, Shattuck-Hufnagel and Klatt, 1979). Our ability to analyze large

Acknowledgments

Supported by National Science Foundation Grant BCS0846147 and National Institutes of Health Grant HD077140. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do no necessarily reflect the views of the NSF or the NIH. Thanks to the Northwestern SoundLab and Jennifer Culbertson for helpful discussion and comments.

References (50)

  • C.T. McMillan et al.

    Cascading influences on the production of speech: Evidence from articulation

    Cognition

    (2010)
  • D.J. Olson

    Bilingual language switching and selection at the phonetic level: Asymmetrical transfer in VOT production

    Journal of Phonetics

    (2013)
  • M. Pouplier

    The role of a coda consonant as error trigger in repetition tasks

    Journal of Phonetics

    (2008)
  • D.A. Rosenbaum et al.

    The parameter remapping effect in human performance: Evidence from tongue twisters and finger fumblers

    Journal of Memory and Language

    (1986)
  • S. Shattuck-Hufnagel et al.

    The limited use of distinctive features and markedness in speech production: evidence from speech error data

    Journal of Verbal Learning and Verbal Behavior

    (1979)
  • V. Stouten et al.

    Automatic voice onset time estimation from reassignment spectra

    Speech Communication

    (2009)
  • J. Yuan et al.

    F0 declination in Mandarin broadcast news speech

    Speech Communication

    (2014)
  • C. Balukas et al.

    Spanish–English bilingual voice onset time in spontaneous code-switching

    International Journal of Bilingualism

    (2015)
  • Boyce, S., Fell, H., MacAuslan, J., & Wilde, L. (2010). A platform for automated acoustic analysis for assistive...
  • K. Crammer et al.

    Online passive-aggressive algorithms

    Journal of Machine Learning Research

    (2006)
  • K. Croot et al.

    Prosodic structure and tongue twister errors

  • G.S. Dell

    A spreading-activation theory of retrieval in sentence production

    Psychological Review

    (1986)
  • G.S. Dell et al.

    Inhibition in interactive activation models of linguistic selection and sequencing

  • V.A. Fromkin

    The non-anomalous nature of anomalous utterances

    Language

    (1971)
  • Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic...
  • Cited by (27)

    • Cascading activation in phonological planning and articulation: Evidence from spontaneous speech errors

      2021, Cognition
      Citation Excerpt :

      This sampling procedure produced 739 correct words matched to 86 errors. To compare these spontaneous errors with previous experimental work, we reanalyzed VOT data from two larger-scale studies of tongue twisters (Goldrick et al., 2011: n = 10 participants; Goldrick et al., 2016: n = 34; we only analyzed data that met inclusion criteria for each study). Both experiments involved tongue twisters designed to induce errors on word-initial consonants contrasting in voicing (e.g., when producing pin bin bin pin rapidly, participants make many errors swapping b and p).

    • Prosody leaks into the memories of words

      2021, Cognition
      Citation Excerpt :

      Partially produced words have shorter segmental content and acoustic details that differ from their fully produced forms (Howell & Vause, 1986; Howell & Williams, 1992). Words that are mispronounced have shown to have both categorical and gradient acoustic errors (Frisch & Wright, 2002; Goldrick, Keshet, Gustafson, Heller, & Needle, 2016). Words that are filled pause words (23,302 words), acronyms (2056 words) or foreign words (467 words).

    • Temporal aspects of self-monitoring for speech errors

      2019, Journal of Memory and Language
      Citation Excerpt :

      There is a clear alternating pattern. This pattern appears to be due to the structure of the tongue twisters, cf. Croot, Au and Harper (2010), Goldrick, Keshet, Gustafson, Heller and Needle (2016) and Wilshire (1998). In Experiment 2 we have set out to see whether some results obtained in Experiment 1 would stand further testing.

    View all citing articles on Scopus
    View full text