Neural correlates of the processing of co-speech gestures
Introduction
Meaningful hand movements (i.e. gestures) are an integral part of everyday communication. It seems once people become involved in a conversation they inevitably start to move their hands to illustrate certain contents of speech. Some of these co-speech gestures bear a formal relationship to the contents of speech and have therefore been termed iconic in the literature (McNeill, 1992). For example, a speaker might form a precision grip and make a quick turning movement when uttering a sentence such as: “I tightened the screw”. When producing such an iconic gesture, the speaker transmits meaning in two channels simultaneously. Previous research has shown that listeners make use of additional gesture information in comprehending language (Alibali et al., 1997, Beattie and Shovelton, 1999, Beattie and Shovelton, 2002). Furthermore, progress has been made in specifying the temporal characteristics of gesture–speech interaction in comprehension (Holle and Gunter, 2007, Kelly et al., 2004, Özyürek et al., 2007, Wu and Coulson, 2007). The present study investigates the neural systems involved in the interaction of gesture and speech in comprehension.
Iconic gestures are a special subcategory of gestures (McNeill, 1992). It is helpful to see what delimits iconic gestures from other gesture types (e.g. pantomime, emblems, pointing). Iconic gestures and pantomime have in common that they often illustrate actions. A crucial difference, however, is that there is no speech during pantomime whereas iconic gestures are mostly produced in combination with speech (McNeill, 1992, McNeill, 2000, McNeill, 2005). Probably related to this difference is the fact that iconic gestures are less elaborate in their form than pantomimed movements. Iconic gestures are actions recruited in the context of another domain (i.e. speech, cf. Willems et al., 2006), therefore the timing of the gestures is deeply intertwined with the timing of speech (for more on this, see below). Their co-speech timing results in a limited time available for the production of an iconic gesture. Accordingly, pantomimed actions can give very detailed (or even exaggerated) descriptions, whereas iconic gestures tend to be much more casual in form and are often abstractions of the performed actions. Furthermore, in contrast to iconic gestures, a sequence of pantomimed movements can be joined together to create sentence-like constructions (Rose, 2006). Iconic gestures differ from another subcategory of gesture called emblems in their degree of conventionalization. Emblems, which are meaningful hand postures such as the victory sign, are so conventionalized in their form that they can be effortlessly understood in the absence of speech (Gunter and Bach, 2004). In comparison, iconic gestures are much less conventionalized. Studies investigating iconic gesture production typically find a great degree of interindividual variability in the form of these gestures (e.g. Kita and Özyürek, 2003). Nevertheless, iconic gestures contain additional information that is not found in speech. In the example described above, only the gesture gives an indication about the size of the screw (probably a rather small screw, because a precision grip was used). In a series of previously conducted ERP experiments, we have shown that such additional gesture information can modulate how the two word meanings of lexically ambiguous words (e.g. ball) are processed in a sentence context (Holle and Gunter, 2007). Thus, iconic gestures interact with speech during online language comprehension. Not much is known, however, about which brain areas are involved when gesture and speech interact.
The interaction of iconic gestures and speech in comprehension can be approached from at least two different perspectives. First, given that many iconic gestures constitute re-enacted actions, one can adopt an action perspective. From this point of view, an empirical question is the extent to which the processing of iconic gestures recruits the brain network associated with action comprehension. Based on the findings that area F5 and PF of the macaque brain contain neurons that fire both during the observation as well as the execution of goal-directed hand movements (Gallese et al., 1996, Gallese et al., 2002, Umiltà et al., 2001), it has been proposed that these so-called mirror neurons form the neural circuitry for action understanding (Rizzolatti et al., 2001). Although direct evidence (via single-cell recording) for mirror neurons in the human brain is still lacking, there is a substantial body of indirect evidence that a similar system exists in humans as well (for recent overviews, see Binkofski and Buccino, 2006, Iacoboni and Dapretto, 2006, Molnar-Szakacs et al., 2006). In particular, the inferior frontal gyrus (IFG) including the adjacent ventral premotor cortex and the inferior parietal lobule (IPL) have been suggested as the core components of the putative human mirror neuron system (MNS) (Rizzolatti and Craighero, 2004). According to a recent theoretical suggestion, the human MNS is able to determine the goal of observed actions by means of an observation–execution matching process (for a more detailed description, see Iacoboni, 2005, Iacoboni and Wilson, 2006). Because many iconic gestures are re-enacted actions, it is therefore plausible that the MNS also participates in the processing of such gestures.
Second, one can adopt a multimodal perspective on iconic gesture comprehension. As has been argued above, iconic gestures show little conventionalization, i.e. there is no “gestionary” that can be accessed for their meaning. Instead, the meaning of iconic gestures has to be generated online on the basis of gesture form and the co-speech context in which the gesture is observed (Feyereisen et al., 1988, McNeill, 1992, McNeill, 2005). Thus, comprehending a co-speech iconic gesture is a process which requires a listener to integrate auditory and visual information. Within the multimodal view on iconic gestures, a further distinction can be made between local and global gesture–speech integration (see Willems et al., 2006). Because co-speech gestures are embedded in spoken utterances that unfold over time, one can investigate the integration processes between gesture and speech both at a local level (i.e. the integration of temporally synchronized gesture and speech units) as well as on a global sentence level (i.e. how greater meaning ensembles are assembled from smaller sequentially processed meaningful units such as words and gestures).
Local integration refers to the combination of simultaneously perceived gestural and spoken information. Previous research indicates that the temporal relationship between gesture and speech in production is not arbitrary (McNeill, 1992, Morrel-Samuels and Krauss, 1992). Instead, speakers tend to produce the peak effort of a gesture, the so-called stroke, simultaneously with the relevant speech segment (Levelt et al., 1985, Nobe, 2000). This stroke–speech synchrony might be an important cue for listeners in comprehension, because it can signal to which speech unit a gesture belongs. Returning to the example given previously, the speaker uttering the sentence “He tightened the screw” might produce the gesture stroke simultaneously with the verb of the sentence. In this example, local integration would refer to the interaction between the simultaneously conveyed visual information (i.e. the turning-movement gesture) and auditory information (the word tightened).
Although related to such local processes, the global integration of gesture and speech is a more complex phenomenon. Understanding a gesture-supported sentence relies not only on the comprehension of all individual constituents (i.e.: words and gestures), but also on a comprehension of how the constituents are related to one another (i.e.: who is doing what to whom, cf. Grodzinsky and Friederici, 2006). This relational process requires integrating information over time. The multimodal aspect in this integration over time is the extent to which the process recruits similar or different brain areas depending on whether the to-be-integrated information is a spoken word or a gesture. Thus, local integration refers to an instantaneous integration across modalities, whereas global integration describes an integration over time, with modality as a moderating variable. Whereas interactions at the global level can be examined in an epoch-related analysis, an analysis of gesture–speech interactions at the local level can only be performed in an event-related design. More precisely, in order to investigate how gesture and speech interact at the local level, one first has to objectively identify the point in time at which gesture and speech start to interact. As will be outlined below, the gating paradigm may be used to determine such a time point.
Willems et al. (2006) investigated the neural correlates of gesture–speech interaction on a global sentence level. In this experiment, subjects watched videos in which an initial sentence part (e.g. The items that he on the shopping list1) was followed by one of four possible continuations: (1) a correct condition, where both gesture and speech matched the initial sentence context (e.g. saying wrote while producing a writing gesture), (2) a gesture mismatch (e.g. saying wrote while producing a hitting gesture), (3) a speech mismatch (e.g. saying hit, gesturing writing) and (4) a double mismatch (saying hit, gesturing hitting). In the statistical analysis, the complete length of the videos was modeled as an epoch. When contrasted with the correct condition, only the mid- to anterior portion of the left IFG (BA 45/47) was consistently activated in all three mismatch conditions. On the basis of this finding, Willems and co-workers suggested that the integration of semantic information into a previous sentence context (regardless whether the to-be-integrated information had been conveyed by gesture or speech) is supported by the left IFG.
Whereas the Willems study investigated the interaction of gesture and speech at a global level, it is an open issue what brain areas are involved in local gesture–speech interactions. One candidate area might be the superior temporal sulcus (STS). There is a substantial amount of literature supporting the notion of the STS as an important integration site of temporally synchronized audiovisual stimuli (Beauchamp, 2005). For example, the STS seems to be involved in the integration of lip movements and speech sounds (Calvert et al., 2000, Wright et al., 2003). Furthermore, Skipper et al. (2005) observed that the activation in the posterior STS elicited by the observation of talking faces is modulated by the amount of visually distinguishable phonemes. In an experiment by Sekiyama et al. (2003) it was found that the left posterior STS is particularly involved in the McGurk effect, e.g. the fusion of an auditory /ba/ and a visual /ga/ into a perceived /da/. While in these examples, visual and auditory information can be mapped onto each other on the basis of their form, there is evidence that the STS is also involved in more complex mapping processes at a higher semantic level, such as the integration of pictures of animals and their corresponding sounds (Beauchamp et al., 2004b). Saygin et al. (2003) have reported that patients with lesions in the posterior STS are impaired in their ability to associate a picture (e.g. a cow) with a corresponding sound (e.g. moo).
On the basis of these results, it is not unreasonable to assume that the STS is also involved in the multimodal interactions between gesture and speech. The integration of iconic gestures and speech during comprehension has some similarities with the integration of pictures and their associated sounds, as it was for instance investigated by Beauchamp et al. (2004b). In both cases, there is a temporal synchrony between auditory and visual information. In the audiovisual condition of the Beauchamp study, the pictures and the corresponding sounds were presented simultaneously. Likewise, as it has been introduced above, the stroke of a gesture tends to coincide with the relevant speech unit. Another similarity is that for both stimulus types, what is being integrated are not the forms of gesture and speech (or the forms of pictures and sounds), but the interpretations of the respective forms. That is, in both cases the integration is said to occur on a semantic-conceptual level. The stimuli used in the Beauchamp study required participants to identify the depicted visual object (e.g. a telephone). On basis of their world knowledge, participants could then activate a number of possible sounds associated with the perceived visual object and decide whether the currently perceived sound matched one of these expectations.2 Similarly, an iconic gesture first has to be processed unimodally to some extent before it can be associated with the co-expressive speech unit. Thus, the interactions of pictures and sounds and gesture and speech have in common that the unimodal information first has to be processed and semantically interpreted to some extent individually, before an interaction between auditory and visual information can occur. However, the two audiovisual interaction types differ in complexity. In the Beauchamp study, the semantic relationship between auditory and visual information was fixed. The visual object was always presented with the sound that such an object typically creates. In contrast, the semantic relationship between iconic gestures and speech is not fixed. A sentence such as During the game, he returned the ball can be accompanied by a gesture that depicts the form of the ball, or a gesture that focuses on the returning motion. Moreover, the gesture might primarily depict the trajectory of the ball’s movement, the manner (rolling, sliding, ...) or a combination of trajectory and manner. Finally, the gesture can depict the scene from a character viewpoint (i.e. the person returning the ball) or from an observer viewpoint. How the gesture is related to speech is not defined a-priori, but has to be detected by the listener on an ad-hoc basis. Thus, the comprehension of iconic gestures requires complex semantic interactions between gestural and auditory information. So far there are no studies that have investigated whether the STS also houses these complex multimodal processes underlying co-speech iconic gesture comprehension.
The present experiment aimed to locate brain areas involved in the processing of co-speech iconic gestures. As has been described above, one can approach the comprehension of iconic gestures from a multimodal perspective. Investigating the putative multimodal integration sites for gesture and speech would entail an experimental design with a gesture-only, speech-only as well as a bimodal gesture + speech condition, as it was for instance suggested by Calvert and Thesen (2004). However, the problem with such a manipulation is that it neglects the one-sided dependency between the two information channels. Whereas understanding speech does not depend on gesture, iconic gestures are dependent upon the accompanying speech in that these gestures are only distinctly meaningful in their co-speech context. It is generally agreed upon that the meaning of decontextualized iconic gestures is very imprecise (e.g. Cassell et al., 1999, Krauss et al., 1991). Thus, when presenting a gesture-only condition to participants, one runs a great risk of inducing artefactual processing strategies. As McNeill has stated: “It is profoundly an error to think of gesture as a code or ‘body language’, separate from spoken language. [...] It makes no more sense to treat gestures in isolation from speech than to read a book by looking at the ‘g’s.” (McNeill, 2005, p. 4). Another independent group of researchers around Robert Krauss have also come to the conclusion that decontextualized iconic gestures convey little meaning to the listener and that the relationship between auditory, visual and audiovisual information is not well captured by a linear model (Krauss et al., 1991, Experiment 3).
Rather than adopting a strict multisensory perspective, the present study approaches the comprehension of co-speech iconic gestures by means of a disambiguation paradigm, where lexically ambiguous sentences (e.g. Sie berührte die Maus, She touched the mouse) are accompanied either by disambiguating iconic gestures or meaningless grooming movements. Such a disambiguation paradigm has several advantages. First, it has some external validity. Holler and Beattie (2003) have shown that speakers spontaneously produce a substantial amount of iconic gestures when asked to explain the different word meanings of a homonym. Second, in a disambiguation paradigm, the iconic gestures are not removed from their co-speech context, which excludes the possibility of a gesture-only condition inducing artefactual processing strategies. Third, the influence of the speech channel, which is certainly the channel with the highest information content, is perfectly controlled for, because the sentences are physically identical in the critical experimental conditions.
Thus, all of the observed differences in a disambiguation paradigm can only be due the accompanying hand movement (i.e. the main effect) or the interaction between the hand movement and the spoken sentence. The challenge in interpreting the results is to determine which one – main effect or interaction – actually caused an observed activation difference. One can think of the present study as an exploratory study in the evolving field of co-speech gesture comprehension. It identifies regions possibly involved in the interaction between iconic gestures and speech in a paradigm with a high external validity.
In the present experiment, only the gestures but not the meaningless grooming movements bias the interpretation of the sentence, resulting in a disambiguation of the homonym. That is, only in the case of gesture there is an interaction between the visually and the auditorily conveyed information. On the basis of the literature, we hypothesized that the processing of co-speech gestures would elicit greater levels of activation in the STS than the processing of the meaningless co-speech grooming movements.
To elucidate the role of the left IFG (i.e. BA 44, 45 & 47) in local gesture–speech interactions, we additionally included a manipulation of word meaning frequency in the present study. All sentences could either be interpreted in terms of a more frequent dominant meaning (e.g. the animal meaning of mouse) or the less frequent subordinate meaning (e.g. the computer device meaning of mouse). Because previous studies have shown that the processing of lexically low frequent words recruits the left IFG to a stronger degree than high frequent words (Fiebach et al., 2002, Fiez et al., 1999, Joubert et al., 2004), we hypothesized that the processing of subordinate gestures would elicit greater levels of activation in the left IFG than dominant gestures. Alternatively, if the left IFG (and in particular the anterior inferior portion) is not only the site of multimodal gesture–speech interactions at the global level, as suggested by Willems et al. (2006), but also at the local level, greater levels of activation for gestures as compared to grooming should be observed in this region.
Section snippets
Participants
Seventeen native speakers of German (10 females), age 21–30 (mean age 25.7, SD = 2.8) participated in this experiment after giving informed written consent following the guidelines of the Ethics committee of the University of Leipzig. All participants were right-handed (mean laterality coefficient 92.7, SD = 11.3, Oldfield, 1971) and had normal or corrected-to-normal vision. None reported any known hearing deficits.
Homonyms
The present study is based on a set of unbalanced German homonyms (for a description
Behavioral results
Accuracy of responses and reaction times were recorded during the functional measurement. Here, we first report the accuracy data, following by the reaction time data.
In general, participants reliably selected the intended target word after both the dominant as well as the subordinate gesture videos (dominant: 91.6% correct, subordinate: 88.2% correct, see Fig. 1). Differences in the performance of participants were analyzed in a repeated-measures ANOVA with the dependent variable performance
Discussion
The present study investigated the neural correlates of the processing of co-speech gestures. Sentences containing an unbalanced ambiguous word were accompanied by either a meaningless grooming movement, a gesture supporting the more frequent dominant meaning or a gesture supporting the less frequent subordinate meaning. We had two specific hypotheses in mind when designing this experiment. First, we expected that the STS would be more involved in the processing of co-speech gestures than in
Conclusion
The present study investigated the neural correlates of co-speech gesture processing. The processing of speech accompanied by meaningful hand movements reliably activated the left posterior STS, possibly reflecting the multimodal semantic interaction between a gesture and its co-expressive speech unit. The processing of co-speech gestures additionally elicited a fronto-parietal system of activations in classical human mirror neuron system brain areas. The mirror neuron system is suggested to be
Acknowledgments
We are grateful to Angela Friederici, who kindly supported the research described here, Christian Obermeier, who has carried out large parts of the gating study and Karsten Müller, who was of great help during the analysis of the data.
References (76)
- et al.
Social perception from visual cues: role of the STS region
Trends Cogn. Sci.
(2000) See me, hear me, touch me: multisensory integration in lateral occipital–temporal cortex
Curr. Opin. Neurobiol.
(2005)- et al.
Integration of auditory and visual information about objects in superior temporal sulcus
Neuron
(2004) - et al.
The role of ventral premotor cortex in action execution and action understanding
J. Physiol. (Paris)
(2006) - et al.
Multisensory integration: methodological approaches and emerging principles in the human brain
J. Physiol. (Paris)
(2004) - et al.
Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex
Curr. Biol.
(2000) - et al.
Visual motion and the human brain: what has neuroimaging told us?
Acta Psychol. (Amst.)
(2001) - et al.
Effects of lexicality, frequency, and spelling-to-sound consistency on the functional anatomy of reading
Neuron
(1999) - et al.
The role of the dorsal stream for gesture production
NeuroImage
(2006) - et al.
Event-related fMRI: characterizing differential responses
NeuroImage
(1998)
Should we reject the expertise hypothesis?
Cognition
Neuroimaging of syntax and syntactic processing
Curr. Opin. Neurobiol.
Communicating hands: ERPs elicited by meaningful symbolic hand postures
Neurosci. Lett.
Neural mechanisms of imitation
Curr. Opin. Neurobiol.
Beyond a single area: motor control and language within a neural architecture encompassing Broca’s area
Cortex
Neural correlates of lexical and sublexical processes in reading
Brain Lang.
Neural correlates of bimodal speech and gesture comprehension
Brain Lang.
What does cross-linguistic variation in semantic coordination of speech and gesture reveal? Evidence for an interface representation of spatial thinking and speaking
J. Mem. Lang.
Pointing and voicing in deictic expressions
J. Mem. Lang.
LIPSIA — a new software system for the evaluation of functional magnetic resonance images of the human brain
Comput. Med. Imaging Graph.
Observing complex action sequences: the role of the fronto-parietal mirror neuron system
NeuroImage
The assessment and analysis of handedness: The Edinburgh inventory
Neuropsychologia
Motor and cognitive functions of the ventral premotor cortex
Curr. Opin. Neurobiol.
Auditory–visual speech perception examined by fMRI and PET
Neurosci. Res.
Listening to talking faces: motor cortical activation during speech perception
NeuroImage
Sentence Comprehension
I know what you are doing. a neurophysiological study
Neuron
Integration of letters and speech sounds in the human brain
Neuron
A general statistical analysis for fMRI data
NeuroImage
How iconic gestures enhance communication: an ERP study
Brain Lang.
Assessing knowledge conveyed in gesture — do teachers have the upper hand
J. Educ. Psychol.
Is working memory still working?
Eur. Psychol.
Do iconic hand gestures really contribute anything to the semantic information conveyed by speech? An experimental investigation
Semiotica
An experimental investigation of some properties of individual iconic gestures that mediate their communicative power
Br. J. Psychol.
Unraveling multisensory integration: patchy organization within human STS multisensory cortex
Nat. Neurosci.
Speech–gesture mismatches: evidence for one underlying representation of linguistic and nonlinguistic information
Pragmat. Cogn.
Aspects of the theory of syntax
The kinetic occipital region in human visual cortex
Cereb. Cortex
Cited by (141)
The visuo-sensorimotor substrate of co-speech gesture processing
2023, NeuropsychologiaAction-speech and gesture-speech integration in younger and older adults: An event-related potential study
2022, Journal of NeurolinguisticsCitation Excerpt :Indeed it has been found that when the verbal message is degraded (Holle et al., 2010), or ambiguous (Holle & Gunter, 2007) gestures can increase the understanding of the message. As already mentioned in the introduction, although the gestures in these studies are called iconic gestures, the examples presented in the articles of Holle et al. (2008, 2010) and Holle and Gunter (2007) look quite similar to the ones used in this study. It seems that although pantomime gestures and iconic gestures are theoretically distinct, clearer boundaries are needed for the operationalisation of stimuli to apply this distinction in research.
Autonomic system tuning during gesture observation and reproduction
2022, Acta Psychologica