Emotion in voice matters: Neural correlates of emotional prosody perception

https://doi.org/10.1016/j.ijpsycho.2013.06.025Get rights and content

Highlights

  • N1 activation reflected early emotion detection in parietal regions of hemispheres.

  • P2 showed variations in topographic activation for emotion conditions.

  • Greater N3 amplitudes in frontal sites reflected more complex cognitive processing.

  • This study supports Schirmer and Kotz's (2006) model of vocal emotion perception.

Abstract

The ability to perceive emotions is imperative for successful interpersonal functioning. The present study examined the neural characteristics of emotional prosody perception with an exploratory event-related potential analysis. Participants were 59 healthy individuals who completed a discrimination task presenting 120 semantically neutral word pairs from five prosody conditions (happy/happy, angry/angry, neutral/neutral, angry/happy, happy/angry). The task required participants to determine whether words in the pair were spoken in same or different emotional prosody. Reflective of an initial processing stage, the word 1 N1 component was found to have greatest amplitude in parietal regions of the hemispheres, and was largest for emotional compared to neutral stimuli, indicating detection of emotion features. A second processing stage, represented by word 1 P2, showed similar topographic effects; however, amplitude was largest for happy in the left hemisphere while angry was largest in the right, illustrating differentiation of emotions. At the third processing stage, word 1 N3 amplitude was largest in frontal regions, indicating later cognitive processing occurs in the frontal cortex. N3 was largest for happy, which had lowest accuracy compared to angry and neutral. The present results support Schirmer and Kotz's (2006) model of vocal emotion perception because they elucidated the function and ERP components by reflecting three primary stages of emotional prosody perception, controlling for semantic influence.

Introduction

The perception of emotion is essential for effective communication and social functioning (Mitchell and Ross, 2008, Plutchik, 1993). Emotion is portrayed through facial expressions and patterns of gesture and speech during social interaction. The ability to perceive emotions from these cues is facilitated by multiple sensory channels (including visual and auditory) (Adolphs et al., 2002). A great deal of research has focused upon the neural mechanisms underpinning facial expressions; however, speech and vocal emotion have received little investigation. This could stem from the conceptual and methodological difficulties associated with operationalising, and empirically observing vocal emotion (Scherer, 1986).

Speech draws on both semantic and paralinguistic components to provide meaning (Mitchell and Ross, 2008). Semantic content encompasses the what aspect of expression whereas paralinguistic content involves how a phrase is expressed. Paralinguistic features such as pitch, duration, and intensity constitute what is commonly known as prosody (Aziz-Zadeh et al., 2010, Mitchell and Ross, 2008, Rood et al., 2009, Schirmer and Kotz, 2006). Prosody performs a critical linguistic function in speech providing information about the grammatical structure of utterances. It is also naturally influenced by affective states and, consequently, provides important cues to facilitate emotion recognition (Truong and van Leeuwen, 2007). Emotional prosody is modified by physiological alterations in arousal, including changes in heart rate, blood flow, subglottal and laryngeal muscle tension (Schirmer and Kotz, 2006). The ability to produce and interpret affective prosody is thus essential for social functions such as asserting power, equality and esteem, as well as avoiding unwanted advances or danger, and displaying intimacy, affection and affiliation (Mitchell and Ross, 2008).

Emotion perception has been attributed to two neural processing streams in a model proposed by Tucker et al. (1995) and further advanced by Phillips et al. (2003). According to this model, there is a ventral stream involved in the processing of sensory information and early automatic identification of emotion state. This stream is mediated by the amygdala, insula, ventral striatum, ventral anterior cingulate gyrus, and ventral prefrontal cortex. Secondly, a dorsal stream involving the hippocampus, dorsal anterior cingulate cortex, and dorsal prefrontal cortex is associated with controlled effortful processing of complex social cognitions and behaviours. Support for the involvement of these neural pathways in processing emotion including affective prosody comes from human lesion, neuroimaging, and animal studies (Cosgrove and Rauch, 1995, Phillips et al., 1998, Reiman et al., 1997, Tucker et al., 1995). Further theorising has implicated structures in the right hemisphere, which are posited to have a role in processing affective information both in general (Adolphs et al., 2002), and in prosody (Bowers et al., 1987), particularly the processing of negatively-valanced emotions via early, automatic orientation to threat (Davidson, 1992).

Schirmer and Kotz (2006) used fMRI and ERP evidence to develop a model of vocal emotion perception that mirrors Tucker et al. (1995) and Phillips et al. (2003) model of automatic and controlled/effortful processing of emotional stimuli. Using event-related potentials (ERPs), Schirmer and Kotz (2006) revealed the temporal processing stages which organise the perceptual and cognitive mechanisms underlying vocal emotion recognition. Stage one of the model reflects the sensory processing stage, whereby the physical parameters (e.g. pitch, volume) are detected by sensory neurons and are conveyed to the central nervous system. This stage occurs pre-emotion perception and is most accurately represented via the N1. An auditory N1 is thought to be predominant in the right temporal lobe based on a connectivity analysis of cerebral activation by Ethofer et al. (2006b) which suggests that the detection of prosodic features occurs in the right hemispheric acoustic regions. The second stage, whereby emotionally significant prosodic cues are integrated (Schirmer and Kotz, 2006), occurs predominantly in the posterior superior temporal sulcus (Wildgruber et al., 2009) and represents automatic emotion perception. This process is thought to occur around 200 ms post-stimulus onset and thus, is represented by the P2 ERP component. Finally, the third stage of the model involves cognitive aspects of semantic integration and contextual evaluation (i.e. controlled emotion perception) and encompasses processes such as inhibition, conflict, and working memory (Schirmer and Kotz, 2006). Stage three is believed to elicit a large, late negativity, which Schirmer and Kotz (2006) identify as the N31 ERP component and can be sourced to the dorsal lateral prefrontal cortex.

Schirmer and Kotz's (2006) model represents a major advance in terms of conceptualising the neural processes underpinning affective prosody; however, empirical evidence is still lacking. Specifically, studies of auditory emotion perception citing support for the model have not always isolated prosody from semantic content. The ‘what’ and ‘how’ of speech are independent phenomena which interact to produce meaning. Semantic information renders the stimuli more complex, meaning that activation to emotional stimuli may not only reflect the portrayed emotion but also the processing of verbal information. This complexity may affect the N3 component which reflects contextual evaluation, and larger amplitudes may reflect the semantic content.

The present study aimed to identify the activation of neural processes involved in the automatic and cognitively controlled discrimination of emotion in prosody using an ERP analysis while controlling for semantic content. In doing so, this study endeavoured to evaluate the vocal emotion perception model by Schirmer and Kotz (2006) without this potential confound. Two approaches were taken to minimise the engagement of verbal processing. Firstly, semantic content was minimised by using a standardised set of emotionally neutral nouns for all facets of the study. Secondly, a discrimination “same-different” task was employed rather than the labelling paradigm used by Schirmer and Kotz (2006) because identification is more demanding on short-term memory and word retrieval capabilities than discrimination (Charbonneau et al., 2003, Pell, 2006, Pell and Leonard, 2003, Tompkins and Flowers, 1985). These demands could have implications for future research with clinical populations who have memory deficits. The present study also aims to confirm previous behavioural research which suggested that accuracy for different emotions is not equal, specifically, neutral words were the most accurately recognised, while angry words were more accurately recognised than happy (Dimoska et al., 2010). Consequently the study examined neutral, happy and angry tones.

It is proposed that the activation of ERP waveform amplitudes will reflect the processes described by Schirmer and Kotz (2006). These effects should be evident in response to the first word of each stimulus pair (word 1). The participants' behavioural response to the same-different task will provide information about whether the emotions are being accurately determined. The examination of “same” compared to different trials can only be explored in the analysis of the second word in each stimulus pair (word 2), as complex controlled processing necessary for discrimination can only occur after the second word is presented.

Specifically, it is hypothesised that the word 1 N1 will have greater amplitude in the hemispheres compared to the midline, reflecting activation of the temporal regions, especially the right hemisphere. Further, it is hypothesised that there will be no difference in word 1 N1 amplitude for emotion vs. neutral conditions because the N1 reflects pre-emotion processing. It is hypothesised that there will be greater amplitude for word 1 P2 in the parietal regions compared to frontal sites and in the hemispheres compared to the midline, reflecting activation of the posterior superior temporal sulcus. Activation is expected to be larger for emotion compared to neutral stimuli (reflecting automatic emotion perception). It is also hypothesised that there would be greater word 1 N3 amplitude in frontal regions compared to parietal regions (reflecting the activation of the dorsolateral prefrontal cortex); larger for emotional stimuli compared to neutral (reflecting greater contextual and cognitive evaluation); and larger for angry compared to happy stimuli (reflecting cognitive evaluation and response preparation to threat). The word 2 N3 will be larger for different compared to same word pairs (reflecting controlled emotion perception). According to different theoretical positions, it is hypothesised that right hemisphere activation of both the P2 and N3 should be greater than left for emotional words with no difference predicted for neutral. It is further hypothesised that within emotional speech, angry words elicit greater right hemisphere activation than happy emotions. In accordance with past research (Dimoska et al., 2010) it is expected that participants will have greater accuracy for different compared to same word pairs and for angry compared to happy words.

Section snippets

Participants

Sixty-one healthy first year psychology students (35 females) aged 17 to 32 (M = 19.28, SD = 2.80) participated in the experiment to fulfil a course requirement. The procedure was explained and written consent was obtained in accordance with the University of New South Wales Human Research Ethics Committee (UNSW HREC). Participants were required to complete a detailed demographic screening questionnaire, and only those who spoke English as a first language, and had normal or corrected-to-normal

ERP analysis

The results of this study are based on the average peak amplitude values for each condition, elicited from the nine electrode sites (FC3, Fz, FC4, C3, Cz, C4, P3, Pz, P4). Table 2 portrays a list of abbreviations which summarise the effects throughout the results section.

Discussion

This study aimed to examine ERP correlates of affective prosodic processing in order to advance the functional model of vocal emotion perception proposed by Schirmer and Kotz (2006). It did so by examining three ERP components: the N1, the P2 and the N3. These were elicited via an affective prosody auditory discrimination task, controlling for semantic content. The results are discussed in order of hypotheses, paralleling the consecutive stages of emotional prosody perception.

Schirmer and Kotz

Acknowledgements

JAR is supported by an Australian National Health and Medical Research Council (NHMRC) Postdoctoral Fellowship (Clinical Training; APP1013796).

References (32)

  • K.P. Truong et al.

    Automatic discrimination between laughter and speech

    Speech Communication

    (2007)
  • R. Adolphs et al.

    Impaired recognition of social emotions following amygdala damage

    Journal of Cognitive Neuroscience

    (2002)
  • L. Aziz-Zadeh et al.

    Common premotor regions for the perception and production of prosody and correlations with empathy and prosodic ability

    PLoS One

    (2010)
  • J. Borod et al.

    Emotional processing deficits in individuals with unilateral brain damage

    Applied Neuropsychology

    (2002)
  • R.J. Davidson

    Emotion and affective style: hemispheric substrates

    Psychological Science

    (1992)
  • A. Dimoska et al.

    Recognizing vocal expressions of emotion in patients with social skills deficits following traumatic brain injury

    Journal of the International Neuropsychological Society

    (2010)
  • Cited by (0)

    View full text