Elsevier

Speech Communication

Volume 52, Issue 6, June 2010, Pages 555-564
Speech Communication

Prosody off the top of the head: Prosodic contrasts can be discriminated by head motion

https://doi.org/10.1016/j.specom.2010.02.006Get rights and content

Abstract

The current study investigated people’s ability to discriminate prosody related head and face motion from videos showing only the upper face of the speaker saying the same sentence with different prosody. The first two experiments used a visual–visual matching task. These videos were either fully textured (Experiment 1) or showed only the outline of the speaker’s head (Experiment 2). Participants were presented with two stimulus pairs of silent videos, with their task to select the pair that had the same prosody. The overall results of the visual–visual matching experiments showed that people could discriminate same- from different-prosody sentences with a high degree of accuracy. Similar levels of discrimination performance were obtained for the fully textured (containing rigid and non-rigid motions) and the outline only (rigid motion only) videos. Good visual–visual matching performance shows that people are sensitive to the underlying factor that determined whether the movements were the same or not, i.e., the production of prosody. However, testing auditory–visual matching provides a more direct test concerning people’s sensitivity to how head motion/face motion relates to spoken prosody. Experiments 3 (with fully textured videos) and 4 (with outline only videos) employed a cross-modal matching task that required participants to match auditory with visual tokens that had the same prosody. As with the previous experiments, participants performed this discrimination very well. Similarly, no decline in performance was observed for the outline only videos. This result supports the proposal that rigid head motion provides an important visual cue to prosody.

Introduction

Prosody is a broad term used to describe variations in the auditory speech signal that correspond to changes in the perception of pitch, loudness and duration. Of its many functions, prosody can indicate general speaker characteristics (such as age, sex, emotional and physiological states), assist in organising an incoming signal into meaningful units for understanding, and convey information extending beyond that provided by sentence syntax, grammar, and the symbolic content of speech sounds (Nooteboom, 1997). Prosody has primarily been studied in terms of how it affects spoken word and sentence recognition (see Cutler et al. (1997) for a review). Yet it is clear that, when available, visual information about speech production (visual speech) also plays a role in speech perception (Benoît and Le Goff, 1998, Davis and Kim, 2004, Sumby and Pollack, 1954, Summerfield, 1992). The current study examined the visual correlates of prosody, and specifically investigated whether such signals are reliable enough to drive perceptual discrimination across multiple productions in a situation where visual speech information is only available from the upper half of the speaker’s face (the rationale for examining these visual cues is outlined below).

This study examined two types of prosodic contrasts: prosodic focus and sentence mode. Prosodic focus describes the situation where a word is made perceptually more salient than the other words in a sentence, and is used to emphasise newness, importance, or to disambiguate a particular item within the sentence. The focussed item in such an utterance can be thought of as having ‘narrow focus’, since the point of informational focus has been narrowed down to that particular item (Bolinger, 1972). This narrow focus contrasts with ‘broad focused’ statements that contain no explicit point of informational focus. The prosodic type we have called ‘sentence mode’ refers to changes in an utterance that signal particular sentence phrasings, such as questions or statements. By mimicking the syntactic content of a declarative statement, ‘echoic questions’ can be phrased without the use of an interrogative pronoun. That is, echoic questions contain the same word content as a statement, yet imply a level of uncertainty by varying suprasegmental acoustic features (Bolinger, 1989). The acoustic properties associated with prosodic focus and sentence mode have been intensively studied and well described. In short, narrow focused renditions (relative to broad focused ones) tend to be articulated with a higher fundamental frequency (F0), have greater intensity, and longer syllable duration (Krahmer and Swerts, 2001). The different sentence modes typically vary in the following ways: statements can be characterised as having a steadily falling F0 contour and ending with a sharp and definite fall signalling finality, while the opposite pattern is observed for echoic questions (i.e., these have a rising F0 on the final syllable, see Fig. 1). Statements also tend to have slightly shorter final syllable durations, and steeper final falls in intensity compared to the same utterances phrased as questions (Eady and Cooper, 1986).

Although prosody has been studied largely in terms of its impact on acoustic properties, there is a burgeoning literature that has investigated how prosody manifests in visual speech. Following on from these studies, the current investigation examined people’s sensitivity to prosodic contrasts likely signalled by visual speech, in this study however, we restricted visual speech information to the upper half of the speaker’s head and face. The motivation for concentrating on signals from the upper face and head motion was to determine whether visual signals not directly related to speech articulation could support prosody-related judgments. The proposition that visual prosodic cues can be obtained from peri-oral regions has received tentative support in the finding that people presented with extended monologues spent a considerable amount of time (65%) looking at the eyes and upper half of the speaker’s face (Vatikiotis-Bateson et al., 1998). Furthermore, even when noise was added to the auditory signal, perceivers still looked at the upper face approximately half the time even though it might be expected that a person’s gaze would shift to the mouth and jaw regions. The maintenance of gaze towards the upper face suggests that other speech-related information, such as prosody, is obtained from these regions.

More direct evidence for the role of the upper face in providing prosodic cues comes from the study by Lansing and McConkie (1999). These authors ascertained where people looked when explicitly trying to decide on lexical content, prosodic focus, or sentence mode. To do this, participant’s gaze direction was tracked when viewing visual only presentations of two-word sentences while identifying what was said, which word was narrow focused, or whether the sentence was a statement or question. It was found that the pattern of how long people looked at the upper, middle and lower parts of the face changed depending on the type of judgment being made. For judgements of prosodic focus and sentence mode, people looked longer at the middle and upper areas of the face, whereas they looked longer at the lower face when deciding upon what was said. However, eye-gaze patterns do not necessarily provide a complete picture of all the visual information that can be processed, i.e., such behavioural observations do not index information processed in peripheral vision. Indeed, it appears that motion information can be accurately processed in the periphery (Lappin et al., 2009, McKee and Nakayama, 1984), thus movements associated with speech production (such as jaw, lip and mouth motion) do not need to be visually fixated in order to be accurately processed (Paré et al., 2003). To be certain as to what signals were in fact available to perceivers, Lansing and McConkie (1999) restricted face motion signals to particular facial regions (either full face or only lower face motion). When presented with either full face motion or lower face motion, the identification of word content and sentence focus exceeded 95% correct, indicating that motion information in the lower half of the face was sufficient to perform such tasks. However, when identifying sentence mode, performance markedly declined when upper face motion was unavailable in comparison to full facial motion. This latter result suggests that visual information from the upper face may be important for the accurate perception of sentence mode, and raises the question of what face or head signals convey this information.

Since many auditory and visual speech properties originate from the same temporal process (i.e., speech production), it is clear why visual speech linked to the motion of the articulators is closely related to acoustics. However, it is less clear why the visible regions of the face beyond the mouth and jaw need be linked to speech acoustics. It is intriguing therefore that correlations have been found between different types of head and face motion and the change in acoustics as a sentence is uttered. The auditory property most studied is F0, with changes in this measure related to movements of the eyebrows (Cavé et al., 1996, Guaïtella et al., 2009) and rigid head motion (Burnham et al., 2007, Ishi et al., 2007, Yehia et al., 2002). In general, it has been found that a significant positive correlation exists between F0 and face and head motion. For example, Cavé et al. (1996) observed the non-rigid eyebrow movements of 10 speakers across various conversational settings, and found that rising F0 patterned with eyebrow movements. However, these movements did not occur for every change in F0, suggesting that the coupling was functional, rather than an automatic uncontrolled consequence of articulation. Similarly, a functional relationship was suggested between variation in F0 and rigid head motion in Yehia et al. (2002). These results indicate that prosody may be signalled both by non-rigid eyebrow movements and rigid head motion.

More recent studies have attempted to identify the nature of the visual speech signals that co-occur with prosodic focus, and to examine whether visual prosody can be identified in perception experiments. They include naturalistic production studies that have examined the motion information produced in conjunction with the auditory signal (with particular focus on either oral or peri-oral areas), and manipulation studies that have examined how changes in auditory signals affect the way prosody is perceived. In reviewing these studies, we will concentrate on results pertinent to peri-oral signals related to prosody.

Building on the work of Dohen and Lœvenbruck (2005) that showed the production of the focal syllables involved significantly larger lip areas, Dohen et al. (2006) investigated whether movements beyond the oral area (eyebrow and head movements) might also be associated with the production of prosodic focus. This study used motion capture to measure lower face movements (lip opening, lip spreading and jaw motion) as well as head and eyebrow movements in five French speakers. A relationship was found between eyebrow motion (rising) and the production of prosodic focus for three out of the five speakers, and a relationship between head nods and focus production for one speaker. Scarborough et al. (2009), who also used motion capture, examined the visual correlates of lexical (using reiterant syllable-based versions of words) and phrasal stress in three male speakers of Southern Californian English. For lexical stress, it was found that there was greater head motion for the stressed syllables, but no differences in eyebrow movement. For phrasal stress, every measure (including eyebrow measures) distinguished stressed from unstressed words.

Dohen and Lœvenbruck, 2005, Scarborough et al., 2009 conducted additional perception studies to determine if observers were sensitive to visual prosody. Dohen and Lœvenbruck showed that when participants were presented with soundless videos of speakers uttering a sentence that had narrow focus on the subject, verb or object phrase (or broad focus), they could successfully identify the focused constituent at better than chance levels. Scarborough et al. also showed that when participants were presented with three stimuli to decide which received stress (with an additional “no stress” option), lexical and phrasal stress in visual only presentation could both be perceived at better than chance levels. Likewise, a similar study by Srinivasan and Massaro (2003) showed that participants could identify whether a sentence presented in a soundless video was a statement or an echoic question, indicating that visual speech alone is capable of conveying information relating to speech mode. It should be noted that all of the above perceptual studies presented the speaker’s whole head, so it is not possible to separate the effect of visual cues directly related to speech production (mouth and jaw) from those signalled by such things as eyebrows and head motion.

In a study similar in concept to that of Lansing and McConkie, 1999, Swerts and Krahmer, 2008 used monotonic renditions of broad focused auditory statements paired with narrow focused visual speech tokens and asked perceivers to identify which word within the utterance received prosodic focus. In the critical conditions for current concerns, Swerts and Krahmer presented participants with either full face videos, videos that restricted visibility to the upper face, or to only the lower face. It was found that performance varied across viewing condition: performance on the videos showing only the upper face was equal to the full face condition (77.3% correct) and significantly better than the lower face presentation condition (51.4% correct, which was itself better than chance performance of 25%). Swerts and Krahmer interpreted their findings as showing that the upper face has more cue value for phrasal prominence.

In sum, it has been shown that information from the upper face can provide visual cues for prosodic focus and sentence mode. What remains to be determined is the type of information from this area that provides prosodic cues. Given that Scarborough et al. (2009) showed that lexical stress was associated with greater head motion but not with eyebrow movements, it would seem that rigid head motion might be the principal cue. The following experiments investigated whether rigid head motion when separated from other upper face cues (e.g., eyebrow motion, brow shape and texture information) is capable of conveying prosodic information.

Section snippets

Experiments 1 and 2 (visual–visual matching)

The first pair of experiments gauged the perceivers’ sensitivity to prosody related visual cues from the speaker’s upper face by using a visual–visual discrimination task (adopting the method used in Davis and Kim (2006)). The aim of these experiments was to determine whether visual signals related to prosody can be used to drive reliable perceptual discrimination. That is, if there are visual cues from the upper head and face that consistently signal prosody, then participants should be able

Materials

The materials consisted of 10 non-expressive sentences drawn from the IEEE (1969) list that describe fairly mundane events of minimal emotive content (see Appendix). Audio-visual recordings were made of two age-matched native male speakers of Australian English in a well-lit, sound-attenuated room using a digital video camera (25 fps). Audio was recorded at 44.1 kHz, 16-bit stereo with an externally connected lapel microphone.

Each sentence was recorded in three speech conditions: a broad focused

Results and discussion

Mean percent of correct responses for both the textured (Experiment 1) and outline videos (Experiment 2) are shown in Table 2. The first thing to note is that performance was considerably better than that expected by chance (i.e., 50%). Indeed, high levels of accuracy were observed across both stimulus presentation conditions. A series of one-sample t-tests indicated that across both textured and outline conditions, the percent of correct responses significantly differed from chance (the

Experiments 3 and 4 (auditory–visual matching)

Experiments 3 and 4 followed the basic design of the earlier experiments (using a 2AFC discrimination task) except that these experiments used auditory–visual stimulus pairs, and required participants to select the pair in which the prosody of the auditory and visual stimuli matched. Experiment 3 used full coloured videos displaying a textured image and Experiment 4 outline videos (see Fig. 2).

Participants

The same participants who completed Experiments 1 and 2 also took part in Experiments 3 and 4, respectively. The order in which participants completed the experimental tasks was counter-balanced (i.e., some took part in the auditory–visual task first, before completing the visual–visual matching task, and vice versa). There was a break (of several minutes) in between sessions in order to minimise potential order and exposure effects.

Materials and procedure

The stimuli were the same as those outlined in Section 3.1.

Results and discussion

Mean percent of correct responses for both the textured and outline presentation conditions are shown in Table 3. As can be seen, performance was better than what would be expected by chance. A series of one-sample t-tests confirmed that the percent of correct scores across both stimulus presentation conditions differed from chance (see Table 3 for t-test values).

A 2 × 3 mixed repeated measures ANOVA was conducted to determine whether task performance differed across the stimulus presentation

General discussion

The current study focussed on visual cues available from the speaker’s upper face and head. This was done not because such cues are likely to be stronger than lower face ones (indeed, previous studies have reported that visual cues from the lip, mouth and jaw are capable of signalling prosody, see Dohen et al., 2004, Erickson et al., 1998), but because such cues are not so directly tied to articulation, and as such, may represent a behaviour that has been reinforced because it achieves better

Acknowledgements

The authors wish to thank Bronson Harry for his patient assistance with the recording of audio-visual stimuli, and two anonymous reviewers and the guest editors for their helpful suggestions to improve the manuscript. The first author also wishes to acknowledge the support of the School of Psychology, University of Western Sydney and MARCS Auditory Laboratories for providing generous financial support. The second and third authors acknowledge support from Australian Research Council (DP0666857

References (35)

  • D. Bolinger

    Intonation and its Uses

    (1989)
  • D. Burnham et al.

    Rigid vs non-rigid face and head motion in phone and tone perception

    Interspeech

    (2007)
  • Cavé, C., Guaïtella, I., Bertrand, R., Santi, S., Harley, F., Espesser, R., 1996. About the relationship between...
  • A. Cutler et al.

    Prosody in the comprehension of spoken language: a literature review

    Language Speech

    (1997)
  • C. Davis et al.

    Audiovisual interactions with intact clearly audible speech

    Quart. J. Exp. Psychol.

    (2004)
  • M. Dohen et al.

    Audiovisual production and perception of contrastive focus in French: a multispeaker study

    Interspeech

    (2005)
  • M. Dohen et al.

    Visual correlates of prosodic contrastive focus in French: description and inter-speaker variability

    Speech Prosody

    (2006)
  • Cited by (56)

    • Effects of hearing loss and audio-visual cues on children's speech processing speed

      2023, Speech Communication
      Citation Excerpt :

      One possibility is via the provision of visual speech cues, such as the speaker's facial movements. For example, movements of the speech articulators can be used as additional, non-auditory cues to phoneme identity (e.g., Grant and Seitz, 2000), and head and eyebrow movements can provide cues to prosodic structure (Cvejic et al., 2010; Kim et al., 2014). Audio-visual (AV) presentation of speech is beneficial to listeners’ speech perception and processing over auditory-only (AO) presentation in a variety of circumstances and in various populations of listeners, both with and without HL.

    • Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

      2021, Knowledge-Based Systems
      Citation Excerpt :

      In our work, we also use the encoder of Transformer to learn the audio representation. The relevance between speech prosody and facial cues has been extensively studied [13], due to their joint relevance to human perception, communication, and behavior. The broad accord of these studies is that during conversations, speech prosody is generally associated with other social cues like facial expressions or gestures [36].

    • Supra-normal skills in processing of visuo-auditory prosodic information by cochlear-implanted deaf patients

      2021, Hearing Research
      Citation Excerpt :

      For example, a stressed word in a sentence can be identified solely on visual prosodic cues (Foxton et al., 2010). Secondly, auditory prosodic variations have visual correlates, and certain facial movements are distinctly associated with auditory prosodic cues within a sentence (Cvejic et al., 2010). For example, head movements are correlated with variations in the fundamental frequency of speech (pitch) and amplitude (Munhall et al., 2004).

    View all citing articles on Scopus
    View full text