Elsevier

Image and Vision Computing

Volume 21, Issues 13–14, 1 December 2003, Pages 1125-1133
Image and Vision Computing

Reactive memories: an interactive talking-head

https://doi.org/10.1016/j.imavis.2003.08.009Get rights and content

Abstract

We demonstrate a novel method for producing a synthetic talking-head. The method is based on earlier work in which the behaviour of a synthetic individual is generated by reference to a probabilistic model of interactive behaviour within the visual domain—such models are learnt automatically from typical interactions. We extend this work into a combined visual and auditory domain and employ a state-of-the-art facial appearance model. The result is a real-time synthetic talking-head that responds appropriately and with realistic timing to simple forms of greeting.

Introduction

The screen-based ‘talking-head’ is a powerful device for mediating interaction between humans and machines, enabling a form of interaction that mimics direct communication between humans [2], [5], [8], [16], [17], [18], [19], [22]. The experience of realism is further enhanced when the computer is equipped with visual and auditory senses with which to perceive the user [1], [3], [6], [9], [10], [12], [20], [21], [23], [26], [27], [28]. In this symmetric situation, both the human and synthetic head can see and be seen, and can hear and be heard.

Of paramount importance within face to face conversation is of course the content of what is said. However, the sequence and timing of accompanying facial expressions is also important; mistimed or inappropriate expressions may convey unintended meaning and can therefore be disruptive. It is reasonable to suppose the same requirements will apply for human interaction with a synthetic talking-head.

An approach that begins to meet these requirements is proposed in Refs. [12], [14]. Their idea is based on the common notion of a state space, in which each vector represents the instantaneous configuration of a participant in an interaction. Such vectors are the end-point of a perceptual process within the computer, sensing the human party, and the start-point for a graphical process generating the synthetic individual. An interaction can be thought of as a pathway through the joint configuration space corresponding to the human and synthetic party in an interaction. The range of possible interactions is represented as a stochastic process over the joint configuration space, which is learnt through observation of real interactions captured on video. Johnson [13] modelled the profiles of two people shaking hands. Jebara and Pentland [12] modelled head and hand gestures. In both cases, the models were used to drive a synthetic individual in response to past joint behaviour.

In the current paper, we construct a simple synthetic talking-head by adopting the same approach together with combined modelling of speech and facial expression. The principal objective has been to produce a reactive head in which speech utterances and facial expressions are both appropriate and timely. We handle only a few kinds of simple verbal interactions. Nevertheless, the resulting system is indicative of a new kind of medium that could complement the photograph in an album or the home video with a reactive icon of a familiar person—hence the title of the paper.

Section snippets

Representing facial appearance and sound

Configurations of the talking-head over a sampling interval (typically around 0.05 s) are represented by the parameters of a facial appearance model, based on that proposed in Ref. [4], combined with the principal components of spectral coefficients from the corresponding sound fragment. These choices were determined by the need for an internal representation that was both concise, to facilitate construction of a stochastic process model, and capable of being mapped back on to realistic images

Representing interactions

A separate face model and sound model are built for each speaker. In our experiments, we use 15 training sequences of the same pair of individuals, recorded at 15 frames per second for the video and 11 kHz for the sound. The video is re-sampled to 21.53 frames per second to match the rate at which sound frames occur. Facial appearance is encoded in 14 parameters and individual sound frames are encoded in 70 parameters.

An interaction is represented by the joint behaviour of the two individuals

Generating a response

Unfortunately, the behaviour model is not easily generative in the sense that it might be used to produce a sample interactive behaviour. Although the information required is contained in the representation, it is not easily extracted. To rectify this problem, a Markov chain is superimposed on top of the behaviour prototypes, with transitions defining the ways in which behaviours in the neighbourhood of prototypes may evolve between time-steps. The probability of a transition is estimated from

Behaviour filter

The Markov chain encapsulates all behaviours seen in the training sequences. With a large and varied training data set all the basics action/reaction should be modelled. Any sequence of a correct behaviour should be able to go through the Markov Chain, using the maximum likelihood, keeping the error E(Ft,TtH) under a fixed threshold.

This behaviour threshold (BT) can be used as a filter for a correct behaviour. Any incorrect behaviour would be detected when the error E(Ft,TtH) goes over the

Results

We show results from a set of experiments in which the system is trained with simple interactions involving the greetings ‘Hello’, ‘Hi’ and ‘How do you do?’, with associated responses and facial expressions on both sides (The utterances used are shown in Table 1).

Training data is acquired from a pair of cameras and microphone headsets attached directly to workstations at which the two participants in an interaction are seated. Results can be seen in Fig. 13, Fig. 14.

The timing of the response

Conclusion

The overall framework linking speech with video within a reactive system has been demonstrated. The real-time synthetic talking-head responds appropriately and with correct timing to simple greetings with variations in facial expression and intonation

The approach presented is not intended to deal with anything but simple forms of interaction of the kind shown. Broader use of English would introduce a new dimension of complexity that is beyond its scope.

References (29)

  • N Johnson et al.

    Learning the distribution of object trajectories for event recognition

    Image and Vision Computing

    (1996)
  • M Reiss et al.

    Storing temporal sequences

    Neural Networks

    (1991)
  • R Bowden et al.

    Jeremiah: The face of computer vision

    (2002)
  • N.M Brooke et al.

    Computer graphics animations of talking faces based on stochastic models

    (1994)
  • J Cassel et al.

    Animated conversation: rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents

    (1994)
  • T.F Cootes et al.

    Active appearance models

    (1998)
  • E Cosatto et al.

    Sample-based synthesis of photo-realistic talking heads

    (1998)
  • B DeCarolis et al.

    Behavior planning for a reflexive agent

    (2001)
  • V.E. Devin, An interactive talking-head, PhD Thesis, School of Computer Studies, University of Leeds, October...
  • H.P Graf et al.

    Face analysis for the synthesis of photo-realistic talking heads

    (2000)
  • O Hasegawa et al.

    Realtime synthesis of moving human-like agent in response to user's moving image

    (1992)
  • O Hasegawa et al.

    Real-time parallel and cooperative recognition of facial images for an interactive visual human interface

    (1994)
  • M Isard et al.

    Contour tacking by stochastic propagation of conditional density

    (1996)
  • T Jebara et al.

    Action reaction learning: automatic visual analysis and synthesis of interactive behaviour

    (1999)
  • Cited by (2)

    View full text