Elsevier

Speech Communication

Volume 65, November–December 2014, Pages 50-66
Speech Communication

Turn-taking, feedback and joint attention in situated human–robot interaction

https://doi.org/10.1016/j.specom.2014.05.005Get rights and content

Highlights

  • We present an experiment where a robot gives route instructions to a human.

  • Humans respond to turn-taking cues (such as filled pauses) in the robot’s speech.

  • Humans can utilize the robot’s gaze for turn-management and disambiguation.

  • Task progression and uncertainty can be inferred from the users’ acknowledgements.

Abstract

In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user’s and the robot’s gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. In addition to this, we have also manipulated turn-taking cues such as completeness and filled pauses in the robot’s speech. By analysing the participants’ subjective rating, task completion, verbal responses, gaze behaviour, and drawing activity, we show that the users indeed benefit from the robot’s gaze when talking about landmarks, and that the robot’s verbal and gaze behaviour has a strong effect on the users’ turn-taking behaviour. We also present an analysis of the users’ gaze and lexical and prosodic realisation of feedback after the robot instructions, and show that these cues reveal whether the user has yet executed the previous instruction, as well as the user’s level of uncertainty.

Introduction

Conversation can be described as a joint activity between two or more participants, and the ease of conversation relies on a close coordination of actions between them (c.f. Clark, 1996). Much research has been devoted to identify the behaviours that speakers attend to in order to achieve this fine-grained synchronisation. Firstly, any kind of interaction has to somehow manage the coordination of turn-taking. Since it is difficult to speak and listen at the same time, interlocutors take turns speaking and this turn-taking has to be coordinated (Sacks et al., 1974). Many studies have shown that turn-taking is a complex process where a number of different verbal and non-verbal behaviours including gaze, gestures, prosody, syntax and semantics influence the probability of a speaker change (e.g., Duncan, 1972, Kendon, 1967, Koiso et al., 1998). Secondly, in addition to the coordination of verbal actions, many types of dialogues also include the coordination of task-oriented non-verbal actions. For example, if the interaction involves instructions that need to be carried out, the instruction-giver needs to attend to the instruction-follower’s task progression and level of understanding in order to decide on a future course of action. Thus, when speaking, humans continually evaluate how the listener perceives and reacts to what they say and adjust their future behaviour to accommodate this feedback. Thirdly, speakers also have to coordinate their joint focus of attention. Joint attention is fundamental to efficient communication: it allows people to interpret and predict each other’s actions and prepare reactions to them. For example, joint attention facilitates simpler referring expressions (such as pronouns) by circumscribing a subdomain of possible referents. Thus, speakers need to keep track of the current focus of attention in the discourse (Grosz and Sidner, 1986). In the case of situated face-to-face interaction, this entails keeping track of possible referents in the verbal interaction as well as in the shared visual scene (Velichkovsky, 1995).

Until recently, most computational models of spoken dialogue have neglected the physical space in which the interaction takes place, and employed a very simplistic model of turn-taking and feedback, where each participant takes the turn with noticeable pauses in between. While these assumptions simplify processing, they fail to account for the complex coordination of actions in human–human interaction outlined above. However, researchers have now started to develop more fine-grained models of dialogue processing (Schlangen and Skantze, 2011), which for example makes it possible for the system to give more timely feedback (e.g. Meena et al., 2013). There are also recent studies on how to model the situation in which the interaction takes place, in order to manage several users talking to the system at the same time (Bohus and Horvitz, 2010, Al Moubayed et al., 2013), and references to objects in the shared visual scene (Kennington et al., 2013).

These advances in incremental processing and situated interaction will allow future conversational systems to be endowed with more human-like models for turn-taking, feedback and joint attention. However, as conversational systems become more human-like, it is not clear to what extent users will pick up on behavioural cues and respond to the system in the same way as they would with a human interlocutor. In the present study we address this question. We present an experiment where a robot instructs a human on how to draw a route on a map, similar to a Map Task (Anderson et al., 1991), as shown in Fig. 1. The human and robot are placed face-to-face with a large printed map placed on the table between them. In addition, the user has a digital version of the map presented on a screen and is given the task to draw the route that the robot describes with a digital pen. However, the landmarks on the user’s screen are blurred and therefore the user also needs to look at the large map in order to identify the landmarks. This map thereby constitutes a target of joint attention.

A schematic illustration of how speech and gaze could be used in this setting for coordinating turn-taking, task execution and attention (according to studies on human–human interaction) is shown in Fig. 2. In the first part of the robot’s instruction, the robot makes an ambiguous reference to a landmark (“the tower”), but since the referring expression is accompanied with a glance towards the landmark on the map, the user can disambiguate this. At the end of the first part, the robot (for some reason) needs to make a pause. Since the execution of the instruction is dependent on the second part of the instruction, the robot produces turn-holding cues (e.g., gazes down and/or produces a filled pause) that inhibit the user to start drawing and/or taking the turn. After the second part, the robot instead produces turn-yielding cues (e.g., gazes up and/or produces a syntactically complete phrase) which encourage the user to react. After executing the instruction, the user gives an acknowledgement (“yeah”) that informs the robot that the instruction has been understood and executed. The user’s and the robot’s gaze can thus serve several simultaneous functions: as cues to disambiguate which landmarks are currently under discussion, but also as cues to turn-taking, level of understanding and task progression.

In this study,1 we pose the questions: Will humans pick up and produce these coordination cues, even though they are talking to a robot? If so, will this improve the interaction, and if so, how? To answer these questions, we have systematically manipulated the way the robot produces turn-taking cues. We have also compared the face-to-face setting described above with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. This way, we can explore what the contributions of a face-to-face setting really are, and whether they can be explained by the robot’s gaze behaviour or the presence of a face per se. A data-collection with 24 subjects interacting with the system has resulted in a rich multi-modal corpus. We have then analysed a wide range of measures: the users’ verbal feedback responses (including prosody and lexical choice), gaze behaviour, drawing activity, subjective ratings and objective measures of task success.

The article is structured as follows. We start by reviewing previous research related to joint attention, turn-taking and feedback in human–human and human–machine interaction in Section 2. The review of these areas ends with five specific research questions that we have addressed in this study. We then describe the experimental setup, data collection and analysis in Section 3. The results from the analysis are presented Section 4, together with a discussion of how the results answer our five research questions. We then end with conclusions and discussion about future work in Section 5.

Section snippets

Joint attention

In situated interaction, speakers naturally look at the objects which are under discussion. The speaker’s gaze can therefore be used by the listener as a cue to the speaker’s current focus of attention. Speakers seem to be aware of this fact, since they naturally use deictic expressions accompanied by a glance towards the object that is being referred to (Clark and Krych, 2004). In the same way, listeners naturally look at the referent during speech comprehension (Allopenna et al., 1998), and

Human–robot Map Task data

In order to address the questions outlined above, we collected a multimodal corpus of human–robot Map Task interactions. Map Task is a well establish experimental paradigm for collecting data on human–human dialogue (Anderson et al., 1991). Typically, an instruction-giver has a map with landmarks and a route, and is given the task of describing this route to an instruction-follower, who has a similar map but without the route drawn on it. In a previous study, Skantze (2012) used this paradigm

Joint attention

We first address Question 1: Can the user utilize the gaze of a back-projected robot face in order to disambiguate ambiguous referring expressions in an ongoing dialogue?

By comparing the main conditions in Group A (Face/Consistent vs. NoFace) with Group B (Face/Random vs. NoFace), and measuring what effect they have on the users’ subjective ratings, task completion and drawing activity, we can investigate whether the users utilised the gaze of the robot in order to disambiguate ambiguous

Conclusions and future work

In this study, we have investigated to which extent humans produce and respond to human-like coordination cues in face-to-face interaction with a robot. We did this by conducting an experiment where a robot gives instructions to a human in a Map Task-like scenario. By manipulating the robot’s gaze behaviour and whether the user could see the robot or not, we were able to investigate how the face-to-face setting affects the interaction. By manipulating the robot’s turn-taking cues during pauses,

Acknowledgements

Gabriel Skantze is supported by the Swedish research council (VR) project Incremental processing in multimodal conversational systems (2011-6237). Anna Hjalmarsson is supported by the Swedish Research Council (VR) project Classifying and deploying pauses for flow control in conversational systems (2011-6152). Catharine Oertel is supported by GetHomeSafe (EU 7th Framework STREP 288667).

References (68)

  • M. Staudte et al.

    Investigating joint attention mechanisms through spoken human–robot interaction

    Cognition

    (2011)
  • N. Ward et al.

    A study in responsiveness in spoken dialog

    Int. J. Hum Comput Stud.

    (2003)
  • S. Al Moubayed et al.

    The furhat back-projected humanoid head – lip reading, gaze and multiparty interaction

    Int. J. Humanoid Rob.

    (2013)
  • Allen, J.F., Core, M., 1997. Draft of DAMSL: Dialog act Markup in Several Layers. Unpublished...
  • J. Allwood et al.

    On the semantics and pragmatics of linguistic feedback

    J. Semantics

    (1992)
  • A. Anderson et al.

    The HCRC map task corpus

    Lang. Speech

    (1991)
  • S. Baron-Cohen

    The eye direction detector (EDD) and the shared attention mechanism (SAM): two cases for evolutionary psychology

  • J. Bavelas et al.

    Listener responses as a collaborative process: the role of gaze

    J. Commun.

    (2002)
  • P. Boersma

    Praat, a system for doing phonetics by computer

    Glot Int.

    (2001)
  • Bohus, D., Horvitz, E., 2010. Facilitating Multiparty Dialog with Gaze, Gesture, and Speech. In: Proc ICMI’10. Beijing,...
  • J.D. Boucher et al.

    I reach faster when I see you look: gaze effects in human–human and human–robot face-to-face cooperation

    Front. Neurorobotics

    (2012)
  • Boye, J., 2007. Dialogue management for automatic troubleshooting and other problem-solving applications. In:...
  • Boye, J., Fredriksson, M., Götze, J., Gustafson, J., Königsmann, J., 2012. Walk this Way: Spatial Grounding for City...
  • E. Boyle et al.

    The effects of visibility on dialogue and performance in a cooperative problem solving task

    Lang. Speech

    (1994)
  • Buschmeier, H., Kopp, S., 2011. Towards conversational agents that attend to and adapt to communicative user feedback....
  • Buschmeier, H., Baumann, T., Dosch, B., Kopp, S., Schlangen, D., 2012. Combining incremental language generation and...
  • Cathcart, N., Carletta, J., Klein, E., 2003. A shallow model of backchannel continuers in spoken dialogue. In: 10th...
  • H.H. Clark

    Using Language

    (1996)
  • H.H. Clark et al.

    Definite reference and mutual knowledge

  • S. Duncan

    Some signals and rules for taking speaking turns in conversations

    J. Pers. Soc. Psychol.

    (1972)
  • J. Edlund et al.

    MushyPeek – a framework for online investigation of audiovisual dialogue phenomena

    Lang. Speech

    (2009)
  • B.J. Grosz et al.

    Attention, intentions, and the structure of discourse

    Comput. Linguist.

    (1986)
  • M. Hall et al.

    The WEKA data mining software: an update

    SIGKDD Explor.

    (2009)
  • Hjalmarsson, A., Oertel, C., 2012. Gaze direction as a back-channel inviting cue in dialogue. In: Proc. of the IVA 2012...
  • Cited by (90)

    • A social robot to deliver a psychotherapeutic treatment: Qualitative responses by participants in a randomized controlled trial and future design recommendations

      2021, International Journal of Human Computer Studies
      Citation Excerpt :

      Social techniques for robots can include the use of physical touch (Chen et al., 2011), distance (Kim and Mutlu, 2014), facial expressions (Gonsior et al., 2011), co-verbal gesture (Salem et al., 2013), multi-modal speech to gesture (Aly and Tapus, 2016), eye gaze (Skantze et al., 2014; Stanton and Stevens, 2014), and pre-programmed behaviors clustered as personality traits (Mou et al., 2020). If used suitably, these features can increase positive interaction outcomes, including robot likability, enjoyment, task performance, communication quality, perceived helpfulness, empathy, anthropomorphic interpretation, and future interaction intentions such as willingness to use it (Aly and Tapus, 2016; Chen et al., 2011; Gonsior et al., 2011; Leite et al., 2013; Mou et al., 2020; Robinson et al., 2018; Salem et al., 2013; Skantze et al., 2014; Stanton and Stevens, 2014). Furthermore, interaction partners can have notable emotional responses to robots’ behaviors (Guo et al., 2019), including the expression of similar emotions to those the robot is displaying in the interaction (Xu et al., 2015).

    • Turn-taking in Conversational Systems and Human-Robot Interaction: A Review

      2021, Computer Speech and Language
      Citation Excerpt :

      In case the task execution is accompanied with a spoken utterance (such as an acknowledgement), the spoken utterance may provide information on the timing of the task execution. Skantze et al. (2014) explored the setting of a robot giving giving instructions to a human subject on how to draw a route on a map. Each piece of instruction was typically followed by an acknowledgement from the human.

    View all citing articles on Scopus
    View full text