Turn-taking, feedback and joint attention in situated human–robot interaction
Introduction
Conversation can be described as a joint activity between two or more participants, and the ease of conversation relies on a close coordination of actions between them (c.f. Clark, 1996). Much research has been devoted to identify the behaviours that speakers attend to in order to achieve this fine-grained synchronisation. Firstly, any kind of interaction has to somehow manage the coordination of turn-taking. Since it is difficult to speak and listen at the same time, interlocutors take turns speaking and this turn-taking has to be coordinated (Sacks et al., 1974). Many studies have shown that turn-taking is a complex process where a number of different verbal and non-verbal behaviours including gaze, gestures, prosody, syntax and semantics influence the probability of a speaker change (e.g., Duncan, 1972, Kendon, 1967, Koiso et al., 1998). Secondly, in addition to the coordination of verbal actions, many types of dialogues also include the coordination of task-oriented non-verbal actions. For example, if the interaction involves instructions that need to be carried out, the instruction-giver needs to attend to the instruction-follower’s task progression and level of understanding in order to decide on a future course of action. Thus, when speaking, humans continually evaluate how the listener perceives and reacts to what they say and adjust their future behaviour to accommodate this feedback. Thirdly, speakers also have to coordinate their joint focus of attention. Joint attention is fundamental to efficient communication: it allows people to interpret and predict each other’s actions and prepare reactions to them. For example, joint attention facilitates simpler referring expressions (such as pronouns) by circumscribing a subdomain of possible referents. Thus, speakers need to keep track of the current focus of attention in the discourse (Grosz and Sidner, 1986). In the case of situated face-to-face interaction, this entails keeping track of possible referents in the verbal interaction as well as in the shared visual scene (Velichkovsky, 1995).
Until recently, most computational models of spoken dialogue have neglected the physical space in which the interaction takes place, and employed a very simplistic model of turn-taking and feedback, where each participant takes the turn with noticeable pauses in between. While these assumptions simplify processing, they fail to account for the complex coordination of actions in human–human interaction outlined above. However, researchers have now started to develop more fine-grained models of dialogue processing (Schlangen and Skantze, 2011), which for example makes it possible for the system to give more timely feedback (e.g. Meena et al., 2013). There are also recent studies on how to model the situation in which the interaction takes place, in order to manage several users talking to the system at the same time (Bohus and Horvitz, 2010, Al Moubayed et al., 2013), and references to objects in the shared visual scene (Kennington et al., 2013).
These advances in incremental processing and situated interaction will allow future conversational systems to be endowed with more human-like models for turn-taking, feedback and joint attention. However, as conversational systems become more human-like, it is not clear to what extent users will pick up on behavioural cues and respond to the system in the same way as they would with a human interlocutor. In the present study we address this question. We present an experiment where a robot instructs a human on how to draw a route on a map, similar to a Map Task (Anderson et al., 1991), as shown in Fig. 1. The human and robot are placed face-to-face with a large printed map placed on the table between them. In addition, the user has a digital version of the map presented on a screen and is given the task to draw the route that the robot describes with a digital pen. However, the landmarks on the user’s screen are blurred and therefore the user also needs to look at the large map in order to identify the landmarks. This map thereby constitutes a target of joint attention.
A schematic illustration of how speech and gaze could be used in this setting for coordinating turn-taking, task execution and attention (according to studies on human–human interaction) is shown in Fig. 2. In the first part of the robot’s instruction, the robot makes an ambiguous reference to a landmark (“the tower”), but since the referring expression is accompanied with a glance towards the landmark on the map, the user can disambiguate this. At the end of the first part, the robot (for some reason) needs to make a pause. Since the execution of the instruction is dependent on the second part of the instruction, the robot produces turn-holding cues (e.g., gazes down and/or produces a filled pause) that inhibit the user to start drawing and/or taking the turn. After the second part, the robot instead produces turn-yielding cues (e.g., gazes up and/or produces a syntactically complete phrase) which encourage the user to react. After executing the instruction, the user gives an acknowledgement (“yeah”) that informs the robot that the instruction has been understood and executed. The user’s and the robot’s gaze can thus serve several simultaneous functions: as cues to disambiguate which landmarks are currently under discussion, but also as cues to turn-taking, level of understanding and task progression.
In this study,1 we pose the questions: Will humans pick up and produce these coordination cues, even though they are talking to a robot? If so, will this improve the interaction, and if so, how? To answer these questions, we have systematically manipulated the way the robot produces turn-taking cues. We have also compared the face-to-face setting described above with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. This way, we can explore what the contributions of a face-to-face setting really are, and whether they can be explained by the robot’s gaze behaviour or the presence of a face per se. A data-collection with 24 subjects interacting with the system has resulted in a rich multi-modal corpus. We have then analysed a wide range of measures: the users’ verbal feedback responses (including prosody and lexical choice), gaze behaviour, drawing activity, subjective ratings and objective measures of task success.
The article is structured as follows. We start by reviewing previous research related to joint attention, turn-taking and feedback in human–human and human–machine interaction in Section 2. The review of these areas ends with five specific research questions that we have addressed in this study. We then describe the experimental setup, data collection and analysis in Section 3. The results from the analysis are presented Section 4, together with a discussion of how the results answer our five research questions. We then end with conclusions and discussion about future work in Section 5.
Section snippets
Joint attention
In situated interaction, speakers naturally look at the objects which are under discussion. The speaker’s gaze can therefore be used by the listener as a cue to the speaker’s current focus of attention. Speakers seem to be aware of this fact, since they naturally use deictic expressions accompanied by a glance towards the object that is being referred to (Clark and Krych, 2004). In the same way, listeners naturally look at the referent during speech comprehension (Allopenna et al., 1998), and
Human–robot Map Task data
In order to address the questions outlined above, we collected a multimodal corpus of human–robot Map Task interactions. Map Task is a well establish experimental paradigm for collecting data on human–human dialogue (Anderson et al., 1991). Typically, an instruction-giver has a map with landmarks and a route, and is given the task of describing this route to an instruction-follower, who has a similar map but without the route drawn on it. In a previous study, Skantze (2012) used this paradigm
Joint attention
We first address Question 1: Can the user utilize the gaze of a back-projected robot face in order to disambiguate ambiguous referring expressions in an ongoing dialogue?
By comparing the main conditions in Group A (Face/Consistent vs. NoFace) with Group B (Face/Random vs. NoFace), and measuring what effect they have on the users’ subjective ratings, task completion and drawing activity, we can investigate whether the users utilised the gaze of the robot in order to disambiguate ambiguous
Conclusions and future work
In this study, we have investigated to which extent humans produce and respond to human-like coordination cues in face-to-face interaction with a robot. We did this by conducting an experiment where a robot gives instructions to a human in a Map Task-like scenario. By manipulating the robot’s gaze behaviour and whether the user could see the robot or not, we were able to investigate how the face-to-face setting affects the interaction. By manipulating the robot’s turn-taking cues during pauses,
Acknowledgements
Gabriel Skantze is supported by the Swedish research council (VR) project Incremental processing in multimodal conversational systems (2011-6237). Anna Hjalmarsson is supported by the Swedish Research Council (VR) project Classifying and deploying pauses for flow control in conversational systems (2011-6152). Catharine Oertel is supported by GetHomeSafe (EU 7th Framework STREP 288667).
References (68)
- et al.
Tracking the time course of spoken word recognition using eye movements: evidence for continuous mapping models
J. Mem. Lang.
(1998) - et al.
Speaking while monitoring addressees for understanding
J. Mem. Lang.
(2004) - et al.
Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor
Speech Commun.
(2011) - et al.
Turn-taking cues in task-oriented dialogue
Comput. Speech Lang.
(2011) - et al.
Pauses, gaps and overlaps in conversations
J. Phonetics
(2010) The additive effect of turn-taking cues in human and synthetic voice
Speech Commun.
(2011)Some functions of gaze direction in social interaction
Acta Psychol.
(1967)- et al.
Infants understand the referential nature of human gaze but not robot gaze
J. Exp. Child Psychol.
(2013) - et al.
Understanding by addressees and overhearers
Cogn. Psychol.
(1989) - et al.
Towards incremental speech generation in conversational systems
Comput. Speech Lang.
(2013)
Investigating joint attention mechanisms through spoken human–robot interaction
Cognition
A study in responsiveness in spoken dialog
Int. J. Hum Comput Stud.
The furhat back-projected humanoid head – lip reading, gaze and multiparty interaction
Int. J. Humanoid Rob.
On the semantics and pragmatics of linguistic feedback
J. Semantics
The HCRC map task corpus
Lang. Speech
The eye direction detector (EDD) and the shared attention mechanism (SAM): two cases for evolutionary psychology
Listener responses as a collaborative process: the role of gaze
J. Commun.
Praat, a system for doing phonetics by computer
Glot Int.
I reach faster when I see you look: gaze effects in human–human and human–robot face-to-face cooperation
Front. Neurorobotics
The effects of visibility on dialogue and performance in a cooperative problem solving task
Lang. Speech
Using Language
Definite reference and mutual knowledge
Some signals and rules for taking speaking turns in conversations
J. Pers. Soc. Psychol.
MushyPeek – a framework for online investigation of audiovisual dialogue phenomena
Lang. Speech
Attention, intentions, and the structure of discourse
Comput. Linguist.
The WEKA data mining software: an update
SIGKDD Explor.
Cited by (90)
A social robot to deliver a psychotherapeutic treatment: Qualitative responses by participants in a randomized controlled trial and future design recommendations
2021, International Journal of Human Computer StudiesCitation Excerpt :Social techniques for robots can include the use of physical touch (Chen et al., 2011), distance (Kim and Mutlu, 2014), facial expressions (Gonsior et al., 2011), co-verbal gesture (Salem et al., 2013), multi-modal speech to gesture (Aly and Tapus, 2016), eye gaze (Skantze et al., 2014; Stanton and Stevens, 2014), and pre-programmed behaviors clustered as personality traits (Mou et al., 2020). If used suitably, these features can increase positive interaction outcomes, including robot likability, enjoyment, task performance, communication quality, perceived helpfulness, empathy, anthropomorphic interpretation, and future interaction intentions such as willingness to use it (Aly and Tapus, 2016; Chen et al., 2011; Gonsior et al., 2011; Leite et al., 2013; Mou et al., 2020; Robinson et al., 2018; Salem et al., 2013; Skantze et al., 2014; Stanton and Stevens, 2014). Furthermore, interaction partners can have notable emotional responses to robots’ behaviors (Guo et al., 2019), including the expression of similar emotions to those the robot is displaying in the interaction (Xu et al., 2015).
Turn-taking in Conversational Systems and Human-Robot Interaction: A Review
2021, Computer Speech and LanguageCitation Excerpt :In case the task execution is accompanied with a spoken utterance (such as an acknowledgement), the spoken utterance may provide information on the timing of the task execution. Skantze et al. (2014) explored the setting of a robot giving giving instructions to a human subject on how to draw a route on a map. Each piece of instruction was typically followed by an acknowledgement from the human.
Think before you speak: An investigation of eye activity patterns during conversations using eyewear
2020, International Journal of Human Computer StudiesPersonalization of industrial human–robot communication through domain adaptation based on user feedback
2024, User Modeling and User-Adapted Interaction