Elsevier

Speech Communication

Volume 53, Issue 1, January 2011, Pages 23-35
Speech Communication

The additive effect of turn-taking cues in human and synthetic voice

https://doi.org/10.1016/j.specom.2010.08.003Get rights and content

Abstract

A previous line of research suggests that interlocutors identify appropriate places to speak by cues in the behaviour of the preceding speaker. If used in combination, these cues have an additive effect on listeners’ turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Thus, in addition to stimuli realized with a human voice, the experiment included dialogues where one of the speakers is replaced with a synthesis. The turn-taking cues investigated include intonation, phrase-final lengthening, semantic completeness, stereotyped lexical expressions and non-lexical speech production phenomena such as lexical repetitions, breathing and lip-smacks. The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions. Furthermore, the results support Duncan’s findings: the more turn-taking cues with the same pragmatic function, turn-yielding or turn-holding, the higher the agreement among subjects on the expected outcome. In addition, the number of turn-taking cues affects the reaction times for these decisions. Thus, the more cues, the faster the reaction time.

Research highlights

► The more turn-taking cues, the higher the inter-annotator agreement. ► The more turn-taking cues, the faster the judgment reaction time. ► No differences were found between synthetic and human voice.

Introduction

At the Department of Speech, Music and Hearing, KTH we currently do research in the area of human-like dialogue systems. The motivation is to allow users to interact with a system in a way that is similar to interacting with a human dialogue partner (cf. Edlund et al., 2008). One crucial aspect of these systems is to control the flow of dialogue contributions between the system and the user. Very few dialogue systems use sophisticated methods to manage turn-taking. These systems are generally poor both at detecting users’ end of turns and at generating appropriate turn-taking behaviour to help users discriminate momentary pauses from ends of turns. A frequently used strategy is to interpret long silences as end of turns. Whereas silence is an explicit, unambiguous indication that a speaker is momentarily not vocalizing, it is a crude detector of end of turns as pause length within turns varies. For dialogue systems in English, the silence threshold for end of turn detection has been reported to range between 0.5 and 1 s (Ferrer et al., 2002). Yet, analyses of spontaneous dialogue in French show that silences within turns (pauses) may be longer than 1 s (cf. Campione and Veronis, 2002). Moreover, Weilhammer and Rabold (2003) found that the mean duration for silences between turns (gaps) in spontaneous face-to-face conversation in American English was 380 ms, which is shorter than 0.5 s. Consequently, if we use a silence threshold (0.5–1 s) to detect end of turns, we end up with a system that has a longer mean response time than humans, but which still risks interrupting its users.

Apart from using silence for end of turn detection in spoken dialogue systems, one frequent strategy is to signal turn-taking artificially, as for example in push-to-talk systems, where the user takes and maintains the turn explicitly by pushing a button. However, while push-to-talk has shown to be an efficient strategy for improving task completion, the extra element of pushing a button appears to affect the way users interact with the system. For example, Fernández et al. (2007) found that, compared to free turn-taking; push-to-talk resulted in longer turns and less positive feedback. Allowing users to interact freely without artificial artefacts such as a button may not be a necessity to build successful spoken dialogue systems, but it is a crucial aspect if we want to build dialogue systems that interact with its users in a human-like manner.

Humans generate speech incrementally and on-line as the dialogue progresses using information from several different sources (Kilger and Finkler, 1995). We start to plan new contributions before the other person has stopped speaking. When starting to speak, we typically do not have a complete plan of what to say but yet we manage to integrate information from different sources in parallel and simultaneously. Occasionally we need to hesitate and revise our speech as we go along. As a consequence, speech is not generated in regular constant pace of vocalized segments, but in streams of fragments in varying sizes (Butterworth, 1975). These irregularities in pause duration and turn length suggest that interlocutors cannot use silence duration to discriminate momentary pauses from ends of turns. An early theory of turn-taking suggests that speakers identify appropriate places to speak by attending to various behavioural cues or signals in the message of the preceding speaker (cf. Duncan, 1972, Duncan and Fiske, 1977). According to Duncan (1972, p. 283): “The proposed turn-taking mechanism is mediated through signals composed of clear-cut behavioural cues, considered to be perceived as discrete”. Duncan explored such turn-taking cues in a corpus of face-to-face dialogues in American English. Correlation analyses of these data show that the number of available turn-yielding signals is linearly correlated with listeners’ turn taking attempts. When several signals are used in combination, there appears to be an additive effect. However, when speakers employed signals to suppress such attempts, the number of turn-taking attempts radically decreased, regardless of the number of turn-yielding signals.

The present study further explores Duncan’s hypothesis by examining the effect of turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Whereas the focus is on how to communicate appropriate places for interlocutors to take the turn, the results also have implications for end of turn detection. The experiment is set up as a game, designed to extract judgements based on first intuition rather than afterthought. The stimuli were dyadic dialogues played to the listeners as continuous dialogue segments. The motivation behind this setup was to present the dialogue segments in chronological order, which is how they are perceived in their original setting. The experimental design allows us to collect data from naïve users in a controlled experimental setting.

Section snippets

Previous work

The aim of the present study is to explore the effect of various behaviours that regulate the flow of interaction in dialogue. The existence of such cues is based on the assumption that listeners attend to interactional cues in dialogue. Such cues are verbal and non-verbal behaviours that pragmatically affect the conversation. For example, Clark (2002) claims that dialogue phenomena such as repeats, repairs, fillers and prolonged syllables are strategies used by speakers to synchronize their

Method

The aim of this work is to investigate experimentally how turn-taking cues form a complex signal and affect listeners’ expectations of turn-taking behaviour in dialogue. The cues are investigated in a perception experiment where subjects listen to dyadic dialogues in chronological order and try to anticipate whether a token will be followed by a speaker change or not. In line with Duncan’s findings, our hypothesis is that, the more turn-taking cues with a particular pragmatic function −

Data preparation

As pointed out by Oliveira and Freitas (2008), manipulating dialogues off-line and analyzing these out of context can be problematic since this may result in stimuli that never would occur in a real dialogue setting. To tackle this problem, the experiment was designed to allow subjects to follow longer dialogue segments chronologically.

Experiment

The experiment included four dialogue segments from four different dialogues. The segments were between 116 and 166 s long. The dialogues were dyadic dialogues with three different speakers, one male and two female. The male speaker (S1) participated in all four dialogues and the two female speakers (S2 and S3) in two dialogues each. In the experiment, the recording stops playing just subsequent to a target IPU, allowing the subjects to make a judgement. Each subject listened to two human–human

Results

This section analyzes the effects of individual as well as combined sets of turn-taking cues. First, we present results on the individual cues. The motive is to investigate whether these behaviours affect the subjects’ judgements as hypothesised.

Conclusion

Duncan (1972) has previously shown that a number of verbal and non-verbal behaviours affect turn taking in dialogue. If used in combination, the number of turn-taking cues is linearly correlated with listeners’ turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective of the present study was to investigate the possibilities of generating turn-taking cues with a synthetic voice. In order to explore this

Acknowledgements

This research was carried out at Centre for Speech Technology, KTH. The research is also supported by the Swedish Research Council Project #2007-6431, GENDIAL. Many thanks to Rolf Carlson, Jens Edlund, Joakim Gustafson, Mattias Heldner, Julia Hirschberg and Gabriel Skantze for help with valuable comments and annotation of data. Many thanks also to the reviewers for valuable comments that helped to improve the paper.

References (38)

  • H. Clark

    Speaking in time

    Speech Commun.

    (2002)
  • A. Cutler et al.

    On the analysis of prosodic turn-taking cues

  • J.P. de Ruiter et al.

    Projecting the end of a speaker’s turn: a cognitive cornerstone of conversation

    Language

    (2006)
  • S. Duncan et al.

    Face-to-Face Interaction: Research, Methods and Theory

    (1977)
  • S. Duncan

    Some signals and rules for taking speaking turns in conversations

    Journal of Personality and Social Psychology

    (1972)
  • J. Edlund et al.

    Exploring prosody in interaction control

    Phonetica

    (2005)
  • Fernández, R., Schlangen, D., Lucht, T., 2007. Push-to-talk ain’t always bad! Comparing different interactivity...
  • Ferrer, L., Shriberg, E., Stolcke, A., 2002. Is the speaker done yet? Faster and more accurate end-of utterance...
  • Ferrer, L., Shriberg, E., & Stolcke, A., 2003. A prosody-based approach to end-of-utterance detection that does not...
  • Cited by (40)

    • Automatic offline annotation of turn-taking transitions in task-oriented dialogue

      2023, Computer Speech and Language
      Citation Excerpt :

      It is important to note that the model’s performance is measured using labels defined automatically by algorithms based solely on voice activity detection — that is, labels generated via heuristics that do not consider the actual conversation content and dynamics. Thus, this approach groups together different types of turn exchanges, such as switches, backchannels and interruptions, which may not be entirely appropriate given the well-studied differences between such categories (Koiso et al., 1998; Heldner et al., 2008; Hjalmarsson, 2011; Gravano and Hirschberg, 2011, 2012). Roddy et al. (2018) extend Skantze’s work to consider whether a spoken dialogue system should stop talking or not in the case of overlap between the speaker and the system.

    • Turn-taking in Conversational Systems and Human-Robot Interaction: A Review

      2021, Computer Speech and Language
      Citation Excerpt :

      However, what they found was that these cues had an additive effect: the listener was more likely to take the turn as the number of turn-yielding cues increased. These studies have later been followed by a large number of studies that have investigated the effects of individual cues, as well as the effects of combining them, using larger datasets, automatic methods, and more thorough statistical analyses (e.g. Koiso et al. 1998; Gravano and Hirschberg 2011; Hjalmarsson 2011). In general, these studies tend to confirm the finding that turn-taking cues are additive, even if there is also a considerable amount of redundancy.

    • A cross-linguistic analysis of the temporal dynamics of turn-taking cues using machine learning as a descriptive tool

      2020, Speech Communication
      Citation Excerpt :

      While many, especially earlier, studies analyze English, turn-taking management has been explored also in other languages. In Hjalmarsson (2011), the author performs a series of experiments for understanding how turn-taking cues affect the perception of the interlocutor in Swedish conversational dialogues. She reinforces the importance of the additive effect of cues on the perception of turn-transitions about to come.

    • Listeners use temporal information to identify French- and English-accented speech

      2017, Speech Communication
      Citation Excerpt :

      The finding is intuitively sound: when the information available to listeners increases, identification performance increases. This is evidence for an additive effect of temporal and spectral cues; however, the combined effect of temporal and spectral cues was smaller than the sum of single effects (Du et al., 2011; Hjalmarsson, 2011). In a similar way, Cunningham-Andersson and Engstrand (1989) have shown that perceived strength of foreign accent increases with the number of target-deviant features.

    • Backchannel behavior is idiosyncratic

      2024, Language and Cognition
    View all citing articles on Scopus
    View full text