Elsevier

Computer Speech & Language

Volume 34, Issue 1, November 2015, Pages 275-291
Computer Speech & Language

Conversational system for information navigation based on POMDP with user focus tracking

https://doi.org/10.1016/j.csl.2015.01.003Get rights and content

Highlights

  • We address a spoken dialogue system which conducts information navigation.

  • We formulate the problem of dialogue management as a module selection with POMDP.

  • The reward function of POMDP is defined by the quality of interaction.

  • The POMDP tracks user's focus of attention to make appropriate actions.

  • The proposed model outperformed the conventional systems without focus information.

Abstract

We address a spoken dialogue system which conducts information navigation in a style of small talk. The system uses Web news articles as an information source, and the user can receive information about the news of the day through interaction. The goal and procedure of this kind of dialogue are not well defined. An empirical approach based on a partially observable Markov decision process (POMDP) has recently been widely used for dialogue management, but it assumes a definite task goal and information slots, which does not hold in our application system. In this work, we formulate the problem of dialogue management as a selection of modules and optimize it with POMDP by tracking the dialogue state and focus of attention. The POMDP-based dialogue manager receives a user intention that is classified by a spoken language understanding (SLU) component based on logistic regression (LR). The manager also receives a user focus that is detected by the SLU component based on conditional random fields (CRFs). These dialogue states are used for selecting appropriate modules by policy function, which is optimized by reinforcement learning. The reward function is defined by the quality of interaction to encourage long interaction of information navigation with users. The module which responds to user queries is based on a similarity of predicate-argument (P-A) structures that are automatically defined from a domain corpus. It allows for flexible response generation even if the system cannot find exact matching information to the user query. The system also proactively presents information by following the user focus and retrieving a news article based on the similarity measure even if the user does not make any utterance. Experimental evaluations with real dialogue sessions demonstrate that the proposed system outperformed the conventional rule-based system in terms of dialogue state tracking and action selection. Effect of focus detection in the POMDP framework is also confirmed.

Introduction

In the past decades, a large number of spoken dialogue systems have been investigated. Many systems are now deployed in the real world, most typically as smart phone applications, which interact with a diversity of users. In the future, interactive robots will be deployed as a communication partner of users. However, a large majority of current applications, such as weather information systems (Zue et al., 2000) and train information systems (Aust et al., 1995, Lamel et al., 2002), are based on a specific task description which includes a definite task goal and necessary slots, such as place and date, for task completion. Users are required to share and follow these concepts; they need to have a clear task goal and specify it according to the system's capability. Some recent systems incorporate general question-answering capability, but it is usually limited to factoid questions such as “when” or “how tall”, or pre-defined templates such as “what is your name?”. When users ask something beyond the system's capability, the system replies “I can’t answer the question”, or turns to the Web search and returns the retrieval list in the display. This kind of dialogue is not natural in interaction with humanoid robots since people want to converse with them besides simple commands. A user-friendly conversational system should not reply with “I can’t answer the question” even if it cannot find the result exactly matching the user query. Instead, it should present relevant information according to the user's intention and preference. Moreover, robots do not have a display to present a document. They must make a concise verbal reply.

The goal of this work is a conversational system with speech media only which can engage in information navigation. By information navigation, we do not assume a specific task goal, but assume a domain such as sports and travel. The system should present relevant information even if the user request is not necessarily clear and there is not a matching result to the user query. Moreover, the system can occasionally present potentially useful information even without the user's explicit request by following the dialogue context. In this work, we design and develop a news navigation system that uses Web news articles as a knowledge source and presents information based on the users’ preference and queries.

There are several studies towards this direction (Kawahara, 2009), but there is not a clear principle nor established methodology to design and implement casual conversation systems. Dialogue management of this kind of systems was usually made in a heuristic manner and often based on simple rules (Bratman et al., 1988, Lucas and, 2000, Bohus et al., 2003). The companions project (Catizone et al., 2008, Cavazza et al., 1630) designed conversational agents that would engage elderly users in sustained conversations based on rules. Misu and Kawahara (2007) developed a Kyoto navigation system that conducts question-answering and proactive presentation by defining a topic structure based on Wikipedia articles. The information state approach to dialogue management (Traum and Larsson, 2003, Kronlid and Lager, 2007) allows for dialogue control to put a topic on hold and return to it later. WikiTalk (Wilcock, 2012, Wilcock and Jokinen, 2013) is a dialogue system that talks about topics in Wikipedia. This system works on the pre-defined scenario that is represented with an automaton, but it forces users to follow the system scenario. Moreover, developers need to implement a new scenario for a new domain or task. A data-driven approach based on phrase-based statistical machine translation (SMT) (Ritter et al., 2011) tries to train response generation from micro-blog data. This approach enables the system to output a variety of responses, but it does not track any user intention or dialogue state to fulfil what the user want to know.

In the past years, machine learning, particularly reinforcement learning (RL), has been investigated for dialogue management. Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) are the most successful and are now widely used to model and train dialogue managers (Roy et al., 2000, Levin et al., 2000, Williams and Young, 2007, Young et al., 2010, Yoshino et al., 2013b). However, the conventional scheme assumes that the task and dialogue goal are clearly stated and readily encoded in the RL reward function. This is not true in casual conversation or information navigation addressed in this work.

Some previous work has tackled with this problem. Pan et al. (2012) designed a spoken document retrieval system whose goal is user's information need satisfaction, and defined rewards by using the structure of the target document set. This is possible only for well-defined document search problems. The strategy requires a structure of the document set and definition of user demand satisfaction. Shibata et al. (2014) developed a conversational chatting system. It asks users to make evaluation at the end of each dialogue session to define rewards for reinforcement learning. Meguro et al. (2010) proposed a listening dialogue system. In their work, levels of satisfaction were annotated in the log of dialogue sessions to train a discriminative model. These approaches require costly input from users or developers, who provide evaluation and supervision labels. In this work, we present a framework in which reward is defined for the quality of system actions and also for encouraging long interactions, in contrast to the previous approaches. Moreover, user focus is tracked to make appropriate actions, which are more rewarded.

Descriptions of the proposed conversational information navigation system are provided in Section 2. In Section 3, details of dialogue modules based on the predicate-argument (P-A) structure are explained. In Section 4, we describe spoken language understanding (SLU) modules based on logistic regression (LR) and conditional random fields (CRFs). In Section 5, we give a belief explanation of POMDP and its extension by incorporating user focus. Experimental evaluations of the proposed POMDP-based system with dialogue sessions are reported in Section 6.

Section snippets

Task of information navigation

Information navigation does not assume a designed task and goal, but provides useful information according to the users’ interest. When the user demands are not clear, the system clarifies the user demands through an interaction. The system presents relevant information even if there is not exactly matching result to the user query. Moreover, the system presents potentially useful information even when the user does not make any explicit request.

In natural human–human conversations,

Presentation of relevant information based on P-A structure

In this section, we describe flexible matching of P-A structure on which the proposed question answering ( QA) and proactive presentation ( PP) modules are based (Yoshino et al., 2011). Text of news articles and user utterances are parsed to extract a P-A structure (an example is shown in Fig. 3). A P-A structure represents a sentence with a predicate, arguments and their semantic role labels (Johansson and Nugues, 2008, Hajič et al., 2009, Matsubayashi et al., 2012). We used the Japanese text

Spoken language understanding (SLU)

In this section, we present the spoken language understanding (SLU) components of our system. It detects the user's focus and intention and provides them to the dialogue manager. The SLU modules are formulated with a statistical model to give likelihoods which are used in POMDP.

Dialogue management for information navigation

The POMDP-based statistical dialogue management is formulated as below. The random variables involved at a dialogue turn t are as follows:

  • s  Is: user state

    User intention.

  • a  K: system action

    Module that the system selects.

  • o  Is: observation

    Observed user state, including ASR and intention analysis errors.

  • P(o|s): observation probability

    Output of SLU with its confidence score, which is defined in Eqs. (10), (12).

  • P(sjt+1|sit,aˆkt): state transition probability

    Model to predict the next user state sjt+1

Experimental evaluations

For evaluation of the system, we collected additional 626 utterances (12 users, 24 dialogues; 2 dialogues by each user) with the proposed dialogue system with speech input (Yoshino et al., 2013a). There are 58 cases regarded as no request (NR) when a user did not say anything for longer than 5 seconds. The gold-standard is annotated by two annotators. The agreement for the user states was 0.958 and Cohen's kappa (Carletta, 1996) was 0.932. The agreement for the system actions was 0.944 and

Conclusions

We have designed and implemented a spoken dialogue system for information navigation of Web news articles updated day-by-day. The system presents relevant information according to the user's interest. We have introduced a user focus detection model, and developed a POMDP framework which tracks the user focus to select the appropriate action module of the dialogue system. In the experimental evaluations, the proposed dialogue management approach determines the state of the user more accurately

References (43)

  • M. Cavazza et al.

    How was your day?: a companion ECA

  • B.J. Grosz et al.

    Attention, intentions, and the structure of discourse

    Comput. Linguist.

    (1986)
  • J. Hajič et al.

    The conll-2009 shared task: syntactic and semantic dependencies in multiple languages

  • Z.S. Harris

    Methods in Structural Linguistics

    (1951)
  • R. Johansson et al.

    Dependency-based syntactic–semantic analysis with propbank and nombank

  • T. Kawahara

    New perspectives on spoken language understanding: does machine need to fully understand speech?

  • D. Kawahara et al.

    A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis

  • K. Komatani et al.

    Flexible mixed-initiative dialogue management using concept-level confidence measures of speech recognizer output

  • F. Kronlid et al.

    Implementing the information-state update approach to dialogue management in a slightly extended SCXML

  • E. Levin et al.

    A stochastic model of human–machine interaction for learning dialog strategies

    IEEE Trans. Speech Audio Process.

    (2000)
  • B. Lucas

    VoiceXML

    Commun. ACM

    (2000)
  • Cited by (27)

    View all citing articles on Scopus

    This paper has been recommended for acceptance by R.K. Moore.

    View full text