Conversational system for information navigation based on POMDP with user focus tracking☆
Introduction
In the past decades, a large number of spoken dialogue systems have been investigated. Many systems are now deployed in the real world, most typically as smart phone applications, which interact with a diversity of users. In the future, interactive robots will be deployed as a communication partner of users. However, a large majority of current applications, such as weather information systems (Zue et al., 2000) and train information systems (Aust et al., 1995, Lamel et al., 2002), are based on a specific task description which includes a definite task goal and necessary slots, such as place and date, for task completion. Users are required to share and follow these concepts; they need to have a clear task goal and specify it according to the system's capability. Some recent systems incorporate general question-answering capability, but it is usually limited to factoid questions such as “when” or “how tall”, or pre-defined templates such as “what is your name?”. When users ask something beyond the system's capability, the system replies “I can’t answer the question”, or turns to the Web search and returns the retrieval list in the display. This kind of dialogue is not natural in interaction with humanoid robots since people want to converse with them besides simple commands. A user-friendly conversational system should not reply with “I can’t answer the question” even if it cannot find the result exactly matching the user query. Instead, it should present relevant information according to the user's intention and preference. Moreover, robots do not have a display to present a document. They must make a concise verbal reply.
The goal of this work is a conversational system with speech media only which can engage in information navigation. By information navigation, we do not assume a specific task goal, but assume a domain such as sports and travel. The system should present relevant information even if the user request is not necessarily clear and there is not a matching result to the user query. Moreover, the system can occasionally present potentially useful information even without the user's explicit request by following the dialogue context. In this work, we design and develop a news navigation system that uses Web news articles as a knowledge source and presents information based on the users’ preference and queries.
There are several studies towards this direction (Kawahara, 2009), but there is not a clear principle nor established methodology to design and implement casual conversation systems. Dialogue management of this kind of systems was usually made in a heuristic manner and often based on simple rules (Bratman et al., 1988, Lucas and, 2000, Bohus et al., 2003). The companions project (Catizone et al., 2008, Cavazza et al., 1630) designed conversational agents that would engage elderly users in sustained conversations based on rules. Misu and Kawahara (2007) developed a Kyoto navigation system that conducts question-answering and proactive presentation by defining a topic structure based on Wikipedia articles. The information state approach to dialogue management (Traum and Larsson, 2003, Kronlid and Lager, 2007) allows for dialogue control to put a topic on hold and return to it later. WikiTalk (Wilcock, 2012, Wilcock and Jokinen, 2013) is a dialogue system that talks about topics in Wikipedia. This system works on the pre-defined scenario that is represented with an automaton, but it forces users to follow the system scenario. Moreover, developers need to implement a new scenario for a new domain or task. A data-driven approach based on phrase-based statistical machine translation (SMT) (Ritter et al., 2011) tries to train response generation from micro-blog data. This approach enables the system to output a variety of responses, but it does not track any user intention or dialogue state to fulfil what the user want to know.
In the past years, machine learning, particularly reinforcement learning (RL), has been investigated for dialogue management. Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) are the most successful and are now widely used to model and train dialogue managers (Roy et al., 2000, Levin et al., 2000, Williams and Young, 2007, Young et al., 2010, Yoshino et al., 2013b). However, the conventional scheme assumes that the task and dialogue goal are clearly stated and readily encoded in the RL reward function. This is not true in casual conversation or information navigation addressed in this work.
Some previous work has tackled with this problem. Pan et al. (2012) designed a spoken document retrieval system whose goal is user's information need satisfaction, and defined rewards by using the structure of the target document set. This is possible only for well-defined document search problems. The strategy requires a structure of the document set and definition of user demand satisfaction. Shibata et al. (2014) developed a conversational chatting system. It asks users to make evaluation at the end of each dialogue session to define rewards for reinforcement learning. Meguro et al. (2010) proposed a listening dialogue system. In their work, levels of satisfaction were annotated in the log of dialogue sessions to train a discriminative model. These approaches require costly input from users or developers, who provide evaluation and supervision labels. In this work, we present a framework in which reward is defined for the quality of system actions and also for encouraging long interactions, in contrast to the previous approaches. Moreover, user focus is tracked to make appropriate actions, which are more rewarded.
Descriptions of the proposed conversational information navigation system are provided in Section 2. In Section 3, details of dialogue modules based on the predicate-argument (P-A) structure are explained. In Section 4, we describe spoken language understanding (SLU) modules based on logistic regression (LR) and conditional random fields (CRFs). In Section 5, we give a belief explanation of POMDP and its extension by incorporating user focus. Experimental evaluations of the proposed POMDP-based system with dialogue sessions are reported in Section 6.
Section snippets
Task of information navigation
Information navigation does not assume a designed task and goal, but provides useful information according to the users’ interest. When the user demands are not clear, the system clarifies the user demands through an interaction. The system presents relevant information even if there is not exactly matching result to the user query. Moreover, the system presents potentially useful information even when the user does not make any explicit request.
In natural human–human conversations,
Presentation of relevant information based on P-A structure
In this section, we describe flexible matching of P-A structure on which the proposed question answering ( QA) and proactive presentation ( PP) modules are based (Yoshino et al., 2011). Text of news articles and user utterances are parsed to extract a P-A structure (an example is shown in Fig. 3). A P-A structure represents a sentence with a predicate, arguments and their semantic role labels (Johansson and Nugues, 2008, Hajič et al., 2009, Matsubayashi et al., 2012). We used the Japanese text
Spoken language understanding (SLU)
In this section, we present the spoken language understanding (SLU) components of our system. It detects the user's focus and intention and provides them to the dialogue manager. The SLU modules are formulated with a statistical model to give likelihoods which are used in POMDP.
Dialogue management for information navigation
The POMDP-based statistical dialogue management is formulated as below. The random variables involved at a dialogue turn t are as follows:
- •
s ∈ Is: user state
User intention.
- •
a ∈ K: system action
Module that the system selects.
- •
o ∈ Is: observation
Observed user state, including ASR and intention analysis errors.
- •
P(o|s): observation probability
Output of SLU with its confidence score, which is defined in Eqs. (10), (12).
- •
: state transition probability
Model to predict the next user state
Experimental evaluations
For evaluation of the system, we collected additional 626 utterances (12 users, 24 dialogues; 2 dialogues by each user) with the proposed dialogue system with speech input (Yoshino et al., 2013a). There are 58 cases regarded as no request (NR) when a user did not say anything for longer than 5 seconds. The gold-standard is annotated by two annotators. The agreement for the user states was 0.958 and Cohen's kappa (Carletta, 1996) was 0.932. The agreement for the system actions was 0.944 and
Conclusions
We have designed and implemented a spoken dialogue system for information navigation of Web news articles updated day-by-day. The system presents relevant information according to the user's interest. We have introduced a user focus detection model, and developed a POMDP framework which tracks the user focus to select the appropriate action module of the dialogue system. In the experimental evaluations, the proposed dialogue management approach determines the state of the user more accurately
References (43)
- et al.
The Philips automatic train timetable information system
Speech Commun.
(1995) - et al.
Incorporating discourse features into confidence scoring of intention recognition results in spoken dialogue systems
Speech Commun.
(2006) - et al.
User evaluation of the Mask Kiosk
Speech Commun.
(2002) - et al.
Partially observable Markov decision processes for spoken dialog systems
Comput. Speech Lang.
(2007) - et al.
The hidden information state model: a practical framework for POMDP-based spoken dialogue management
Comput. Speech Lang.
(2010) - et al.
Ravenclaw dialog management using hierarchical task decomposition and an expectation agenda
An e-optimal grid-based algorithm for partially observable Markov decision processes
- et al.
Plans and resource-bounded practical reasoning
Comput. Intell.
(1988) Assessing agreement on classification tasks: the kappa statistic
Comput. Linguist.
(1996)- et al.
Information extraction tools and methods for understanding dialogue in a companion
How was your day?: a companion ECA
Attention, intentions, and the structure of discourse
Comput. Linguist.
The conll-2009 shared task: syntactic and semantic dependencies in multiple languages
Methods in Structural Linguistics
Dependency-based syntactic–semantic analysis with propbank and nombank
New perspectives on spoken language understanding: does machine need to fully understand speech?
A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis
Flexible mixed-initiative dialogue management using concept-level confidence measures of speech recognizer output
Implementing the information-state update approach to dialogue management in a slightly extended SCXML
A stochastic model of human–machine interaction for learning dialog strategies
IEEE Trans. Speech Audio Process.
VoiceXML
Commun. ACM
Cited by (27)
Introduction for Speech and language for interactive robots
2015, Computer Speech and LanguageFacilitating Multi-turn Emotional Support Conversation with Positive Emotion Elicitation: A Reinforcement Learning Approach
2023, Proceedings of the Annual Meeting of the Association for Computational LinguisticsNewsPod: Automatic and Interactive News Podcasts
2022, International Conference on Intelligent User Interfaces, Proceedings IUI
- ☆
This paper has been recommended for acceptance by R.K. Moore.