Dealing with complex queries in decision-support systems
Introduction
Under uncertainty, a modern and useful decision-theoretic model is the influence diagram [17]. It consists of an acyclic directed graph with associated probabilities and utilities, respectively modeling the uncertainties and preferences tied in with the stated problem. Nowadays this probabilistic graphical model is frequently adopted as a basis for constructing decision-support systems (DSSs). The results of evaluating an influence diagram are decision tables containing the optimal decision alternatives, policies or responses. Thus, for every decision, there is an associated decision table with the best alternative, i.e. the alternative with the maximum expected utility for every combination of relevant variables (usually called attributes within this context) that are observable before the decision is made. The evaluation algorithm determines which of the observable variables are relevant. These variables are outcomes of random variables and/or other past decisions.
A decision table may have millions of rows and typically more than twenty columns leading to enormous data sets for storage and analysis. Expert DSS users demand such an analysis on mainly two grounds. First, DSS decision tables provide the best decision-making recommendations. However, experts may find such recommendations hard to accept if they come without any explanation whatsoever of why the proposed decisions are optimal. Unexplained responses are not good enough for expert users since DSSs operate on a model that is an approximation of the real world. The importance of explanations has been reported in the literature, see e.g. [9], [12], [13]. Thus, for example, in health-care problems, usually involving difficult trade-offs between the treatment benefits and risks, practitioners may use decision tables to determine the best patient treatment recommendations. For this purpose, they need to understand the underlying reasons or implicit rules.
In medical DSSs, clinical practice guidelines assemble the relevant knowledge gathered through literature review, meta-analysis, expert consensus, etc., and operationalize this information as informal, text documents. This makes the gathered information difficult to interpret automatically and the decision-making process hard to guide. Shiffman and Greenes [19] propose translating guideline knowledge into decision table-based rule sets. Shiffman [18] proposes augmenting decision tables by layers, storing collateral information in slots at various levels beneath the logic layer of the conventional decision table. Information relates to table cells, rows and columns. It may include how tests are performed, the benefits/risks of the recommended strategies, costs, literature citations, etc., to help understand the domain. All these decision tables are different than ours. Our knowledge base is the model (influence diagram) and its evaluation, stored in the decision tables. The model (graph with probabilistic dependencies and probability and utility information) is built from clinical practice guidelines, data and expert input. Also, there is no uncertainty in clinical guidelines. Influence diagrams are based on subjective probabilities and utilities, and support learning and reasoning with uncertainty and preferences.
In [6] we introduced KBM2L lists to find explanations. The main idea stems from how computers manage multidimensional matrices: computer memory stores and manages these matrices as linear arrays, and each position is a function of the order chosen for the matrix dimensions. KBM2L lists are new list-based structures that optimize this order by putting equal responses in consecutive positions, yielding the target explanations and simultaneously achieving compact storage. These lists implicitly include the probability and utility models, they are simple, and have no added complex layers.
Not only do expert users employ decision tables as a knowledge base (KB) for explanations; they also query the DSS about which is the best recommendation for a given set of attributes in different ways. This is the second reason for decision table analysis. In a typical session, experts interact with DSSs to:
- (A)
formulate a query in the KB domain;
- (B)
translate the query into the KB formalism;
- (C)
implement the response retrieval;
- (D)
build the response efficiently;
- (E)
communicate the response(s) and/or suggest improvements, and wait for user feedback.
For (A) and (B), we distinguish between two groups of queries (closed/open) depending on whether or not the whole set of attributes is instantiated. A closed query is a specific and well-defined query entered by users that know all the attribute information. An open query is less specific, as it includes attribute values that are undefined either because they are hard or expensive to obtain or they are unreliable. Martinez et al. [15] give a similar classification for GIS (geographical information systems), although they focus on data efficient updating and access from a physical point of view (merely as a database), rather than from a logical point of view (as a KB).
(C) to (E) may be troublesome, especially for open queries, due to imprecise response retrieval failing to satisfy users. Additionally, the DSS may not include the whole decision table, because an exhaustive evaluation of the decision-making problem can be too costly. In this case there will be no response at all. Worse still, both situations could apply at the same time, demanding a methodology to undertake tasks (C)–(E) dealing with ambiguity and ignorance about the response.
Let us illustrate these ideas with the following clinical problem. It is a real health-care decision-making problem regarding the optimal treatment of non-Hodgkin lymphoma of the stomach.
Primary gastric non-Hodgkin lymphoma, gastric NHL for short, is a relatively rare disorder, accounting for about 5% of gastric tumors. This disorder is caused by a chronic infection by the Helicobacter pylori bacterium [5]. Treatment consists of a combination of antibiotics, chemotherapy, radiotherapy and surgery.
A number of influence diagrams have been constructed and validated [14]. These models are only meant to be used for patients with histologically confirmed gastric NHL. We have taken the most complex version with three decision nodes. This influence diagram is shown in Fig. 1, and is briefly discussed in the following. The first of the decision nodes, helicobacter-treatment (ht), corresponds to the decision to prescribe antibiotics against H. pylori. The second decision concerns carrying out surgery (s). The possibilities are either curative surgery, involving the complete removal of the stomach and locoregional tumor mass; palliative surgery, i.e. partial removal of the stomach and tumor; or no surgery. The last decision, ct-rt-schedule (ctrts), is concerned with the selection of chemotherapy (Chemo), radiotherapy (Radio), chemotherapy followed by radiotherapy (Ch.Next.Rad), or none.
The influence diagram model consists of 17 chance nodes (ellipses), one value node (diamond), three decision nodes (rectangles) and 42 arcs. Nodes to the left of the decision nodes (see Fig. 1) concern pretreatment information. Nodes to the right of the decision nodes are posttreatment nodes. Variables with their associated domains are listed in Table 1. See [14] for further details on the model. Bielza et al. [1] detail the use of KBM2L lists to gain a better understanding of the treatment basis of the gastric NHL model.
The gastric NHL influence diagram evaluation outputs three decision tables, one for each decision variable, each containing the optimal treatment for each combination of attributes in the tables.
Let us take the first decision table concerning the ht decision. It contains four attributes (cs, bd, hc, and hp), and the expected utility of each treatment alternative ht = No/Yes. To illustrate likely user queries, suppose a user queries the DSS about patients with the following configurations:
We will look at all the discussed queries in this paper. In the first case, , the query is closed since the four attributes are instantiated. The question is about a patient that has a good histological classification (hc = Low.Grade), a favorable prognosis (cs = I), the H. pylori bacterium (hp = Present), and a big tumor (bd = Yes). Unless this query corresponds precisely to an unsolved part of the problem, the response should be easy to retrieve.
In the second case, , the query is open because the doctor has not yet performed a biopsy to ascertain the histological-classification (hc). This could perhaps be due to the high cost of the biopsy.
In the third case, , the query is even more open, specifying only a medium clinical stage (cs = II2) for the patient. However, the user may be interested in finding out which treatment patients like these should receive. Responses are not expected to be easy to retrieve now. There are many possible alternatives, where users will find it unsatisfactory if different and perhaps unknown responses are retrieved. Therefore, strategies should be developed to assure user satisfaction. One possibility is table reordering to provide more precise answers. Another is sophisticated prediction procedures to infer the unknown responses from (somehow) close known responses or by having the user intervene at some steps to reduce response uncertainty.
In this paper, we propose a query system based on the KBM2L framework to deal with these complex situations. Unlike database management systems that operate with facts, DSSs must provide explanations besides efficiently retrieving the query response information [10]. Thus, our KBM2L framework provides not only an efficient and satisfactory query response retrieval but also an informed response explanation. It is not our aim to develop clinical practice guidelines, but to provide a DSS with a user interface capable of performing complex queries involving more than just accessing a clinical protocol database or document.
The paper is organised as follows. Section 2 outlines the technique of KBM2L lists. Section 3 describes the query complexity and shows how to deal with a closed query. Section 4 tackles less specific and more complex open queries. The proposal combines decision tables that have been compacted using KBM2L lists with learning, information access and information retrieval processes. We give several examples applied to the non-Hodgkin lymphoma problem. Section 5 contains the conclusions and suggests further research.
Section snippets
Basics
A decision table output by evaluating an influence diagram is a set of attributes that determines the optimal policy. Besides all the attribute configurations, a decision table includes the response or optimal alternative associated with each configuration. A base is defined as a vector with elements equal to the attributes in a specific order. Given a base, an index is a vector whose elements are the attribute values, interpreted as the coordinates with respect to that base. With a fixed order
Complexity of queries
Queries are stated as attribute instantiations. Therefore, they are related to the KBM2L index and employ multidimensional point access methods [21]. The DSS is expected to return a response stating the optimal policy using a small subset of the KB. However, an added difficulty is that the optimal policy may be unknown.
Let us explain this point in further detail. As mentioned earlier, the exhaustive evaluation of the decision-making problem may be too costly in terms of time and memory
Open queries
We have seen that the expert is an agent that queries the DSS about the optimal policy for the decision-making problem. Expert and DSS enter into a dialogue consisting of queries, responses and explanations. For closed queries, the expert receives definite and accurate responses. Responses to open queries are not so straightforward due to expert imprecision. Not all attributes are instantiated. Possible reasons are the unreliability of some attribute values, missing knowledge, high retrieval
Conclusions and further research
A decision model builds on guidelines, probabilities, utilities, probabilistic relationships, among other sources of information. Decision tables are the result of evaluating a decision model, taking into account that information. Their extraordinarily large size motivated us to analyse them. The aim was to save memory space and, more interestingly, retrieve knowledge (to understand DSS suggestions). In our previous paper we managed to achieve both aims. Moreover, by analysing the items—groups
Acknowledgments
Research partially supported by grants from the Spanish Ministry of Science and Innovation (TIN2007-62626 and Consolider Ingenio 2010-CSD2007-00018). Thanks to Peter Lucas for valuable support with the medical problem. We are also grateful to the referees for their valuable remarks that have definitely helped to improve the manuscript.
Concha Bielza received the M.S. degree in mathematics from Complutense University of Madrid, Madrid, Spain, in 1989 and the Ph.D. degree in computer science from the Universidad Politécnica de Madrid, Madrid, in 1996. She is currently a Full Professor of statistics and operations research with the Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid. Her research interests are primarily in the areas of probabilistic graphical models, decision analysis, metaheuristics for
References (21)
- et al.
Explaining clinical decisions by extracting regularity patterns
Decision Support Systems
(2008) - et al.
A list-based compact representation for large decision tables management
European Journal of Operational Research
(2005) - et al.
The effects of structural characteristics of explanations on use of a DSS
Decision Support Systems
(2006) - et al.
A PGM framework for recursive modeling of players in simple sequential Bayesian games
International Journal of Approximate Reasoning
(2010) - et al.
Processing partially specified queries over high-dimensional databases
Data & Knowledge Engineering
(2007) A guide to the literature on learning probabilistic networks from data
IEEE Transactions on Knowledge and Data Engineering
(1996)- et al.
A Bayesian method for the induction of probabilistic networks from data
Machine Learning
(1992) - et al.
Evidence propagation and value of evidence on influence diagrams
Operations Research
(1998) - et al.
Helicobacter pylori gastritis and primary gastric non-Hodgkin's lymphomas
Journal of Clinical Pathology
(1994) - et al.
Computers and Intractability: A Guide to the Theory of NP-Completeness
(1979)
Cited by (0)
Concha Bielza received the M.S. degree in mathematics from Complutense University of Madrid, Madrid, Spain, in 1989 and the Ph.D. degree in computer science from the Universidad Politécnica de Madrid, Madrid, in 1996. She is currently a Full Professor of statistics and operations research with the Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid. Her research interests are primarily in the areas of probabilistic graphical models, decision analysis, metaheuristics for optimization, data mining, classification models, and real applications, like biomedicine, bioinformatics and neuroscience. Her research has appeared in journals like Management Science, Computers and Operations Research, Statistics and Computing, the European Journal of Operational Research, Decision-Support Systems, Naval Research Logistics, the Journal of the Operational Research Society, Medical Decision Making, Methods of Information in Medicine, IEEE Transactions on SMC, International Journal of Systems Science, Bioinformatics, Briefings in Bioinformatics, Journal of Statistical Software, Journal of Heuristics, Intelligent Data Analysis, Developmental Neurobiology, Neuroinformatics, IEEE Transactions on Signal Processing, and Expert Systems with Applications as well as chapters of many books.
Juan A. Fernandez del Pozo received his MS degree in Computer Science in 1999 and PhD in Computer Science in 2006 from Universidad Politécnica de Madrid (UPM), Madrid (Spain). He is currently Associate Professor of Statistics and Operations Research at School of Computer Science and member of the Computational Intelligence Group at the UPM. His research interest includes decision analysis and intelligent decision-support systems based on influence diagrams and Bayesian networks that perform knowledge acquisition in huge decision tables, knowledge discovery and data mining on models' outputs for explanation synthesis and sensitivity analysis. He is also interested in optimization based on evolutionary algorithm and classification models. He is collaborating with several Spanish Foundations in modeling the service quality and life quality on social service environments. His articles have appeared in various academic journals including: Springer Lecture Notes in Computer Science, Journal of Operational Research, Computers & Operations Research, Decision-Support Systems, Expert Systems with Applications, Medical Decision Making. His teaching interests include Statistics, Decision-Support Systems and Operations Research.