University of Southern California Dissertations and Theses

Induction As Knowledge Integration.

(USC Thesis Other)

Induction As Knowledge Integration.

PDF

Download a page range

Download transcript

Copy asset link

Request this asset

Transcript (if available)

Content INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type o f computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 INDUCTION AS KNOW LEDGE INTEGRATION by Benjamin Douglas Sm ith A Dissertation Presented to the FACULTY OF TH E GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirem ents for the Degree D O CTO R OF PHILOSOPHY (Com puter Science) December 1995 Copyright 1996 Benjamin Douglas Smith UMI Number: 9625274 UMI Microform 9625274 Copyright 1996, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by B e n ja m in # D o u g 1 a s _ S m ith ____ under the direction of hL* Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date D e.c arib £ £. .18, _ _ 199 5 _ _ # _ DISSERTATION COMMITTEE Yo^J) S. Chairperson D ed ication To my wife Christina, for all her love, support, and patience. ii A ck now ledgm ents This work benefited from discussions with several people, particularly Haym Hirsh, who helped me to shape this work in its early stages. I would also like to thank the other m em bers of m y guidance com m ittee, Yolanda Gil, Shankhar Rajamoney, Jean-Luc Gaudiot, and my advisor Paul Rosenbloom. The Soar group at USC provided m any useful com m ents throughout the evolution of this work, and m ade the graduate program fun as well. Thanks are also due to my parents for their support and their assurances th at there was indeed a light at the end of the tunnel. Finally, I would not be where I am today w ithout my wife Christina, who persevered through the ups and downs of my graduate school career, and provided me with love and encouragement throughout. This work was partially supported by the National Aeronautics and Space Ad m inistration (NASA Ames Research Center) under cooperative agreement num ber NCC 2-538, and by the Inform ation Systems Office of the Advanced Research Projects Agency (A RPA /ISO ) and the Naval Com m and, Control and Ocean Surveil lance Center RD T& E Division (NRaD) under contract num ber N66001-95-C-6013. C ontents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xi 1 Introduction 1 1.1 Research G o a l s ................................................................................................... 2 1.2 Organization of the D isse rta tio n ..................................................................... 4 2 KII 6 2 . 1 Knowledge R e p re se n ta tio n ............................................................................... 8 2.2 O p eratio n s.............................................................................................................. 10 2.2.1 Translators .............................................................................................. 11 2.2.1.1 Examples of T ra n s la to rs .................................................... 12 2.2.1.2 Independent and Dependent Knowledge Translation . 14 2.2.2 Integration .............................................................................................. 16 2.2.3 Enum eration .......................................................................................... 17 2.2.4 Solution-set Queries ............................................................................. 18 2.2.5 Increm ental versus Batch P r o c e s s in g .............................................. 20 2.3 K II Solves an Induction Task ........................................................................ 22 2.3.1 Hypothesis S p a c e .................................................................................... 22 2.3.2 Instance Space ....................................................................................... 23 2.3.3 Available K n o w le d g e ............................................................................. 23 2.3.4 Translators .............................................................................................. 23 2.3.5 Integration and Enum eration ............................................................ 25 iv 3 Set Representations 27 3.1 Gram m ars as Set R ep resen tatio n s.................................................................. 28 3.2 Closure of C and P under In te g ratio n ........................................................... 29 3.3 Com putability of Induction ............................................................................ 29 3.3.1 Closure Properties of Set O p e r a tio n s ................................................ 30 3.3.2 Com putability of Solution S e t ............................................................ 33 3.3.2.1 Expressiveness Bounds on ( C x C ) H P ............................. 34 3.3.2.2 Expressiveness Bounds on C and P ................................ 34 4 RS-K II 37 4.1 The Regular Set R e p re se n tatio n ..................................................................... 37 4.1.1 Definition of D F A s ................................................................................ 37 4.1.2 Definitions of Set Operations ........................................................... 38 4.1.2 . 1 U n i o n ....................................................................................... 39 4.1.2.2 In te rse c tio n ............................................................................ 39 4.1.2.3 Com plem ent ......................................................................... 40 4.1.2.4 Cartesian P r o d u c t ..................... : ..................................... 40 4.1.2.5 P ro je c tio n ................................................................................ 45 4.1.2 . 6 Transitive C l o s u r e .............................................................. 49 4.1.3 DFA Im p le m e n ta tio n ............................................................................ 50 4.1.3.1 Recursive D F A s...................................................................... 51 4.1.3.2 Prim itive D F A s..................................................................... 52 4.2 RS-KII O perator Im plem entations.................................................................. 53 4.2.1 Integration ............................................................................................. 54 4.2.1.1 In te rse c tio n ............................................................................ 54 4.2.1.2 U n i o n ....................................................................................... 56 4.2.1.3 Minimizing after In te g ra tio n .............................................. 58 4.2.2 Enum eration .......................................................................................... 60 4.2.2.1 The Basic Branch-and-Bound A lg o rith m ..........................62 4.2.2.2 Branch-and-Bound with Partially Ordered Hypotheses 65 4.2.2.3 Fully Modified Branch-and-Bound Algorithm . . . . 73 4.2.2.4 Efficient Selection and P r u n i n g ....................................... 75 5 RS-KII and A Q ll 77 5.1 AQ-11 A lg o rith m ................................................................................................. 78 5.1.1 The Hypothesis S p a c e ......................................................................... 78 5.1.2 Instance Space ....................................................................................... 79 5.1.3 The Search A lg o r ith m ..................................... ; . . . . . 80 5.2 Knowledge Sources and T ra n s la to rs .............................................................. 81 5.2.1 E x a m p le s ................................................................................................. 81 5.2.2 Lexicographic Evaluation F u n c t io n ................................................... 84 5.2.2. 1 C onstructing P ..................................................................... 89 5.2.3 Domain T h e o r y ...................................................................................... 96 v 5.3 An Induction T a s k ............................................................................................ 99 5.3.1 Iris T a s k ................................................................................................... 99 5.3.1.1 Translators for Task K now ledge..........................................100 5.3.1.2 Results of L ea rn in g ..................................................................101 5.3.2 The CUP T a s k ........................................................................................101 5.3.2.1 Results of L ea rn in g ..................................................................102 5.4 Com plexity A n a ly s is ........................................................................................... 102 5.4.1 Com plexity of AQ - 1 1 ..............................................................................103 5.4.1.1 Cost of A Q -ll’s Main L o o p .................................................103 5.4.1.2 Cost of L e a rn T e rm ..................................................................103 5.4.1.3 Total Tim e Complexity for AQ-11....................................... 106 5.4.2 Com plexity of RS-KII when Em ulating A Q -1 1 ..............................107 5.5 S u m m a r y ................................................................................................................112 6 RS-K II and IVSM 113 6.1 The IVSM and CEA A lgorithm s.......................................................................114 6.1.1 The C andidate Elim ination A lg o r ith m ........................................... 114 6.1.2 Increm ental Version Space M e rg in g .................................................. 115 6.1.3 Convex Set R ep resentation................................................................... 116 6.1.4 Equivalence of CS-KII and I V S M ......................................................116 6.2 Subsum ption of IVSM by R S - K I I ................................................................... 120 6.2.1 Expressiveness of Regular and Convex S e ts .................................... 120 6.2.2 Spaces for which RS-KII Subsumes C S -K II.....................................121 6 .2.2.1 Conjunctive Feature L a n g u a g e s ..........................................1 2 2 6.2.2.2 Conjunctive Languages where RS-KII Subsumes CS- KII ............................................................................................123 6.2.3 RS-KII Translators for IVSM K now ledge........................................126 6.2.3.1 Noise-free E x a m p le s .............................................................. 126 6.2.3.2 Noisy Examples with Bounded Inconsistency . . . . 127 6.2.3.3 Domain T h e o ry .........................................................................129 6.3 Com plexity of Set O p e r a tio n s .......................................................................... 130 6.3.1 Com plexity of Regular Set In te rse c tio n ............................................ 131 6.3.2 Com plexity of Regular Set E n u m e ra tio n ......................................... 132 6.3.3 Com plexity of Convex Set I n te r s e c tio n ............................................ 133 6.3.4 Complexity of Enum erating Convex S e t s ......................................... 135 6.3.5 Complexity C o m p a ris o n .......................................................................135 6.3.5.1 Equating Regular and Convex s e t s ...................................136 6 .3.5.2 Tree Structured H ie ra rc h ie s................................................ 137 6.3.5.3 Lattice Structured F e a tu r e s ................................................ 139 6.4 Exponential Behavior in CS-KII and R S -K II ...............................................142 6.4.1 Haussler’s T a s k ........................................................................................143 6.4.2 Perform ance of CS-KII and R S - K I I .................................................. 144 6.4.2.1 RS-K II’s Perform ance on Haussler’s Task ..................... 145 vi 6.4.3 Complexity A n a ly s is ............................................................................... 150 6.4.4 S u m m a r y .................................................................................................. 153 6.5 D iscu ssio n ............................................................................................................... 154 7 Related Work 155 7.1 IVSM ......................................................................................................................155 7.2 G r e n d e l ...................................................................................................................156 7.3 Bayesian P arad ig m s.............................................................................................. 157 7.4 Declarative Specification of B ia se s...................................................................158 7.5 PAC l e a r n in g ........................................................................................................ 159 8 Future Work 161 8.1 Long Term V isio n ..................................................................................................161 8.2 Im m ediate I s s u e s ..................................................................................................162 9 Conclusions 164 Reference List 165 L ist O f Tables 2.1 Exam ples................................................................................................................ 23 2.2 Translated Exam ples........................................................................................... 25 3.1 Language Families............................................................................................... 29 3.2 Closure under Union and Intersection................................................................29 3.3 Closure Properties of Languages under Union and Intersection 31 3.4 Closure Under Operations Needed to Com pute the Solution Set. . . . 33 3.5 Sum m ary of Expressiveness Bounds................................................................... 36 5.1 Param eters for VL\................................................................................................ 100 5.2 Examples for the CUP Task................................................................................. 101 6.1 Translations of Haussler’s Exam ples.................................................................146 L ist O f F igures 2.1 Translators............................................................................................................... 24 2.2 Com puting the Deductive Closure of (C, P ) ................................................. 26 3.1 Projection Defined as a Homomorphism ........................................................ 33 4.1 Definition of Union............................................................................................... 39 4.2 Definition of Intersection.........................................................................................40 4.3 Definition of Com plem ent.................................................................................. 40 4.4 Regular G ram m ar for {shuffle(x,y) | x ,y G (a |6 )*$ s.t. x < y} . . . . 43 4.5 Definition of Cartesian P roduct........................................................................ 45 4.6 NDFA for Projection (F irst).............................................................................. 47 4.7 DFA for Projection (F irst)................................................................................. 48 4.8 DFA for Projection (Second)................................................................................. 48 4.9 Intersection Im plem entation.............................................................................. 55 4.10 Union Im plem entation......................................................................................... 57 4.11 Branch-and-Bound where D< is a Total O rder............................................ 63 4.12 Branch-and-Bound where < x is a Partial O rder........................................ 67 4.13 Param eters to BranchAndBound-2.................................................................. 6 8 4.14 Branch-and-Bound T hat Returns n Solutions Also in A ................................74 4.15 Param eters to BranchAndBound-3 for Im plem enting Enumerate. . . . 75 5.1 The VL\ Concept Description Language........................................................ 79 5.2 The O uter Loop of the AQ-11 Algorithm ...................................................... 81 5.3 The inner Loop of the AQ11 Algorithm ............................................................. 82 5.4 Exam ple Translator for VL\.............................................................................. 83 5.5 Regular Expression for the Set of VXi Hypotheses Covering an Instance. 84 5.6 M apping Function from Hypotheses onto Strings............................................8 8 5.7 Translator for the L E F........................................................................................ 8 8 5.8 DFA Recognizing {shuffle(x,y) € (0 |l)2 f c | x < y } ...................................... 90 5.9 Moore M achine for the M apping...........................................................................92 5.10 CUP Domain Theory........................................................................................... 97 5.11 G ram m ar for VL\ Hypotheses Satisfying the CUP Theory Bias 98 5.12 G ram m ar for Instantiated VL\ Language.........................................................100 6.1 Classify Instances Against the Version Space................................................... 119 6 . 2 RS-KII Translator for Noise-free Exam ples......................................................128 6.3 RS-KII Translator for Examples with Bounded Inconsistency. . . . . 129 6.4 RS-KII Translator for an Overspecial Domain Theory................................. 131 6.5 Intersection Im plem entation................................................................................ 132 6 . 6 Explicit DFA for the Intersection of Two DFAs..............................................133 6.7 Intersection of Convex Sets, Phase One........................................................... 134 6 . 8 Haussler’s Negative Exam ples............................................................................. 144 6.9 DFA for Co, the Version.Space Consistent with po.......................................146 6.10 DFA for Ci, the Version Space Consistent with n i .......................................146 6.11 DFA for C2 , the Version Space Consistent with n-i.......................................147 6.12 DFA for C 3 , the Version.Space Consistent with 7 1 3.......................................147 6.13 DFA for C 0 C\C\........................................................................................................148 6.14 DFA for CoflCi After Em pty Test.....................................................................148 6.15 DFA for C oO C jnC j................................................................................................149 6.16 DFA for C o flC iD ^ After E m pty Test.............................................................149 6.17 DFA for C'oDC'inCanC'a........................................................................................ 150 6.18 DFA for CoflCiflC^nCs After Em pty Test.....................................................151 6.19 DFA for CoCCi fl... After Em pty Test.................................................152 6.20 DFA for C n C ;...........................................................................................................152 x A bstract Accuracy and efficiency are the two m ain evaluation criteria for induction algorithms. One of the m ost powerful ways to improve performance along these dimensions is by integrating additional knowledge into the induction process. However, integrating knowledge th at differs significantly from the knowledge already used by the algorithm usually requires rewriting the algorithm . This dissertation presents KII, a Knowledge Integration framework for Induction, th at provides a straightforward m ethod for integrating knowledge into induction, and provides new insights into the effects of knowledge on the accuracy and complexity of induction. The idea behind KII is to express all knowledge uniformly as constraints and preferences on hypotheses. Knowledge is integrated by conjoining constraints and disjoining preferences. A hypothesis is induced from the integrated knowledge by finding a hypothesis consistent with all of the constraints and maximally preferred by the preferences. Theoretically, just about any knowledge can be expressed in this m anner. In prac tice, the constraint and preference languages determ ine both the knowledge th at can be expressed and the complexity of identifying a consistent hypothesis. RS-KII, an instantiation of KII based on a very expressive set representation, is described. RS- K II can utilize the knowledge of at least two disparate induction algorithm s— AQ-11 and CEA (”version spaces” )—in addition to knowledge neither algorithm can utilize. It seems likely th at RS-KII can utilize knowledge from other induction algorithm s, as well as novel kinds of knowledge, but this is left for future work. RS-K II’s com plexity is comparable to these algorithm s when using only the knowledge of a given algorithm , and in some cases RS-K II’s complexity is dram atically superior. KII also provides new insights into the effects of knowledge on induction th a t are used to derive classes of knowledge for which induction is not computable. C hapter 1 In trod u ction Induction is the process of going beyond w hat is known in order to reach a conclusion th at is not deductively implied by the knowledge. One form of induction commonly studied in m achine learning is classifier learning (e.g, [Mitchell, 1982, Briem an et al., 1984, Quinlan, 1986, Pazzani and Kibler, 1992]). In this form of induction, the objective is to learn an operational description of a classifier th at m aps objects from some predefined universe of instances into classes. The induction process is provided w ith knowledge about the classes, but there is usually not enough knowledge to derive the classifier from the knowledge deductively. It is necessary to go beyond what is known in order to induce the classifier. The desired classifier is usually referred to as the target concept, and a classifier is often referred to as a hypothesis. The classifier induced from the knowledge is not necessarily the target concept. In order to go beyond w hat is known deductively, it is necessary to make a num ber of assum ptions and guesses, and these are not always correct. The induced classifier m ay classify some instances incorrectly with respect to the target concept. The degree to which the induced classifier makes the same classifications as the target concept is a m easure of its accuracy. Hopefully, the induced hypothesis is fairly accurate, and ideally it is equivalent to the target concept. The more knowledge there is to induce from, the more accurate the induced classifier can be. The assum ptions and guesses m ade in the inductive process can be based on a broader base of knowledge, so there is less opportunity for m aking an incorrect decision. In the extrem e case, there is enough knowledge to deduce the classifier, in which case no m istakes are m ade, and the induced classifier exactly m atches the target concept. 1 An ideal induction algorithm would be able to utilize all of the available knowl edge in order to maximize the accuracy of the induced classifier. In practice, most existing algorithm s are w ritten with certain kinds of knowledge in m ind. If some of the available knowledge is not of these types, then the algorithm cannot use it, or can only use it with difficulty. One way an algorithm can m ake use of such knowledge is to express it in term s of the kinds of knowledge already used by the algorithm . For instance, Quinlan suggested casting knowledge as pseudo-examples and providing the pseudo-examples to the induction algorithm as input [Quinlan, 1990]. However, it is not always possible to express new knowledge in term s of the existing knowledge, in which case the knowledge still cannot be used. If the first approach fails, a second approach is to rew rite the algorithm to make use of the new knowledge. Rewriting an algorithm to utilize a new kind of knowledge is difficult. It also fails to solve the underlying problem. If yet another kind of knowledge is m ade available, the algorithm may have to be modified once again. Existing induction algorithm s therefore fall short of the ideal algorithm . Existing algorithm s each use varying ranges of knowledge, but none can utilize all of the available knowledge in every learning scenario. The accuracy of the hypotheses induced by these algorithm s is less than it could be since not all of the knowledge is being utilized. The m ethods m entioned above for integrating additional knowledge into the induction algorithm can be used to extend the algorithm , but these m ethods are of lim ited practicality. 1.1 R esearch G oals The goal of this dissertation is to construct an induction algorithm th at can utilize arbitrary knowledge, and therefore maximize the accuracy of the learned hypoth esis. Utilizing all of the available knowledge may either increase or decrease the com putational complexity compared to using less knowledge. The induction pro cess is guided by the knowledge, so additional knowledge may reduce the am ount of com putational effort. However, there is more knowledge to utilize, and the cost of utilizing th a t knowledge may overwhelm any com putational benefits it provides. Thus there is a potential trade-off between accuracy and com putational complexity. 2 An algorithm th at utilizes all of the knowledge may be prohibitively costly, or even uncom putable. It m ay be necessary to accept a reduction in accuracy in order to reduce the com putational complexity. The second goal of this research is to inves tigate the ways in which expressiveness (breadth of utilizable knowledge) can be exchanged for com putational cost. The idea is to express all knowledge about the target concept uniformly in term s of constraints and preferences on the hypothesis space. Knowledge is integrated by conjoining the constraints and computing the union of the preferences. The integrated constraints and preferences specify a constrained optim ization problem (CO P). Solutions to the COP are the hypotheses identified by the knowledge as possible target concepts. The knowledge cannot discrim inate further among these hypotheses, so one is selected arbitrarily as the induced hypothesis. This idea is embodied in KII, a Knowledge Integration framework for Induction. KII is described in C hapter 2 . Theoretically, any knowledge th at can be expressed in term s of constraints and preferences can be utilized by KII. However, it may be com putationally intractable to solve the CO P resulting from these constraints and preferences. It may even be undecidable whether a given hypothesis is a solution to the COP. The constraint and preference languages determ ine what constraints and preferences can be expressed, and the com putational complexity of solving the resulting COP. Thus, these lan guages provide a way of trading expressiveness for computability. Each choice of languages yields an operationalization of KII th at makes a different trade off be tween expressiveness and complexity. KII provides a m ethod for integrating arbitrary knowledge into induction and formalizes the trade-offs th at can be m ade between expressiveness and complexity in term s of th e constraint and preference languages. This relationship is used to determ ine the most expressive languages for which induction is computable. These analyses, and the space of possible constraint and preference languages, appear in Chapter 3. KII can generate algorithm s at practical and interesting trade-off points. One such trade-off is dem onstrated by RS-KII, an instantiation of KII with expressive constraint and preference languages described in C hapter 4. RS-KII is expressive, 3 utilizing the knowledge used by at least two disparate induction algorithm s, AQ- 11 [Michalski, 1978] and, for certain hypothesis spaces, the Candidate Elim ination Algorithm (CEA) [Mitchell, 1982]. In conjunction with this knowledge, RS-KII can also utilize knowledge th at these algorithm s can not, such as a dom ain theory and noisy examples with bounded inconsistency [Hirsh, 1990]. It seems likely th at RS- KII can utilize the knowledge of other induction algorithm s as well, although this is left for future work. This raises the possibility of combining knowledge from several induction algorithm s in order to form hybrid algorithm s (as in adding a domain theory to AQ-11). RS-K II’s apparent expressiveness also raises the possibility of improving the accuracy of the induction by utilizing additional knowledge. RS-KII is expressive, but it can also be com putationally complex. In the worst case it is exponential in the num ber of knowledge fragm ents it utilizes. However, when RS-KII utilizes only the knowledge of AQ-11, it has a com putational complex ity th at is slightly higher than th at of AQ-1 1 , but still polynomial in the num ber of examples. W hen RS-KII uses only the knowledge used by CEA, RS-KII’s worst- case complexity is the same as th at of CEA, but there are problems where RS-K II’s complexity is 0 ( n 2) when the complexity of CEA is 0 (2 n). Em ulations of these algorithms by RS-KII, and analyses of RS-K II’s complexity with respect to AQ-11 and CEA, are discussed in Chapter 5 (AQ-1 1 ) and Chapter 6 (CEA). 1.2 O rganization o f th e D issertation The rem ainder of this dissertation is organized as follows. The KII framework and an illustration of KII solving an induction problem are discussed in Chapter 2. Chap ter 3 lays out the space of possible set representations and identifies the most expres sive representations for which induction is computable. RS-KII, an instantiation of K II based on the regular set representation, is described in Chapter 4. Em ulations of AQ-11 and IVSM by RS-KII are described in Chapter 5 and Chapter 6 , respectively. These chapters dem onstrate th at RS-KII can utilize the knowledge used by these algorithm s, and th a t RS-KII can also utilize this knowledge while using knowledge these algorithm s cannot. RS-K II’s com putational complexity when utilizing this knowledge is compared to the complexity of the original algorithms. Related work 4 is discussed in C hapter 7. Future work is discussed in Chapter 8, and conclusions appear in C hapter 9. ( 5 C hapter 2 K II KII is a Knowledge Integration framework for Induction. It can integrate arbitrary knowledge into induction, and provides a foundation for understanding the effects of knowledge on the complexity and accuracy of induction. Knowledge comprises examples, biases, domain theories, meta-knowledge, implicit biases such as those found in the search control of induction algorithm s, and any other inform ation th at would justify the selection of one hypothesis over another as the induced hypothesis. Knowledge can be defined more precisely as the examples plus all of the biases. Mitchell [Mitchell, 1980] defines bias to be any factor th at influences the selection of one hypothesis over th e other as the target concept, other than strict consis tency with the examples. The distinction between biases and examples is somewhat arbitrary, since they both influence selection of the target concept. More recent definitions consider consistency with the examples as a form of bias [Gordon and desJardins, 1995]. The latter definition of bias is used here. The term s knowledge and bias will be used interchangeably. By definition, the biases are the only factors th at determ ine which hypothesis is selected as the target concept. Induction is then a m atter of determ ining which hypotheses are identified (preferred for selection) by the biases. The biases may prefer several hypotheses equally. In this case, the best th at can be done is to select one of the hypotheses at random , since there are no other factors upon which to base a selection of one hypothesis over another. If there were such a factor, it would be one of the biases by definition. 6 The hypotheses identified by the biases are term ed the deductive closure of the knowledge, since these are the hypotheses deductively implied by the biases. Induc ing a hypothesis by selecting a concept from the deductive closure of the knowledge m ay seem more like deduction than induction. However, there is still plenty of room for inductive leaps within this form ulation. Inductive leaps come from two main sources. One inductive leap occurs when there is more than one hypothesis in the deductive closure, so th at one of them m ust be selected arbitrarily as the induced hypothesis. The second place where inductive leaps occur is in the knowledge itself. The knowledge can include unsupported biases, assum ptions, and guesses—th at is, inductive leaps. Even if there is only one hypothesis in the deductive closure of the knowledge, it may not be the target concept if the knowledge is not firmly grounded in fact. In the knowledge integration paradigm , a hypothesis is induced from a collection of knowledge (biases) as follows. Each bias is represented explicitly as an expression in some representation language. An expression representing the biases collectively is constructed by composing the expressions for each individual bias. The composite expression is then used to generate the hypotheses identified by the biases. Ideally, the composite expression is the set of hypotheses identified by the biases, but real istically it m ay be necessary to do some processing in order to extract the identified hypotheses. One exam ple of an algorithm in the knowledge integration paradigm is incremen tal version space m erging (IVSM) [Hirsh, 1990]. Biases are translated into version spaces [Mitchell, 1982] consisting of the hypotheses identified by the bias (i.e., con sistent with the knowledge). The version spaces for each of the biases are intersected to yield a new version space consistent with all of the knowledge. This composite version space is the set of hypotheses identified by the biases. A hypothesis can be selected arbitrarily from the composite version space as the target concept. The version space representation for biases in IVSM allows the identified hy potheses to be easily extracted, but it is somewhat lim ited in the knowledge it can represent. It can only utilize biases th at reject hypotheses from consideration, but it cannot utilize knowledge th at prefers one hypothesis over another. Also, version spaces are represented as convex sets, and the expressiveness of this representation further lim its the kinds of knowledge th at can be represented. K II can utilize both 7 constraint knowledge and preference knowledge, and allows a range of representation languages. These are discussed further in Section 2.1. KII consists of three operations: translation, integration, and enumeration. The first step is translation. The form in which knowledge occurs in the learning scenario is usually not the form in which KII represents knowledge. Translators [Cohen, 1992] convert knowledge into the form used by KII. Once translated, the knowledge is then integrated into a composite representation from which the deductive closure can be com puted. At any tim e, one or more hypotheses can be enumerated from the deductive closure of the integrated knowledge. A hypothesis is induced from a collection of knowledge by translating the individual knowledge fragm ents (biases) in the collection, integrating the fragm ents together, and enum erating one of the hypotheses identified by the biases. O ther inform ation about the set of hypotheses identified by the biases is also relevant to induction, such as whether the set is empty, or whether it contains a single hypothesis. KII provides this inform ation via queries. These are discussed further in Section 2.2.4. The rest of this chapter is organized as follows. K II’s knowledge representation is described in Section 2.1. The translation, integration, enum eration, and query operations are described in Section 2.2. An exam ple of KII solving an induction task is given in Section 2.3. 2.1 K n ow led ge R ep resen tation K II’s knowledge representation is subject to a num ber of constraints. First, it m ust be expressive. Knowledge th a t can not be expressed can not be utilized, and the goal of KII it to utilize all of the available knowledge. Second, the representation m ust be composable. This facilitates integration by allowing a collection of knowledge fragm ents to be represented as a composition of individual fragm ents. Finally, the representation m ust be operational. T hat is, it m ust be possible both to enum erate hypotheses from the deductive closure of the integrated knowledge and to answer queries about the deductive closure. KII satisfies these three criteria on the representation language by expressing knowledge in term s of constraints and preferences on the hypothesis space. The 8 constraints correspond to representational biases, and the preferences correspond to procedural biases. For instance, a positive example m ight be expressed as a con straint th at hypotheses in the deductive closure m ust cover the example. A negative example can be expressed as a constraint requiring hypotheses in the deductive clo sure not to cover the example. An inform ation gain m etric, used by such algorithm s as ID3[Quinlan, 1986] and FOCL[Pazzani and Kibler, 1992], would be expressed as a preference for hypotheses with a higher information gain. There is more than one way to express a given knowledge fragm ent in term s of constraints and preferences. These issues are discussed further in Section 2.2.1. Each fragm ent of knowledge is translated into constraints and preferences. The constraints and preferences of a knowledge fragm ent are represented in KII by a tuple of three sets, (H , C, P), where H is the hypothesis space, C is the set of hypotheses in H th at satisfy the constraints, and P is a set of hypothesis pairs, (a, b), such that a is less preferred than b (a < b). W hen the hypothesis space is obvious, H can be om itted from the tuple. The biases encoded by tuple (if, C, P) identify the m ost preferred hypotheses among those satisfying the constraints. Hypotheses th at do not satisfy the con straints can not be the target concept, so these are elim inated. Among the rem ain ing hypotheses, P indicates th at some of these hypotheses should be selected over others. The less preferred hypotheses are elim inated, leaving a set of hypotheses th at satisfy the constraints and are preferred by the preferences. The knowledge can not discrim inate further among these hypotheses. This is the set of hypotheses identified by the biases in (H ,C ,P ). This set is also referred to as the deductive closure of the knowledge in (H , C, P ). Specifically, the deductive closure of (H ,C ,P ) is the set {h € C | \/h'ec(h,h') g P}. This is equivalent to Equation 2.1, below. In this equation, A is the complement of set A with respect to universe H , and first projects a set of pairs onto its first elements. For example, first({(xi,yi), {£2 , 2/2 ), • • •}) returns the set {1 1 , 1 2 ) • • •}• first((C xC )D P )nC (2.1) The deductive closure of (H , C, P ) tuple can also be thought of as the solution set of a constrained optim ization problem (CO P), where the dom ain of the COP is the hypothesis space (H ), the constraints are specified by C , and the optim ization 9 criteria are specified by P. For this reason, an (H , C, P) tuple is also called a COP, and the deductive closure of (H , C, P) is also referred to as the solution set. The (H , C, P) representation is expressive and composable, but since KII does not specify particular set representations for H , C , and P, it is not operational. It is expressive, since just about any knowledge relevant to induction can be ex pressed in term s of constraints and preferences. Integration is defined for every pair of (H, C, P) tuples, so the representation is also composable. Constraints are inte grated by conjoining them into a single conjunctive constraint, and preferences are integrated by com puting their union. Specifically, the integration of (H, C\, P\) and (H,c2,p2) is ( f f ,c 1nc72, p 1u p 2). In order for the (H , C, P) representation to be operational, there m ust be an algorithm for enum erating hypotheses from the deductive closure of (H, C, P). The deductive closure of (H, C, P) is always defined in term s of set operations over if, C, and P , but since KII does not specify any particular set representations, the set operations are not operational. The {H, C, P) representation is operationalized by providing set representations for JT, C, and P . In theory, any knowledge can be represented in term s of constraints and preferences, but in practice the set represen tation determ ines the constraints and preferences th at can be represented, and the complexity of integrating knowledge and extracting hypotheses from the deductive closure of the knowledge. Specifying a set representation operationalizes KII, and produces an induction algorithm (or family of algorithm s) at a particular level of complexity and expres siveness. This provides a useful handle for investigating the effects of knowledge on induction, by relating these effects to properties of set representations. Set repre sentations and their properties are discussed further in Chapter 3. An instantiation of KII based on the regular-set representation is discussed in C hapter 4. 2.2 O perations This section formally describes the operations provided by KII. The four families of operations are translators for expressing knowledge as (H, C, P) tuples, an integrator for integrating (H, C, P) tuples, an enumerator for enum erating hypotheses from the 10 deductive closure of a (H , C, P) pair, and queries about properties of the deductive closure. An induction task consists of an unknown target concept and knowledge from which the target concept is to be induced. Knowledge provided by the induction task is expressed in some representation, but probably not the COP representation expected by KII. Translators convert knowledge from its task representation into COPs. COPs for each knowledge fragm ent are integrated by K II’s integrate operator into a composite CO P representing all of the knowledge. A hypothesis is induced by selecting a hypothesis arbitrarily from the deductive closure of the integrated knowledge. This is achieved by the enumerate operator. The deductive closure of the integrated knowledge is exactly the solution set to the CO P resulting from inte grating all the translated knowledge. The enumerate operator returns a requested num ber of hypotheses from the solution set of this COP. Also relevant to induction are certain properties of the solution set, such as whether the solution set is empty, whether the induced hypothesis is the only one in the solution set, and whether a given hypothesis is in the solution set. These properties, and others, are determ ined by solution-set queries. The rem ainder of this section discusses the KII operations in more detail. 2.2.1 Translators An induction task consists of an unknown target concept and some collection of knowledge from which the target concept is to be induced. In order for KII to use this knowledge, it m ust first be translated from whatever form it has in the task (its naturalistic representation [Rosenbloom et al., 1993]) into the representation expected by KII—namely, a COP. This operation is performed by translators [Cohen, 1992]. A translator converts knowledge from some naturalistic representation into the constraints and preferences th at make up a COP. The translators are highly task- dependent. They depend on the kind of knowledge being translated, the naturalistic representation of th at knowledge, and the hypothesis space. The hypothesis space is a necessary input to the translator since the output of the translator consists of 11 constraints and preferences over the hypothesis space. It is necessary o know the domain being constrained in order to generate meaningful constraints, and likewise for preferences. The dependence of the translator on the hypothesis space m eans th at all trans lators for a given induction task m ust take the same hypothesis space as input, and output constraints and preferences over the same domain. Effectively, all of the knowledge fragm ents are dependent on the hypothesis space bias, which is also a knowledge fragm ent. These restrictions m ean th at each induction task will usually need its own set of translators. However, this does not m ean th at reuse is impossible. If there is some knowledge used by several tasks, and the knowledge has the sam e naturalistic representation in all of the tasks, and the tasks use th e sam e hypothesis space, then the same translator can be used for th at knowledge in all of the tasks. 2.2.1.1 Examples of Translators Specifications of translators for common knowledge sources are shown below. In the following, H is the hypothesis space, pos is a positive example, neg is a negative example, and T is a dom ain theory. A dom ain theory is a collection of horn-clause rules th at explain why an instance is an example of the target concept. • PosExample(H, pos) — > (H, C, {}) where C = {h € H | h covers pos} (e.g., FOCL [Pazzani and Kibler, 1992], AQ11 [Michalski, 1978], CN2 [Clark and N iblett, 1989], CEA [Mitchell, 1982]) Every noise-free positive example is covered by the target concept. No hypoth esis th a t fails to cover a noise-free positive example can be the target concept. A noise-free positive example is therefore translated as a constraint th at is only satisfied by hypotheses th at cover the example. This bias is known as strict consistency with the examples, and is used by m any existing algorithms. 12 • NegExample(H, neg) — > (H, C, {}) where C = {h £ H | h does not cover neg} (e.g., FOCL, AQ11, CN2, CEA) The target concept does not cover noise-free negative examples. The transla tor for noise-free negative examples is similar to the translator for noise-free positive examples, except th a t the constraint is satisfied by hypotheses th at do not cover the example. • NoisyExamples(H,examples) — > (H ,{},P) where P — {{x,y) € H | x is consistent with fewer examples than y} In the real world, examples can contain errors, or noise. Dem anding strict consistency with noisy examples could cause the target concept to be rejected, or there m ay be no hypotheses consistent with all of the examples. One sim ple approach is to assume th a t the examples are m ostly correct, and therefore hypotheses th at are consistent with m ost of the examples are more preferred than those consistent with only a few examples. It is difficult to find a translation for individual examples th at would yield this preference when their translations are integrated. Instead, all of the examples are translated collectively into a single set of preferences. This translation is somewhat naive, but provides a simple illustration of how preferences are used, and how noisy examples can be handled. More sophisti cated translators for noisy examples are discussed in Section 6 .2.3.2 . • CorrectDomainTheory(H,T) — > (H, C, {}) where C = {h e H | h covers exactly those instances explained by T} (e.g., EBL[DeJong and Mooney, 1986, M itchell et al., 1986]) A com plete and correct dom ain theory exactly describes the target concept. T he theory is translated into a constraint th at is satisfied only by the target concept, which is described by the theory. There is very little induction here, 13 since the target concept has been provided. This kind of theory is more often used in speed-up learning, where the objective is not to learn a new classifier, but to m ake a known classifier more efficient (e.g., [DeJong and Mooney, 1986]). • OverSpecia,lDomainTheory(H,T) — » (H, C, {}) where C = {h G H \ h is a. generalization of domain theory J 1 } (e.g., IOE [Flann and D ietterich, 1989]) An overspecial dom ain theory can only describe why some of the instances covered by the target concept are examples of the target. The theory is a specialization of the target concept. The theory is translated as a constraint satisfied by all generalizations of the concept described by the theory. See Section 5.2.3 and Section 6.2.3.3 for more detailed dom ain theory translators. A given fragm ent of knowledge can have several translations, depending on how the knowledge is to be used. One can think of the intended use as knowledge th at is provided im plicitly to the translator. For example, there are at least two translations for dom ain theories, depending on what assum ptions are m ade about the correctness of the theory. Likewise, examples also have different translations, depending on whether the examples are assumed to be noisy or error free. 2.2.1.2 Independent and Dependent Knowledge Translation A collection of knowledge fragm ents th a t can be translated independently of each other are said to be independent. The only inputs to a translator for an independent fragm ent is the fragm ent itself. There are no truly independent knowledge fragm ents in KII, since each fragm ent is translated into constraints and preferences over the hypothesis space, so both the fragm ent and the hypothesis space are necessary inputs to the translator. However, if each fragm ent is paired with the hypothesis space to form a new unit, then these units can all be translated independently. Knowledge fragm ents th a t can only be translated in conjunction with each other are said to be dependent. This can occur either because the knowledge fragm ents are inherently dependent, or because the set representations for C and P cannot express the constraints and preferences for the individual fragm ents, but can express the 14 constraints and preferences imposed by the fragm ents collectively. A translator for a collection of dependent knowledge fragm ents takes the whole collection as input and produces a single {H , C, P) tuple. The inputs are term ed a dependent unit. Ideally, the dependencies among the knowledge are minimal, so th a t each depen dent unit consists of only a few knowledge fragm ents. Every knowledge fragm ent is dependent on the hypothesis space, so there is at least one unit per knowledge fragm ent, each of which contains the fragm ent and the hypothesis space. Additional dependencies can decrease the num ber of units and increase the num ber of frag m ents in each, until at m axim um dependence there is a single unit containing all of the knowledge. Independence among the knowledge leads to greater flexibility in deciding what knowledge to utilize, and ensures th at knowledge integration occurs within K II’s integration operator and not within the translators. Independence leads to greater flexibility in using knowledge. W hen a collection of knowledge fragm ents is dependent, it is not possible to translate only some of the fragm ents in the collection (unless they participate in other dependent units as well). It is often an all or none proposition. This constrains the choices of what knowledge to utilize. The greater the independence among the knowledge, the smaller the dependent units tend to be, and the fewer constraints there are. At m axim um independence, each unit consists of a knowledge fragm ent and the hypothesis space, so individual fragm ents can be utilized or om itted as desired. W hen the knowledge fragm ents are independent, they can be translated and integrated incrementally. However, this does not m ean th at induction in KII is necessarily increm ental. In order to induce a hypothesis from a body of knowledge in KII, the knowledge is translated and integrated, and a hypothesis is enum erated from the deductive closure of the knowledge. The translation and integration steps m ay be increm ental when the knowledge is independent, but the enum eration step may not be. The enum eration algorithm may have to start over when new knowledge is integrated, instead of picking up where it left off. This subject is discussed further in Section 2.2.5. Independence among the knowledge ensures th at the integration work occurs within K II, and not within the translators. Independent knowledge fragm ents are translated independently, and the (H , C, P) tuples for the individual fragm ents are composed by the Integrate operator into a single (H , C, P) tuple. The integration 15 work occurs within KII. Collections of dependent knowledge fragm ents are translated directly into a single (H , C, P ) tuple. The integration of the knowledge occurs within the translator, and not within KII. W hen m ost of the induction process occurs outside of KII, the power of KII cannot be brought to bear in facilitating the integration of new knowledge. In the extrem e case there is a single translator for all of the knowledge, and all of the in tegration takes place within th at translator. The translator is effectively a special purpose integration algorithm replacing K II’s integration operator. In order to uti lize additional knowledge, a new translator has to be constructed. K II’s knowledge integration capabilities are circumvented. Independence of the knowledge is clearly desirable, but not always easy to obtain. Some knowledge is inherently dependent, and m ust always be translated as a unit. O ther knowledge is dependent in one representation, but independent in another. Independence tends to increase with the expressiveness of the C and P represen tations, since it is more likely th at translations of individual knowledge fragments can be expressed. However, complexity also increases with expressiveness, so it m ay be necessary to trade some independence among the knowledge for improved complexity. 2.2.2 Integration Two CO Ps can be integrated, yielding a new COP whose solution set is the de ductive closure of the knowledge represented by the two COPs being integrated. The integration operation is defined as follows. Both CO Ps m ust have the same hypothesis space, since it makes no sense to integrate COPs whose constraints and preferences are over entirely different universes of hypotheses. Integrat e((H , Cu Pi), (H, C2, P2)) = (H, C^D C2, P X UP2) (2.2) The C and P sets of each COP represent reject knowledge and preference knowl edge respectively. If a hypothesis is rejected by either COP, then it cannot be the target concept. Thus the constraints are conjoined, which corresponds to intersect ing the C sets. Preference knowledge, by contrast, is unioned. If one COP has no preference between hypotheses a and 6 , but the other COP prefers a to 6 , then a 16 should be preferred over b. This corresponds to com puting the union of the prefer ences in the two P sets. 2.2.3 Enum eration The enum eration operator returns a requested num ber of hypotheses from the de ductive closure of the knowledge represented by an (H , C, P) tuple. The solution set to {H , C, P) consists of the undom inated elements of C, and is defined formally in Equation 2.3. In this definition, the function first({(xi, t/i), (a;2 , 2/2 )1 •• •}) is a projection returning the set of tuple first-elements, nam ely {xi, x 2, ...} . Solutions((H,C,P)) = {x 6 C | Vy6c (x, y) & P} = {x e H \ (x £ C and 3yzc(x,y) € P) or x $ C} = {x E H | x G C and 3yec(x,y) € P} U C = first({(x,y) e C x C | (x,y) € P})DC = first{{CxC)nP)nC (2.3) E num erate takes three argum ents, (H, C, P ) , A , and n, where {H , C, P) is a COP, A is an arbitrary set from the same universe as C and P, and n is a non-negative integer. Enumerate((H, C , P), A, n) returns an extensional list of up to n hypotheses th a t are in both A and the deductive closure of (H ,C ,P ). If there are fewer than n hypotheses in the intersection of A and the deductive closure of (H , C, P ) , then Enumerate((H,C,P),A,n) returns all of the hypotheses in the intersection. The argum ent A is necessary to im plem ent some of the solution-set queries, as shown in Section 2.2.4. It can be effectively elim inated by setting A to C, or any superset of C, since the solution set is always a subset of C. Enumerate((H, C ,P ),A ,n ) — * ■ {^i, h2,... hm} where m = m in(n, \SolnSet({H, C,P))C\ A|) The tuple (H , C, P) is the result of integrating the tuples generated by the trans lators. C is the intersection of several constraint sets, and P is a union of preference 17 sets. Equation 2.3 assumes th at P is transitively closed and a partial order. How ever, neither of these conditions is guaranteed by the integration operator. The preference sets produced by the translators m ust be transitively closed partial or ders, b u t these conditions are not preserved by union. If P\ has the preference a < b and P 2 has the preference b < a, then Pi U P2 contains a cycle (a < b and b < a). A partial ordering is antisym m etric and irreflexive by definition, so the union is not a partial ordering. Transitive closure is not preserved by union either. If a < b and c < d are in P i, and 6 < c is in P 2 , P 1 UP2 will not contain a < d, even though this is in the transitive closure. The lack of transitive closure is dealt with by transitively closing P prior to calling Enumerate. The im plem entation of the closure operation depends on the set representation for P . Cycles in P are more problem atic. Currently, KII simply assumes th at the preferences are consistent, so th at P has no cycles. If P does contain cycles, then hypotheses participating in the cycles are all dom inated, and not p art of the solution set. In other words, hypotheses about which there is conflicting knowledge are rejected, but the rem aining hypotheses are considered normally for m em bership in the solution set. Contradictions can also occur among the constraint knowledge. If the constraints are m utually exclusive, then C is empty, and so is the solution set. KII tolerates such conflicts, but does not have any sophisticated strategy for dealing with them . The solution space simply collapses, and it is up to the user to try a different set of knowledge. A more sophisticated approach would be to detect m utually exclusive constraints and cycles in the preferences, either as part of integration or prior to enum eration, and resolve the conflicts according to a conflict resolution strategy. This is an area for future research. 2.2.4 Solution-set Queries Solution-set queries return information about the deductive closure of a {H , C, P ) tuple. There are four solution-set queries, based on those Hirsh proposed for version spaces [Hirsh, 1992]. As discussed earlier, version spaces represent deductive closures of knowledge, so these queries should also be appropriate for KII. 18 There are four solution-set queries, defined as follows. In these definitions, (H , C, P) is a COP, and h is some hypothesis in H. • Member(h, (H, C, P)) — > Boolean Returns true if h is a m em ber of the solution set to (H , C, P ) . Returns false otherwise. Member(h, (H,C, P)) is equivalent to h £ C and ({/t}xC ')D P = 0 • Em pty((H ,C,P )) — > Boolean Returns true if the solution set to (H , C, P) is empty. Returns false otherwise. • Unique((H,C,P)) — * Boolean R eturns true if the solution set to (H , C, P) contains exactly one member. Returns false otherwise. • Subset ((H% C,P), A) — > Boolean A is a set of hypotheses. Returns true if (H, C, P) C A, and returns false otherwise. Subset((H, C , P), A) = SolutionSet((H, C , P)) fl A = 0. The queries take (H ,C ,P ) as an argum ent instead of the solution set itself. This is because it may be possible to answer the query without com puting the entire solution set, which could yield significant com putational savings. Taking (H , C, P) as an argum ent also allows the queries to be im plem ented in term s of Enumerate, which facilitates both im plem entation of the queries and analysis of their com putational complexity. The com putational complexity of the queries is essentially the complexity of enum erating one or two hypotheses. W hether enum erating a few hypotheses is significantly cheaper than com puting the entire solution set depends on the set representation and on the CO P itself. The com putational complexity of enum eration as a function of set representations is discussed in Chapter 3, and the question of whether enum eration is cheaper than com puting the whole solution set is discussed in C hapter 4. The queries are defined in term s of Enumerate as follows. • Member(h, (H , C ,P )) Enumerate((H, C ,P ), {A}, 1) ^ 0 • Empty((H, C ,P )) < $ ■ Enumerate((H, C ,P ),H , 1) = 0 19 • Unique((H,C, P)) < * = > • \Enumerate((H, C, P), H, 2)| = 1 • Subset((H, C, P), A) ^ Enumerate((H, C, P ), A, 1) = 0 It is conjectured th at these four queries plus the enum eration operator are suf ficient for the vast m ajority of induction tasks. M ost existing induction algorithm s involve only the enum eration operator and perhaps an Empty or Unique query. The candidate elim ination algorithm [Mitchell, 1982] and increm ental version space m erging (IVSM) [Hirsh, 1990] use all four queries, but do not use the enum eration operator. 2.2.5 Increm ental versus B atch Processing In an induction algorithm , a hypothesis is induced from all of the knowledge inte grated so far. Inducing the hypothesis involves some am ount of work. W hen new knowledge is integrated, the old induced hypothesis may no longer be valid in light of the new knowledge. A new hypothesis should be induced. In an increm ental algorithm , the work done in inducing a hypothesis from the old knowledge can be applied towards inducing the new hypothesis. For example, in the candidate elim ination algorithm [Mitchell, 1982], the version space represents all of the hypotheses consistent with the examples. Any of these hypotheses can be selected as the induced hypothesis. The work done in com puting the version space is conserved in com puting a new version space consistent with the old examples plus a new example. The new version space is com puted by removing inconsistent hypotheses from the old version space. The new version space does not need to be com puted from scratch. In a batch algorithm , the old work cannot be applied to inducing a new hy pothesis. The algorithm m ust start over again from scratch. For example, in AQ-11 [Michalski, 1978], the hypothesis is found by a beam search guided by an information theoretic m easure, which is derived from the examples. If a new example is added, the m etric changes. The search m ust start from the beginning with the new m etric. KII is neither inherently batch nor inherently increm ental. R ather, individual instantiations of K II can be increm ental or batch, depending on the set represen tation and the translators. These factors can be m anipulated and analyzed, which 20 makes KII a potentially useful tool for experim enting with these factors, and may suggest ways for m aking algorithm s more incremental. A hypothesis is induced from a collection of knowledge by translating the frag m ents into {H , C, P ) tuples, integrating the tuples into a single {H , C, P) tuple, and then enum erating a hypothesis from the solution set of this composite tuple. W hen new knowledge is added, it is translated into (H ,C ',P '), integrated with the old {H , C, P ) tuple to yield {H , CnC ", P U P '), and a new hypothesis is enum erated from (tf,C 'n C " ,P u P '). There are two places where work done in processing the old knowledge can be saved or lost when new knowledge becomes available. The work done in enum er ating a hypothesis from the solution set of the (H,C,P) tuple can be applied to enum erating a hypothesis from the solution set of (H, CflC", P U P '). Work done in integrating the old knowledge can either be saved or lost in integrating the new knowledge fragm ent, (H ,C ’, P '). W hether any of the work involved in enum erating a hypothesis from the solution set of (H , C, P ) can be applied to enum erating a hypothesis from the solution set of (H, CnC', P U P ') depends largely on the set representations for C and P , and the extent of the changes introduced by C' and P \ As an extrem e example, if P and P' are both empty, then the solution set to (H , C, P) is ju st C, and the solution set to (H,Cr\C',Pl)P') is ju st C n C 1 . A hypothesis can be enum erated from CflC" by continuing the enum eration of C until a hypothesis is found th a t is also in C'. There is no need to start the enum eration of C over from the beginning, since any hypothesis not in C is not in CC\C' either. All of the previous work is saved. T he second place where K II can either save or lose previous work is in integrating knowledge. Work is lost only if the translation of the existing knowledge depends on the new knowledge. This usually occurs in translators where one of the inputs takes all knowledge of a given type (e.g., all the examples). W hen a new knowledge fragm ent of this type is added, the old fragm ents m ust be retranslated to take the new fragm ent into account. However, the old translation has already been integrated. If this translation is not retracted before the new translation is integrated, then the old and new translations m ay conflict. Since KII has no facility for retracting translations, the only way to retract the old translation is to re-integrate all of the 21 knowledge from scratch, om itting the offending translation. The previous integration work is lost, as is any enum eration work. Not all translators of this type necessarily require the old translation to be re tracted when new knowledge arrives. Consider a translator, tran(E) th at takes as input the set of all available examples. Let E be a collection of examples, and e a new example. Furtherm ore, let tran(E) be (H ,C ,P ) and tran(El){e}) be (H ,C *,P *). If {H, C, P) has already been integrated, then it usually m ust be retracted before integrating tuple(H,C*,P*). However, consider what happens if (H ,C *,P *) can be expressed as (H , C'flC'e, PUPe), where (H, Ce, Pe) is derived from e (and possibly E). T he translator can simply output (H, Ce,Pe). This will be integrated with the pre viously integrated (H,C,P) producing (H ,C *,P *), which is the correct translation of E U{e}. There is no need to retract (H,C,P) first. 2.3 K II Solves an In du ction Task An example of how KII can solve a simple induction task is given below. Sets have been represented extensionally in this example for illustrative purposes. This is not the only possible set representation, and is generally a very poor one. The space of possible set representations is investigated more fully in C hapter 3. 2.3.1 H ypothesis Space The target concept is a m em ber of a hypothesis space in which hypotheses are de scribed by conjunctive feature vectors. There are three features s iz e , c o lo r, and shape. The values for these features are s iz e £ {sm all, la rg e , a n y -siz e } , c o lo r £ {black, w h ite, a n y -c o lo r} , and shape £ { c ir c le , re c ta n g le , any-shape}. A hy pothesis is some assignment of values to features for a total of 27 distinct hypotheses. Hypotheses are described as 3-tuples from s iz e x s iz e x s h a p e . For shorthand iden tification, a value is specified by the first character of its name, except for the “an y -r” values which are represented by a “?” . So the hypothesis (w hite, a n y -s iz e , c ir c le ) would be w ritten as (iu,?,c) or ju st wlc. 22 2.3.2 Instance Space Instances are the “ground” hypotheses in the hypothesis space. An instance is a tuple {color, size, shape) where color G {black, w hite}, size 6 {sm all, la rg e } , and shape € { c ir c le , re c ta n g le } . A hypothesis covers an instance if the instance is a m em ber of the hypothesis. Re call th at a hypothesis is a subset of the instance space, as described in some language. In this learning scenario, an instance is a m em ber of (covered by) all hypotheses that are more general than the instance. Specifically, an instance, (z, c, s) is covered by every hypothesis in the set {z, a n y -siz e } x{c, a n y -c o lo r} x { s , any-shape}. 2.3.3 Available K nowledge The available knowledge consists of three examples (classified instances), and an assum ption th at accuracy increases with generality. There are three examples, two positive and one negative, as shown in Table 2.1. The target concept is (s,? ,? ). T hat is, s iz e = sm a ll, and color and shape are irrelevant. identifier class example e - 1 + (small,white,circle) e - 2 + (small,black,circle) e -3 — (large,white,rectangle) Table 2.1: Examples. 2.3.4 Translators The first step is to translate the knowledge into constraints and preferences. Three translators are constructed, one for each type of knowledge: the positive examples, negative examples, and the generality preference. These translators are shown in Figure 2.1. As with all translators in KII, each of these translators takes the hy pothesis space, H, as one of its inputs. Since the hypothesis space is understood, (if, C, P) tuples will generally be referred to as ju st (C , P) tuples for the rem ainder of this illustration. 23 • . TranPosExample(H,(z,c,s)) — > (C, {}) where C = {x £ H \ x covers (2 , 0 , 5 )} = { z , a n y -siz e } x {c, a n y -c o lo r} x { s,a n y -sh a p e } • TranNegExample(H, (z,c,s)) — > (C, {}) where C = {a; G x does not cover {z, c, s)} = complement of {2 , a n y -siz e } X {c, an y -c o lo r} X {s, any-shape} • TranPreferGeneral(H) — » ({ } ,P ) where P = {(x, y) £ H x H | x is more specific than y} = {(s 6 r, ?6 r), (sbr,slr ), (s 6 r, ??r), (siar, ? iu r),...} Figure 2.1: Translators. The examples are translated in this scenario under the assum ption th at they are correct. Under this assum ption, hypotheses th at do not cover all of the positive examples, or th a t cover any of the negative examples, can be elim inated from consid eration. Beyond these constraints, examples do not express preferences for certain hypotheses over others. Thus, positive examples are translated into (C ,P ) pairs in which P is em pty and C is satisfied only by hypotheses th a t cover the example. Negative examples are translated similarly, except th at C is satisfied by hypotheses th at do not cover the example. The translated examples are shown in Table 2.2. The hypothesis space is om itted for brevity. The assum ption th at general hypotheses are more accurate is represented as a a preference. Specifically, the assum ption is translated into a (C, P) pair where C is em pty (it rejects nothing), and P = {(x,y) € H x H \ x is m ore specific than y}, where H is the hypothesis space. Hypothesis x is more specific than hypothesis y if x is equivalent to y except th at some of the values in y have been replaced by “any” values. For example, (s,w,r) is more specific than (? ,tn ,r), but there is no ordering between (?,u;, c) and (s, u>, r). There are too m any preferences to list explicitly, 24 Exam ple Class C ei : swc + swc, sw l, sic, Iwc, s??, ?zu?, 1 1 c, ??? e2 : sbc + sic, si?, s?c, ?ic, ?i?, s??, ??c, ??? e3 : Iwr su;c, szar, /tor, sic, s ir , Ibc, /ir , /i?, si?, stc?, Iwc, Ibc, Ibr, sic, sir, lie, Ibl, sll, 1 1 c Table 2.2: Translated Examples. but th e first few are shown in the definition for the TranPreferGeneral translator in Figure 2.1. 2.3.5 Integration and Enum eration The COPs are integrated into a single COP, (C, P). C is the intersection of the C sets for each knowledge fragm ent, and P is the union of preferences for each knowledge fragm ent. This is shown in the following equation. The elem ents of each {C, P) pair are subscripted with the knowledge fragm ent from which the pair was translated: e l, e2, and e3 for the examples, and mg for the “more general” preference. In the following, (Ci, Pi) ® (C2, P 2 ) is shorthand for Integrate((Ci,P\), (C 2 ,P 2)). (C,P) = (Cel,{ } ) ® (C e 2 ,{ } ) ® (C e 3 ,{})@ (H ,P mg) = (Cei n c e 2 nCe3 n P ,{ } u { } u { } u P mfl) = ( { 5??, ??c, s? c }, { (s i r , ? ir ), (s i r , sir ) , . . .}) A hypothesis is induced by m aking a call to Enumerate((C,P),H, 1), which returns a hypothesis selected arbitrarily from the deductive closure of (C, P ) . The deductive closure consists of the undom inated elem ents of C with respect to the dom inance relation P . One way to com pute this set is to partially order C according to P and find the elements at the top of the order. C contains three elements, (s, ?, ?), (?,?,c) and (s, ?,c). P prefers both (s,l, ?) and (?,?,c) to (s,l,c), but there is no preference ordering between (s, ?, ?) and (?,?,c). The deductive closure contains the undom inated elements of C , namely (s,l,l) and (?,?, c). Figure 2.2 illustrates this com putation. The arrows indicate 25 Figure 2.2: Com puting the Deductive Closure of (C ,P ). the partial ordering imposed by P, and the highlighted hypotheses are the elements of C. Enumerate((C, P), H, 1 ) returns one hypothesis selected arbitrarily from the de ductive closure of (C, P) as the induced hypothesis. If it selects (s , ?, ?), then it has correctly induced the target concept. However, it m ay ju st as easily select (?, ?,c), the other elem ent of the deductive closure. The selection of a hypothesis is an in ductive leap, and not guaranteed to be correct. The am ount of am biguity in the leap can be reduced by utilizing more examples or other knowledge, thus reducing the num ber of hypotheses in the deductive closure. Additional inform ation about the deductive closure of (C,P) th at is relevant to induction can be obtained through the solution-set queries. For example, we may wish to know whether or not the induced hypothesis was the only one in the deductive closure. This and other queries are shown below. • Unique({(s, ?, ?), (?, ?,c)}) = f a l s e • Empty({(s, ? ,? ),(? ,? , c)}) = f a l s e o Member({(s, ? ,? ),(? ,? , c)}, (?,?, c)) = t r u e 26 C hapter 3 Set R ep resen tation s The KII framework discussed in Chapter 2 is not operational. Knowledge in KII is represented as (C ,P ) pairs, where C is a set of hypotheses and P is a set of hypothesis pairs. The integration, enum eration, and query operators are all defined in term s of set operations on (C , P) pairs. KII does not specify any particular set representation for C and P —in fact, these sets may be arbitrarily expressive. This expressiveness allows KII to utilize any knowledge th a t can be expressed in term s of constraints and preferences. However, arbitrarily expressive sets are not operational, and so neither is KII. In order to operationalize KII, operational set representations m ust be provided for C and P. These representations determ ine which C and P sets can be expressed, and therefore the kinds of knowledge th at KII can utilize. The more expressive the representation, the more knowledge th at KII can use. However, set operations in more expressive representations tend to have higher com putational complexities. Since K II’s integration, enum eration, and query operators are defined in term s of set operations, their com putational complexity is determ ined by the set representation. KII can be operationalized with m any different set representations. The expres siveness of the representation, and the com putational complexity of its set opera tions, determ ine the knowledge th at the operationalization can utilize, and the cost of its integration, enum eration, and query operations. If expressiveness is at a pre m ium , one could select a very expressive set representation in exchange for increased complexity. If speed is of the essence, an inexpressive yet low complexity represen tation may be more appropriate. There may also be a representation th a t blends 27 the best of both worlds: expressive enough to represent most knowledge, yet with a reasonable com putational cost. In order to understand the space of operationalizations, it is necessary to under stand the space of possible set representations. Section 3.1 casts the space of set representations in term s of the well understood space of gram m ars. The set representations for C and P should also be closed under integration, although this is not strictly necessary. This condition is discussed in Section 3.2, and languages th at are closed under integration are identified. The complexity of induction also restricts the choice of set representations for C and P. Operationalizations of KII with expressive representations tend to have high induction costs, and for some representations, induction may even be undecidable. Clearly, such representations cannot be used to operationalize KII. The most ex pressive representations for which induction is decidable within the KII framework are identified in Section 3.3. 3.1 G ram m ars as Set R ep resen tation s Every com putable set is the language of some gram m ar. Similarly, every com putable set representation is equivalent to some family of gram m ars. These families include, but are not lim ited to, the families of the Chomsky hierarchy [Chomsky, 1959]—regular, context free, context sensitive, and recursively enumerable. Table 3.1 lists the language families in the Chomsky hierarchy in addition to other im portant language families. The list is in order of decreasing expressiveness, with the most expressive languages at the top and the least expressive at the bottom . The com plexity of set operations generally increases with the expressiveness of the language. Each family in the list properly contains all of the families below it. Every com putable set representation either corresponds to one of the families in the list of Figure 3.1, or is subsumed by one of them . M apping set representations onto families of gram m ars clarifies the space of possible representations, and makes it possible to use results from autom ata and formal language theory to analyze relevant properties of a set representation, such as expressiveness, and the decidability and com putational complexity of set operations. 28 Recursively Enum erable (r.e.) ____________ Recursive____________ Context Sensitive (CSL) _______ Context Free (CFL)_______ Determ inistic Context Free (DCFL) Regular Finite Table 3.1: Language Families. 3.2 C losure o f C and P under Integration The integration operator is defined in term s of intersection and union, as shown below. I n t e g r a t e ^ , A ) , (C 2 ,P 2)) = (CaCC,, P 1liP2) (3.1) The representation for C m ust be closed under intersection, and the representation for P m ust be closed under union. This ensures th at the (C , P) pair resulting from integration can be expressed in these representations. Table 3.2 indicates which of the languages in Table 3.1 are closed under intersection and union. Some language are not closed under intersection, but are closed under intersection with regular languages. Intersection with a regular language is often denoted as DP. Operations Language Regular DCFL CFL CSL recursive r.e. n V V V V n R V V V V V V u V ■ V V V V Table 3.2: Closure under Union and Intersection. 3.3 C om p u tab ility o f In du ction In order to induce a hypothesis, or to answer the solution set queries, it is necessary to enum erate one or m ore hypotheses from the deductive closure, or solution set, 29 of (C,P). This is only possible if the solution set is at m ost recursively enum er able. Otherwise, the solution set cannot be enum erated by any Turing machine, and therefore the enum eration and query operations are uncom putable. W hether or not the solution set is recursively enumerable depends on the set representations for C and P. Recall th at the solution set is com puted from C and P according to the following equation: Jirst((CxC)nP)nC (3.2) A set representation is a class of gram m ars, and a specific set corresponds to a gram m ar in the class. Closing the set representations for C and P under Equa tion 3.2 m ust yield a t m ost the recursively enum erable languages. Otherwise, the representations can express C and P sets for which the solution set is not recursively enumerable. It is possible to determ ine the m ost expressive set representations for C and P th a t will yield a recursively enum erable solution set for every C and P set ex pressible in those representations. The closure properties of the set operations in Equation 3.2 can be used to express the solution set representation as a function of the representations for C and P. Inverting this function yields the m ost expressive representations for which the solution set is recursively enumerable. 3.3.1 Closure Properties o f Set O perations The closure properties of intersection and complement are well known for m ost lan guage classes, although it is an open problem whether the context sensitive languages are closed under complem entation [Hopcroft and Ullman, 1979]. These properties are sum m arized in Table 3.3. Cartesian product and projection (first) present more of a problem. A Cartesian product can be represented a num ber of different ways, each with different closure properties and expressive power. The representations dif fer in the way th at a pair, (x ,y ), in the Cartesian product is m apped onto a string in the gram m ar th at represents the product. Projection is the inverse of Cartesian product, and its definition depends on the m apping used to represent Cartesian products. The m ost straightforward m apping is concatenation. T hat is, a tuple (x ,y ) in the Cartesian product is represented as the string xy in the gram m ar for the 30 Operations Language Regular DCFL CFL CSL recursive r.e. n V V V y / complement y / y / ? V Table 3.3: Closure Properties of Languages under Union and Intersection. product. Under this m apping, the set A x B is represented as A -B . O ther mappings interleave the symbols in xy in various ways in order to represent the pair (x,y). For example, the symbols in x and y could be strictly alternated, or shuffled, so th at (x, y) is represented as xiyix 2y 2 ■ ■ ■ x nyn where x = x ix 2 . . . x n and y = yiy 2 . ..y n. An interleaving m aintains the order of the symbols in each of the component strings, so th at if Xi comes before x,+i in x, then x,- comes before x,-+i in the interleaving as well. However, there may be an arbitrary num ber of symbols from y between x,- and 3 - t + i • For a given Cartesian product, A x B , not all m appings of A x B can be repre sented in the same language class. For example, context free gram m ars are closed under concatenation, but not under shuffling [Hopcroft and Ullman, 1979]. If A and B are context free gram m ars, then under the concatenation mapping, A x B is rep resented as AB, which is also a context free gram m ar. Under the shuffle mapping, A x B is represented by the shuffle of A and B , which could be a context sensitive gram m ar. W hether a language class is closed under Cartesian product depends on the m apping. The expressive power of various m appings are discussed in more detail in Section 4.1.2.4. Although the closure properties of languages under Cartesian product depends on the m apping, the closure properties of languages under projection are fixed. Let X be some subset of A x B . Regardless of the mapping, X will be represented as some interleaving of the strings in A and B . Let T,a be the alphabet for A and let E s be the alphabet for B . The string in X corresponding to (x, y) is some interleaving of the symbols in x and y—th at is, (x, y) m aps onto a string in (Syi U Eb)*. Call this string w. Projecting w onto its first element extracts x from the interleaved string. If Syi and Eg are disjoint, this can be done by deleting from the string symbols 31 th a t belong to Eb- M appings preserve the order of the symbols in x and y, so the rem aining symbols in the strings are the symbols of x in the correct order. If the alphabets of A and B are not disjoint, they can be m ade disjoint by applying appropriate homomorphisms [Hopcroft and Ullman, 1979]. Applying a hom omorphism , h, to a language, L, replaces each symbol in the alphabet of L with a string from some other alphabet (each symbol can be replaced by a different string). The strings in h(L) are the same as in L , except th a t the symbols in each strings have been substituted according to h. For example, if h replaces a by 1 and replaces b by 2 , then h({ab, bb}) is {1 2 , 2 2 }. Let Tia be the alphabet of A and let E b be the alphabet of B. If E^ and Eb are not disjoint, define homomorphisms h\ and h 2 such th at h\ m aps symbols in E^ into corresponding symbols in Ea>, and h 2 m aps symbols in Eb to corresponding symbols in Eb». The alphabets Ea> and Eb< are selected so as to be disjoint. The Cartesian product of A and B is defined to be some interleaving of the strings in h\(A) and h 2 (B). In the case of the solution set formula, the only Cartesian product is C x C , which is represented by some interleaving of the strings in h\(C ) and h 2 (C). Let X be a subset of hi(A )xh 2 (B), where hi and h 2 are defined as described above. The projection first(X ) is com puted by erasing from strings in X symbols th at belong to Eb», where Eb» is the alphabet of h 2 (B). This is accomplished by a hom omorphism th at m aps every symbol in Eb< to the em pty string, e. The resulting strings are m em bers of E^,, where E,4 < is the alphabet of hi(A). However, strings in first(X) should be composed of symbols from E>i, not symbols from Ea< - A second hom omorphism is applied th a t m aps symbols in Ea> to corresponding symbols in Eyi. This is the inverse of h\. The last two homomorphisms can be accomplished by a single hom omorphism , h' , th a t m aps symbols in Eb< onto e, and m aps symbols in Ea' onto symbols in E ^. This definition of projection in term s of homomorphisms is sum m arized in Figure 3.1. The closure properties of languages under arbitrary homomorphisms are well known— specifically, the regular, context free, and recursively enum erable languages are closed under arbitrary homomorphisms, and the rest are not. 1 1More precisely, full trios and full AFLs are closed under arbitrary homomorphisms [Hopcroft and Ullman, 1979]. 32 Let X be a subset of hi(A) x /i 2 (i?) where hi (a,) = the ith symbol in h^Ci) = the ith symbol in Eb> EA' H Egi = 0 first(X) = h’(X) where ,,, s . f the iih symbol in E^ if < r ,- E S a> h ^ =s\ e if a,- G Eb - Figure 3.1: Projection Defined as a Homomorphism. The closure properties of languages under projection, intersection, intersection with a regular gram m ar, and complement are sum m arized in Table 3.4. The closure properties of Cartesian product depend on the m apping, as discussed earlier. Operations Language Regular DCFL CFL CSL recursive r.e. n V y/ V V n R V V V V V V complement V V ? V projection (homomorphisms) V V V Table 3.4: Closure Under Operations Needed to Com pute the Solution Set. 3.3.2 C om putability o f Solution Set The closure inform ation in Table 3.4 is sufficient to determ ine the m ost expressive class of languages for C and P for which the solution set is recursively enum erable (com putable). First, expressiveness bounds on (C x C )n P are established from the r.e. bounds on the solution set (first((CxC)nP)DC) using the known closure prop erties of projection, intersection, and complement. The next step is to establish expressiveness bounds on C and P from the bounds on (C 'xC ')nP . The dependence 33 between the closure properties of languages under Cartesian product and the m ap ping used to represent the product makes this somewhat imprecise, but an effective bound can still be obtained on the expressiveness of C and P. 3 .3 . 2 . 1 E x p re ssiv e n e ss B o u n d s o n (C'xC’)flP The first step is to establish expressiveness bounds on (C x C )flP . This is accom plished with the following theorem. T h e o re m 1 first((C 'xC ')D P)nC ' is computable (r.e.) if and only if {C xC)C\P is at most context free and C is at most r.e. P ro o f. The proof of this theorem follows from the closure properties of the lan guage representing ( C x C ) n P under projection, complement, and intersection. If (CxC)D P is context free, then first((CxC)f\P) is also context free. Closing the context free languages under complement yields the recursive languages, so the set first((Cx C)HP) could require a recursive language to express. Intersecting this set with C yields the solution set. Both recursive and r.e. languages are closed under intersection, so the solution set is r.e. as long as C is at m ost r.e. (i.e., C is not uncom putable). If (CxC)DP is not context free, then first((CxC)DP) can be recursively enu m erable, since this is the next m ost expressive language th at is closed under pro jection. Recursively enumerable sets are not closed under complement. The com plem ent of a recursively enumerable set is not recursively enum erable . 2 Therefore, first((CxC)C\P) is not com putable. Intersecting first((CxC)DP) with C yields the solution set. The intersection of an uncom putable set with any other set is also uncom putable, so the solution set is uncom putable. □ 3 .3 . 2 . 2 E x p re ssiv e n e ss B o u n d s on C a n d P The expressiveness bounds on C and P can be derived from the bounds on (C x C)C\P established in the previous subsection. (C x C )f\P can be at m ost context free. 2For every r.e. language, L, L is not r.e. [Hopcroft and Ullman, 1979]. This is stronger than simply stating that the r.e. languages are not closed under complement. If only the latter were true, it might be possible to find r.e. languages whose complements are also r.e. However, the first statement says that there are no such languages. 34 Context free sets are not closed under intersection, so C x C and P cannot both be context free. Table 3.4 indicates th at at m ost, one of C x C and P can be regular and the other can be context free. This follows from the closure of context free gram m ars under intersection with regular gram m ars. The expressiveness bound on C depends on what m apping is used to represent C x C , since different m appings yield different closure properties for languages under Cartesian product. Regular gram m ars are closed under Cartesian product for the concatenation m apping, and for arbitrary interleavings (e.g., shuffling). This follows from the closure of regular gram m ars under concatenation and interleaving [Hopcroft and Ullman, 1979]. Context free gram m ars are closed under concatenation, but not under arbitrary interleavings [Hopcroft and Ullman, 1979]. C can be at most context free if the concatenation m apping is used, or at m ost regular if an interleaving m apping such as shuffle is used. Another possible set representation is the null representation. The only set ex pressible in this representation is the em pty set. An instantiation of KII th at rep resents P with the null representation cannot use any preference information—the P set is always empty. W hen P is represented this way, the solution set equation reduces to just C. This follows from the closure of the null representation under intersection with any set (even an uncom putable one). The intersection of any set with the em pty set is just the em pty set. Since the solution set reduces to C when P is represented by the null representation, C can be at m ost recursively enumerable. It is also possible to choose the null representation for C instead of for P. How ever, choosing this representation for C means th at C is always empty. W hen C is always empty, so is the solution set, so this is not a very useful representation for C. Table 3.5 summarizes the expressiveness bounds on (C xC )n P , C , and P, under the assum ption th at the representations for C and P are closed under the given m apping for Cartesian product. The requirem ent th a t the solution set be com putable restricts the choice of lan guages for C and P to at m ost context free for one of them , and regular for the other. This holds when the m apping for Cartesian product is concatenation, but for interleaving-type m appings, C can be at m ost regular, and P can be at m ost context free. 35 c P (C'xC')nP Jirst((CxC)nP)nC < regular < regular < regular < regular < regular < DCFL < DCFL < recursive < regular < CFL < CFL < recursive < CFL < regular < CFL < recursive > CFL > CFL > CSL > CFG uncom putable uncom putable null < r.e. null null < r.e. null null < r.e. Table 3.5: Sum m ary of Expressiveness Bounds. The integration operator further requires th at the language for C be closed under intersection, and th at the language for P be closed under union. Both the regular and context free languages are closed under union, so both of these are appropriate representations for P. Of these two languages, only the regular languages are closed under intersection, so C is lim ited to the regular languages. The only choice allowed by the restrictions of th e integration operator is to represent P as at m ost a context free gram m ar, and C as at most a regular gram m ar. It is possible for C to be context free and P to be regular under certain lim ited conditions. Context free languages are not closed under intersection, but they are closed under intersection with regular sets. This means th a t at m ost one of the (C, P) pairs being integrated can have a context free C set as long as the C sets of the rem aining pairs are regular. Integrating all of these pairs results in a (C, P) pair in which C is context free and P is regular. 36 C hapter 4 R S-K II RS-KII is an im plem entation of KII in which C , and P are represented as regular sets. This section describes the regular set representation, and im plem entations of all the KII operations for regular sets: integrate, enumerate, and the solution-set queries. Translators are not discussed since translator specifications and im plem entations are not fixed, but are provided by the user for each knowledge source and hypothesis space. 4.1 T he R egular Set R ep resen tation A regular set is specified by a regular gram m ar, and contains all strings recognized by the gram m ar. Regular sets are closed under intersection, union, complement, concatenation, and a num ber of other useful operations [Aho et al., 1974]. Since the solution set is defined in term s of these operations, the solution set is also expressible as a regular gram m ar. For every regular gram m ar there are equivalent determ inistic finite autom ata (DFAs). Since DFAs are easier to work with than regular gram m ars, regular sets are implem ented as DFAs. DFAs are discussed in Section 4.1.1, and the set operations are defined in Section 4.1.2. 4.1.1 D efinition of DFAs A DFA is defined as a 5-tuple (Q , s, 8 , F , S) where Q is a finite set of states, s & Q is the start state, 8 is a state transition function from Q x S to Q, F is a set of final 37 (accept) states, and S is an alphabet of input symbols. A second move function, 8" from QxY,* to Q, that is defined over strings instead of single symbols can be derived from 8. For example, 8*(q, w ) returns the state reached from q by path w £ E*. A string w £ E* is accepted by a DFA iff 8*(s,w) £ F. Every regular set can be represented by several equivalent DFAs. Among these DFAs there is exactly one minimal DFA[Hopcroft and Ullman, 1979]. A minimal DFA has the fewest states needed to recognize a given regular set. A non-minimal DFA can be minimized as follows. 1. Remove all states that can not be reached from the start state. 2. Remove all states that can not reach an accept state (these are dead states). 3. Merge equivalent states. Two states, p and q, are equivalent if for every input string w, 8*(p,w) is in F if and only if 8*(q,w) is in F. That is, the DFAs (Q,p, -F, E) and (Q ,q, 8,F, E) recognize exactly the same strings. There are well known algorithms for minimizing DFAs, but we will not describe them here. See [Hopcroft and Ullman, 1979] for more information on this topic. In general, minimizing a DFA takes tim e 0 (|E ||Q |2). 4.1.2 D efinitions o f Set Operations The set operations used by KII for integrating (H ,C ,P ) tuples are intersection and union. The representation for C must be closed under intersection, and the representation for P must be closed under union. The solution set is defined in terms of complement, Cartesian product, projection, intersection and union. The representations for C and P do not need to be closed under these operations, since the solution set may require a more expressive representation than that used for either C or P. However, regular grammars happen to be closed under all of these operations. This guarantees that the solution set can also be expressed as a regular grammar. These operations are defined below for regular grammars. Finally, an operation is needed for computing the transitive closure of P. The enumeration operator assumes that P is transitively closed, but this condition is not preserved by integration. An operator is needed for transitively closing P prior to 38 enumerating the solution set. This operator is described below, along with the other set operations. 4 .1.2.1 U n ion The union of two DFAs, A and B, results in a DFA that recognizes a string if the string is recognized by either A or B. The DFA for the union can be constructed from A and B as shown in Figure 4.1. A state in this DFA is a pair of states, (94, qb) i one from A and one from B. The move function for the union is computed from the move functions of A and B. On input a, AUB simulates a move in A from qA and a move in B from 95, both on input a. A state (qA,qB) in the union is a final state if qA is a final state in A or if 95 is a final state in B. (Q , s , 6,F , E) — (Q n S u S u F ! ,^ ) U (Q2,s 2, 62,F 2,X 2) where Q — Qi x Q 2 s - ( s i,s 2) £((?i» 92), (r) = (S\{qi,<j),S2(q2,a)) F = F ix Q 2 U Q \x F 2 S = S iU S 2 Figure 4.1: Definition of Union. 4 .1.2 .2 In tersectio n A string is accepted by the intersection of two DFAs, A and B , if it is accepted by both A and B. The DFA for the intersection of A and B can be constructed from A and B as shown in Figure 4.2. As with union, a state in the DFA for AOB is a pair of states, (94,95), one from A and one from B. On input a, AC\B simulates a move in A from 94 and a move in B from 96, both on input u. If both 94 and 95 are accept states in their respective DFAs, then (94,95) is an accept state in the intersection. 39 (Q ,s , 6,F,E) = (Q i,s i,6 i,F i,E i) fi (Q2,^2,^2,JP 2,^2) where Q = Q 1 XQ 2 S = (si,s2) 5({9ii92),o-) = (Si{q1,a),S2(q2,<r)) F = Fi x F2 S = S iH S 2 Figure 4.2: Definition of Intersection. 4.1.2.3 Complement The complement of set A with respect to the universe S* is written A. A string is in A if and only if it is not accepted by the DFA for set A. The DFA for A is exactly the same as the DFA for A, except that accept states in A ’s DFA correspond to the reject states in A ’s DFA, and vice versa. A reject state is any state that is not an accept state. The construction for the complement of a DFA is shown in Figure 4.3. (Q ,s, 6,F ,E ) = (Q,s, 6,F,E) Figure 4.3: Definition of Complement. 4.1.2.4 Cartesian Product The Cartesian product of two sets, A and B, is a set of tuples {(a, 6) | a € A and y € B}. A set of tuples can not be directly represented as a regular grammar, or indeed as the language of any grammar. This is because the language of a grammar consists of strings, not tuples per se. A mapping between tuples and strings is required. Under mapping M , the set A x B is represented by a regular grammar whose language is { M ( x ,y ) | (x,y) € A x B } . This language is also denoted as M ( A x B ) . The choice of mapping greatly determines which sets of tuples can be expressed by regular grammars. 40 One obvious m apping is concatenation. T hat is, the Cartesian product of two reg ular sets is represented by the concatenation of the two sets. Formally, M ((x ,y}) = xy, and M ( A x B ) = AB, Regular gram m ars are closed under concatenation, so regular gram m ars are also closed under Cartesian product for this mapping. A less obvious mapping is to map (x,y) onto a string w where w is an interleav ing of the strings x and y. For example, the tuple {X1X2. . . a;n, yiy2 . . . yn) would be represented by the string “xiyix 2y 2 ■ ■ -xny„”, where the symbols of x and y alter nate. This mapping is called shuffling. Under this mapping, A x B is represented by shuffle(A,B), which is the set {shuffle(x,y) | x € A and y € B ). Regular grammars are closed under shuffling. A DFA for the shuffle of any two DFAs is specified in Section 4.1.2.5. This proves by construction that regular grammars are closed under shuffling. Differences in Expressiveness. T he Cartesian product of any two regular sets can be expressed as a regular set under both the concatenation and shuffle mappings. This follows from the closure of regular sets under both concatenation and shuffling. Every subset of a Cartesian product th at can be expressed as a regular gram m ar under the concatenation m apping can also be expressed as a regular gram m ar under the shuffle mapping. However, some subsets can be expressed as regular gram m ars under the shuffle m apping, but not under the concatenation mapping. The shuffle m apping is strictly more expressive than the concatenation m apping in this sense. Let S be a subset of A x B , where A and B are sets, but not necessarily regular sets. If S can be expressed as a regular grammar under the concatenation mapping, then S can also be expressed as a regular grammar under the shuffle mapping. To see why this is so, consider that S can be written as (J ,- A{XB{, where A ,- c A and Bi C B. S can be expressed as a regular grammar under the concatenation mapping if and only if S can be written as a finite union of A,xB{ pairs, where each A ,- and B{ is a regular set. Under the concatenation mapping, this union is represented as UiLsi AiBi, where k is finite. Regular grammars are closed under concatenation and finite union, so (J*=1 A{B, is a regular grammar. Under the shuffle mapping, S is represented as U*=i shuffle(A{,Bi). Since regular sets are closed under shuffling, this is also a regular grammar. Therefore, every subset of A x B expressible as a regular 41 grammar under the concatenation mapping is also expressible as a regular grammar under the shuffle mapping. The reverse is not necessarily true. Some subsets of a Cartesian product can be expressed as regular grammars under the shuffle mapping, but not under the con catenation mapping. For example, consider the set {(w,w) | w G (a|6)*}. Under the concatenation mapping, M {{{w ,w ) | w G (a|&)*}) = {ww | w G (a|6)*}, which is a context sensitive language [Hopcroft and Ullman, 1979]. Under the shuffle mapping, M {{(w ,w ) | w G (a|6)*}) = {shuffle(w,w) | w G (a|6)*}, which is ((aaJK&ft))*.1 A n Illu stra tio n . The following is a simple example that will hopefully provide some intuition for the differences between these representations. Let H be a set of strings recognized by the regular grammar (a|6)*$, where $ indicates the end of a string. Let P be the preference {(x,y) G H x H | x < y}, where < is the standard dictionary ordering (e.g., “aa6$ < a6$”). P can be expressed under the shuffled mapping, but not under the concatenation mapping. In the shuffled mapping, the tuple (a: = £1X2 • • • ,2/ = 2 /1 2 /2 • • •) maps onto the string X\yix2y2 — If xi = a and y\ = b, then x < y, so (x,y) is in P. If Xi = b and 2 /1 = a, then y < x, so (x ,y ) is not in P. If Xi = 2/1, then a similar comparison is made between x2 and 2 /2 - If x2 and y2 are equal, then x3 and y3 are compared, and so on until the compared symbols differ, or one of the symbols is the string termination symbol. If x terminates before y, but the two strings are equal up to that point, then x comes before y in dictionary order, so (x, y) is also in P. If both terminate at the same time, or x is longer than y, then x y, so (x, y) is not in P. A regular grammar that recognizes {shuffle(x,y) | (x ,y ) G (a|fe)*$x(a|6)*$s.t.x < y} is shown in Figure 4.4 This grammar recognizes a shuffled string, Xiy\X2y2 ..., if adjacent pairs of sym bols, xt * 2 /f, in the string are the same up to some pair x^yk- This is recognized by the production SAME. In the next pair, Xk+iyk+i, 1 must be lexicographically less than 2/fc+i, or Xk+i must be the end-of-string symbol (S). These conditions are recognized by X_LESS_THAN_Y and $(a|6), respectively. In the first case, the re maining symbols alternate between symbols from x and y until one of the strings l skuffle(xix2 .. .x„,xix 2 .. ,xn) = xixix 2x 2 .. .x„x„. Symbol x,- can be either a or b, so X1X1X2X2 .. .xnx„ is a member of ((aa)|(66))*. 42 SAME X_LESS_THAN_Y ANY_XY ANY_X ANY_Y SAME X_LESS_THAN_Y ANY_XY SAME ($(a|6)) ANY_Y (aa )|( 6 6 ) ab (a|6 )(a|6 ) ANY_XY (a|6)$ ANY-X $(a|6) ANY_Y (a|6)*$ (a|6)*$ Figure 4.4: Regular Grammar for {shuffle(x,y) | x ,y € (a|6)*$ s.t. x < y} has no more symbols. After this, the remaining symbols are from the longer string. Strings of this form are recognized by ANY-XY. In the second case, the string x has terminated, so the remaining symbols are all from y. Strings of this form are recognized by ANY_Y. P cannot be recognized by a regular grammar under the concatenation grammar. Under the concatenation mapping, (x,y) maps onto xy. Every symbol in x is seen by the DFA for the grammar before any symbol of y is seen. The DFA for the grammar would have to store enough information about x to determine whether x < y. However, in order to make this determination, all the symbols of x must be saved. The string x can be arbitrarily long, but by definition a DFA can have only a finite number of states. Therefore, the DFA cannot store enough information about x to make the determination. The shuffled grammar gets around this problem by changing the order in which the symbols are seen, so that all the information needed to make the determination is available locally. Very little state has to be saved, if any. More formally, P cannot be expressed as a regular grammar under the concate nation mapping because P would not be a regular language. That this language is not regular can be proven by the pumping lemma, which says that a language L is not regular if for all n it is possible to select an integer i and a string 2 from L such that for every way of breaking 2 into strings u, v, and w, the string uv'w is not in L. The length of uv must be less than n, and v cannot be the empty string. 43 P is a set of strings of the form waw/3, where w is a string in (a|6)*, and either a is in a(a|6)* and /? is in 6(a|6)*$, or a = $ and fi is a string in (a|6)+$. For an arbitrary but fixed n, let 2 be the string bna$bnb$. For all ways of breaking 2 into u, v, and w such that |itu| < n and v is not empty, uv is the string bk where 1 < k < n, and w is the string bn~ka$bnb$. The string string uv'w is bk~^b^'bn~ka$bnb% , which is equivalent to 6n6lu ^,-1^a$6n6$. This string is not in P when |u|(z — 1) > 1, since 6n& lt'K ,-1)a$ comes after bnb$ in dictionary order under this condition. Since v must be chosen to be non-empty, |u|(z — 1) > 1 when i > 1. Therefore, there exists at least one i for which uv'w is not in P when uvw is in P. Therefore, according to the pumping lemma, P is not regular. D efin itio n o f C artesian P ro d u c t as Shuffling. Because of superior expressive power of the shuffle mapping, RS-KII defines the Cartesian product of two regular sets, A and B, as shuffle(A,B). The DFA for shuffle(A,B), is constructed from DFAs A and B as shown in Figure 4.5. The input to the DFA are strings of the form shuffle(x,y). As mentioned above, this returns a string in which the symbols from x and y alternate. This is defined here a little differently than it was above in order to address some subtle points. First, each symbol is preceded by an identifier that indicates whether the symbol belongs to x or y. This helps the DFA sort out the symbols. For example, shuffle{abc, xyz) = l a 2x lb2y l c 2 z. Second, x may have fewer symbols than y, or vice versa. If one string has fewer symbols than the other, then the symbols alternate until the shorter string is exhausted, after which the remaining symbols are all from the longer string. For example, shuffle(abcd,xy) = l a 2x \b2y \c\d. The DFA for shuffle(A,B) processes the input string shuffle(x,y) by simulating moves in A on the symbols from x and simulating moves in B on the symbols from y. The shuffled DFA accepts the string shuffle(x,y) iff x is accepted by A and y is accepted by B. More formally, a state in the shuffled DFA is a pair (qA, 9b) where qA is a state in A, qs is a state in B. On input cr, a move is simulated in A if a is from S a , and simulated in B if c r is from S b - The shuffled DFA accepts when both A and B are in accept states. 44 (Q ,s, 6 ,F, S) — (< 5i , s i , 6i,F i,S i) x (Q2 , 8 2 , 6 2 ^ 2 , ^ 2 ) where Q ~ ~ ~ Q 1XQ2 S = {si,s2) M/n rr\ - / ( ^ ( ^ ) C T )» ^ ) if ^id = 1 «««.,«.),<t«») - ((„,*(„,„)) ifaitt = 2 F1 = i ^ x i ^ S = SiU S2U {1,2} where 1 & 2 not in T,x or S2 Figure 4.5: Definition of Cartesian Product. 4.1.2.5 Projection The function first(A) projects a set of pairs, A, onto the set of first elements of the pairs. For example, first({(x!,yi), (x 2 ,y 2), •••, (®n,2/n)}) returns {xj, x 2, ... ,x n}. As discussed in Section 4.1.2.4, A is a set of pairs represented by a regular set of the form {M.(x,y) | (x,y) € A }, where M is a mapping from x,y pairs onto strings. Given this representation for A, the regular set representing first(A) is of the form {a: | 3y M (x ,y ) € A}, where M.(x,y) is the same mapping used in A. The DFA for first(A) recognizes a string x if there is a string y such that M ( x , y) is recognized by the DFA for A. The DFA is provided with the string x as input, but it must make a guess as to what y should be. It can then test whether M .(x,y ) is recognized by the DFA for A. If the mapping M. is concatenation, then the DFA for first(A) is relatively straightforward. It simulates moves in the DFA for A on input x. After consuming x, the DFA is in state q. If there is a path from q to an accept state, then there exists a y such that M (x , y) is accepted by the DFA for A. The string y corresponds to the path. If there is no such path, so that q is a dead state, then a : is not accepted by the DFA for first(A). RS-KII does not use the concatenation mapping, however. It uses the shuffle mapping instead, due to its greater expressiveness (see Section 4.1.2.4). The shuffle mapping interleaves the symbols from x and y, which complicates the DFA for 45 first(A). The DFA is provided with x as input, but m ust guess a value for y. If shuffle(x,y) is recognized by the DFA for A, then x is in first(A). The symbols of x and y are interleaved by shuffling, so th at a symbol from x is followed by one from y. After sim ulating a move in A on symbol X{ from x, a symbol is guessed for y;, the next symbol of y, and a move is sim ulated in A on y,-. A DFA th at makes “guesses” of this sort is m ost easily described by a non- determ inistic DFA (NDFA). An NDFA is just like a DFA, except th a t it can have a choice of several next-states on any given input. On any given move, one of the possible next-states is selected arbitrarily, and the NDFA moves to this state. The possible next-states are shown as a set of states in the definition of the NDFA’s 8 function. Every NDFA can be converted into an equivalent DFA. The NDFA for the projection will be described first, for clarity, and then this gram m ar will be converted into an equivalent DFA. C o n s tru c tin g th e N D F A . The NDFA for first(A) is constructed as follows. The input to the NDFA is a string x. The NDFA guesses a string y, such th at shuffle(x, y) is in A. On each input, ax from x, the NDFA simulates a move in A on crx. This is followed by a move in A on symbol ay. This second symbol is the NDFA’s guess for the next symbol in y. The set of possible guesses for ay is the alphabet for y, E y. Thus from a state q in A, the NDFA can move on input ax to any state in {^(q',< 7 y) | q' = 8 j\(q,ax) and < ry € E y}. The actual next state is selected non- determ inistically on every move. The NDFA effectively creates an input string for A composed of alternating symbols from x and its guess for y. An NDFA accepts a string x if there is some sequence of non-determ inistic moves on input x th at will lead to an accept state. In other words, the NDFA described above accepts x if there is some way to select symbols for y such th a t shuffle(x,y) is recognized by A. M o d ify in g th e N D F A . This NDFA works fine as long as x and y are the same length, but m ust be modified slightly to deal with cases where x and y are of different lengths. Let (x,y) be a pair such th a t shuffle(x,y) is in A. If x is shorter than y, then shuffle(x,y) alternates symbols from x and y until x is exhausted, after which the rem aining symbols are all from y. After the NDFA sees the last symbol in x, 46 it will have to guess the remaining symbols in y, and make moves in A on each of those symbols. The NDFA described above does not do this—it guesses only one y symbol for each symbol in x. A similar condition holds when x is longer than y. In this case, shuffle(x,y) alternates symbols from x and y until y ends, after which the remaining symbols are all from x. After y terminates, the NDFA should not make any guesses for y. This is equivalent to guessing the empty string for y on each move. However, the NDFA described above does not guess a string for y after each symbol from a:, but instead guesses a single symbol for y. For both cases, the NDFA must be able to guess a string of symbols for y after each symbol from x instead of guessing only a single symbol. The NDFA can be made to do this by changing the next-state function as follows. From state q on input crx, the NDFA can move to any state in {^(<7', w) | w G E* and q' = S^iq, o-x)}. Thus on every input x, the NDFA can move to any state in A reachable by a string in crxE*. In most cases, the only states reachable in A from q are of the form crxay. In this case the modified NDFA behaves just like the NDFA described above. However, when x and y are differing lengths, and all of the symbols in one of the strings have been exhausted, there will be states reachable from q of the form ax£*. One final modification is also needed. A expects symbols in x to be preceded by the identifier “1”, and symbols in y to be preceded by the identifier “2”. The NDFA’s move function must be modified to provide these additional identifier symbols to A. The NDFA with all of the above modifications is shown in Figure 4.6. (Q ,s , 6’,F, S) = first((Q,s,6,F, £ )) where w ) I w G (2S)* and q' = < $ (< 7 ,1 cr)} Figure 4.6: NDFA for Projection (First). C on vertin g th e N D F A to a D F A . The NDFA in Figure 4.6 can be converted into an equivalent DFA. Each state in the DFA corresponds to a set of states in the 47 NDFA. The state moved to from state q in the DFA on input a is the set of NDFA states that can be reached non-deterministically on input a from any of the NDFA states in q. A set of NDFA states is an accept state in the DFA if the set contains at least one state that is an accept state in the NDFA. The equivalent DFA for the NDFA of Figure 4.6 is shown in Figure 4.7. (Q',s',6',F', S) = first((Q,s,6,F,Y,)) where Q' = 2^ (the power set of Q) s ' = {s} £'({91,9a» ■•■,?»»},*) = I w G (2 S )* and 9 ' = S(qi,1<r)} 7 i F' = { q £ 2q \qf)F^ib} Figure 4.7: DFA for Projection (First). D F A for S econd. Although it is not strictly necessary, the function second(A) is occasionally useful. This function projects a set of pairs onto their second elements. The DFA for second(A) is essentially the same as the DFA for first(A), except that guesses are made for the first element instead of the second. The DFA is shown in Figure 4.8. (Q ',s ',^ ,F ',S ) = first((Q,s, 6,F, E » where Q' = 2Q s' = {*} *'({91192»---»9n},ff) = L K % '’2cr) I = 6(<li’w ) and te € (IS )* } 7; F' = { q e 2Q \qr)F^<b} Figure 4.8: DFA for Projection (Second). 48 4.1.2.6 Transitive Closure The preference set, P , is assumed to be transitively closed. However, this condition is not preserved by integration (union). An explicit operator is therefore defined for the purpose of re-establishing the transitive closure of a preference set. Let tc(P) be the transitive closure of P . The regular grammar for tc(P) accepts (x ,y ) if and only if (x ,y ) 6 P , or there is a finite sequence of one or more strings, Zi, z2, ..., zn, such that (x, Z\) € P and (zx,z2) € P and (z2 ,z 3) € P and . .. and (zn,y) G P. That is, (x,y) is in the transitive closure of P either if (x,y) is in P , or if there is a finite sequence of relations in P connecting x to y. There are a number of algorithms for computing the transitive closure of a binary relation defined over a finite set of objects (e.g., Warshall’s algorithm [Warshall, 1962]). Every preference set, P , expressible as a regular grammar can be mapped onto a finite relation. The finite relation can be transitively closed using one of the standard algorithms, and the result mapped back into a regular grammar for the transitive closure of P. Every regular set can be expressed as a finite union of This is simplest to see when P is expressed as a regular grammar under the concatenation mapping. Under this mapping, the regular grammar for P can be expressed as U£=i(A-5 i), where A ,- and B{ are regular subsets of the hypothesis space, H (see Section 4.1.2.4). Every hypothesis in A ,- is less preferred than every hypothesis in That is, Ai < Bi. P can be mapped onto a finite relation, R, over the A ,- and B{ sets. Each of these sets is a single object, Xi, in R. The finite relation is then transitively closed. For every pair of relations Xi < X 2 and X 2 < X 3 in R, the relation X x < X 3 is added to R if it is not already there. At most m such relations are added, where m < k2. The set tc(P ) is P U U £ i P.', where P, is one of the relations added to R. Each relation is of the form Xi < Xj, and is represented by the regular grammar X{Xj. If P is represented with the shuffle mapping, then P is a finite union of the form U£=i shuffle(Xi, Yi) where X{ and Yi are regular subsets of the hypothesis space. Ev ery hypothesis in X{ is less preferred than every hypothesis in Y{. The transitive closure of P is computed just as it was for the concatenation mapping. A rela tion R is defined where (Xi, Yj) is in R if shuffle(Xi,Yi) is in the finite union that specifies P . R is transitively closed using any standard algorithm (e.g., Warshall’s 49 algorithm). The set tc(P ) is P U U ,^=i Pi, where R{ is one of the relations added to R. Each of these relations is of the form A < B, where A and B a re elements of ... ,Xk, Yi, ? 2 ,. . . Vjt}- The relation A < B is represented by the regular grammar shuffle(A,B). 4.1.3 DFA Im plem entation A DFA is implemented in RS-KII in terms of the following components. • A start state, s. • An alphabet of input symbols, E. • A next-state function, 6 : (Q x S ) — > Q • A final-state function, F : Q — ► Boolean. • A dead-state function, Dead : Q — > {t,f, ?}. • An accept-all-state function, AcceptAll : Q — > {t,f, ?}. W ith the exception of the last two functions, these components correspond to obvious elements of the standard (Q , s, 6, F, E) 5-tuple that specifies a DFA. The last two functions allow for fast identification of at least some dead states and accept-all states, respectively. A dead state is a reject (non-accept) state from which there is no path to an accept state. An accept-all state is an accept state from which every path leads to an accept state (i.e., no path leads to a reject state). The dead-state and accept-all state function can only identify some dead states and accept-all states. If a state can not be identified by the appropriate function, the function returns “?” or “unknown”. Determining whether a state in a non-minimal DFA is a dead state or an accept- all state can require an exhaustive search of the DFA. However, many set operations preserve, or partially preserve, this information. These two functions provide a way to save such information. The ability to identify dead-states and accept-all states inexpensively is central to the solution-set enumeration algorithm, which is described in Section 4.2.2. 50 Notably absent from the list of DFA components is Q , the set of states. States are generated as needed by applying the next-state function to the current state. In order to find a path through the DFA from start state to accept state, or to determine whether a string is accepted by the DFA, only a few of the states in the DFA are usually visited. The complexity of these searches is proportional to the number of states visited. However, if Q is represented extensionally, then the time needed to create the DFA far exceeds the cost of searching it. By not representing the states explicitly, the time and space complexity of search ing the DFA can be kept proportional to the number of states visited. In order for this to work, the time and space complexity of the move-function and other com ponents of the DFA must be significantly less than 0 (|Q |). A DFA that meets these criteria is called an intensional DFA. If the time or space complexity of the components are proportional to 0(\Q\), then the DFA is said to be extensional. This occurs, for example, when the next-state function is implemented as an explicit look-up table. DFAs constructed from set operations on other DFAs can be represented inten- sionally. Let R be the DFA resulting from a set operation over DFAs A and B. The idea is to implement the next-state function of R in terms of the next-state func tions of A and B. The other components of R are implemented similarly. The space complexity of K ’s components is 0 (1), and their time complexity is proportional to that of A ’s components and B 's components. DFAs represented this way are called recursive, since they are defined “recursively” in terms of more primitive DFAs. All of the DFAs resulting from set operations in RS-KII are represented as recursive DFAs. Recursive DFAs are discussed further in Section 4.1.3.1 The DFAs for the H, C, and P sets are called ■primitive DFAs. These DFAs can be either intensional or extensional. Primitive DFAs are discussed further in Section 4.1.3.2. 4.1.3.1 Recursive DFAs Recursive DFAs are the results of set operations. A DFA resulting from an expression of set operations is essentially an expression tree of DFAs, where the root is the DFA representing the result of the expression, the nodes are DFAs resulting from 51 intermediate set operations, and the leaves are the primitive DFAs input to the expression. The components of the DFA at each node are defined in terms of the components of the DFAs at the node’s child nodes. The DFA at each node requires only enough space for pointers to its children, and definitions for the components. This is essentially constant space for each DFA. The total space for the expression is the sum of the space consumed by each of the primitive DFAs at the leaves, plus a small constant amount of space for each node in the tree. If the tree is balanced and binary, there are about as many internal nodes as leaves. If there are n primitive DFAs, each consuming k units of space, then the total space cost is 0 {nk). The root DFA could have as many as kn states, depending on the set operations in the expression tree, so considerable space is saved by not representing the DFA’s states explicitly. DFAs represented as expression trees are called recursive DFAs. The components of these DFAs are defined recursively in terms of the components of other DFAs. Eventually, this regression must ground out in DFAs in which the components are extensionally defined. These are called primitive DFAs. The C and P sets specifying the integration of two (C, P) tuples are represented in RS-KII by recursive DFAs. The DFAs for the C and P sets generated by translators are primitive DFAs, by definition. 4.1.3.2 Prim itive DFAs Primitive DFAs are DFAs in which the components are implemented extensionally. The start state is represented extensionally, as is the alphabet. In practice, all of the translators for a given hypothesis space can be made to generate DFAs that use the same alphabet. Thus in practice, the alphabet is stored once, and each of the primitive DFAs have a pointer to it. There are several ways to implement the functions of a primitive DFA, but the most straightforward way is with explicit lookup tables. It is generally a good idea to reduce the space complexity of primitive DFAs, since the complexity of recursive DFAs—such as the solution set—is proportional to the space complexity of the primitive DFAs from which the recursive DFA is constructed. The time complexities of the next-state and other functions are similarly related, so 52 it is also worth using implementations of these functions for primitive DFAs with as low a time complexity as possible. A simple lookup table can consume up to 0 (|Q ||£ |) units of space for the next- state function, and 0 (|<3 |) units for each of the other functions. Table lookup is a constant-time operation, so the DFAs functions can all be executed in constant time. Compressing the lookup tables can reduce their space complexity, but may increase the time complexity of table lookup. Good implementations of compressed tables have lookup times of at most log(rc), where n is the size of the table. In some DFAs, the next state can be computed from the current state and the input symbol in time proportional to the size of the state. This is an effectively constant time and space implementation as long .as the size of the state is not a function of some relevant scale-up variable. Using minimal DFAs is another way to keep the space complexity down without increasing the time complexity. Translators should be able to generate minimal DFAs directly. If it is not possible to generate a minimal DFA directly, then the translator can create a non-minimal DFA and then minimize it. However, the space savings from using a minimal DFA may not be worth the computational complexity of explicit minimization. One way to dramatically reduce the space complexity of any primitive DFA in exchange for increased time complexity is to explicitly store an equivalent NDFA, and use the NDFA to simulate the DFA’s functions, such as next-state and final- state. An NDFA generally has exponentially fewer states than an equivalent DFA, and the DFA move function can be simulated from the NDFA move function in 0 (n) time, where n is the number of states in the NDFA[Aho et al., 1974]. Testing whether a state is final is also 0 (n), as are the other functions. 4.2 R S -K II O perator Im p lem en tation s Recall that the operations defined by KII are translation, integration, enumeration, and queries. This section discusses RS-KII’s implementation of the enumeration and integration operations. Translator and query implementations are not discussed. Translators depend on the hypothesis space and the knowledge being translated, so the translators are provided by the user for each hypothesis space and knowledge 53 source rather than being a fixed part of RS-KII. Translators for specific induction tasks are discussed in Chapter 5 and Chapter 6. The queries are implemented in terms of the enumeration operator, as described in Section 2.2.4 of Chapter 2, and do not require further discussion. The integration operator is discussed in Section 4.2.1, and the enumeration operator is discussed in Section 4.2.2. 4.2.1 Integration As discussed in Chapter 2, integration is defined as follows. Integrate((T>,C'i,Pi), (T>,C'2,P 2)) = (D, CidCi, Pil)P2) (4.1) This requires intersection and union. Implementations of these set operations are described below. 4.2.1.1 Intersection The intersection of two DFAs, A and B, is implemented as a recursive DFA, as shown in Figure 4.9. The components of this DFA are derived from the components of A and B according to the definition of intersection given in Figure 4.2. Tuples in the intersection are implemented as pairs of pointers to states in A and B. The alphabet, E, is S^flE #, the intersection of the alphabets for A and B. If and Eb are different alphabets, then their intersection is computed and stored extensionally in the DFA for AC\B. If Ea and Eb are the same alphabet, then E^flEs is just E^ (or equivalently, E #). In this case, the DFA for ADB just stores a pointer to Ea instead of storing an explicit copy. This is the most common case, since translators with the same domain can usually be made to output DFAs with the same alphabet. The Dead function takes as input a state, (91,92), in AC\B, and determines whether or not the state is dead. A state is dead if there is no path from the state to an accept state. Determining this fact can require an exhaustive search of the states reachable from (91,92). However, there are cases where it can be quickly determined whether (91,92) is dead from information about 91 and 92. If the status of (91,92) can be quickly determined, then Dead((qi, 92)) returns tru e or f a ls e , appropriately. Otherwise, it returns unknown, and it is up to the caller to decide whether to perform the expensive search or to do without the information. 54 {s\,8\,F\,Deadi,AccAU\,T,\) fl (s2,82,F2, Dead2, AccAll2, S 2) = (s, S, F, Dead, Accept All, E) where ^ ( < 0 1 > 0 2 )) = Dead({qi,q2)) = ( s i,s 2) <<5i(gi,cr),(52(g2,cr)) ■ ^ 1 (0 1 ) A F2{q2) t r u e if else unknown t r u e if Accept All ((9 1 , 9 2 )) = f a l s e if E = else unknown E iD E 2 Pointer to Ej Deadi(qi) = t r u e or Dead2 (q2) = tr u e AccAll$q$ = t r u e and AccAll2 (q2) = t r u e AccAlli(qi) = f a l s e or AccAll2 (q2) = f a l s e if Ei ^ S 2 if Ei = S 2 Figure 4.9: Intersection Im plementation. It can be quickly determ ined th at a state, (9 1 , 9 2 ) 1 in Af)B is dead if 9 1 is a known dead state in A or if q2 is a known dead state in B. This inform ation can be obtained from the Dead functions for A and B, respectively. In all other cases, it is necessary to perform an exhaustive search of the states reachable from (9 1 , 9 2 ) to determ ine whether it is a dead state. If 9 1 and q2 are known to be non-dead states in A and B, then there is at least one path from 9 1 to an accept state in A, and at least one path from q2 to an accept state in B. However, if there is no path in common between 9 1 and q2, then the state (9 1 , 9 2 ) is a dead state in Af)B. The Dead function for AC\B returns unknown in this case. It also returns unknown if A’s Dead function returns unknown for 9 1 , or if B 's Dead function returns unknown for q2. The AcceptAll function is essentially the dual of the Dead function. A state, (0 i i 0 2 )i in AOB, is an accept-all state if every path from (9 1 , 0 2 ) leads to an accept state ((9 1 , 9 2 ) need not be an accept state, though, unless there is a path from the state to itself). The function Accept All ((9 1 , 9 2 )) returns t r u e if it can be quickly 55 determined that (91,92) is an accept-all state, returns f a l s e if it can be quickly determined that it is not an accept-all state, and unknown otherwise. The state (91,92) is an accept-all state when 91 is an accept-all state in A and 92 is an accept-all state in B. The state is not an accept-all state if either 91 or 92 is not an accept-all state in its respective DFA. Both of these cases can be quickly determined from the AcceptAll functions of A and B, respectively. The accept-all status of (91,92) is unknown in Af)B only if the accept-all status of 91 is unknown in A or the accept-all status of 92 is unknown in B. The AcceptAll function “preserves” accept-all state information from A and B, in the sense that if the accept-all status of every state in A and B is known (i.e, either tr u e or f a ls e ) , then the accept-all status of every state in AC\B is also known. That is, the AcceptAll function for AC\B never returns unknown for any state in Aflf? unless the AcceptAll function for A or B returns unknown for some state in A or B, respectively. The Dead function for A(~\B does not preserve dead state information from A and B. Given two DFAs, A and B, in which all of the dead states are known, ADB can contain states for which the Dead function returns unknown, as was described above. 4.2.1.2 Union The union of two DFAs, A and B, is implemented similarly to intersection. The main differences are that the components of the union are derived from A and B according to the definition of union in Figure 4.1, and the alphabet E is E^UE# instead of E ^flE s. If E^ and Eb are the same alphabet, as is usually the case, then E is just a pointer to one of these alphabets. Otherwise, E is a pointer to a new alphabet containing the symbols in both E^ and Eb - The implementation of union is shown in Figure 4.10. A state in AUjB is of the form (91,92), where 91 is a state in A and q2 is a state in B. The function Dead((qi,q2)) returns tru e or f a ls e , accordingly, when it can be quickly determined whether (91, q2) is a dead state in A\JB. When this determination cannot be made quickly, the function returns unknown. A state, (91,92), in A\JB is a dead state if and only if there is no path from (91,92) to an accept state in AU5 . 56 (si,6i,Fi,Deadi,AccAlli,T,i) U (s2,62, F2, Dead2, A c c A l^ ,^ ) = (s, 6, F, Dead, AcceptAll, S) where <$«gi,g2 ),cr) = n f a . * ) ) = Dead{(qu q2)) = 5 = (Si,S2) (£i(9i»0,).£a(92,o')) Fi(gi) V ^2(92) t r u e if AcceptAll ((91,92)) E ■ { f a l s e if else unknown t r u e if else unknown S 1 UE2 Pointer to Ei Dead\(qi) = tru e and Dead2{q2) = tru e Deadi(qi) = f a l s e or Dead2(q2) = f a ls e AccAlli(qi) = tr u e or AccAll2(q2) = tru e if Ei ^ E2 if Ei = E2 Figure 4.10: Union Implementation. This only occurs when there is no path from qi to an accept state in A, and no path from q2 to an accept state in B. Otherwise, (91,92) is not a dead state. Both of these conditions can be quickly determined from the Dead functions of A and B, respectively. The Dead function for A\JB only returns unknown for (91, q2) when the dead state status of one of 91 and q2 is unknown, and the status of the other state is either unknown or f a ls e . The Dead state function for the union of two DFAs, A and B, preserves the dead state inform ation of A and B, in th a t the dead state status of every state in AUB is known if the dead state status of every state in A and B is known. Union is said to preserve dead state information. A state (91,92) in AUB is an accept-all state if every path from (91,92) leads to an accept state. The function AcceptAll returns tr u e or f a ls e , accordingly, when this determination can be made cheaply, and unknown when it cannot. It can be cheaply determined that (91,92) is an accept-all state if 91 is an accept-all state in 57 A or if B is an accept-all state in B. Otherwise, the determ ination cannot be m ade cheaply, and AcceptAll(( 9 1 , 9 2 )) returns.unknown. If neither 9 1 nor 9 2 are accept-all states in A and B, respectively, then it still possible th at (9 1 , 9 2 ) is an accept-all state in AUB. This occurs when the paths th at do not lead to accept states from 9 1 in A do lead to accept states in B from 9 2 , and vice-versa. In general, ascertaining this fact requires an exhaustive search of the states reachable from (9 1 , 9 2 ) in AUi? to determ ine whether they are all accept states. The function AcceptAll((9 1 , 9 2 )) returns unknown in this case. It also returns unknown if the accept-all status of one of 9 1 and 9 2 is unknown, and the accept-all status of the other state is either f a l s e or unknown. The AcceptAll function for the union of two DFAs, A and B , does not preserve the accept-all information of A and B. Even if the accept-all status of every state in A and B is known, there can still be states in AUjB for which AcceptAll returns unknown. Union does not preserve accept-all information. 4 .2 .1 .3 M in im iz in g a fte r I n te g r a tio n The integration of several (H, C, P) tuples is accomplished by by intersecting all of the C sets and com puting the union of the P sets. The result is a single tuple. The intersection and union operations in RS-KII return recursive DFAs. These opera tions are constant tim e, and the resulting DFAs have a very low space complexity. However, the resulting DFAs have not been minimized, which means they can have far more states than the corresponding minimal DFAs, and th a t they may have states for which the Dead and AcceptAll functions return unknown. The combination of extraneous states in C and P, and the presence of dead states th at cannot be identified by the DeadState function can increase the cost of enum erating a hypothesis from the solution set by introducing sim ilar states into the DFA for the solution set. A hypothesis in the solution set can be generated by finding a path from the start state of the DFA to one of its accept states—a simple graph search. The presence of dead states th at cannot be detected by the DFA’s DeadState function can cause backtracking. All of the state reachable from the unidentified dead state m ust be visited before it is obvious th a t the state cannot 58 reach an accept state, and is therefore dead. If the DFA is non-minim al, there may be m any such states reachable from each unidentifiable dead state. If the DFA’s DeadState function can detect every dead state in the DFA, then the search will never backtrack, and the extraneous states are not detrim ental. Back tracking only occurs when there is no path from the current state to any of the accept states— th at is, when the current state is a dead state. If the DFAs DeadState func- * . tion can always determ ine whether a given state is dead, then the states reached by each edge out of the current state can be tested by this function. One of the non-dead states is visited next. The current state always has at least one such child state. If it did not, then the current state would also be dead, and it would never have been visited in the first place. The DFA for C is the intersection of several DFAs, and since intersection does not preserve dead state inform ation, the DFA may have several states for which the DeadState function returns unknown. The problem is th at a state, ( < 7 1,(72)> in the intersection is dead if <71 or <72 are dead in their respective DFAs, or if neither <71 or <72 is dead, but there is no path (string) from <71 to an accept state in the first DFA th a t is also a path from <72 to an accept state in the second DFA. The first case is easy to detect, but the second requires an exhaustive search of the states reachable from <71 and < 7 2, which can be expensive. One way to reduce this cost is to minimize the DFAs for C and P after each integration, or at least do an exhaustive search of the states in C to identify all of the dead states. However, m inim ization takes tim e proportional to 0(|E ||< 3|2), and identifying dead states takes tim e proportional to 0 (|Q |). This cost generally negates any benefit from storing the DFAs intensionally. The m otivation for representing DFAs intensionally is th a t only a few of the states are visited while searching the DFA for a path from the start state to an accept state. Only in the worst case are all of the states visited. By m inimizing the DFA, or by identifying the dead states, the cost is proportional to the num ber of states in the average case as well as in the worst case. Therefore, RS-KII does not minimize after integration, or try to identify dead states in C. It is generally b etter to identify and remove only those dead states encountered during the search, since this usually involves only some of the states in the DFA. 59 Although identifying and removing dead states from C after integration is not generally cost-effective, there is at least one case where it can be beneficial. If (Q ,s , 6 ,F,H1 ) is the intersection of ( Q i,s i, 6 i,i* i,E i) and {Q2 ,S2 ,S2 ,F 2 ,T,2), then the size of Q is at m ost |<3i | IQ2 I• However, m any of the states in Q m ay be dead states. If after removing these states, the size of Q is bounded by the size of the larger of the two original DFAs, then the cost of intersecting and removing dead states from k such DFAs is only fc|Q*|2, where |Q*| is the num ber of states in the largest of the k DFAs. Removing dead states after each intersection can be much cheaper in this case than removing them after all of the DFAs have been intersected. RS-KII does not have an explicit operation for removing dead states, but the enum eration and query operations do identify any dead states they happen to en counter. Applying a query, such as Empty, after each (H , C , P) tuple is integrated will identify and remove m any of the dead states in C and P. This approach is used in the RS-KII em ulation of the Candidate Elim ination algorithm [Mitchell, 1982] in Section 6.4 in order to achieve benefits sim ilar to those described in the previous paragraph. 4.2.2 Enum eration The enum eration operator returns hypotheses from the deductive closure of (C , P). It is used both to select the induced hypothesis and to im plem ent solution-set queries. Enumerate((C,P),A,n) returns a list of n hypotheses th at are both in the solution set of (C , P) and in the regular set A. If there are only m < n hypotheses in the intersection of A and the solution set, Enumerate returns all m hypotheses. The A and n argum ents are necessary to im plem ent the queries. In order to induce a hypothesis, it is only necessary to select a single hypothesis from the deductive closure. This can be done by setting n = 1 and A = E*. The deductive closure of (C ,P ) consists of the m ost preferred hypotheses in C , according to the preference relation P. Specifically, this is the set shown in Equation 4.2. first((C xC)C\P)C\C (4.2) In this equation, P is assumed to be transitively closed. However, P is the union of several DFAs, and transitive closure is not preserved by union (see Section 2.2.3). 60 The first action of Enumerate is to reestablish the transitive closure of P by applying the TransitiveClosure operator (see Section 4.1.2.6 ). In the rem ainder of this section, P is assumed to be transitively closed unless stated otherwise. One way to enum erate the solution set is to compute the DFA for the solution set, and generate strings from this DFA using a graph search. The solution set can be expressed as a DFA since regular gram m ars are closed under all of the operations in Equation 4.2. Although RS-KII could compute the solution set directly using the set opera tions defined in Section 4.1.2, this option is not used for efficiency reasons. The DFAs resulting from the set operations are non-minimal, possibly containing a large num ber of dead states. W hen a graph search enters a region of dead states, it m ust eventually backtrack, since there is no way to reach an accept state from such a region. In the worst case, all the states in the region m ust be visited before the search discovers th at it has to backtrack. The problem is th at information about which states are dead is not preserved by intersection, and the only way to tell that a state is dead is to visit all of the states reachable from the state, and determ ine th a t none of them are accept states. The am ount of backtracking can be reduced by using certain regularities in the structure of the DFA to increase the am ount of dead-state information available to the search. For example, P is a partial order, and therefore irreflexive, anti symm etric, and transitive. The solution set equation also contains a C x C term . This introduces a certain am ount of sym m etry into the DFA. These regularities can be easily exploited by a more sophisticated search algorithm , but are not captured in the DFA in such a way th at a graph search can easily make use of them . These regularities are m eta-inform ation true of all solution sets. For this reason, RS-KII enum erates the solution set using a modified branch- and-bound search, into which the regularities m entioned above are integrated. The solution set consists of the undom inated hypotheses in C, where the dom ination relation is determ ined by P. This m aps very well onto branch-and-bound, which also finds the undom inated elements of a set. However, a few modifications are required. The basic branch-and-bound algorithm takes a set of elements, X , and an evalu ation function, / : A’— that assigns a real num ber evaluation to each hypothesis. 61 The algorithm returns an elem ent of the set {x G X | Vy € X f(x) -ft f{y)}- This algorithm is described in Section 4.2.2.1 . Two modifications to the basic branch-and-bound algorithm are required in order to im plem ent the enum eration operator. Enumerate((C, P ), A, n) returns n hypothe ses from the set {a: G X \ \ /y £ X x -ftp y } fl A, where Xp is the partial ordering im posed by P. The first modification is to evaluate hypotheses according to the partial ordering -<p instead of the total ordering imposed by / . This modification, described in Section 4.2.2.2, returns a single elem ent of the set {x 6 X | Vy € X x -jip y}, which is the deductive closure of (C,P). This is sufficient to im plem ent Enumerate when n = 1 and A = S* (i.e., A is not used). In order to im plem ent Enumerate((C,P),A,n) for arbitrary values of n and A, the modified branch-and-bound algorithm m ust be extended to find n elements that are in both A and the deductive closure of (C , P). These modifications are described in Section 4.2.2.3. In order to perform efficiently, the modified branch-and-bound algorithm m ust im plem ent certain d ata structures efficiently. These issues are discussed briefly in Section 4.2.2.4. 4.2.2.1 The Basic Branch-and-Bound Algorithm Branch-and-Bound takes a set of elements, X , and an evaluation function / , and finds an undominated elem ent of X . An element x is dom inated if and only if there is an elem ent y in X such th at x <x y, where <x is a dom ination relation over elements of X . The dom ination relation <x can be any ordering relation on X . In the basic branch-and-bound algorithm , <x is a total ordering on X derived from the evaluation function / , where x <x y if and only if f(x) < f(y) The idea behind branch-and-bound is to find an undom inated elem ent of X by pruning regions of the search space, X , in which every elem ent in the region is dom inated by the best elem ent of another region. This is the “bound” part of the algorithm . It is followed by the “branch” , which splits the un-pruned regions into several sub-regions. This splitting and pruning process continues until all but one of the regions have been pruned, and the rem aining region contains only a single elem ent, x. This elem ent is an undom inated elem ent of X. 62 A version of the basic branch-and-bound algorithm based on K um ar’s form ulation [Kumar and Laveen, 1983, K um ar, 1992] is shown in Figure 4.11. The search space is a set of elements, X , and regions of the search space are subsets of X. The algorithm m aintains a collection of subsets, L, which initially contains only X. In each iteration, a subset X s is selected from this collection by the Select function, and split into smaller subsets with the Split function. Dom inated subsets are then pruned from the collection. Subset X,- is dom inated by subset Xj if for every hypothesis in Xi there exists a preferable hypothesis in Xj. Formally, X{ is dom inated by X j if and only if (\/h£Xi)(3h'(EXj)h <x hi. This process continues until there is only one subset left in the collection, and this subset contains exactly one elem ent, x. This elem ent is returned as the undom inated elem ent found by the search. BranchAndBound(X, D<) returns x G X s.t. % £ V X is a set of elements. D < is a dom ination relation on subsets of X derived from <. < is a total ordering over elements of X . BEGIN WHILE L is not empty DO IF L is a singleton, {X,}, and Xi is a singleton, {a;}, THEN RETURN X ELSE X s « — Select (L) Replace X 3 in L with Split (X3). Dom inated * — { A T ,- € L | 3X j (X{ Z>< X j)} L * — L — Dominated END WHILE RETURN f a ilu r e END BranchAndBound. Figure 4.11: Branch-and-Bound where D < is a Total Order. T h e D o m in a tio n R e la tio n a m o n g S u b s e ts . The dom ination relation among subsets is derived from the dom ination relation among individual hypotheses, <x, and is expressed as a binary relation, jD<, where (X,-, X j) is an elem ent of D< if and 63 only if subset Xi is dom inated by subset Xj. This is w ritten Xi D<_ Xj. If (X i,X j) is not an element of D<, then Xi is not dom inated by Xj. This is w ritten X{ J J ) < Xj. In order for branch-and-bound to be cost-elfective, the cost of testing whether a subset is dom inated and should thus be pruned m ust be significantly cheaper than the cost of the search avoided by pruning the subset. It may only be possible to cheaply determ ine the dom ination relation between some pairs of subsets. For this reason, The basic branch and bound algorithm uses the incom plete but inexpensive dom ination relation, £><, instead of the complete but potentially expensive relation D<. A pair of subsets, (X i,X j) is in D< if and only if Xi D < X j, and this relation can be determ ined by an inexpensive test. If Xi D < Xj, but this fact cannot be determ ined inexpensively, Xi D < X j will be false. Xi D < X j is also false if Xi is not dom inated by Xj. It is always possible to cheaply determ ine the dom ination relation between singleton sets, {x} and {?/}, since this is a trivial m atter of determ ining whether x < x y, and < x is defined between every pair of elements in X . Even though D < is incomplete, this does not invalidate the branch and bound algorithm . Eventually, the subsets in the collection, L, will be split to the point where the subsets in L can be compared by D <. This requires th at the Split function be defined so th at X is eventually split into a finite num ber of subsets th at can be compared by JD<. If X is finite, then the branch and bound algorithm is always guaranteed to halt, since in the worst case, X will be split into a finite num ber of singleton subsets, and singleton subsets can always be compared by D <. T o ta l a n d P a r tia l O rd e rin g s . The basic branch and bound algorithm assumes th at D k is a total ordering over the subsets of X . W hen £)< is a total order, there is at m ost one dom inant subset among any collection of subsets. This guarantees th at after splitting X into subsets th at can be compared by D <, all but one subset will be pruned from the collection. The rem aining subset can be split further until a singleton subset is identified th at dom inates the rem aining subsets. The single elem ent in this set is returned as the undom inated elem ent of X . If D < is a partial order, then in any collection of subsets there may be more than one undom inated subset, in which case the algorithm will not halt. 64 The relation D < is derived from <x- If <x is a total order, and Split( X s) par titions X s into disjoint subsets, then D< is also a total order. Given any set of elements ordered by <x, there is exactly one m axim al elem ent. Therefore, between two disjoint subsets, Xi and X j, one subset is guaranteed to contain the maximal elem ent of X{UXj, and this subset dom inates the other. If the subsets are not dis joint, then both X , • and X j m ay contain the m axim al element in which case neither dom inates the other. However, this is somewhat irrelevant since the dom inant el em ent is the same in both subsets. One subset can be arbitrarily selected as the dom inant one w ithout fear of pruning away the dom inant element. If <x is a partial ordering, then so is D <. For example, Let a and 6 be the undom inated elem ents of Xi, and let y and 2 be the undom inated elements of Xj. If a <x y and 2 <x b, but there is no relation between a and 2 nor between b and y, then X, does not dom inate X j, nor does X j dom inate Xi. There is no dom ination relation between Xi and Xj. 4.2.2.2 Branch-and-Bound with Partially Ordered H ypotheses The deductive closure of (C , P) consists of the undom inated hypotheses of C, ac cording to P, the partially ordered dominance relation over individual hypotheses. The basic branch-and-bound algorithm finds an undom inated elem ent of a set X according to a totally ordered dominance relation, <x- In order to enum erate an elem ent of the deductive closure of (C ,P ), the basic branch-and-bound algorithm m ust be modified to accept a partially ordered dominance relation over the individual hypotheses instead of requiring a totally ordered relation. The problem with using a partial order in the basic branch-and-bound algorithm is th at the term ination condition assumes the dominance relation <x is a total order. The basic algorithm term inates when the collection contains only a single subset, and this subset is a singleton. In a total ordering, there is at m ost one undom inated elem ent. Eventually, all of the other elements will be pruned leaving only the undom inated elem ent in the collection. In a partial ordering, there can be several undom inated elements. Eventually, all of the dom inated elem ents will be pruned leaving only the undom inated elements, but since there can be m ore than one undom inated elem ent, the algorithm may not halt. 65 One solution is to modify the term ination condition so th at the algorithm halts as soon as one of the subsets in the collection contains nothing but undom inated elements. An arbitrary elem ent of this subset is returned by the search as the undom inated elem ent. O ther subsets in the collection may also contain undom inated elements, but since only one elem ent is needed, the term ination condition described above is sufficient. One advantage of this approach is th at it can be easily extended to return m ultiple hypotheses, as will be seen in Section 4.2.2.3. A subset X{ in the collection, L, contains only undom inated elem ents if Xi is undom inated by all of the other subsets in L, and if no hypothesis in Xi dom inates any other hypothesis in X{. The first condition is true if (VXj^j € L) Xi P < Xj. The second condition is true if Xi P < Xi. T hat is, no hypothesis in X, dominates any other hypothesis in X,-. These two conditions can be combined into one test, nam ely (V X j€ i) Xj |5< X j. If Xj contains only undom inated hypotheses, then any one of them can be returned by the search as the undom inated element. In order to evaluate this term ination test, it m ust be possible to determ ine whether or not (X s,X j) G D <, where X 3 is the selected subset, and X j is an elem ent of the collection. If the test is true, Xs contains only undom inated hy potheses. Unfortunately, the complete dominance relation, D<, is not available to the branch and bound algorithm . Only the less expensive, but incom plete, relation D < is available. Since this relation is incomplete, (Xj,Xj) ^ D < could m ean either th at (Xj, Xj) ^ D<, or th a t the dom ination relation between Xi and X j cannot be determ ined cheaply. In the latter case, (Xj, Xj) m ay or may not be a m em ber of D<. In order to discrim inate between these two cases, the relation D^ is defined. Xj X j is true if and only if Xj /)< X j and this relation can be determ ined cheaply for Xj and Xj. The relation Xj X j is false either if the dominance relation between Xi and X j cannot be determ ined cheaply, or if Xj D < Xj. The two cases cannot be distinguished. The D ^ relation requires an inexpensive test for determ ining when one subset does not dom inate another. The modified algorithm , BranchAnd.Bound.-2, is shown in Figure 4.12. It uses the relation to determ ine when one of the subsets in th e collection contains only undom inated elements. An elem ent of this subset is selected arbitrarily and returned as the undom inated elem ent found by the search. The relation D < is used to prune 66 dom inated subsets. The relations D < and are derived from the dom ination relation <x over individual hypotheses. The relation <x can be a partial order. BranchAndBound-2(X,D<, D returns x G {x € X | VyG ^ x ^ y} X is a set of elements jD< is a dom ination relation on subsets of X derived from < jD* is a not-dom inated relation on subsets of X derived from < < is a partial ordering over the elements in X BEGIN L ^ { X } WHILE L is not empty DO X s < — Select(L) IF ( X s X{) for every Xi in L and (X„ X a) THEN RETURN any elem ent of X s ELSE Replace X 3 in L with Split(Xs). Dominated < — L | 3X j € L L < — L — Dominated END WHILE RETURN f a ilu r e END BranchAndBound-2. Figure 4.12: Branch-and-Bound where <x is a Partial Order. Given appropriate argum ents, BranchAndBound-2 can find a single element of the deductive closure of (C ,P ), which is sufficient to im plem ent the function Enumerate{(C,P),Ti*,l). The argum ents passed to BranchAndBound-2 in order to implement Enumerate((C,P), S * ,l) are shown in Figure 4.13, and described in detail below. The argum ents to BranchAndBound-2 consist of X , and the dominance rela tions D< and over subsets of X , as derived from < * . The Split and Select functions are also param eters to BranchAndBound-2, albeit implicit ones. The algorithm BranchAndBound-2 finds a single elem ent of the set {a: £ X | Vy 6 X (x -ftx y))- In order to im plem ent Enumerate{(C, P ), E*, 1 ), the argum ents m ust be set so th a t BranchAndBound-2 finds a single elem ent of the deductive closure of (C,P) —namely, {x £ C \ \/y e C (x -ftp y)}. 67 Enumerate((C,P), S * ,l) = BranchAndBound- 2 (C,D'<,D 1r) where • A subset of C is denoted (to, < 7) where to € S* and q = 6 c(sc, to). C itself is (e, sc), where e is the em pty string and sc is the start state of C. o (wi,qi) D^ (w 2 ,q2) iff {(wu qi)x{w 2 ,q2))nP = 0 where ((wi,qi)x(w 2 ,q 2))nP = 0 if Deadc(qi) = t r u e or Deadc{q 2 ) = t r u e or (q = deltap(sp, shuffle(wi,w2)) and Accept All P(q) = tru e ) • (Wi,qi)D< {w 2 ,q 2) iff ((w 1 ,qi)x(w 2 ,q2)) Q P and (w 2,q 2) ^ 0 where ((Wi,q 1}x(w 2 ,q2)) C P U q = Sp(sp,shuffle(wi,W 2 )) and Accept All P(q) = t r u e (w2 ,q2) 7^ 0 if Deadc(q 2 ) = t r u e • Split((w,q)) = {(u> • cr,q') \ a € S and q' = 8 c(q,v)} • Select(L) returns X„ G L s.t. (VAft - G L)(XS,X{) $ where (wi,qx) (w2 ,q2) iff q = 6 p(sp,shuffle(wi,w2)) and Accept All P(q) = t r u e Figure 4.13: Param eters to BranchAndBound-2. S p e c ify in g S u b s e ts . The set of elements, X , is set to C. The relations D< and Df>, and the Split function all depend on the way in which subsets of X are represented. There are several ways to partition C into subsets, but one th at seems to work well is to specify a subset by the tuple (u>, g), where to is a string and q is the state in C reached on input to from the start state. This subset contains all strings in C of the form to to', where to' is a string leading from q to an accept state. Since C is a DFA, there is at m ost one state, q, reached by a given input, so (to, < 7) contains all strings in C th at begin with to. The notation (to, q) is in fact shorthand for the concatenation of two DFAs, one th at recognizes to, and one th at recognizes all strings to' such th a t 6 *(q, to') reaches 68 an accept state in C. The latter DFA can be derived from the DFA for C by simply m aking q the start state. Because of this simple derivation, all of the subsets of C can share the DFA for C, so new DFAs do not need to be created for each subset. Subset (w,q) is a singleton if and only if q is an accept state and every path from q leads to a dead state. T hat is, w is in C, but no extension of w is in C. Subset {w,q ) is em pty if and only if q is a dead state in C, since no extension of w is recognized by C. The Split Function. The Split function takes a subset of C, (w, q), and partitions it into subsets by extending w with each of the symbols in the alphabet, E: Split((w, q)) = {(u> • < 7, q') \ q' = 6 c (q, cr)} Alternately, w could be extended by strings instead of by individual symbols. One possibility is to extend w by a string w' such th at ww' is a hypothesis in H, though perhaps not one recognized by C. In this case a subset (w , q) contains all hypotheses th at are extensions of hypothesis w, and also recognized by C. This version of the split function would be defined as follows: Split((w,q)) = {(u> • w',q') \ q' = 8 * c {q,w) and 8 * H(s,ww') G FH} The Dominance Relation. The dominance relations, D < and D^ are incomplete versions of D < th at are defined only on subsets for which the dominance relation can be determ ined inexpensively. The complete dominance relation over subsets of the form (u>, q) will be defined first, followed by inexpensive definitions for D < and £> < ■ A subset (wi,qi) is dom inated by subset (u > 2,< 72) if and only if (\/he(wi,qi))(3h'€ (u)2 , 9 2 )) {hi h') £ P. T hat is, every hypothesis in (wi,qi) is less preferred than some hypothesis in ( ^ 2 ,^ 2 ), according to P. This relation can be expressed in term s of set operations on P and the two subsets as follows. The set ((tWi,9 i)x (io 2, 9 2 ))n P is {(x,y) \ x < y and x € {wi,qi) and y G ( ^ 2 ) < 7 2) }• Projecting this set onto its first elements yields the set of hypotheses in {w\,qi) th at are dom inated by one or more elements of (u> 2 , 9 2 )- If the projection is exactly (iuj,qi), then every hypothesis in (wi,qi) is dom inated by elem ents of 69 (w 2 ,q 2 ), so (wi,qi) D< {u > 2 , 9 2 )• Otherwise, some elements of (w 2 ,q2) are not dom inated, so (wi,qi) p K (w2 ,q2). The relation Z)< is therefore defined as shown in Equation 4.3. (u>i,9i) D< (1 0 2 , 9 2 ) iff first(((wi, qi) x (1V2 , 92))f~I.P) = (u>i,9i) (4.3) Approximating the Dominance Relation. This test is expensive to compute. A less expensive, though possibly overspecific test is needed to define D <. As a first approxim ation, a sufficient but not necessary condition for (wi,qi) D K (w 2 ,q 2 ) is ((wi, 9 1 ) x (w2 , 9 2 )) Q P- W hen this is true, every hypothesis in (Wi, 9 1 ) is dom inated by every hypothesis in (w 2 ,q 2 ). To this condition a test m ust be added to ensure th at (w 2 ,q2) is not empty, since does not allow a subset to be dom inated by an em pty set. The complete test is sum m arized in Equation 4.4. (wi,qi)D<(w2 ,q2) iff ((wi,q 1)x (w 2 ,q2))) C P and (w2 ,q2) ^ 0 (4.4) This definition is a step in the right direction, but it is still too expensive to compute. The second condition of this test determines whether {w2 ,q2) is empty. This is equivalent to testing whether q2 is a dead state, which can be expensive if this information is not already known by the DFA. The first condition is also expensive. Specifically, it determines whether ((wi,qi)x(w 2 ,q2)) is a subset of P, which is equivalent to testing whether ((wi,qi)x(w 2 ,q 2))C\P is empty. This is also an em ptiness test, and therefore potentially expensive. The two tests of Equation 4.4 are expensive to com pute in general, but for some subsets they can be com puted cheaply. A sufficient condition for the first test, ((iu i,g i)x (11) 2 , 9 2 )) Q Pi is 6p(s,shuffle(wi,w2)) = 9 and Accept All P(q) = tru e . T hat is, the string shitffle(wi,w2) leads to an accept-all state in P, which means th at every pair of strings in {(wix,w 2y) \ x 6 £* and y G E*} is a m em ber of P. This certainly includes all the strings in (wi,qi)x(w 2 ,q2), so (w-[,qi)x(w 2 ,q2) is a subset of P. Com puting 9 = Sp(startp, shuffle(wi,w2)) takes tim e proportional to the lengths of w\ and w2. Determ ining whether 9 is an accept-all state is accomplished by the AcceptAll function, which only returns t r u e or f a l s e when the answer can be com puted cheaply, and returns unknown when it cannot. This test can only compare subsets for which AcceptAll does not return unknown. 70 T he second test in Equation 4.4 determ ines whether (11)2 ,(72) is em pty— i.e., whether q2 is a dead state. If it can be determ ined cheaply whether q2 is a dead state, then the function Deadc(q2 ) returns t r u e or f a l s e , according to w hether or not the state is dead. Otherwise, the function returns unknown. This leads to the definition for D < shown in Equation 4.6. {wi,q 1) b <(w2 ,q2) iff AcceptAllP(Sp(startp, shuffle(wi,w2))) = t r u e and (4.5) Deadciqi) = f a l s e (4.6) D e fin in g th e N o t-D o m in a te d R e la tio n . The definition of D^ is derived sim ilarly. The complete and correct test for determ ining whether one subset is not dom inated by another is shown in Equation 4.7. This is the complement of the dom inance relation, D < . (wt,qi) g><(w 2 ,q2) iff first((wu q1)x{w 2 ,q2))nP ^ (u>i,?i) (4.7) This relation is expensive to determ ine. It m ust be replaced by a less expen sive, though possibly incom plete relation. A sufficient condition for this relation is ((Wi,qi)x(w 2 ,q2))nP = 0. W hen this condition is true, no hypothesis in (w\,qi) is dom inated by any hypothesis in (w 2 ,q 2 ). The approxim ation of the not-dom inated relation based on this test is shown in Equation 4.8. (m ,qi) ]p<(w2 ,q2) if ((wu qi)x(w 2 ,q 2))r\P = 0 (4.8) This test is still too expensive to compute, since it involves testing whether the intersection of two sets is empty. However, a sufficient condition for this test can be com puted inexpensively. Namely, ({wi,qi)x(w 2 ,q 2))f\P = 0 if the string shuffle{w\,w 2 ) leads to a dead state in P, or if either qi or q2 are dead states in C. The Dead functions for C and P return unknown if it is expensive to determ ine whether a given state is dead, and otherwise return t r u e if it is dead, and f a l s e if it is not. Fortunately, P is a union of several DFAs, so if the dead state information is known for all of the states in these DFAs, then the Dead function for P never 71 returns unknown (see Section 4.2.1.2). The definition for D^ is given formally in Equation 4.10. {wu qi) D < (w 2 ,q 2 ) iff Deadc(qi) = t r u e or Deadc{qi) = t r u e or (4.9) Deadp( 6 p(startp,shuffle(w\,W 2))) = t r u e (4.10) T h e S e le c tio n C rite r ia . Because of the way Split and D < are defined, it is possible for a subset, (w,q ), in the collection to be empty. Subset (w,q) is em pty if q is a dead state. This is expensive to determ ine, so em pty subsets are not elim inated autom atically by the Split function, nor are they identified as dom inated subsets by D <. Eventually, (to, 9 ) will be split into child subsets, (wwi,qi ) through (wwk,qk), whose emptiness can be cheaply determ ined. In the worst case, q\ through qk comprise all of the states reachable from q. At least one of these is a known dead state, and this inform ation can then be propagated to the other states. However, the cost of splitting (to, q) to this point can be quite expensive since it is proportional to the num ber of states th a t are reachable from q. This expense can be m itigated by being intelligent about which subset is selected in each iteration. Given a choice between (toi, 9i ) and (102, 92), choose (102, 92) if shuffle(wi,W 2 ) leads to an accept-all state in P, since this means th at every string in (1 0 2 , 9 2 ) is preferred to every string in (toi, 9 1 ). The subset (1 0 1 , 9 1 ) could be pruned as dom inated at this point, except th at is not known whether (102, 92) is empty. Recall th at the dom ination relation D < does not allow a subset to be dom inated by an em pty subset. If the same selection criteria is applied consistently, the child subsets of (1 0 2 , 9 2 ) will also be selected over the subset (1 0 1 , 9 1 ). Eventually, (1 0 2 , 9 2 ) will be split into enough child subsets th at it can either be determ ined th at all of the children are empty, or th a t one of them is non-empty. In the former case, the em pty children are pruned from the collection, since all em pty sets are dom inated, which frees up the selection criteria to select (to i,9 i). In the latter case, there is at least one non-em pty child subset of (1 0 2, 9 2 ). Call this subset (to*,9 *). Since it is known th a t this subset is not empty, and it is also known th a t every extension of to* is preferred to every extension of toi, the subset (toi,9 i) is dom inated by (to*, 9 *), and can be pruned from the collection. 72 W hether or not the selected subset eventually turns out to be empty, the selection criteria described above is often cheaper and never worse than the inverse policy. If (wi,qi) and its children were preferred over (w2 ,q2) in the above example, then the search cost would be higher. If {wi,qi} is either em pty or non-empty, and (w2 ,q2) is non-empty, then the child subsets of (wi,q\) are searched, followed by the children of (w2, 9 2 )- In the above example, only the children of (w2 , 9 2 ) would be searched. If (u>i,9 i) were non-empty, and (1 0 2 , 9 2 ) were empty, then the children of both subsets m ust be searched using either selection criteria. The costs are the same in this case. Subsets are therefore selected from the collection according to the first policy, since it can reduce the am ount of search required. The selection policy prefers subset (1 0 2 , 9 2 ) over (u>i,9 i) when shuffle(w1 ,w 2) leads to an accept state in P. This is the same as the definition of D < , except th at the emptiness test for (1 0 2 , 9 2 ) is om itted. Call this new relation D'<. The subsets in the collection are partially ordered according to D ' 0 and one of the subsets at the top of the order is selected. The relation is defined formally in Equation 4.11, and the selection criterion is defined in Equation 4.12. (u>i,9 i) £>< (w 2 ,q2) iff 9 = 6 p(sp, shuffle(wi,w2)) and Accept All P(q) = t r u e (4.11) Select [L) returns X s e L s.t. (VA; G L) (X S, X { ) £ D '< (4.12) 4.2.2.3 Fully Modified Branch-and-Bound Algorithm The full im plem entation of Enumerate((C, P), A,n) m ust be able to enum erate up to n hypotheses th at are both in the deductive closure of (C,P) and in the set A. This is accomplished with a few simple modifications, as shown in Figure 4.14. If the selected subset, X a, contains only undom inated hypotheses then instead of returning a single elem ent of X a as is done in BranchAndBound-2, X s is intersected with A, and elements are enum erated from X s fl A until all n hypotheses have been enum erated or there are no more hypotheses in the intersection. X s is m arked as enum erated, and rem ains in the collection. If additional hypotheses are needed, the search continues until another undom inated subset is found, and its elem ents are enum erated as m entioned above. W hen all n hypotheses have been enum erated, 73 or all of the subsets in the collection have been m arked, the search halts and the list of hypotheses enum erated so far is returned. This list contains no more than n hypotheses, but m ay contain fewer if the intersection of A and the undom inated elements of X contains fewer than n hypotheses. The algorithm incorporating these modifications is called BranchAndBound-3 and is shown in Figure 4.14. BranchAndBound-3(X, Z)<, D ^,n, A) returns n elements of { x £ X \ \fy^x % y}C\A X is a set of hypotheses D < is a dom ination relation on subsets of X derived from < is a not-dom inated relation on subsets of X derived from < n is the num ber of solutions requested < is a partial ordering over the elements of X BEGIN S o lu tio n s < — {} E n u m erated « — {} WHILE |S o lu tio n s | < n and L — Enum erated is not em pty DO X s < — select(L — Enumerated) IF (X s Xi) for every Xi in L THEN Add hypotheses from X S(~\L to S o lu tio n s until |S o lu tio n s | = n o r there are no more hypotheses in X sf)L ELSE Replace X s in L with Split(Xs) Dom inated * — {A,- G L \ 3X jG L (X{ £)< Xj)} L < — L — Dom inated END IF END WHILE RETURN S o lu tio n s END BranchAndBound-3. Figure 4.14: Branch-and-Bound T hat Returns n Solutions Also in A. RS-KII im plem ents Enumerate((C, P ) ,A , n) by calling BranchAndBound-3 with the argum ents shown in Figure 4.15. These are the same argum ents passed to BranchAndBound-2 in order to implement Enumerate((C,P), E*, 1), with the addi tion of the n and A argum ents. 74 Enumerate((C,P),A,n ) = BranchAndBound-3 (C, £><,D^, A,n) where • A subset of C is denoted (w , q) where u i g E ’ and 9 = 8 c{sc, w). • (tni,gi) {w2 ,q2) iff ((w i, q\) x{w2, g2>)nP = 0 where ((^ i,9 i)x (^ 2 ,9 2 ))n P = 0 iff Deadc(qi) = t r u e or Deadc(q2 ) = t r u e or (q — delta*P(sp,shuffle(wi,W 2 )) and Accept All P(q) = tru e ) • (w-l ,qi)D < {w2 ,q2) iff ((w i,g i)x (u ; 2 ,g 2 )) - ^ where ((w 1 ,q 1)x{w 2 ,q2)) C P iff q = 8 p(sp,shuffle(wi,w2)) and Accept All P(q) = t r u e • Split((w,q)) = {(to • < 7 ,g') | a € E and q' = ^c( 9 )Cr)} • Select(L) returns X s £ L s.t. (VX,- € L)(Xs,Xi) ^ P '< where (in i,9 i) £»< ( ^ 2 , 9 2 ) iff ^ = 8 p(sp,shuffle(wi,w2)) and Accept All P(q) = t r u e Figure 4.15: Param eters to BranchAndBound-3 for Im plementing Enumerate. 4 .2 .2 .4 E fficien t S e le c tio n a n d P r u n in g The collection L m ust be represented intelligently in order to minimize the complex ity of selecting subsets and pruning dom inated subsets. In RS-KII, L is m aintained as a lattice representing the known dominance relations among the subsets in L. Each node in L is a subset, and an edge from Xi to X j indicates th a t X j D'< X{ is true. Selecting a hypothesis is a m atter of selecting an elem ent from the top of the lattice. The nodes at the top are m aintained as a list, with each top-node pointing to the next top-node in the list. A subset X{ is dom inated if there is some subset X j in the collection such th at X{ £)'< Xj and Xj is not empty. Thus, if the selected subset is discovered to be non-empty, every subset below it in the lattice is pruned. If the selected subset is discovered to be empty, then it is removed from the top of the lattice. The subsets it 75 used to dom inated are prom oted to the top of the lattice if they are not dom inated by other top-nodes. W hen the selected subset, X 3, is split, it is removed from the lattice and replaced with its children. If some subset Xi is below X„ in the lattice, then Xi is below each child of X a as well. This is because Xi D'< S is true if and only if every hypothesis in Xi is less preferred than every hypothesis in X s. Thus, the relation holds for every subset of X s. However, D'< is not known for every pair of subsets. Thus, D'< may be known between some subset Xi and a child of X s where it was not known between Xi and X s. The children of X s are tested against the other top-nodes, and edges are added between the children and these nodes accordingly. 76 C hapter 5 R S -K II and AQ11 A Q -ll [Michalski, 1978] is an induction algorithm th at induces a hypothesis in the VLi hypothesis space language [Michalski, 1974] from examples and a user-defined evaluation function. RS-KII can em ulate the A Q -ll algorithm by utilizing this knowledge and the biases implicit in A Q -ll. W hen RS-KII utilizes only this knowl edge, the com putational complexity of RS-KII is a little worse than th at of A Q -ll. RS-KII can also utilize knowledge th at A Q -ll cannot, such as a dom ain theory. By using this knowledge in conjunction with the A Q -ll knowledge sources described above, RS-KII effectively integrates this knowledge into A Q -ll, forming a hybrid induction algorithm . RS-K II’s ability to em ulate A Q -ll and to integrate additional knowledge are dem onstrated in this chapter. The A Q -ll algorithm is described in Section 5.1. This is followed in Section 5.2 by a description of the knowledge used by A Q ll, and im plem entations of RS-KII translators for th at knowledge. This section also describes RS-KII translators for domain theories, which A Q -ll cannot use. In Section 5.3, RS-KII uses these translators to solve a synthetic induction task. The knowledge m ade available by this task includes the knowledge used by A Q -ll, plus a domain theory. It is dem onstrated th at when RS-KII uses only the knowledge used by A Q -ll, both RS-KII and A Q -ll induce the same hypothesis. W hen RS-KII uses the domain theory in addition to the A Q -ll knowledge, RS-KII induces a more accurate hypothesis than the one induced by A Q -ll. In Section 5.4, the com putational complexity of RS-KII is analyzed and compared to th a t of A Q -ll. W hen using only the knowledge available to A Q -ll, RS-K II’s complexity is a little worse than th at of A Q -ll. 77 5.1 A Q - ll A lgorith m The A Q -ll induction algorithm [Michalski, 1978, Michalski and Chilausky, 1980] induces a hypothesis from a list of noise-free positive and negative examples, and a user-defined evaluation function th at Michalski calls a “lexicographic evaluation function,” or LEF [Michalski, 1978]. Since the examples are assumed to be noise-free, A Q -ll induces a hypothesis th a t is strictly consistent with the examples. T h at is, the induced hypothesis covers all of the positive examples and none of the negative examples. There may be several hypotheses consistent with the examples, in which case the LEF is used to select among them . Ideally, the selected hypothesis is one th a t is m ost preferred by the LEF. However, finding a global m axim um of the LEF is intractable, so A Q -ll settles for a local m aximum . A locally optim al hypothesis is found by a beam search, which is guided by the LEF. The hypothesis space for A Q -ll is one described by the VXi hypothesis space language. This language is described in Section 5.1.1. The instance space, from which the examples are selected, is described in Section 5.1.2. The algorithm itself in described Section 5.1.3. 5.1.1 The H ypothesis Space The hypothesis space consists of sentences in the VXi concept description language [Michalski, 1974]. This is actually a family of languages, param eterized by a list of features and feature-values. Specific hypothesis space languages are obtained by instantiating VL\ with these param eters. Figure 5.1 describes a regular gram m ar for the VXi fam ily of languages. This gram m ar is param eterized by a list of (/,-, V) pairs, where /,• is a feature name, and V i is the set of values th a t the feature /,• can take. The set of values is expressed as a regular gram m ar, which simplifies the specification of the VLu gram m ar. Any finite or countably infinite set of values can be expressed as a regular gram m ar by m apping each value onto a binary integer in 1 (0 | 1 )* (i.e., the set of binary numbers greater than zero). 78 Param eters: ( /i, 14), ( / 2, 14), • • •, (fk, Vk) VLl -> TERM | TERM or VLl TERM — > SELECTOR | SELECTOR TERM SELECTOR -> T h - ( I » V i T I 1” h - (< 1 < 1 = 1 + 1 > 1 >) v 2 r 1 t fk ^ (< 1 < 1 = 1 ^ 1 > 1 >) v* r Figure 5.1: The VL\ Concept Description Language. Sentences (hypotheses) in VLi languages are disjunctive norm al form (DNF) expressions of selectors. A sentence in this language is a disjunct of term s, and a term is a conjunct of selectors. A selector is an expression of the form [/; # u], where /,- is a feature, # is a relation from the set { = , 7^ ,< ,< ,> ,> } , and v is one of the values th a t feature /; can take. For example, the following is a hypothesis in one possible VLi language: [ c o lo r = r e d ] [ s i z e > 20] o r [ s iz e < 5]. 5.1.2 Instance Space An instance in A Q -ll is an ordered tuple, (vi, u2, ..., ujt), in which u ,- is the value of feature /,• for th a t instance. For example, if there are two features, s iz e and c o lo r, then instances would be tuples such as (50, red) and (25, green). Exam ples are classified instances. An example is positive if it is covered by the target concept, and negative if it is not. An instance is covered by a VXi hypothesis if it satisfies the hypothesis. T hat is, the hypothesis m ust contain at least one term in which every selector is satisfied by the instance. A selector [/,• # s;] is satisfied by instance (t>i, u2, ..., Vk) if V{ # x,\ Recall th at is a relation from For example, the hypothesis [c o lo r = re d ] [ s iz e > 20] o r [ s iz e < 5] cov ers the instance (50, red) since red = red and 50 > 20. The hypothesis does not cover (25, green), since the instance is satisfied by neither term . 79 5.1.3 The Search A lgorithm The A Q -ll algorithm searches the hypothesis space for a hypothesis th at is con sistent with all of the examples, and locally maximizes the LEF. The algorithm is shown in two parts in Figure 5.2 and Figure 5.3. A Q -ll is essentially a separate-and-conquer [Pagallo and Haussler, 1990] algo rithm . The positive examples are placed in a “pot.” A term is found th at covers at least one of the positive examples in the pot and none of the negative examples. The positive examples covered by the term are removed from the pot, and the process repeats on the rem aining examples in the pot. Eventually, every positive example is covered by at least one of the term s, and no negative example is covered by any of the term s. The term s are disjoined, and returned as the induced hypothesis. This m ain loop of the algorithm is shown in Figure 5.2. The second part of the algorithm is an inner loop th at searches the space of term s for a term th at covers a positive example called the seed, but none of the negative examples. The seed is selected in each iteration of the m ain loop from the pot of uncovered positive examples. There are m any term s th at satisfy these criteria. The term returned to the m ain loop by the search is ideally one th at maximizes the LEF. In practice, the space of term s is so huge th a t searching it for the best term would be intractable. Instead, a beam search with width b is used to find a locally optim al term . The width of the beam is determ ined by the user. The beam initially contains the term tr u e . On each iteration, each term in the beam is replaced with its children. Child term s are generated from the parent term by conjoining a selector to the end of the parent. This selector m ust cover the positive seed, and fail to cover at least one new negative example th at was covered by the parent. This ensures progress towards a term th at covers the seed but none of the negative examples. The best b term s in the beam are retained, as determ ined by the LEF. W hen every term in the beam covers the seed but none of the negative examples, the best term in the beam is returned to the m ain loop. The beam search is shown in Figure 5.3. 80 Algorithm AQ -ll (Pos, N eg, L E F , b ) returns h PosList is a list of positive exam ples NegList is a list of negative exam ples L E F is a term evaluation function b is the beam w idth BEGIN h = false LOOP Select seed from PosList term = L earn J’ erm(seed, NegList, L E F ,b ) h = h V term Rem ove from PosList every example covered by T erm UNTIL PosList is empty RETURN h END AQ -ll Figure 5.2: The O uter Loop of the A Q -ll Algorithm. 5.2 K now ledge Sources and Translators The knowledge used by A Q -ll [Michalski, 1978] consists of positive and negative examples, a user-defined lexicographic evaluation function (LEF), a list of features, and a set of values for each feature. The features and values param eterize the VL\ concept description language, and the rem aining knowledge drives the search. In order for RS-KII to use this knowledge, it m ust be expressed in term s of constraints and preferences. This is accomplished by translators, which are de scribed below. In addition to the knowledge used by A Q -ll, RS-KII can utilize any other knowledge for which a translator can be constructed. A translator for one such knowledge source— a domain theory—is described below. Translators for other knowledge sources are also possible. 5.2.1 Exam ples A Q -ll induces hypotheses th at are strictly consistent with the examples. T hat is, the induced hypothesis m ust cover all of the positive examples and none of the 81 Algorithm LearnTerm(seed,NegList,LEF, b) returns term seed is a positive example N egL ist is a list of negative examples LEF is a term evaluation function b is the beam width BEGIN T erm s = {true} LOOP (1) Select n e g -se e d from N egL ist (2 ) S = set of selectors covering seed but not neg-seed. (3) C h ild re n = {t • selector | t £ Terms and selector £ 5} (4) Terms = best b term s in C h ild re n according to LEF (5) Remove from NegL ist examples covered by none of the term s in Terms UNTIL N egL ist is em pty RETURN the best element of Terms according to LEF END LearnTerm Figure 5.3: The inner Loop of the AQ11 Algorithm. negative examples. A positive example is translated into a constraint satisfied only by hypotheses th at cover the example. A negative example is translated into a constraint satisfied only by hypotheses th at do not cover the example. The im plem entation of this example translator is shown in Figure 5.4. It takes as input a hypothesis space and an example. The hypothesis space is an instantia tion of VLi with appropriate features and values. The function Covers(H,i) returns the set of hypotheses in H th at cover instance i, and the function Ex eludes {H, i) returns the set of hypotheses in H th at do not cover instance i. Both sets are rep resented as regular expressions (i.e., regular sets). The regular expression returned by Covers(H,i) is shown in Figure 5.5. Excludes(H,i) returns the complement of Covers{H,i). The regular gram m ar returned by Covers(H,i) recognizes all hypotheses in H th at are satisfied by instance i. H is an instantiation of VL\, and i is an ordered vector of feature values, (vi,V 2 ,...,u t). A hypothesis covers an instance if the hypothesis contains at least one term in which every selector is satisfied by the instance. In general, selector [/,• # u] covers (vi,v2, ..., v*) if u ,- # u, where is 82 Tran A QExample ( VXi((/i, Vi), ( / 2, V2) , ..., (/*, V*)), <(ui ,v 2 , . . . , v k), class)) -*• (C, {}) where C = Covers( VLi({fi, Vi), ( /2, V2) , ..., (/*, Vk)), ( v i , V 2 , . . . , v k }) if class = p o s it i v e £fecZu<M V Z a^ i.V i),(/ 2 ,V2) ,...,< /fc,Vfc)), (ui, V 2 , . . . , v k )) if c/ass = n e g a tiv e Figure 5.4: Exam ple Translator for VLi. a relation in { < , < , = , 7^,>,>}. For instance, the selector [ / 3 <12] covers instance (red, 18,9) since 9 < 12. Figure 5.5 shows the regular expression returned by Covers(H,i). At the top level, the expression is (ANY-TERM or)* COVERING-TERM (or ANY-TERM)*. This says th at at least one term in the hypothesis m ust cover the instance, nam ely the term recognized by COVERING-TERM. A term covers the instance if every selector in it covers the instance. COVERING- TERM expands to COVERING-SELECTOR+, which is a conjunction of covering selec tors. COVERING-SELECTOR is the set of selectors th at are satisfied by the instance. The instance is an ordered vector of feature values, (ui,u2, ... ,vk). A selector cov ers (ui,v2, ... ,ujt) if it is of the form [/,• # v], where u ,- # v and # is a relation in {< , < , = , > , > }. This set of selectors is recognized by the regular expression COVERING-SELECTOR. The definition of COVERING-SELECTOR is a little obscure. An instance is covered by a selector of the form [/,• < u] if the value of the instance on feature /,• is less than v. T hat is, the instance is covered by all selectors in the set {[/,• < u] | v > u,}, where u ,- is the value of the instance on feature /,-. A similar analysis holds for the other relations, < , = , > , and > . COVERING-SELECTOR expands into a union of regular expressions for each of these sets. Each of these sets can be w ritten as regular expressions. For example, the regular expression for { [ / 2 < v] | v > 10} is [/2 < ( 1 - 9)(0 - 9)+]. The function Excludes (H,i) returns the complement of the gram m ar returned by Covers(H,i). 83 Covers{VLi[{fu Vi), {f 2 ,V2) ,..., (fk,Vk)), (vu v2, ... ,vk)) -» G where G = (ANY-TERM or)* COVERING-TERM (or ANY-TERM)* ANY-TERM = SELECTOR* COVERING-TERM = COVERING-SELECTOR* SELECTOR = “[” / i (< | < | = | ^ | > | > ) Vi “]” | “ [” / a « | < | = | ^ | > | > ) Va “]” | “[” /* (< | < | = |V I > I >)% “ ]” COVERING-SELECTOR = {[/,• # v] | # U and # E {< , < , = , ^ , > , > }} = T /1 < [v E Vi I/i > f>i} 1 ” 1 • • 1 1 ” h < {u G 14 f k > Vk} “]” T /1 < {u € V i I/i > m i r i • • 1 T f k < {vEVk f k > Vk} “]” T h = Vl “]” | . 11” f k = vk “]” | T f t * (V i - M ) ’ 1 • . 1 r f k 7 ^ (Vk - {vk}) 1 ’ 1 “[” h > {veVi I/i < M T I • ■ 11” f k > {vEVk f k < vk] r T h > {veVi I/i < v i } r \ • • 1 T f k > {vE 14 h < Vk} “]” Figure 5.5: Regular Expression for the Set of VL\ Hypotheses Covering an Instance. 5.2.2 Lexicographic Evaluation Function A Q -ll searches the hypothesis space in a particular order, and returns the first hypothesis in the order th at is consistent with the examples. The search order therefore expresses a preference for hypotheses th a t are visited earlier in the search to those th at are visited later. The search order is partly determ ined by the search algorithm , and partly determ ined by the lexicographic evaluation function (LEF), which guides the search. The LEF and the search algorithm are translated conjointly into a preference for hypotheses th at occur earlier in the search. Expressing this preference as a regular gram m ar can be somewhat difficult. Given two hypotheses, h\ and h2, determ ining which of them is visited first by the search is often difficult to extract ju st from the information available in h\ and h2. Often, the only way to determ ine their relative ordering is to run the search and see which one is visited first. This is clearly im practical. However, for some searches, the ordering can be determ ined directly from hi and h2. For example, in best first search, h\ is generally visited before h 2 if hi is preferred by the evaluation function, although the topology of the search tree also affects the visitation order. 84 The beam search used by AQ-ll falls into the class of search orderings th at are difficult to express. However, when the beam width is one, beam search devolves into hill climbing search, which can be expressed as a regular gram m ar, though somewhat awkwardly. Although the biases imposed by some search algorithm s are difficult to represent, it should be rem embered th at these biases are themselves approxim ations of intractable biases. For example, A Q -ll uses a beam search to find a locally optim al approxim ation to the globally optim al term . Although the beam search is difficult to encode in RS-KII, it may be possible to find some other approxim ation of the intractable bias th at can be expressed more naturally in RS-KII. This is an area for future work. The rem ainder of this subsection describes the search ordering used by AQ-ll, and how a restriction on this ordering can be expressed as a regular gram m ar in RS-KII. A Q -ll uses a two-tiered search. At the bottom level, it conducts a beam search through the space of term s for a term th at covers a specified positive example and none of the negative examples. This search is guided by the LEF. Let the search order over the term s be specified by -< * where <1 -<t h if h and t 2 are term s, and t\ is visited before t2. The LEF may assign equivalent evaluations to some term s, in which case it does not m atter which term is visited first. In this case, there is no ordering between the two term s. The top level is a search through the space of hypotheses. In each iteration, the term found by the beam search is disjoined to the end of the current hypothesis. This can be viewed as a non-backtracking hill climbing search, where children are generated by disjoining a term to the parent hypothesis, and the term s are ordered according to -<t, with term s occurring earlier in -<t being selected first. Let -</, be the order in which A Q -ll visits hypotheses, where hi -</, h 2 if hi and h 2 are both hypotheses, and h\ is visited before h2. The LEF and the fixed search algorithm are translated into a preference defined in term s of -<& . Namely, the preference set P is {(x,y) E H x H | y -<h ®}j which says th at hypothesis x is less preferred than y if x is visited after y. P is essentially the preference set for {(a, 6 ) E H x H | a -<h b}, except th at the order of the tuples is reversed. If -</, can be expressed as a regular gram m ar, it is likely th a t P can also be expressed as a regular gram m ar. To express -<h as a regular gram m ar, a decision procedure is first defined th at determ ines whether 85 hi -<h h 2 for arbitrary hypotheses hi and h2. This procedure is then expressed as a regular gram m ar. A hill climbing search first visits the root of the search tree, followed by the best child of th at root. The subtree rooted at the child is searched recursively in the same fashion until a goal node is found. If no goal is found after searching to the bottom of the tree, the search would backtrack from the leaf node, and search the next best child of the leaf’s parent. This is a pre-order traversal of the search tree. The root is visited first, and then the subtrees rooted at each child node are searched recursively from best child to worst child, as determ ined by the evaluation function. The top-level search of AQ-11 is a hill-climbing search. Let hypothesis hi be ii,iV /i,2 V . . . Vti,n, and let hypothesis h 2 be t 2 .1 Vt2 .2 V ... Vt2,m, where t ,j is a term . The top-level search visits hi before h 2 if hi comes before h 2 in pre-order. This can be determ ined by comparing the term s of each hypothesis from left to right. If ti.i <t t 2 .i1 then hi comes first, and if t 2,i -<t fi.i, then h 2 comes first. If the first term s of both hypotheses are the same, then the next pair of term s, t i i2 and t 2 , 2 , are compared in the same fashion. This continues until the term s differ, or until one hypothesis runs out of term s. In this case, the longer hypothesis is an extension of the shorter one, and therefore a descendent of the shorter one. Hill climbing visits descendents after visiting the parent, so the shorter one is visited first. This decision procedure is based on -<t, the order in which the beam search visits the term s, as guided by the LEF. In a beam search, the term s in the beam can be visited in any order. There is one beam for each iteration of the search, and the term s in each beam are visited after the term s in all of the beams for prior iterations. Terms th at participate in none of the beams are never visited, and therefore least preferred. Determ ining w hether term ti is visited before t 2 is difficult, since it requires knowing which beams the term s are in, which in turn requires knowing the beam s for each of the previous iterations. This information is difficult to com pute from ti and t2, other than by running a beam search to determ ine whether ti or t 2 is visited first. However, when the beam width is set to one, a beam search becomes a hill climbing search, for which it is easier to specify the search order. A hill climbing search visits the search tree in pre-order, as described above for the top-level search. Nodes in this search tree are term s. A term is a conjunction of selectors, S1S2 ■ ■ ■ sn. To determ ine which of two term s is visited first, compare their selectors from left 86 to right. Let t\ = si,i>si,2 • • • 3i,„ and let <2 = > 8 2,1- 3 2 ,2 • • • 32,m- Term t\ is visited first if the LEF assigns a b etter evaluation to Si,i than to s 2,i. If s 2 ,1 is better, then t2 comes first. If both Si,i and S2 4 are equally good, then compare si^ and s 2 ,2 in the same fashion. Compare from left to right until one term has a better selector. If one term has no more selectors, then the shorter term is preferred. One caveat for this search is th at the LEF is an evaluation for term s, not for selectors. For a term S1S2 . • -sn, the evaluation of selector sn is LEF(siS2 . .. s„), not LEF(sn). To summ arize, the preference order between two hypotheses, hi and h2 is de term ined as follows. Com pare the term s of the two hypotheses from left to right according to -<f The term s are compared by -<t in the same fashion. The selectors of each term are compared from left to right until one term runs out of selectors, or the selector of one term has a lower evaluation than the corresponding selector of the other term . Recall th at the evaluation of the ith selector in a term is LEF(sis2 . . . st), where Si through s,_i are the preceding selectors in the term . This is the procedure for determ ining whether hi -</, h2. The ordering -</, is essentially a lexicographic ordering. A lexicographic ordering is an alphabetical ordering over strings in some alphabet, where the symbols in the alphabet are ranked according to some total ordering. For example, a lexicographic ordering over strings in (a — z )* is the standard dictionary order. The -<h ordering can be m apped onto a lexicographic ordering over strings in (0 — 9)*, where the symbols in the alphabet (0 — 9) have the usual ordering for digits. This ordering can be easily expressed as a regular gram m ar, as can the m apping from hypotheses onto digit strings. These two gram m ars can be composed into a single regular gram m ar according to a construction th at will be explained later. This is the regular gram m ar for -<h- A hypothesis is m apped onto a digit string as follows. Each selector is replaced by the evaluation assigned to it by the LEF. Evaluations are assumed to be fixed-length integers of length /, where lower valued integers correspond to better evaluations. The V symbols between term s are replaced by a zero. This m apping is shown in Figure 5.6. Hypothesis hi is preferred to hypothesis h2 if M(hi) -q ex M (h2), where -</ex is the standard lexicographic ordering over strings of digits. 87 M (t i Vt2V . . . Vfn) = M t(ti) • 0 • Mt(h) • 0 •... • 0 • M t(tn) M t(sis2... sm) = LEF(si)LEF(sis 2 )...L E F (s is 2 . . . s n) Figure 5.6: M apping Function from Hypotheses onto Strings. TranLEF(H,LEF, examples) — > (C ,P ) where C = H (i.e., no constraints) P = {( : ) 6 H x H \ M(y) X,ei M (s)} (M has access to H , the LEF, and the examples) Figure 5.7: Translator for the LEF. The reason for placing a zero between term s can be seen in the following example. Let hi be fiVf2Vsi and let h 2 be tiV t2s2Vi3 , where t 2s 2 is the conjunction of term t 2 and selector s2. W ithout the zeros between term s, M(hi) is Mt(ti)-Mt(t 2 )-LEF(si ), and M (h2) is M t(ti) • M t(t 2)LEF(t 2 s2). If LEF(t 2s2) is less than LEF(si), then h\ will be preferred to h2, which is not correct. The evaluation of t 2s 2 is being improperly compared to the evaluation of t 2Vs2 instead of t 2 because the presence of the disjunction (V) is not being taken into account. Appending a zero to the end of a term ’s evaluation assigns a better evaluation to a term inated term than to any extension of th a t term , thereby producing the correct comparison. Given this mapping, the preference ordering over the hypotheses imposed by the LEF can be expressed as a preference set P. A hypothesis is preferred if its m apped string comes earlier in the lexicographic ordering, and less preferred if it comes later. The preference set P is therefore {{x,y) £ H x H \ M(y) ~< iex M(x)}, where H is the hypothesis space and -<iex is the standard lexicographic ordering over strings of digits. This leads to the translators for the LEF shown in Figure 5.7. This translator takes as input the hypothesis space (H), and the LEF. Since the LEF often depends on additional knowledge, this information is also provided to the translator. In most cases, this information is the examples covered and not covered by the term being evaluated, so the translator takes as input the list of examples. 88 The next step is to express C and P as regular sets. C is an instantiation of the VLy hypothesis space, all of which can be expressed as regular gram m ars as shown in Figure 5.1. Representing P as a regular set is a little more difficult. This is discussed in Section 5.2.2.1, below. 5.2.2.1 C o n s tru c tin g P A pair of hypotheses, (x ,y ), is a m em ber of P if M(x) comes after M(y) lexico graphically. This can be expressed as a composition of two sim pler DFA, one for the lexicographic ordering, and one for the m apping, M. T h e L e x ic o g ra p h ic O rd e rin g . The DFA for the standard lexicographic ordering over strings of digits, -<iex, is fairly simple. The DFA takes as input a pair of shuffled strings, w = shuffle(wi,w2), where wi and io2 are strings of digits from (0 — 9)*. The symbols in w alternate between W\ and u;2) so th at the first two digits in w are the first digit from Wi and the first digit from w^- These two digits are compared, and if the digit from is lower, then w is accepted (w\ -* < /er u;2). If the digit from w\ is higher, then wi comes after w2, and w is rejected. Otherwise, the next pair of symbols are compared in the same m anner. If one string runs out of symbols, and the two strings are equivalent up to this point, the shorter string is preferred. If W\ is shorter, then w is accepted, and if w\ is longer, w is rejected. This DFA is shown in Figure 5.8. T h e M a p p in g . The m apping, M , is expressed as a kind of DFA known as a Moore machine [Hopcroft and Ullman, 1979]. A Moore machine is a DFA th at has an output string associated with each state. The output string for each state is composed of symbols from an output alphabet, A. The output string for a state is em itted by the machine whenever it enters th at state. The Moore m achine for M takes as input a hypothesis, h, and em its a string w such th at w = M(h). Formally, a Moore m achine is specified by the tuple (Q ,s, 6 :QxY, —> Q ,out : Q — » A ^S , A), where I is a (usually small) fixed integer. The Moore machine for M is constructed as follows. Since the m apping depends on the LEF, the Moore machine is is param eterized by the LEF. The machine takes a hypothesis as input. W hen it recognizes a selector, it em its the string 89 •0^9 -1 = 9 2-9 3^9. 4-> 0-9,$ 0- 8.$ 0-9,$ Figure 5.8: DFA Recognizing {shuffle(x,y) € (0 |l)2 fc | x < y}. LEF(s\S 2 ... s„), where sn is the selector it has ju st recognized, and S1S2 . ■ ■ sn_i are the preceding selectors in the term . Since each state in the machine can have only one output string, there m ust be at least one state in the m achine for each possible term . This is clearly impossible. However, m ost evaluation functions evaluate a term based on the num ber of positive and negative examples covered by the term . Instead of having one state for each term , the machine has one state for each way of partitioning the examples into covered and uncovered subsets. Each partitioning corresponds to a set of term s th at are all assigned the same evaluation by the LEF. The list of covered and uncovered examples is passed to a modified version of the LEF, which returns an evaluation of the term based on this information rather than by looking at the term itself. The evaluation returned by the LEF is the output string for this state. 90 There are 2" distinct ways to partition a list of n examples into sets of covered and uncovered examples, so the m achine can have at least 2" states. Fortunately, the DFA is represented as an implicit DFA, so all of these states do not have to be kept in memory. Even though this Moore m achine has 0 (2 ") states, the cost of induction can be considerably less than 0 (2 ") (it is in fact polynomial, when using only the knowledge used by AQ-11, as is dem onstrated in Section 5.4). Enum erating a hypothesis from the deductive closure of {C,P} involves visiting some of the states in C and P. Only in the worst case (e.g., when the deductive closure is em pty) are all of the states visited. Since the DFAs for C and P are represented implicitly, the cost of enum eration is proportional to the num ber of states visited, not to the total num ber of states in each DFA. The Moore m achine for the m apping function is specified in Figure 5.9. Behavior of the Moore Machine. The machine takes as input a hypothesis, h, and outputs M(h), where M is the m apping described previously in Figure 5.6. Recall th at M {t\ V^V . . . V/*), where t\ through tk are term s, maps onto the string ■ ■ • 0 -Mt{tk). The m apping Mt m aps a term , S1 S2 . . . sm, where si through sm are selectors, onto the string LEF(s\)- LEF(s\S 2 ) • ... • LEF(s\S 2 ... sm). The LEF assigns an /-digit integer to each term . As was discussed previously, this value depends on which examples are covered by the term , and on which positive examples are covered by the previous term s in the hypothesis. Recall th at a term is any sequence of selectors, so th at si, S1 S2 , and so on are all term s, not just S1S2 • • • sm. The machine keeps track of which examples are covered by the current term , and which examples are covered by the previous term s in the hypotheses. This is done by m aintaining two binary vectors, E t and Eh, of length n, where n is the num ber of examples. The vector Et indicates which examples are covered by the current — # term , and Eh indicates which examples are covered by the current hypothesis (all term s but the current one). A one in position i of the vector indicates th at example i is covered, and a zero indicates th a t example i is not covered. The vector for the hypothesis is initially all zeros, since the initial hypothesis is f a l s e . The vector for the current term is all ones, since the term is initially tr u e . 91 Param eters: E an ordered list of n examples L E F : ( E h ,E t ) — > ( 1 — 9)J where Eh is a binary vector of length n indicating which examples are covered by the hypothesis. Et is a binary vector of length n indicating which examples are covered by the current term . / is a small, fixed integer. SEL is a DFA th at recognizes selectors / i through fk are the features A Moore machine for M is (Q, s ,S :Q x S — > Q, out :Q — > A (,E , A) where: Q = {{E h, E t, end) | E h, E t € (0 |l)n, end G {0, l}}U{d} end is a flag indicating whether the m a chine has ju st seen the end of a term ( 1 ), or is in the m iddle of a term (0 ). d is the term inal state. s = (0 n, l n, 0 ) if w G SEL Clear bit i of E t if example i not covered by selector w. Set end to 0 . Goto state {E h ,E t, 0). if w = “V” V m arks the end of a term . Set end to 1 . Set bit i of Eh to 1 if bit i of E t is 1 , else leave it unchanged. Set every bit in E t to 1. Goto state (Eh, Et, e) if w = Goto state d, the term inal state. 6 *(d,w) = d 6 *((Eh,E n,end),w) = out((Eh,Et,end)) LEF(Eh, Et) if end = 0 0 * if end = 1 out(d) = e, the em pty string. E = {[,/i,/ 2 ,...,A ,< ,< ,= ,> ,> ,0 - 9 ,] ,V ,$ } A = { 0 ,1 ,2 ,3 ,4 ,5 , 6 ,7 , 8 ,9} Figure 5.9: Moore M achine for the Mapping. 92 W hen a selector is seen, E t is updated by clearing bit i of the vector if the selector does not cover example i. Since a term is a conjunction of selectors, if any selector in the term fails to cover an example, there is no way to cover the example by adding m ore selectors to the term . Therefore, seeing a selector never causes a bit in Et to be turned on. After seeing the selector, the machine outputs the value assigned by the LEF to the current term . This value depends on the examples covered by the current hypothesis and the current term . Specifically, the machine outputs LEF(Eh, Et)- The machine continues in this fashion until the symbol V is seen. This symbol m arks the completion of the current term and the beginning of a new term . The completed term is now considered part of the current hypothesis, and a new current term is started. The new current hypothesis covers every example covered by the completed term , since the the term is a disjunct of the hypothesis. Bit i of Eh is set to one if bit i of E t is one, and is left unchanged otherwise. This is equivalent to com puting the logical or of Eh and Et. Since a new current term has started, the vector Et is set to all ones. The machine outputs a 0 to m ark the end of the term . More specifically, it outputs a string of I zeros, since the output strings of the m achine m ust all be of the same length, and the strings output by the LEF are of length I. The machine continues to process symbols as described above until the $ symbol is seen, signifying the end of the hypothesis, at which point the machine halts. Specifically, it moves into a state th at has an edge to itself on every input symbol, and has no output string associated with it. The only mem ory available to a Moore machine are its states. Therefore, the vectors Eh and E t correspond to states in the machine. The vectors are of finite length, so there are finite num ber of states in the machine. W hen a selector is seen, the vectors are recom puted as described above, and the machine moves to this state. Each state has an output string associated with it, which is em itted when the machine moves to th at state. Specifically, the machine em its the string LEF(Eh, Et) after seeing each selector, and the string 0( after seeing each V symbol. Since the tuple (Eh,E t) does not indicate whether a selector or a V symbol was just seen, an additional bit, labeled end, is added to the state. A state in the m achine is therefore of the form (Eh,Et,end). W hen end is one, indicating th at a V symbol has just 93 been seen, the output for the state is 0f. W hen end is zero, the output for the state is LEF(EhiEt). C o m p o sin g th e G ra m m a rs . The DFA for P = {{x,y) € H x H \ M(y) -< M(x)} is computed from M (the Moore machine for the m apping), and from the DFA for -< (the standard lexicographic ordering over strings of digits). The idea is to use two copies of M , and pass x to one of them , and y to the other. The output from the copies of M , is passed to the DFA for -<, which determ ines whether M (x) -< M(y). Input strings to P are of the form shuffle(a, 6 ) = 0 1 6 x 0 2 6 2 • ■ • «n&n where a and 6 are hypotheses, and a,- and 6 , - are symbols in a and 6 . P m aintains two copies of M , M a and Mb. On odd num bered input symbols (i.e, symbols from a), P sim ulates a move on Ma, and on even num bered symbols (from 6 ) P sim ulates a move on Mb. W hen one move has been m ade in both M a and Mb, the output strings, wa and Wb, from each Moore m achine are shuffled together. Moves are then sim ulated in the DFA for ~ < on the shuffled input string, shuffle(wb,wa)—th at is, the DFA tests whether Wb -< wa. P is in an accept state iff the DFA for -< is in an accept state. P recognizes the set {(a, 6 ) G H x H | M(b) -< M (a)}. The above construction for P is essentially a substitution of the regular language generated by the Moore machine into the regular language accepted by the DFA for Regular languages are closed under substitution [Hopcroft and Ullman, 1979], and all Moore machines generate regular languages, so this construction always yields a regular gram m ar for P. A n In fo rm a tio n G a in M e tric . One common way to compare term s is with an information gain m etric (e.g., [Quinlan, 1986]). The information of a term is a measure of how well it distinguishes between the positive and negative example. Terms with higher inform ation are preferred. Let po be the num ber of positive examples covered by the term , let n 0 be the num ber of negative examples by the term , and let p\ and n i be the num ber of positive and negative examples th at are not covered by the term . Let p and n be 94 the total num ber of positive and negative examples, respectively. The information of the term is shown in Equation 5.1. Po + rip P + n P i + » i p + n Po , Po , n o n o ■ l°g 2 — ; ------+ — ; ---- loS2 Po + fto Po + n o p o + Po + n o Pi , Pi , nt nx ■ log2 -----+ ---- log2 -------- + (5.1) P i + n x P i + m pi + ni p i + n x_ W hen this information m etric is used to evaluate term s, the LEF function is defined as follows. The LEF takes as input the vectors E t and Eh), which indicate the examples covered by the current term and the previous term s in the hypothesis. The evaluation is the information of the current term with respect to the negative examples and the positive examples th at have not yet been covered by the previous term s. If the positive examples covered by the previous term s are not excluded, the same term gets selected in each call to LearnTerm. B it i of E t is on if the term covers example i, and is off if the examples is not covered. The LEF is assumed to know which bits in each vector correspond to positive and negative examples. The total num ber of negative examples covered and not covered by the term (no and n i) can be com puted from Eh. by counting the num ber of ones and zeros in the appropriate positions. To com pute the num ber of positive examples in the pot, a bit-wise complement of Eh is first performed. A bit in the resulting vector is on if example i is not covered by any of the previous term s. The num ber of ones in this vector corresponding to positive examples are com puted. This is p, the total num ber of positive examples in the pot—the set of positive examples not yet covered by the current hypothesis. T he num ber of positive examples in the pot th at are covered by the term (po) can be determ ined by com puting the logical and of E t with the complement of Eh. Bit i in this vector is on if example i is covered by the term , but not by any of the previous term s. To com pute p i, the num ber of positive examples in the pot th at are not covered by the term , subtract po from p. The values for no, n x, po, Pi, p and n are then substituted into Equation 5.1, which returns a real num ber between -1 and 1. The first I digits of this num ber constitute the digit string returned by L E F ( E h ,E t). 95 5.2.3 D om ain Theory A domain theory encodes background knowledge about the target concept as a collection of horn-clause inference rules th at explain why an instance is a member of the target concept. The way in which this knowledge biases induction depends on assum ptions about the correctness and completeness of the theory. Each of these assum ptions requires a different translator, since the bias m aps onto different constraints and preferences. A complete and correct theory can explain why every instance covered by the target concept is a m em ber of the concept. The theory exactly describes the target concept. This is a very strong form of bias, since the theory identifies the target concept uniquely and accurately. No other knowledge is necessary. The bias can be expressed as a constraint th at is satisfied only by the target concept. This bias is used in algorithm s such as Explanation Based Learning [DeJong and Mooney, 1986, M itchell et al., 1986]. Domain theories are not always complete nor correct. An incom plete theory can explain some instances, but not all of them . This occurs, for example, when the theory is overspecific. In an overspecific theory, the target concept is some generalization of the theory. The bias imposed by an overspecific theory can be translated as a constraint th at is satisfied only by generalizations of the theory. In addition to being incomplete, a theory can also be incorrect. An incorrect theory misclassifies some instances. One common kinds of incorrectness is overgen erality. An overgeneral theory explains all of the examples covered by the target concept, and some th at are not. An overgeneral theory can be expressed as a con straint th a t is satisfied only by hypotheses th at are specializations of the theory. Overgeneral theories are used by algorithm s such as IO E [Flann and Dietterich, 1989] and OCCAM [Pazzani, 1988]. A translator for a particular overgeneral domain theory is described below. The theory being translated is derived from the classic “cup” theory [Mitchell et al., 1986, W inston et al., 1983], as shown in Figure 5.10. Since the theory is assumed to be overgeneral, the target concept m ust be a specialization of the theory. This bias is expressed as a constraint, which is satisfied only by specializations of the theory. More precisely, the theory is a disjunction of several sufficient conditions for CUP, 96 cup(X) holdJiquid(X), liftable(X), stable(X ), drinkfrom(X). holdJiquid(X) plastic(X) | china(X) | m etal(X). liftable(X) sm all(X), graspable(X). graspable(X) sm all(X), cylindrical(X) | sm all(X), has_handle(X). stable(X) flat.bottom (X ). drinkfrom(X) open_top(X). Figure 5.10: CUP Domain Theory. and it is assumed that some disjunction of these conditions is the correct definition for CUP. The dom ain theory is a union of several sufficient conditions th at can be expressed in term s of the operational predicates. The target concept is assumed to be equivalent to a disjunction of one or more of these conditions. The operational predicates are those th at m eet user-defined criteria for ease-of-evaluation and comprehensibility, among others. For this theory, the only operationality criteria for predicates is th at they can be evaluated on an instance (e.g., small(X) can be determ ined for instance X , but cup(X) cannot). For this theory, the leaf predicates are the operational predicates. The sufficient conditions for the theory are as follows: 1. cup(X) :- plastic(X ), small(X), cylindrical(X), flat_bottom (X), open_top(X). 2. cup(X) :- china(X), sm all(X), cylindrical(X), flat_bottom (X), open.top(X ). 3. cup(X) :- m etal(X ), small(X), cylindrical(X), flat.bottom (X ), open_top(X). 4. cup(X) :- plastic(X ), sm all(X), has_handle(X), flat_bottom (X), open_top(X). 5. cup(X) :- m etal(X ), sm all(X), has_handle(X), flat_bottom (X), open_top(X). 6 . cup(X) :- china(X), small(X), has_handle(X), flat.bottom (X ), open_top(X). The target concept is assumed to be a disjunction of one or more of these con ditions. This bias is translated as a constraint satisfied by hypotheses th at are equivalent to these conditions. The regular gram m ar for the constraint is computed by m apping each condition onto an equivalent hypothesis (or set of hypotheses), and com puting the union of these hypotheses. 97 In order to m ap the conditions, the hypothesis space language is assumed to have features corresponding to each of the operational predicates in the theory. The operational predicates are all Boolean valued (e.g., small(X) is either true or false), so the corresponding feature is also Boolean valued. This leads to selectors such as [sm all = tru e ] and [sm all = f a ls e ]. The features are listed below. Each feature has the same nam e as its corresponding predicate. p l a s t i c sm a ll fla t_ b o tto m c h in a c y l i n d r i c a l o pen.top m eta l has_handle T he regular gram m ar for the set of hypotheses corresponding to disjunctions of sufficient conditions (specializations) of the dom ain theory is expressed as shown in Figure 5.11. This gram m ar is a little less general than it could be, since it does not allow all perm utations of the selectors within each term . However, the more general gram m ar contains considerably m ore rules, and perm uting the selectors does not change the semantics of a hypothesis. c -> TERM I C or TERM TERM — ► SUFFICIENT-CONDITION SUFFICIENT-CONDITION (PLASTIC | CHINA | METAL) (HAS-HANDLE | cylindrical) SMALL FLAT_BOTTOM OPEN-TOP PLASTIC [ p la s t ic = tru e] CHINA -> [ch in a = tru e] METAL — > [m etal = tru e] HAS-HANDLE [has_handle = tru e] CYLINDRICAL — > [ c y lin d r ic a l = tru e] SMALL -> [sm all = tru e] FLAT_BOTTOM -> [flat_b ottom = tru e] OPEN-TOP -+ [open_top = tru e] Figure 5.11: G ram m ar for VLi Hypotheses Satisfying the CUP Theory Bias. 98 5.3 A n In du ction Task This section provides two concrete examples of AQ-11 and RS-KII solving simple induction tasks. In the first task, both AQ-11 and RS-KII use only the knowledge available to AQ-11. Both algorithms learn the same hypothesis. The second task is a synthetic one designed to show the effects of m aking a dom ain theory available. The theory is the CUP theory described in the previous section. AQ-11 cannot use the theory, and so is forced to learn from examples and the LEF alone. RS-KII has access to the domain theory as well, and this improves the accuracy of the hypothesis. 5.3.1 Iris Task The first induction task is taken from the Iris domain [Fisher, 1936]. The data set contains three classes of fifty instances each. Each class corresponds to a different kind of iris. The three classes are as follows. 1. Iris Setosa 2. Iris Versicolour 3. Iris Virginica The goal is to learn a VLi hypothesis th a t identifies one of the classes as distinct from the other two. For this task, the goal is to learn the first class, Iris Setosa. Instances are 4-tuples consisting of the values of four objective m easurem ents taken from a given plant. Specifically, these features are as follows (in order): 1. Sepal length in millimeters 2. Sepal width in millimeters 3. Petal length in millimeters 4. Petal w idth in millimeters The values for these features are integers between 0 and 300. 99 5.3.1.1 Translators for Task Knowledge The VXj hypothesis space is param eterized with a list of features and a set of values for each feature. There are four features, each of which can be a positive integer. This leads to the following list of param eters for VL\ shown in Table 5.1. Instantiating V\ with these param eters yields the gram m ar shown in Figure 5.12. feature values h (1 — 9)(0 — 9)* h (1 — 9)(0 — 9)* h (1 _ 9)(0 — 9)* U (1 — 9)(0 — 9)* Table 5.1: Param eters for VL\. VLl — ► TERM | TERM or VLl TERM -> SELECTOR | SELECTOR TERM SELECTOR -> “[” fi RELATION (1 - 9)(0 - 9)* “]” | “[” / 2 RELATION (1 - 9)(0 - 9)* “]” | “[” / 3 RELATION (1 - 9)(0 - 9)* “]” j “[” / 4 RELATION (1 - 9)(0 - 9)* “]” RELATION -><| <| = | ^|> |> Figure 5.12: G ram m ar for Instantiated VLi Language. Translators are needed for the positive and negative examples. An example is positive if it is an instance of the target class (e.g., Iris Setosa), and negative other wise. Examples are translated by the TranAQExample translator shown previously in Figure 5.4. The LEF is not specified by the Iris task. The information gain translator de scribed earlier is a good general-purpose LEF, and is the one used in this example. The translator for the LEF is the one described in Section 5.2.2. 100 5.3.1.2 Results of Learning The concept definition learned by both AQ - 1 1 and RS-KII for Iris Setosa is the one-selector hypothesis [ / 3 < 30]. 5.3.2 The C U P Task The second task is a synthetic one based on the CUP domain. The knowledge consists of a list of examples, an information-gain m etric for evaluating hypotheses, and the CUP domain theory of Figure 5.10. RS-KII can make use of all of this knowledge, but AQ-11 cannot use the domain theory. T he features for this task are the operational predicates of the CUP theory, as shown below. All of these features are Boolean valued. f a p l a s t i c f 4 : c y lin d r ic a l f a flat_bottom f a china f a hasJiandle f a open_top f a m etal f a sm all There are also two irrelevant features, c o lo r (/g) and p e r c e n t - f u l l (/io). The c o lo r feature can take the values {red, blue, green}, and the p e r c e n t - f u l l feature can take integer values between 0 and 1 0 0 indicating the am ount of liquid in the cup. The target concept is “plastic cups with handles” , which is a specialization of the CUP dom ain theory. The concept is represented by the VXi hypothesis C/i = tr u e ] [ / 5 = tr u e ] [ / 6 = tr u e ] [/V = tr u e ] C/8 = tru e ] . There are four examples of this concept, as shown in Table 5.2. ID class h /2 h h h fe h fs /9 /io ei + t f f f t t t t blue 1 0 e2 + t f f f t t t t red 50 e3 — f t f f t t t t green 60 e4 — t f f t t t t t blue 2 0 Table 5.2: Examples for the CUP Task. The other available knowledge sources are the CUP domain theory shown previ ously in Figure 5.10, and an information gain m etric for evaluating hypotheses. 101 The examples are translated according to TranAQExample, as shown previously in Figure 5.4. The dom ain theory is translated into (C, {}), where the regular gram m ar for C is as shown in Figure 5.11. The information-gain m etric is translated according to the translator described in Section 5.2.2. 5.3.2.1 Results of Learning AQ-11 has no access to the domain theory. It learns a hypothesis th at is consistent with the examples, and preferred by the inform ation gain m etric. It learns the following hypothesis: [/i = fa ls e ] and [/^ = fa ls e ]. This hypothesis recognizes all plastic, cylindrical objects, even if they are not cups. In order to learn a more accurate concept, considerably more examples are necessary. RS-KII can learn the correct hypothesis from the same num ber of examples. This is because RS-KII has access to the dom ain theory, which provides a very strong bias on the hypothesis space. It only considers hypotheses th at are specializations of the domain theory. This drastically reduces the search space, increasing the accuracy of the induced concept. RS-KII learns the following concept: [ /i = tru e] [ / 5 = tru e] C/6 = tru e] [ / 7 = tru e] [ / 8=true] This is the only hypothesis th at is both consistent with the examples, and a special ization of the dom ain theory. 5.4 C om p lexity A nalysis The com putational complexity of RS-KII depends on the num ber of knowledge frag m ents it utilizes as well as the nature of those fragm ents. A general analysis of RS-K II’s complexity would depend on the nature of the knowledge, which is im possible to quantify meaningfully. However, it is possible to obtain a complexity analysis for particular classes of knowledge. This is the approach taken in this sec tion to compare the complexity of RS-KII to AQ-11. 102 5.4.1 C om plexity of AQ-11 In the following complexity analysis, e denotes the num ber of examples, k indicates the num ber of features, and b is the width of the beam search. 5.4.1.1 C o st o f A Q - l l ’s M a in L o o p The AQ-11 algorithm consists of a m ain loop th at calls LearnTerm once per iteration. On each iteration, the term found by LearnTerm. is disjoined to the end of the current hypothesis, and the positive examples covered by the term are removed from the pot. Removing covered examples from the pot takes tim e proportional to e. The m ain loop iterates until the pot is em pty— i.e., every positive example is covered by at least one of the term s. Since each term is guaranteed to cover at least one positive example th at the other term s do not cover, the m ain loop makes at m ost one call to Learn Term for each positive example. The worst-case complexity of the m ain loop is therefore 0(e(x + e)) where x is the complexity of LearnTerm 5 .4 .1 .2 C o st o f L e a rn T e rm Learn Term conducts a beam search through the space of term s to find a term th at covers a given positive example (the seed) and none of the negative examples. The search m aintains a “pot” of negative examples th a t are covered by at least one term in the beam. A set of selectors is com puted th at covers the positive seed but not the negative examples. Child term s are generated from the term s in the beam by extending them with the selectors in this set. The child term s are evaluated according to the LEF, and the best b term s are retained in the beam. The cost of LearnTerm depends partly on the num ber of selectors th at cover the positive seed but not the negative example. In theory, this set can be quite large, or even infinite, which would make AQ-11 intractable. In practice, the set can be m ade much smaller with a simple bias. The generation of this set of selectors is discussed below, and an upper bound on its size is derived. The cost of LearnTerm is then com puted using this bound. G e n e r a tin g th e S e t o f S e le c to rs. The complete set of selectors th a t cover the positive seed but not the negative example is generated as follows. Let vPi, - be the 103 value of the positive seed on feature /,-, and let un,, - be the value of the negative exam ple on feature /,-. The union of the following sets of selectors over all k features is the set of selectors th at cover the positive seed but not the negative example: * [ ft = Up,i] ^ Up,; v n,i 9 [/.• ~ f~ «„,<] if vPti ^ u„,i • {[/«• i. 9 {[/; < v] I vP ii < v < vn,i} if Vp,i < vn,i 9 {[/«■> v] I v n < V Vn v] I vn < V < Up} if uP ii > un,i T he last four sets in this list can contain a large to infinite num ber of selectors, depending on whether the feature values m ap onto the integers or the reals. How ever, m any of these selectors are redundant, in th a t they can be partitioned into classes of “equivalent” selectors. For any hypothesis, one selector can be substituted for any other in the same class, and the resulting hypothesis will still cover the same examples. Of course, the two hypotheses m ay assign different classes to new instances. This suggests a way to bias the set of selectors—namely, include only one selector from each class in the set. As will be argued below, the num ber of classes for a given feature is bounded by the num ber of examples. If there are k features, the biased set of selectors contains at m ost 0 (ek) elements. Consider the set of selectors, {[/,• < u] | uP i; < u < u„)t} where uPi, - < u A selector partitions the examples into a set of covered and uncovered examples. The selectors in this set can induce at m ost e different partitions on the examples. To see why this is so, take the value of each example on feature and order the values from smallest to largest. If there are e examples, there are at m ost e partitions th at can be induced by selectors of the form [/,- < u]: the first example in the list is covered, the first two are covered, and so on up to all of the examples being covered. For each of these partitions, the characteristic selector for the equivalence class is [/,• < u] where v is one of the values in the list of examples. The set of characteristic selectors is {[/,- < u] | uPi, - < v < u„) t - and v is the value of some example on feature /.}• 104 The characteristic selectors for the other sets of selectors are com puted similarly. Let E be the set of positive and negative examples, and let 7T,(jB) be the projection of the examples onto their values for feature /,-—th at is, Vi{E) is the set of values held by one or more of the examples on feature The set of characteristic selectors th at cover the positive seed but not the negative seed is the union of the following sets for i between one and k: • {[/« ' = UP,«]} if VP ,i ¥ = vn,i • {[/< ~ f~ u n ,i]} if Vp,i ^ Vn,i • {[/; < u] I vP,i < v < t>„,; and v G 7r,-(.E)} if vP ti < vn v] I vn < v < vp and v G 7T;(.E)} if vp vn w ] I Vn < v < Vp and v € ^ vP ,i > vn,i Each of these sets contains at most e elements, since 7t,-(jB) contains at m ost e = \E\ values. The union of these sets over all k features contains at m ost O(ke) selectors. The tim e needed to generate this set is also O(ke). C o st o f O n e I te r a tio n . Each iteration of LearnTerm consists of six steps: 1. Select a negative example from the pot. This takes tim e 0(1). 2. Com pute 5 , the set of selectors th at covers the seed, but does not cover the negative example selected in line (1). Com puting S takes tim e 0(ek), and contains O(ek) selectors. The derivation of these results is given below. 3. Generate the children of the term s in the beam: {ti • s | t, is a term in the beam , and s G 5}. Com puting the children takes tim e 0 (6 |5 |) = 0{bek). 4. Evaluate the children with the LEF. In m ost cases, the evaluation of a term depends on the num ber of positive and negative examples the term covers. The tim e to evaluate all the child term s is therefore bounded by 0(bke 2) 105 5. Sort the children according to the LEF, and select the first 6 children as the new beam. This takes tim e 0(6|5'| log(6|5'|)) = 0(bek\og{bek)). If the beam width is one, the new beam is com puted by selecting the best child. This only takes tim e 0 (6 |5 |) = 0(bek). 6. Remove from the pot of negative examples every example covered by none of the term s in the beam. This takes tim e 0(eb). The total tim e for one iteration is bounded by the sum of these costs, or 0 ( |5 | + 6|5'|+6|S'|e+6|£'| log(6|5|)+e6), where S is the num ber of selectors th at cover the seed but not the negative examples. Since there are at m ost ek such selectors, substituting ek for |5 | yields 0(ek + bek + bke2 + bek\og(bek) + eb), or 0(bke2 + beklog(bek)). W hen the beam width is one, the term bek\og{bek) can be replaced by bek, in which case the tim e cost for one iteration is bounded by 0 (bke2 + bek) = 0 (bke2). Since the beam width (6) is one, this can be rew ritten as 0(ke2). T o ta l C o st o f L e a rn T e rm . LearnTerm makes at m ost one iteration per negative example. The num ber of negative examples are bounded by e, so the tim e complexity of one call to LearnTerm is is bounded by 0(bke2 log (bke) + bke3) when b > 1, or by 0(ke3) when 6=1. These costs are summ arized in Equation 5.2 and Equation 5.3. 0(ke3) if 6 = 1 (5.2) 0[bke2 log(6fce) + bke3) if 6 > 1 (5.3) 5 .4 .1 .3 T o ta l T im e C o m p le x ity fo r A Q -11. The total cost of AQ-11 is 0(e(a; + e)) where x is the tim e complexity of LearnTerm. This yields the following com putational complexity for AQ-11: 0(bke3 log(bke) + bke4) if 6 > 1 (5.4) 0(bke4) if 6 = 1 (5.5) The bke3 log(6fce) term is the cost of expanding and pruning the beam , and the bke4 term is the cost of evaluating the term s with the LEF. 106 5.4.2 C om plexity of R S-K II w hen Em ulating AQ-11 W hen em ulating AQ-11, RS-KII induces a hypothesis by enum erating a single hy pothesis from the solution set of (C ,P ). This is accomplished by the branch-and- bound algorithm described in Figure 4.14 of C hapter 4. The param eters to the algorithm are shown in Figure 4.15 of the same chapter. The branch-and-bound algorithm m aintains L , a collection of subsets of C. On each iteration, one of these subsets, X s, is selected. X s is compared to all of the subsets in L. If X s is not dom inated by any of the subsets, then it contains only solu tions, and an elem ent of X s is enum erated. Otherwise, X s is replaced by Split (X s) in the collection, and dom inated subsets are pruned from the collection. This continues until a hypothesis is enum erated, or L is empty. T he cost of one iteration depends on the num ber of subsets in L. An iteration consists of four steps, the costs of which are analyzed below. 1. Select X s from L. This can be done in constant tim e if L is m aintained according to the efficient schemes m entioned in Section 4.2.2.4. 2. D eterm ine whether X a is undom inated by all of the subsets in L. This takes tim e 0 (|T |t^ ), where is the cost of comparing two subsets with the not- dom inated relation, D ^ . 3. Split X 3. A subset of C is of the form Wi • qi, where W{ is some string and qi = 6 c(startc,Wi), the state reached by W { in C. Split(wi • qi) generates a child subset of W{ • qi by extending W{ with a symbol cr from S c and changing qi to be the state reached by the new string, W {C T. There is at m ost one child subset for each symbol, for a total cost of 0 (|E c |)- 4. Replace X g with Split(Xs) in L, and prune dom inated subsets from L. None of the existing subsets in L dom inate each other. The child subsets need to be compared to each other, and to each of the existing subsets in L. There are at m ost |S c | child subsets, and a — 1 existing subsets, where a is the num ber of subset in L prior to replacing X s with its children. The children are added to the collection as follows. Each child is compared to the a — 1 existing subsets, and each of the a — 1 existing subsets is compared to the child. If an 107 existing subset is dom inated, it is removed from the collection. If the child is not dom inated by any of the existing subsets, the child is added to the list. Otherwise the child is discarded. For the first child, there are 2(a — 1) comparisons. If the child is added to the collection, the collection contains a subsets. Determ ining whether the second child should be added requires 2a comparisons in this case. If each child is added to the collection, the total num ber of comparisons required is E E s 1 " 1 2((ct — 1) + i), for a total cost of 0 ( ( |E c |2 + |Ec||T|)2-<) where is the cost of comparing two subsets with the dom ination relation D The total cost of one iteration is the sum of these costs. The costs of the two dom inance relations, t _ < and t^, are of the same order, so these values can be replaced by a single variable, t. The resulting com putational complexity for one iteration of BranchAndBound-3 is shown in Equation 5.6. 0 ( (|E c |2 + |E c p | ) t ) (5.6) T h e Size o f L. L initially contains a single subset th at is equivalent to C. In each iteration, a subset is selected and split into at most Ec subsets. L can grow geometrically unless subsets are pruned aggressively. The pattern of growth for L when RS-KII em ulates AQ-11 is a cyclic one. It grows geometrically for a few iterations, and is then pruned back to a single subset. Call the initial single subset X \ =w\-q\. This is split into subsets by adding symbols from E c to the end of uq, the prefix string for X\. Symbols in E c are not selectors, but rather components of selectors (i.e., feature names, the six relations, and digits from which to construct the value field). These subsets cannot be meaningfully compared since their prefix strings end in partially constructed selectors. Additional symbols are needed to form a complete selector, at which point the subsets can be compared. Selectors are strings of up to / symbols, where I depends on the num ber of digits th at can be in the value field of a selector. Subsets with the shortest prefix strings are selected first, so the subsets are expanded breadth first until of the subsets have com plete selectors. L grows geometrically for these iterations. At the end of this growth cycle, all of the subsets in L have complete selectors. T he subsets can be compared with the dominance relation jD<, and the dom inated 108 subsets pruned from L. The relation prunes a subset u> ; • g ,- if there is another subset in the collection, Wj • qj, such th at shuffle(w{,Wj) leads to an accept-all state in P and Wj ■ qj is not em pty (i.e., qj is not a dead state in C ). O m itting the em pty test yields the relation -D'<, which is used to order select the a subset from the collection. Since P is a lexicographic ordering, if W{ comes before W j , then every extension of W{ also comes before Wj and its extensions. This means th at shuffle(wi,Wj) leads to an accept-all state in P. Since a lexicographic ordering is a total ordering, for every pair of subsets, u > ,- • qi and Wj ■ qj in the collection, either W{ comes before Wj, or Wj comes before Wi, or to ,- and Wj are equivalent. In all but the last case, one of the two subsets is dom inated, assuming the other is not empty. In a total ordering there can be at m ost one m axim al elem ent, so all but one subset can potentially be pruned, assuming the dom inant subset can be shown to be non-empty. However, it is not usually possible to tell whether a subset is empty, since C s Dead function returns unknown for m ost states. Therefore, L is not pruned. However, the way in which subsets are selected m itigates against this. The subsets are ordered according to D'<, and the best subset is selected as X s in the next iteration. The Z3'< relation is exactly D < w ithout the em pty test. The Branch-and-Bound algorithm sorts the subsets in L once in each iteration according to D '<. L is stored as a lattice reflecting the known dominance relations among the subsets. The elements in the top of the order are the undom inated elements according to _D'<. After splitting a subset sufficiently, it may be identified as either em pty or non-em pty instead of unknown. If it is non-empty, all the subsets dom inated by it in L are pruned. Otherwise, the em pty subset is removed from L. The representation for L effectively allows the subsets to be compared according to Z)'< instead of D <. The subsets at the top of the lattice are the undom inated elements according to Z)'<. Since D'< does not check for emptiness, it allows a subset to be dom inated by an em pty subset. However, dom inated subsets are not pruned. They are only moved lower in the lattice. If the dom inating subset later turns out to be empty, the dom inated subsets m ay become new top elements. This allows incorrectly “pruned” subsets to be reinstated in L. Since £)'< is a total ordering, and defined over all subsets th at have prefix strings with complete selectors, there is at m ost one subset at the top of the lattice after the growth stage. W hen children are added, they are only compared to the subsets at the top of the lattice, since if 109 they dom inate these elem ents, they certainly dom inate those below it. In effect, \L\ is always one after each growth stage. The size of L grows by |Ec|-fold until \L\ is the num ber of selectors (i.e., L contains all one-selector extensions of the initial single hypothesis, Xi). There is one iteration where \L\ = 1, |£ c | iterations where \L\ = 0 ( |£ c |) , |£ c |2 iterations where \L\ = 0 ( |E c |2), and so on up to |S c rite ra tio n s where \L\ = O d E c f). The cost of the last tier dom inates the cost of all the other tiers. The cost of this tier is |£ c |/ tim es the cost of an iteration in which |Z| = |£ c |/- The cost of the growth stage is bounded by this cost, which is 0 ( |£ c |2^)- The m axim um size to which L can grow is the num ber of selectors with which the initial single subset, X i, can be extended. There are at m ost O(ek) such selectors, as was shown in Section 5.4.1. Since the size of L is |£ c |,) this term can be replaced with ek in 0 ( |£ c |2,t). This yields the cost complexity bound for the growth stage shown in Equation 5.7. This cost includes pruning L back to a single element. 0((ek) 2t) (5.7) A S p e c ia l C ase. The dom inance relation D ^ is not entirely accurate, in th a t it allows a subset Wi • qi to be dom inated by • <72 if <72 is a dead state. This means a subset is dom inated by an em pty subset, which is not correct. This approxim ation is necessitated by the expense of determ ining whether q2 is dead. Dom inated subsets are removed from L , but since D'K is not entirely correct, some subsets m ay be removed because D < thinks they are dom inated by subsets th a t are in fact empty. This is m itigated by m aintaining L in such a way th a t a dom inated subset can be reinstated to L if the subsets th at supposedly dom inate it later turn out to be empty. The selected subset, X a, is em pty if for each of its child subsets, W{ • q{, qi is a recognizable dead state. In this case, X s is m arked as empty, and the subsets th at X a was thought to dom inate are reinstated to L if they are not otherwise dom inated. The em ptiness of X a is propagated to the supersets of X a, since some of these may also be empty. If these are in fact empty, then the subsets they dom inated may also be reinstated to L. All of the reinstated subsets are then compared to the other subsets of L as the children of X a have been. The complexity of this step depends on the num ber of subsets dom inated by X a and its em pty parent subsets. 110 However, X 3 is never em pty when RS-KII is em ulating AQ-11. This is because all of the dead states in C are im m ediately recognizable. C is an intersection of constraint sets from each example. A constraint set for an exam ple contains the hypotheses th at are consistent with the example, and an intersection of two of these sets, C\ 0 C2, contains hypotheses consistent with both examples. A partial hypothe sis, w, leads to a state w-{qi,q2) in the intersection. The state (qi, q2) is recognizably dead if either q\ or q2 are dead—th a t is, there is no extension of hypothesis w th a t is consistent with one of the examples. The state is not recognizably dead if and only if there is one extension of w th at is consistent with the first example, and a second extension of w th at is consistent with the second example, but no extension th at is consistent with both examples. This never happens, since any VL\ hypothesis can be extended to be consistent with any set of m utually consistent examples. If the examples are not consistent, then C is im m ediately recognizable as empty. Since X s is never em pty unless C is empty, none of the dom inated subsets ever need to be reinstated. The cost of the reinstatem ent operation is therefore om itted from the com putational complexity of RS-KII when em ulating AQ-11. N u m b e r o f I te r a tio n s . The search algorithm makes one iteration for every se lector in the induced hypothesis. The hypothesis induced by RS-KII is consistent with the positive and negative examples. Furtherm ore, each selector in the hypoth esis makes progress towards consistency with the examples, and no selector or term is added unnecessarily (e.g., once a term is consistent with the negative examples, additional selectors are not appended to it). These latter conditions are guaranteed by the LEF. By the same argum ents used in the analysis of AQ-11, the induced hypothesis contains at m ost one term per positive example, and each term contains at m ost one selector per negative example. If there are p positive examples, and n negative examples, the hypothesis contains at m ost pn selectors. T o ta l C o m p u ta tio n a l C o m p le x ity . The search algorithm makes at m ost pn iterations. The cost to induce a hypothesis is therefore pn tim es the cost of the growth stage, in which all of the selectors are generated (Equation 5.7. This yields 111 the cost complexity for RS-KII shown in Equation 5.8 when RS-KII uses only the knowledge used by AQ-11. 0(pn(ek) 2 t) (5.8) The cost of comparison, t, is proportional to the cost of com puting the next-state function in P. A state in P is a vector of length e, so the cost of com puting the next-state is 0(e). Substituting e for t, and e2 for pn in Equation 5.8 yields the following com putational complexity for RS-KII when em ulating AQ-11: 0(e 5 k2) (5.9) The complexity of AQ-11 with a beam-size of one is 0(e 4 k ). The complexity of RS-KII is worse by a factor of ek. This comes from the comparisons between incom plete selectors performed during the growth stage. If the Split function is defined so th at the string w in subset (iu, q) is extended by a full selector instead of by a single symbol, then the complexity is 0(ke4), the same as for AQ-11. 5.5 Sum m ary RS-KII can utilize all of the knowledge used by AQ-11 when the beam size is one, and thus functionally subsumes AQ-11 for this beam width. The complexity of RS- KII when utilizing only the knowledge of AQ-11 is a little worse than th a t of AQ-11, but still polynomial in the num ber of examples and features. RS-KII can also utilize a dom ain theory, which AQ-11 can not. RS-KII can utilize a domain theory at the same tim e it is using AQ-11 knowledge, effectively integrating this knowledge into AQ-11. 112 C hapter 6 R S-K II and IV SM Increm ental version space merging (IVSM) [Hirsh, 1990] is an extension of M itchell’s candidate elim ination algorithm (CEA) [Mitchell, 1982] th at can utilize noisy exam ples, domain theories, and other knowledge sources in addition to the the noise-free examples to which CEA is lim ited. IVSM strictly subsumes CEA. Each fragm ent of knowledge is translated into a version space of hypotheses th at are consistent with the knowledge. The version spaces for each knowledge fragm ent are intersected, yielding a new version space consistent with all of the knowledge. IVSM is exactly an instantiation of KII in which constraints are represented as version spaces, or convex sets, and preferences are represented by the null represen tation, which can express only the em pty set (i.e., no preference inform ation can be utilized). This instantiation is called CS-KII, for convex-set KII. The IVSM and CEA algorithm s are described further in Section 6.1, and the equivalence of IVSM and CS-KII is dem onstrated. RS-KII can utilize the same knowledge as IVSM, and thereby subsume both IVSM and CEA, but only for some hypothesis spaces. IVSM expresses knowledge as a set of hypotheses th at are consistent with the knowledge. For some hypothesis spaces, every subset of the hypothesis space th at can be expressed as a convex set can also be expressed as a regular set. In these spaces, every knowledge fragm ent th at can be expressed as a convex set in IVSM can be expressed as an equivalent regular set in RS-KII. These are the hypothesis spaces for which RS-KII subsumes IVSM. For other spaces, there are at least some subsets th at can be expressed as a convex set but not as a regular set. For these spaces, RS-KII and IVSM overlap in the knowledge they can use, or the knowledge utilizable by IVSM and RS-KII is 113 disjoint. There are no non-trivial spaces for which IVSM subsumes RS-KII, since for every non-trivial hypothesis space there is at least one regular set th at cannot be expressed as a convex set. The expressiveness of RS-KII and IVSM is investigated further in Section 6.2.1, and sufficient conditions are identified for hypotheses spaces for which RS-KII subsumes IVSM. One class of hypothesis spaces for which RS-KII subsumes IVSM is the class of spaces described by conjunctive feature languages. These are the hypothesis spaces commonly used by CEA and IVSM. This class is defined more precisely in Section 6.2.2.1. The subsum ption of IVSM by RS-KII for this class of spaces is investigated in this section as well. This includes both a general proof, and empirical support in the form of translators for a num ber of common knowledge sources that are also used by IVSM. RS-KII also subsumes IVSM in term s of com putational complexity for this class of hypothesis spaces. Section 6.3 compares the complexity of set operations in the regular and convex set representations for subsets of these hypothesis spaces. For these spaces, the complexity of RS-KII is up to a squared factor less than th at of IVSM, and in all cases RS-K II’s worst case complexity is bounded by IVSM ’s worst case complexity. For at least one space, there is a set of examples for which the complexity of IVSM is exponential in the num ber of examples, but for which the complexity of RS-KII is only polynomial. These examples are the well known examples dem onstrated by Haussler [Haussler, 1988] to cause exponential behavior in CEA. The behavior of RS-KII for these examples is investigated in Section 6.4. Concluding rem arks are m ade in Section 6.5 6.1 T he IV SM and C E A A lgorith m s 6.1.1 T he C andidate Elim ination A lgorithm The candidate elim ination algorithm (CEA) [Mitchell, 1982] induces a hypothesis from positive and negative examples of the target concept. The implicit assum ption of the algorithm is th at the target concept is both a m em ber of the hypothesis space and consistent with all of the examples. CEA m aintains a set of hypotheses, called a version space, th at consists of the hypotheses consistent with all of the examples 114 seen so far. Examples are processed incrementally. Each example is processed by removing hypotheses from the version space th a t are inconsistent with the example. If the example is positive, then all hypotheses not covering the example are removed. Conversely, a negative example is processed by removing the hypotheses th at do cover the example. If after processing an example the version space is empty, then the algorithm halts. No hypothesis in the hypothesis space is consistent with all of the examples, and the version space is said to have collapsed. If the version space contains only a single hypothesis, then the version space has converged. This single hypothesis is uniquely identified by all of the examples seen so far. If the implicit assum ptions of CEA are correct, then this hypothesis is the target concept. If there are m ultiple hypotheses in the version space, then it is assumed th at one of them is the target concept. However, the target concept can not be discrim inated from the other candidates w ithout observing additional examples. If m ore examples are available, they can be processed in the hopes th at the version space will con verge. Otherwise, a hypothesis can be selected arbitrarily from the version space as the induced hypothesis, or the entire version space can be used to classify unseen instances as follows. If the instance is covered by all of the hypotheses in the version space, then the instance is also covered by the target concept and is classified as a positive instance. Likewise, if none of the hypotheses in the version space cover the example, neither does the target concept. The instance is classified as negative. Otherwise, the instance m ay or may not be covered by the target concept. In this case, the instance is classified as unknown. 6.1.2 Increm ental Version Space M erging Increm ental version-space merging (IVSM) [Hirsh, 1990] is an extension of the candi date elim ination algorithm th at learns from non-example knowledge as well as from examples. Each knowledge fragm ent is translated into a constraint on the hypothesis space, and this constraint is represented as a version space of hypotheses th at satisfy the constraint. The version spaces for each fragm ent are intersected to produce a single version space consistent with all of the knowledge. Every hypothesis in this 115 set satisfies all of the constraints, and is therefore equally preferred by the knowledge for being the target concept. W hen the knowledge consists solely of noise-free examples, IVSM computes ex actly the same version space as CEA, and with essentially the same tim e and space complexities [Hirsh, 1990]. T hat is, IVSM subsumes CEA. A hypothesis can be selected arbitrarily from the final version space as the target concept, or instances can be classified against all of the hypotheses in the version space, as was described for CEA. O ther queries about the version space are also possible. The queries th at have generally proven useful are m em bership, emptiness (collapse), uniqueness (convergence), and subset [Hirsh, 1992]. 6.1.3 Convex Set R epresentation In both IVSM and CEA, the version space is represented as a convex set. A convex set consists of all hypotheses “between” two boundary sets, S and G, where S is the set of m ost specific hypotheses in the version space and G is the set of m ost general hypotheses in the version space. A hypothesis is in the version space if it is more general than or equivalent to some hypothesis in S', and is more specific than or equivalent to some hypothesis in G. Formally, a convex set is specified by the tuple (H,-<,S,G ), where H is the hypothesis space, -< is a partial ordering of generality over H, and S and G are the boundary sets. W hen the values of H and -< are obvious, a convex set can be w ritten as ju st (S , G). The convex set {H, -* < , 5, G) consists of all hypotheses h such th a t s ■ < h ■ < g for some s in S and g in G. The relation ^ is derived from such th at x X y iff x y or x = y. (H, -<,S,G) = { h e H I 3 ses 3 geG { s ± h ± g)} (6.1) 6.1.4 Equivalence of CS-K II and IVSM IVSM essentially provides a collection of operations on convex sets: translation of knowledge into constraints expressed as convex sets, intersection of convex sets, enum eration of hypotheses from a convex set, and queries on convex sets such as 116 m em bership, em ptiness, uniqueness and subset. These correspond to the operations provided by KII. IVSM is equivalent to CS-KII, an instantiation of KII in which constraints are represented as convex sets, and preferences are represented by the null represen tation, which can represent only the em pty set (i.e., preference inform ation is not allowed). Both CS-KII and IVSM represent knowledge as convex sets, and provide the same operations on convex sets. In IVSM, each knowledge fragm ent is translated into a a convex set containing all of the hypotheses th at are consistent with the knowledge. The fragm ents are integrated by intersecting their corresponding convex sets into a single set contain ing the hypotheses consistent with all of the fragm ents. This set corresponds to the version space in CEA. A hypothesis can be selected from this set as the target concept, or instances can be classified against the entire set, in the same way th at CEA classifies instances against the version space. IVSM also supports other op erations on the convex set th at represents all of the integrated knowledge, namely m em bership, em ptiness (collapse), and uniqueness (convergence). CS-KII represents knowledge in essentially the same way, and provides the same operations. In CS-KII, a knowledge fragm ent is translated into an (H , C, P) tuple, where H and C are convex sets, and P is the em pty set. The representation for P can represent only the em pty set, so P is always empty, regardless of the knowledge. The C set contains all of the hypotheses th at are consistent with the knowledge fragm ent. This is exactly the same convex set th at IVSM would use to represent the knowledge fragm ent. IVSM and CS-KII both represent a given knowledge fragm ent in term s of constraints on the hypothesis space, which is represented as a convex set of the hypotheses satisfying the constraints. Neither IVSM nor CS-KII can represent knowledge in term s of preference information. In CS-KII, this lack of knowledge is expressed explicitly by an em pty P set. In IVSM, the lack of preference knowledge is im plicit, so the P set can be om itted. The two representations are equivalent. Knowledge is integrated in CS-KII by intersecting (H , C , P) tuples. The C sets of the pairs are intersected, and the P sets are unioned. Since the P sets are always empty, the union of two P sets is also empty. The result of integrating several knowledge fragm ents is (# ,(7 ,0 ), where C is the convex set containing the hypotheses consistent with all of the integrated knowledge fragm ents. This C set is 117 the the same convex set th at IVSM would use to represent the same collection of knowledge fragm ents. Each fragm ent is represented by the same convex set in both IVSM and CS-KII, and both IVSM and CS-KII integrate knowledge by intersecting these convex sets. Finally, CS-KII provides the same operations on convex sets as IVSM. In IVSM the queries are applied to the version space, and in CS-KII they are applied to the solution set of (C, P). Since P is always empty, the solution set of (C , P) is just C. This is the same convex set com puted by IVSM. The queries in IVSM and CS-KII therefore both apply to C, the convex set of hypotheses consistent with all of the knowledge. The provided queries are m em bership, emptiness, and uniqueness. Both IVSM and CS-KII can also test whether C is a subset of another convex set. CS-KII also provides an explicit operator for enum erating hypotheses from C. IVSM does not provide an explicit enum eration operator, but its existence is implied by IVSM’s ability to select an arbitrary hypothesis from the version space as the target concept. In both IVSM and CS-KII, hypotheses are enum erated from a convex set, so they can both use the same im plem entation of the enum eration operator. IVSM provides one additional operation th at CS-KII does not provide directly. This is the classification of instances against the entire version space (see Sec tion 6.1.1). CS-KII does not provide this classification operation directly, but it can be implem ented in term s of two operations th at CS-KII already does provide: translation and the subset query. The im plem entation is shown in Figure 6.1. The idea is to translate the instance as if it were a positive example. This yields the set of hypotheses th at cover the instance. If the version space (solution set) is a subset of this set, then every hypothesis in the version space covers the instance. The instance is assigned the p o s it i v e classification. A similar procedure determines whether every hypothesis fails to cover the instance: translate the instance as if it were negative, and determ ine whether the solution set is a subset of the resulting set. If the version space is a subset of neither translation, then the instance is assigned the unknown classification. CS-KII and IVSM are equivalent. Both represent knowledge as convex sets of hypotheses consistent with the knowledge, and both provide the same operations on this representation. These operations are im plem ented the same way in both 118 Algorithm Classify(i, (C, P)) i : unclassified instance (C,P): a COP BEGIN (C+, P +) < — TranExample(H,^,(i,positive}) (C~,P~) < — TranExample(H,^,(i,negative)) IF Subset((C,P),(C+,P +)) THEN RETURN p o s it iv e ELSE IF (Subset((C,P),(C~,P~))THEN RETURN n e g a tiv e ELSE RETURN unknown END IF END Classify Figure 6.1: Classify Instances Against the Version Space. IVSM and CS-KII, except th at CS-KII represents the lack of preference knowledge explicitly with an em pty P set. The representation for P can express only the em pty set. One could imagine an instantiation of KII in which both C and P were represented as convex sets. However, convex sets are not closed under union [Hirsh, 1990], so integration would not be defined in this instantiation of KII. The candidate elim ination algorithm is subsumed by both IVSM and CS-KII. This follows from the equivalence of CS-KII and IVSM, and the fact th at IVSM subsumes CEA [Hirsh, 1990]. One m ight ask whether AQ-11 was also subsumed by CS-KII and IVSM, at least to the extent th at AQ-11 is subsumed by RS-KII (see C hapter 5). The answer is a qualified no. AQ-11 finds a VXi hypothesis th at is strictly consistent with the examples and preferred by the LEF. The ability to express preference inform ation is required in order to utilize the LEF, but CS-KII cannot utilize preference information. The only P set th at can be expressed in CS- KII is the em pty set. Extending the P representation to convex sets does not help much either, since convex sets are not closed under union, which makes it impossible to integrate (C, P) tuples. 119 However, CS-KII as it stands can at least find a VL\ hypothesis consistent with the examples. The generality ordering over VL\ hypotheses is well defined, so convex subsets of VLi can certainly be expressed. Each example is translated as a convex set of hypotheses consistent with the example, the convex sets are intersected, and a hypothesis is selected arbitrarily from the intersection as the target concept. 6.2 S u b su m p tion o f IV SM by R S-K II RS-KII subsumes IVSM, but only for hypothesis spaces for which every convex subset of the space is also expressible as a regular set. In general, the convex set repre sentation and the regular set representation overlap in expressiveness, but neither subsumes the other. Section 6.2.1 discusses the relative expressiveness of the two representations in general term s, and identifies conditions for which every convex subset of hypothesis space can also be expressed as a regular set. These are the hypothesis spaces for which RS-KII subsumes IVSM. A specific class of hypothesis spaces for which RS-KII subsumes IVSM is iden tified in Section 6.2.2.1. Since RS-KII subsumes IVSM for these spaces, it should be possible to construct RS-KII translators for the knowledge used by IVSM. Sec tion 6.2.3 provides such translators as additional empirical support for the subsum p tion of IVSM by RS-KII. 6.2.1 Expressiveness o f Regular and Convex Sets Convex sets and regular sets overlap in expressiveness. There is some knowledge th at can be expressed in both representations, and some th at can be expressed in one representation but not the other. The convex set representation places no restrictions on either the hypothesis space or the partial ordering of generality. The hypothesis space and the generality relation could require arbitrarily powerful Turing machines to represent them . This means th at convex sets can express m any sets th a t regular sets cannot, at least in theory. In practice, operations on convex sets—such as finding the m ost specific 120 common generalization of two hypotheses—may be com putationally infeasible for sufficiently expressive hypothesis spaces or generality orderings. Although the convex set representation is very expressive, it does not subsume the regular set representation. For every partial ordering having transitive chains of length three or more (e.g., w -< x -< y), there are regular subsets of the hypothesis space th at cannot be expressed as convex sets with th at ordering. For example, consider a set containing two elements, g and s, where g is a very general hypothesis, and s is a very specific hypothesis. If there is at least one hypothesis between these two hypotheses in the partial ordering, such th a t s -< x -< g, then the set cannot be represented by a convex set since every convex set containing s and g m ust also contain x. However, the set {s, g} can be represented by a regular set. There are only two elem ents, and any finite set can be expressed as a regular set. Another way of stating this phenomenon is th a t convex sets cannot have “holes” , but regular sets can. A convex set m ust contain every hypothesis between the boundary sets. It cannot exclude hypotheses unless all of the rem aining hypotheses are between some other pair of boundary sets. Regular sets, on the other hand, can represent sets with holes. W hether a given subset of the hypothesis space has a “hole” depends on the generality ordering. Under one ordering, -<, a given set may have a hole, but in ordering -</, the same set m ay not have a hole. However, there will be other subsets of the hypothesis space th at do have holes in the ■ < ' ordering. At least some of these can be expressed as regular sets. 6.2.2 Spaces for which R S-K II Subsum es CS-KII For some hypothesis spaces, every convex subset of the space can be expressed as a regular set. These are the hypothesis spaces for which RS-KII subsumes CS-KII. A convex set, (S,G,-<, H), contains all hypotheses th at are more specific than or equal to some elem ent of G, and more general than or equivalent to some element of S. S is the set of maximally specific elements in the set, and G is the set of m axim ally general elements, where every element of S is more specific than one or 121 more elements of G. A convex set (S , G, -* < , H) is therefore equivalent to the following set: S U G U (J ( { h e H \ s ^ h } n { h e H \ h < g } (6 .2 ) s6 S,geG Every convex subset of H can be expressed as a regular gram m ar if for every hypothesis x in the hypothesis space, the sets {h £ H \ x -< h} and {h £ H \ h ^ x} can be expressed as regular sets. This follows from closure of regular sets under intersection and union, and the finiteness of the S and G sets. It is difficult to determ ine whether an arbitrary hypothesis space satisfies this condition, and is therefore a hypothesis space for which RS-KII subsumes CS-KII. However, there is a generalization th at is easier to test, nam ely th a t both the hypoth esis space (H ) and the generalization ordering (-<) can be expressed as regular sets. The set {h £ H \ h -< x} is equivalent to first((H x {x})flR ^), where R< is a gram m ar encoding the generality ordering T hat is, R ^ = {(x,y) £ H x H | x -< y}. Since regular gram m ars are closed under intersection, Cartesian product, and projec tion, {h £ H \ h x} is expressible as a regular set for every hypothesis x. Similarly, the set {h £ H \ x -< h) is equivalent to second(({x} xH)C\R<), and therefore ex pressible as a regular gram m ar as long as both H and R< are expressible as regular gram m ars. 6.2.2.1 Conjunctive Feature Languages One class of languages for which RS-KII subsumes CS-KII are a subset of the con junctive feature languages. A hypothesis in a conjunctive feature language is a conjunct of feature values, with one value for each of the features. For example, in a language with k features, hypotheses are of the form (Vi, V2 , • • •, 14) where Vi is an abstract value corresponding to a set of ground values. An instance is a vector of ground values, (vi,U 2 , ...,Vf.). An instance is covered by a hypothesis if V{ £ V for every i between one and k. If the “single representation trick” is used (e.g., [Dietterich et a/., 1982]), there is an abstract value, u, for each ground value, v. Feature values consist of both ground values and abstract values. The ground values are the specific values th at the feature can take, and the abstract values correspond to sets of ground values. For example, the feature c o lo r may have ground values such as green and red, and abstract values such as dark-color and 122 light-color. The abstract values correspond to sets of ground values. Features may be continuously valued or discrete. If the ground values are continuously valued (e.g., the real num bers) then the abstract values are sets of ground values, such as the set of real num bers between one and five, inclusive. The abstract values for each feature can be partially ordered by generality. Value V\ is more general than value V -j if Y-i is a subset of V\. If neither value is a subset of the other, then there is no generality ordering between them . The generality ordering over the abstract features is called a generalization tree. A value in the tree is more general than its descendents. Typically, the leaves of the tree are abstract values th at cover only a single ground value. The conjunctive feature languages are a family of languages. The fam ily is pa ram eterized by the num ber of features, and a generalization tree for each of the features. A generalization tree is a tuple (F, -<), where F is a set of abstract values, and x -< y means th at abstract value x is more specific than abstract value y. T he gram m ar for the language is a concatenation of the gram m ars for the values of each feature. G(Fi) is the gram m ar for the abstract values in feature F{. ConjunctiveFeatureLanguage((Fi, - < i ) ,..., (Fk, -<k)) — » • G(F\) • G(F2) •... • G(Fk) (6.3) The generality relation among hypotheses is derived from the relations on in dividual features. A hypothesis, (xi,x 2 is more general than or equal to (yi, V 2 i • • • > Vk) if Vi X{ for every i between one and k. Hypothesis (x\,x2, ..., Xk) is strictly more general than (yi,y2, ■ ••■,yk) if there is at least one feature, /,-, such th at yi -< ,• Xj, and for the rem aining features, fjjti, yj dij Xj. The strict relation is stated formally in Equation 6.4, and is defined formally in Equation 6.5. (xu x 2,...,x k) -< (?/i,J/2,...,yfc) iff (Bi G {1,..., k} Xi -<i y{) and (Vj e {1,..., k} x{ X,- y{) (6.4) (xu x 2,...,x k) ^ (z/i,2/2 ,...,i/i) iff Vj€{l,...,fc}x,- X,?/,- (6.5) 6.2.2.2 Conjunctive Languages where RS-KII Subsumes CS-KII RS-KII subsumes CS-KII for hypothesis spaces in which both the hypothesis space and the generality ordering over the hypotheses can be expressed as regular gram mars. This is true of conjunctive feature languages in which the values for each 123 feature can be expressed as a regular gram m ar, and the generalization tree for each feature can be expressed as a regular gram m ar. The gram m ar for the hypothesis space is the concatenation of the gram m ars for each feature. The gram m ar for the generality ordering, -<, over hypotheses is defined as fol lows. Let R <ti be the regular set {(a:,y) 6 FixFi | x -< ,• j/}, which represents the generalization hierarchy for feature /,-. Let R<,i be the regular set {(x,y) € F{XF{ | x y}, where x ^ t - y if x -< ,• y or x — y. This is the union of R ^ j and f2=),-, where R =li = {ww | w 6 F,}. The generalization hierarchy, over hypotheses is constructed from the regular sets for the individual feature hierarchies as follows: k R-< = [J R <,i • • • • • R<,i~i • R-(,i • R<,i+ 1 •... • R<,k (6-6) t=i If both the hypothesis space and the generality orderings for each feature, and R*,i, can be expressed as regular gram m ars, then every convex subset of the hypothesis space is expressible as a regular set. W hen this is true, any knowledge th at can be expressed by CS-KII can also be expressed by RS-KII. W hat generality orderings over features can be expressed as regular grammars? A generality ordering, R <t{ or R<,i, is a subset of F{xF{. If F{ can be expressed as a regular gram m ar, then so can FiXFi, since regular gram m ars are closed under Cartesian product. However, not every subset of F{XF{ can be expressed as a regular gram m ar (see Section 4.1.2.4), so not all generality orderings over Fj are expressible as regular gram m ars. It is difficult to specify necessary and sufficient conditions for which a given subset of FixFi can be expressed as a regular gram m ar. However, it is possible to identify some sufficient conditions. For example, any finite set can be expressed as a regular gram m ar, so any generality ordering over F{ can be expressed as a regular gram m ar if Fi is finite. Likewise, the equality relation, R=,i, can also be expressed as a regular gram m ar if F{ is finite. W hen Fi is infinite, some but not all orderings can be expressed. f?=i, can always be expressed as a regular set under the shuffle m apping if Fi is a regular set (see Section 4.1.2.4). However, only some generality orderings can be expressed as regular gram m ars. Among the subsets of FiXFi th at can be expressed as regular gram m ars under the shuffle m apping are those of the form {(x,y) € Fi xFi \ x < y} where 124 < is a lexicographic (dictionary) ordering. This provides a way to represent total orderings over F . In one common generality ordering over features with ordinal values (e.g., inte gers), each abstract value is a range of values, and one abstract value is more general than another if its range properly contains the other. For example, a range m ight be of the form [x..y], where x < y, and one range would be more general than another if the first range contained the second (e.g., [5..10] -< [0..20]). To represent the set of abstract values, Fi, as a regular gram m ar, the range [x..y] is represented as the string (x o y ) where and o is a special symbol th at indicates the end of x and the beginning of y. F{ is the set of all possible ranges, {a: o y | x,y £ INT and < y} where INT is a regular expression for the set of integers, (0 — 9)1 1(0 — 9)+ . Fi can be expressed as the regular set {shuffle(x,y) \ x,y € INT}. The generality ordering over F; is {([®i-J/i], [® 2, J/2]) € FixFi | x 2 < £1 and 2 /1 < 2/2}- This is the set of all pairs of ranges such th at the first range is contained in the second. The regular set for this ordering takes as input strings of the form shuffle(wi,W2 ), where W\ and w 2 are strings encoding ranges in F,-. A range in F,- is encoded as a string of the form shuffle{x,y) where x and y are ground values for the feature, and x < y. An input to the gram m ar is therefore an interleaving of four strings, a, b, c and d, where the input is accepted if the range [a..6 ] is contained w ithin the range [c..d], and both of these are valid ranges (i.e, a < b and c < d). The gram m ar for the generality ordering, over F; is constructed from four sim pler gram m ars. Create four copies of a regular gram m ar th at recognizes the set {shuffle(x,y) | x,y € INT and x < y}. This is a simple modification of the DFA described in Section 4.1.2.4 of C hapter 4. Call these gram m ars M i through M4. For each quartet of symbols in the input string, a t&,c,d;, pass a,- 6 , - to M i, c,di to M2, c,at - to M3, and b{di to M4. The machines Mi and M 2 ensure th at the ranges [a..6] and [c..d\ are valid, in th at a < b and c < d. The machines M 3 and M 4 ensure th a t [a..6] is contained within [c..d\—th at is, c < a and b < d. If all four machines accept their input, the input string is accepted by the composite regular gram m ar. 125 6.2.3 R S-K II Translators for IVSM Knowledge RS-KII and CS-KII (IVSM) overlap in expressiveness. Some knowledge can be utilized by both RS-KII and CS-KII, and some knowledge can be used by one but not the other. For some hypothesis spaces, RS-KII subsumes CS-KII— th at is, all knowledge th at can be utilized by CS-KII can also be utilized by RS-KII. This is true of hypothesis spaces described by conjunctive feature languages, where the generality ordering for each feature can be expressed as a regular gram m ar. As empirical support for this fact, RS-KII translators will be dem onstrated for a num ber of knowledge sources used by IVSM. The hypothesis space for all of these translators is assumed to consist of conjuncts of feature values, where the general ization ordering for each feature can be expressed as a regular gram m ar. These translators generally take as input both the hypothesis space (H ) and the generality ordering (R-<). is the union of R < and R =, and therefore a regular set whenever R< and R - are regular. The ordering R-< is used in place of R^ in the translators because it is easier to construct the necessary C sets from operations on R< than it is to construct them from set operations on R^. The P sets produced by the translators are always empty, since the set representation th at CS-KII uses for P can only express the em pty set. A translator for noise-free examples is described in Section 6.2.3.1, a transla tor for noisy examples with bounded inconsistency [Hirsh, 1990] is described in Section 6 .2 .3.2 , and a translator for overgeneral domain theories is described in Sec tion 6 .2.3.3. 6.2.3.1 Noise-free Examples The target concept is consistent with all of the noise-free examples, by definition. T hat is, the target concept covers all of the positive examples and none of the negative examples. A hypothesis th at is not consistent with the noise-free examples cannot be the target concept. A noise-free example is translated as a constraint th at is satisfied by all of the hypotheses consistent with the example. A positive example is translated as (C , 0), where C is the set of hypotheses th at cover the example. Similarly, a negative 126 example is translated as (C, 0), where C is the set of hypotheses th at do not cover the example. For the conjunctive feature languages, an instance (mi, m 2 ,. .., Xk) is covered by a hypothesis, (t>i, V2 , ■ ■., ut), if and only if for every feature /,• from one to k , X{ u,-. The set of hypotheses covering an instance can be com puted as follows. Let Fi be the set of abstract values th a t feature i can take. Let be a regular gram m ar encoding the generality ordering over jF,: = {(x,y) € H x H \ x ■ < { y}. The set of features values covering a :,- is given by the following formula: {u € Fi | Xi r<i u} = first(R^tif]({xi}xFi)) (6.7) This set is expressible as a regular gram m ar if a;,-, f?-<, and Fi are regular sets, since regular sets are closed under union, intersection, and projection. The set of hypotheses covering the instance is the Cartesian product of these sets. The regular gram m ar for this set is the concatenation of the regular gram m ars for the set of values covering the instance on each feature. A translator for noise-free examples based on this construction is given in Figure 6.2.3.1. This translator takes as input the hypothesis space, expressed as a list of value sets for each feature, F\ through Fk\ the generality ordering f?-<, expressed as a list of generality orderings for each feature, R ^ i through R-<Xi ai*d an example of the form ( u i c ) , where Vi is a value for feature /,• and c is a classification (e.g., positive or negative). 6.2.3.2 N oisy Examples with Bounded Inconsistency Bounded inconsistency [Hirsh, 1990] is a kind of noise in which each feature of the example can be wrong by at m ost a fixed am ount. For example, if the width value for each instance is m easured by an instrum ent with a m axim um error of ±0.3m m , then the width values for these instances have bounded inconsistency. The idea for translating examples with bounded inconsistency is to use the error m argin to work backwards from the noisy example to com pute the set of possible noise-free examples. One of these examples is the correct noise-free version of the observed example, into which noise was introduced to produce the observed noisy example. The target concept is strictly consistent with this noise-free example. Let e be the noisy observed example, E be the set of noise-free examples from which e could have been generated, and let e' be the correct noise-free example from 127 TranExample(Fi, F2, ..., Fk, -^<,1 > R-<,2i • • • » R^,k {(xu x 2, . .., ®fc),class)) -> (C , 0) where Fi x F2x ... xFk is the hypothesis space R <>1 ' R*fl • • • • • is R< {{x\,x2, . . . ,Xk),class) is an example q — f Vi x l^ x . . . x 14 if class = n e g a tiv e V 1 XV2 X ...X I 4 if class = p o s it i v e Vi = {u e Fi | v r< a:,} = ^ rst(i2 ^ > t -n({a:f} xi*})) Figure 6.2: RS-KII Translator for Noise-free Examples. which e was in fact generated. Ideally, e is translated into (C , 0), where C is the set of hypotheses consistent with e'. However, it is unknown which example in E is e'. Because of this uncertainty, a noisy example is translated as ((7,0), where C is the set of hypotheses th at are strictly consistent with at least one of the examples in E. Hypotheses th a t are consistent w ith none of the examples in E are also not consistent with e', and therefore not the target concept. This is the approach used by Hirsh [Hirsh, 1990] in IVSM to translate noisy examples with bounded inconsistency. This suggests the following RS-KII translator for examples with bounded incon sistency. The set of possible noise-free examples, E , is com puted from the noisy ex amples and the error margins for each feature. Each example, e,-, in this set is trans lated using the RS-KII translator for noise-free examples, TranExample(H,R^,ei) (Figure 6.2.3.1), which translates example e ,- into (C;, 0). C, • is the set of hypotheses th at are strictly consistent with e,-. The translator for the bounded inconsistent ex am ple returns (C = Ui=i C,-, 0). C is the set of hypotheses consistent with at least one of the examples in E. The set E is com puted from the observed example, (®i, x 2, ..., Xk, class), and the error m argins for each feature, ± ( > 1 through ±< 5fc, as follows. If the observed value for feature i is a;,-, and the error m argin is ± <5,-, then the correct value for feature i is in 128 TranExampleBI(H, , (61, 62, . . . , 6k),((x 1,X2,...,Xk),dass)) -> ( C ,P ) where E = [xi,± 5 j]x[x 2,±<52]x ••• {class} where [a:,-, ± 6 ,- ] = {u | x,- — 6 { < v < x,- -f £,} C = [J Ci s.t. (Ci, 0) = TranExample(H, R^, et) Figure 6.3: RS-KII Translator for Examples with Bounded Inconsistency. {u | Xi — S{ < v < x;+<5,}. Call this set [x,-, ± < 5 ,-] for short. Since examples are ordered lists of feature values and a class, E is [xi,±<$i]x[x2 , ± 6 2 ] x . . . x[x*, ±<5^]x {class}. A translator for examples with bounded inconsistency based on this approach is shown in Figure 6.2.3.2. It takes as input the hypothesis space {H), a regular gram m ar encoding the generality ordering {R-<), the error margin for each feature ( ± 6 1 through ±(5/t), and an example. The hypothesis space input is represented in factored form as an ordered list of regular sets, F\ through Fk, where F{ is the set of abstract values for feature /,-. The input R< is also represented in factored form as an ordered list of regular sets, R<ti through R-<,k, where R ^ j represents the generality ordering for feature /,-. The factored representation is used since the translator makes a call to TranExample(H,R < ,e), which expects H and R< to be represented in this fashion. 6.2.3.3 Domain Theory A dom ain theory is a set of rules th at explain why an instance is a m em ber of the target concept. For instance, a particular object is a cup because it is liftable, and has a stable bottom and an open top. There are m any ways to use a dom ain theory, depending on assum ptions about the completeness and correctness of the theory. For example, if the theory is overspecial, then there are instances which cannot be explained as m em bers of the target concept. The target concept is a generalization of the dom ain theory. The theory provides an initial guess at the target concept, and additional knowledge indicates which generalization of the theory has the most 129 support for being the target concept. O ther assum ptions about the correctness and completeness of the theory are also possible (see Section 6.2.3.3). For the sake of brevity, only a translator for overspecial theories will be described. If the hypothesis is overspecial, then the target concept is some generalization of the dom ain theory. The theory could therefore be translated as a constraint satisfied only by generalizations of the theory. Determ ining which hypotheses are generalizations of the theory is somewhat difficult. One approach is to m ap the theory onto an equivalent hypothesis, and then use the generalization ordering to determ ine the generalizations. However, it is not clear how to perform this mapping. A second approach is to translate both the theory and a positive example at the same tim e. An explanation of the example by the theory corresponds to a sufficient condition for the concept described by the theory. The target concept is a generalization of the explanation, and an an explanation can be easily m apped onto the hypothesis space. The explanation is a conjunction of operational predicates, or features, and exactly corresponds to a m em ber of the hypothesis space. Once the m apping has been done, it is a simple m atter to com pute the set of hypotheses more general th an the explanation. Only a positive example can be used for this translation, since the theory can explain why an instance is a m em ber of the target concept, and therefore a positive example, but cannot explain why an instance is not a m em ber of the target concept. To utilize negative examples in this way, a theory is needed for the complement of the target concept. Com plem entary theories for positive and negative examples are also used in IVSM [Hirsh, 1990]. A translator for an overspecific dom ain theory is shown in Figure 6.2.3.3. In this translator, H is the hypothesis space, R-< is a regular gram m ar for the generalization ordering, T is a dom ain theory, and e is a positive example. The explanation of an exam ple by the theory is essentially a generalized example. This example is passed to the translator for noise-free examples described in Section 6.2.3.1. 6.3 C om p lexity o f Set O perations The com putational complexity of IVSM (CS-KII) and RS-KII are determ ined by the complexity of integrating knowledge, and the complexity of enum erating hypotheses 130 TranOverSpecialDT(H, R *,T, e = (inst,positive)) — > (C,P) where (C, P) = TranExample(H, R<, (eip/am (inst, T ), positive)) Figure 6.4: RS-KII Translator for an Overspecial Domain Theory. from the solution set. The complexity of these operations is determ ined in turn by the com putational complexity of the set operations th at define them . W hen the P set is empty, the only set operation th at determ ines the com puta tional complexity of integration is intersection. Integration involves intersecting the C sets and com puting the union of the P sets, but since the P sets are empty, their union is always empty. The union operation can be com puted in constant tim e, and is dom inated by the complexity of intersecting the C sets. Since P is empty, the deductive closure of {C , 0) is just C. The complexity of enum erating a hypothesis from the deductive closure of (C , P) therefore depends on the size of the DFA for C. In CS-KII, the P set is always em pty since only the em pty set is expressible in the representation for P. Therefore, the costs of integration and enum eration in CS-KII depend on the cost of intersecting convex sets and the cost of enum erating hypotheses from an intersection of convex sets, respectively. In RS-KII, the P set can be non-empty. However, when only the knowledge expressible in CS-KII is utilized, the P set is always empty. The cost of integration and enum eration in CS-KII are determ ined by the costs of intersecting regular sets and the cost of enum erating hypotheses from an intersection of regular sets. The com putational complexity of intersection and enum eration for regular sets is derived in Section 6.3.1 and Section 6.3.2, respectively. The com putational com plexity of intersection and enum eration for convex sets is derived in Section 6.3.3 and Section 6.3.4. A comparison between the complexity equations for regular and convex sets is m ade in Section 6.3.5. 6.3.1 C om plexity of R egular Set Intersection Recall th a t the intersection of two DFAs is im plem ented in RS-KII by constructing an intentional DFA as shown in Figure 6.5. Constructing this DFA only requires 131 pointers to the two original DFAs, and definitions of the DFAs components in term s of calls to similar components in the the two original DFAs. The intentional DFA for the intersection of two DFAs can be constructed in constant time. {si,Si,Fi,Deadi, AccAlli,T>\) fl (s2 , 8 2 ,F2, Dead2, Acc/l//2, E 2) = (s, 8 , F, Dead, AcceptAll, S) where s = -F((?l)?2)) = Dead((qi,q2)) = (5 1 , 5 2 ) (tfi(9i> c ) » ^2(92,0-)) F 1(q1) A F 2(q2) t r u e if else unknown t r u e if Accept All ((^ 1 , 9 2 )) = f a l s e if ■ { else unknown EjDSz Pointer to S i Deadi(qi) = t r u e or Dead 2 (q2) = t r u e AccAlli(qi) = t r u e and AccAll 2 (q2) = tru e AccAll\(qi) = f a l s e or AccAll 2 (q2) = f a l s e if S j ^ S 2 if S i = S 2 Figure 6.5: Intersection Implementation. 6.3.2 C om plexity of Regular Set Enum eration Although constructing an implicit DFA for the intersection of two DFAs is a constant tim e operation, enum erating hypotheses from this DFA is not. A hypothesis is enum erated by finding a path from the start state to an accept state. This can take both tim e and space proportional to the num ber of states in the DFA. The implicit DFA for the intersection represents the explicit DFA shown in Fig ure 6.6. The implicit DFA generates states and edges in the explicit DFA as they are needed by the search. Thus the space cost is bounded by the num ber of states actually searched instead of by the total num ber of states in the explicit DFA. This saves space on average, but the worst case complexities are still the same. 132 (Q ,s , 6 ,F ,E ) = (Qi,s-i, 6 i,Fi,'E)n(Q 2 ,s 2 , 6 2 ,F 2 ,'E) where Q = Q 1 XQ 2 s = ( s i,s 2) 5((?i»9a),ff) = {Si(qi,(r),62{q2,cr)} F = F \x F 2 Figure 6.6: Explicit DFA for the Intersection of Two DFAs. The cost of enum erating a hypothesis from this DFA is proportional to the num ber of states in the DFA, and the cost of computing the next-state function, 6 . The explicit DFA has |Q i||Q 2| states, and the cost of com puting 6 is cost ( 6 1 ) + cost( 6 2). The tim e and space needed to enum erate a hypothesis from the intersection of two DFAs is bounded by Equation 6.8. 0(|Q i||< 32|[cosi(*i) + cosf(£2)]) (6.8) 6.3.3 C om plexity of Convex Set Intersection The intersection of two convex sets is computed in two phases. In the first phase, a convex set is com puted th at represents the intersection of the two sets, but whose boundary sets contain extraneous hypotheses. In the second phase, these hypotheses are elim inated from the boundary sets, yielding a m inim al convex set. This analysis follows [Hirsh, 1990]. Phase One. The non-minimal intersection of two convex sets, (H, X, Si,G\) and (H,~(,S 2 ,G 2 ), is defined in Figure 6.7. In this definition, LUB(a,b,~<) returns the least upper bounds of a and b in the partial order X, and GLB(a, 6, -<) returns the greatest lower bounds of a and b in T hat is, L£/B(a, 6, X) returns the m ost spe cific common generalizations of a and b, and GLB(a, b, X) returns the m ost general common specializations of a and b. Com puting the non-minim al intersection involves finding LUB(a, b, X) for every pair of hypotheses (a, 6) in S i x S 2, and GLB(x,y, x ) for every pair of hypotheses (x,y) in Gi x G2. There are |<S'i||52| + |G i||G 2| such pairs. The cost of computing 133 (S,G,^,H) = (Si,Gu ^,H)n(S 2,G 2,H,^) where 5 = {s | 3slesllS26S2 s.t. s € LUB(sl,s2,-<)} G = { g | 331eGllff2G G 2 s.t. g € GLB(gl,g2,^)} Figure 6.7: Intersection of Convex Sets, Phase One. the non-minim al intersection is proportional to the num ber of pairs and the cost of com puting the GLB s and LUBs. Assuming th a t the cost of com puting the LUB of two hypotheses is f;u( , and the cost of com puting the GLB is tgw, the cost of com puting the non-m inim al intersection is given by the Equation 6.9. |5’ 1||5 2|i/1 ,6 + |G ?1||G 2|^ tt (6.9) P h a s e T w o . In the second phase, the boundary sets are minimized by removing from S hypotheses th a t are more general than other elem ents of S, or th a t are not more specific than some elem ent of G. The first test requires |5 |2 comparisons. Removing elements of S th a t are not more specific than or equivalent to some element of G requires |£'|(|Gr1| -f- |G2|) comparisons. This makes use of the observation th at since G contains the GLBs of every pair of elements in G\ and G2, an elem ent s 6 S is more specific than some elem ent of G if and only if s is m ore specific than an elem ent of Gi and an elem ent of G2. The cost of minimizing S is therefore [I*?!2 + |S |(|G i| 4- |G2|)]f<, where t< is the cost of comparing two hypotheses to determ ine which is more general. S contains the LUBs of all pairs of elements from Si and so (S'! < |j S 'i ||5'2|. The cost of minimizing the S set is therefore bounded by [(|5 i||5 2|)2 + |5 i||5 2|(|G i|-f |G2|)]f<. A sym m etric analysis holds for minimizing G. The total cost of minimizing the intersection of two convex sets is given by Equation 6.10. [(lu lls',!)2 + |S’ i ||5, 2|(|G 1| + K72|) + (IGall^l)2 + |<?a||G 2|(|$ i| + l ^ l ) ] ^ (6.10) T o ta l C o s t. The cost of intersecting two convex sets, is the sum of Equation 6.9 and Equation 6.10. The values of t<, and tgib, depends on both the hypothesis 134 space and the generality ordering (-<). The tim e cost of com puting the intersection of two convex sets, (Si,Gi) and (Si, Gi), is bounded by Equation 6.11: |5 l||S 2 |t/u 6 + |G l||G !2 |ta/6+ [(|Si||Sa|)2 + 15 x 11521(16 x 1 + |Ga|) + (IGillGal)2 + |C7i||Ga|(|Si| + |5a|)]<< (6.11) 6.3.4 C om plexity of Enum erating Convex Sets Enum erating a hypothesis from an intersection of convex sets is inexpensive. If only one or two hypotheses are needed, the first hypotheses in the S and G sets can be returned. These are both constant tim e operations. This is sufficient to induce a hypothesis, and to answer the emptiness and uniqueness queries. If additional hypotheses are needed, they can be enum erated by starting with an elem ent of the S set and climbing the generalization tree. This is done by generating the im m ediate parents of the hypothesis in the generalization tree, and enum erating only those parents th at are also covered by at least one G set elem ent. The complex ity of generating an im m ediate parent depends on the representation of the generality ordering. For conjunctive languages with tree-structured and lattice-structure fea tures, it is a constant tim e operation. Verifying th a t a hypothesis is covered by at least one G set elem ent takes tim e proportional to |G|f<. The tim e to generate each elem ent is therefore 0(|G |<<), at least for conjunctive languages with tree or lattice-structured features. 6.3.5 C om plexity Com parison The intersection and enum eration costs for convex sets and regular sets are in term s of entirely different quantities. The costs for convex sets are in term s of the S and G set sizes, whereas the costs for regular sets are in term s of the num ber of states in the DFA. In order to compare these costs we need to know the relationship between these different quantities for equivalent sets. It is difficult to find a general equation relating these quantities, but it is possible to relate them when the sets are assumed to be subsets of a specific hypothesis space. The idea is to devise an algorithm for translating an arbitrary convex subset of the 135 hypothesis space to an equivalent regular set, such th at the size of the resulting regular set can be expressed in term s of |5j and |G|. This provides the necessary relation between the sizes of equivalent regular and convex subset of the hypothesis space. This relation is only valid for the given hypothesis space, or class of hypothesis spaces, and the approach requires th at every convex subset of the hypothesis space can be expressed as an equivalent regular set. The com putational complexities of CS-KII and RS-KII will be compared using this approach for the class of hypothesis spaces described by a conjunctive language with tree-structured features, where the generalization tree has a fixed depth bound. A similar comparison will be m ade for conjunctive languages in which the features are lattice structured, where the generalization tree for each feature has a fixed depth bound. These hypothesis spaces are commonly used with IVSM and CEA, and every convex subset of these hypothesis spaces can be expressed as regular set (see Section 6.2.2.1). 6.3.5.1 Equating Regular and Convex sets A hypothesis in a conjunctive language can be w ritten as a vector of feature values, («i, V2, ..., Vk), where u ,- is one of the values in the generalization hierarchy for feature /,-. A hypothesis is generalized by replacing at least one feature value, u,-, with a m ore general value. A hypothesis is specialized by replacing at least one feature value with a more specific value. The generality ordering over the hypotheses has a m axim ally general hypothesis, £/, th at classifies every instance as positive, and a m axim ally specific elem ent, 0, th at classifies every instance as negative. A convex set in this language can be w ritten as shown in Equation 6.12, where x y if x and y are hypotheses in the language and x is either more specific than y, or x = y. (S, G) = { h \3 se S 3 g e G s ^ h ^ g } = ( J { h | 0 ^ h ^ y } (6.12) seG In a conjunctive language, a hypothesis can be w ritten as a vector of feature values, (vi,V 2 , ..., U fc). Equation 6.12 can therefore be rew ritten as shown in Equa tion 6.13. In this definition, fi(h) is the value of hypothesis h on feature /;. 136 (,S,G > = ( J {(^1, «2, - • -, | fi(s) ^ Vi ■ < fi(U)A...Afk(s) ^ Vk ^ fk(U)} n sSS •••,»*:> I /i(0) d Vi < /i(ff)A...A/fc(0) ;< V jt d fk(g)} g e a = U ^ i I -fr(s) ^ «i ^ /i(tO } x • • • x{t>* | fk{s ) ^ ^ ^ A (^ )} n s e s 1 J {wi I /i(0) •< vi ± /i(ff)}x ... x{vk | /jt(0) ^ u fc ^ A(fl-)}. (6.13) sGG Let j4,'(>s) be the regular set { ? ;,■ | /,(s ) ■ < u ,- ^ /<(^)}i and let Bi(g) be the regular set {u,- | /i(0) ^ u; ^ /»'(flr )}- The convex set described in Equation 6.13 can be expressed as the regular set shown in Equation 6.14. Regular gram m ars are closed under concatenation, intersection, and finite union, so the resulting set is a regular gram m ar. (S,G) = { j A 1(s)A 2(s)...Ak( s ) n \ j B 1(g)B 2(g)...Bk(g) (6.14) s e S g&G The num ber of states in the regular set described in Equation 6.14 depends on the num ber of states in the j4,-(s) and B{(g) gram m ars for each s and g. The sizes of these gram m ars depend on the depth of the generalization hierarchies (trees) for each feature, and whether these hierarchies are tree structured or lattice structured. 6.3.5.2 Tree Structured Hierarchies If the generalization hierarchy for each feature is tree structured, then there is at m ost one hypothesis in the S set [Bundy et al., 1985]. Equation 6.14 reduces to Equation 6.15. (W ,G > = U ( ^ ( 5)n 5 i(ff))(A 2(S) n 5 2(ff) ) . . . (Ak(s)nBk(g)) (6.15) g€G For brevity, the set A,(s)fl5,(£f) will also be referred to as X{(s,g). The set X{(s,g) is equivalent to {u; € F{ | /,(s ) ■< V{ -< fi{g)}, where F{ is the set of values for feature /,-. It contains all hypotheses between fi(s) and fi(g) in the generalization hierarchy for /,-. In a tree structured generalization hierarchy with depth d, Xi(s,g) contains at m ost d hypotheses. This is because s and g m ust be on the same branch 137 of the tree in order for s to be less general than g, and no branch of the tree has more than d nodes. The DFA for Xi(s,g) therefore has at m ost 0(d) states. The DFA for Xi(s,g)X 2(s,g) . .. Xk(s,g) has at most 0(kd) states. Call this DFA X(s,g) for short. The DFA for UffeGX(s> 9 ) has at m ost 0(\G\kd) states. For hypothesis spaces described by conjunctive languages in which the general ization hierarchies for each feature are tree structured and have a m axim um depth of d, every convex subset of the space can be expressed as a regular set as shown in Equation 6.14. The DFA for this set has at m ost 0(\G\kd) states. This relation between the sizes of equivalent convex and regular sets allows the tim e complexity of intersecting two regular sets and enum erating a hypothesis from their intersection (Equation 6.8) to be rew ritten in term s of the boundary set sizes of two equivalent convex sets, as shown in Equation 6.16. In this equation, it is assumed th at the regular set (Q\, Si, < $ i, F\, E) represents the same set of hypotheses as the convex set (Si, G\), and th at (Qi, s 1 ? Si, Fi, E) represents the same set of hypotheses as (52, G 2 ). Equation 6.16 is derived by substituting \G\\kd for Q 1 and \G2 \kd for Q 2 in Equation 6.8. 0 (\Qi\\Q 2 \[cost(6 i) + cos<(<52)]) = 0 (|G i||G ( 2|(A:d)2[cost((51) + cost((52)]) (6.16) For m ost gram m ars, cost(Si) and cost(S2 ) are 0(1). The cost of enum erating a hypothesis from the intersection of two regular sets has been expressed in term s of the sizes of 5 and G. This cost can now be compared to the cost of enum erating a hypothesis from the intersection of two equivalent convex sets, (Si, G\) and (S 2, 0 2 ). The cost of intersecting two convex sets is shown in Equation 6.11. For conjunctive languages with tree structured features, the 5 set never has more than one elem ent, and the cost of determ ining the generality relation between two hypotheses, t<, is 0(kd) [Hirsh, 1990]. The cost of finding the least upper bound of two hypotheses (t/ui) is also 0 (kd) when the generalization hierarchies for each feature are tree structured. The cost of finding the greatest lower bound of two hypotheses (tgib), is also 0(kd). Substituting these values into Equation 6.11 yields Equation 6.17. kd 4- |Gi||G2|fcd + (|Gi| + IG2 I + (|Gi ] [G2 1)2 + 2|Gi||G2|)fcd. (6-17) 138 This cost is bounded by Equation 6.18: 0 ((|G ,||G 2|)2fcd) (6.18) The cost of enum erating a hypothesis from the intersection of two regular sets is 0(|G j||G 2|(fcd)2), and the cost of enum erating a hypothesis from two equivalent convex sets G ((|G i||G 2 |)2A:d). If kd -C |Gi11G21 , then the cost for regular sets is significantly less than the cost for convex sets. Since the size of the G sets can grow exponentially in the num ber of examples, this is often a valid assumption. W hen a convex set is represented as an equivalent regular set, and the generalization hierarchies are tree structured, the regular set is effectively a factored representation of the convex set. This allows the set to be represented more compactly, and improves the com putational complexity. Similar complexity improvements can be obtained for convex sets by m aintaining them in factored form [Subramanian and Feigenbaum, 1986]. However, there are still expressive differences between convex and regular sets. In particular, regular sets can represent sets with “holes”, whereas convex sets cannot. Also, convex sets are not closed under union [Hirsh, 1990], but regular sets are [Hopcroft and Ullman, 1979]. The com putational complexity of enum erating a hypothesis from an intersection of n convex sets in IVSM is n(|G *|4)A:d, where |G*| is the m aximum size attained by the G set. The size of G* can be as high as 2" [Haussler, 1988]. The com putational complexity of enum erating a hypothesis in RS-KII from the same n sets represented as regular sets is is bounded by G (|G |n(&d)n), where |G| is the size of the largest G set among the convex sets being intersected. The worst case complexities of IVSM and RS-KII are equivalent. Although RS-KII can be exponential, this is a very loose upper bound. As will be seen in Section 6.4, the complexity of RS-KII can be polynomial when the complexity of IVSM is exponential. 6.3.5.3 Lattice Structured Features W hen the generalization hierarchies for the feature are lattice structured rather than tree structured, the relation between the sizes of convex and regular sets changes somewhat. 139 The S set is not guaranteed to be singleton in this language [Bundy et al., 1985]. Convex subsets are therefore expressed as shown in Equation 6.19. (S,G) = {J{v1 \ f 1(s)±v1±f1(U)}x...x{vk \fk(s)±vk ± f k(U)}n ses U {«i I /i(0) < Vi < fi(g)}x ...x{vk I A-(0) d Vk < f k(g)} (6.19) aeG Let >l,(s) be the regular set {u,- | /,(s ) ^ u ,- X /,({/)}, and let B{(g) be the regular set {u,- | /,(0 ) di V{ ^ fi(g)}- The regular set equivalent to the one in Equation 6.19 is shown in Equation 6.21, where X{(s,g) = Ai(s)DBi(g). (S, G) = \ j A 1(s)A2(s)...Ak(S) n \ j B 1(g)B2(g)...Bk(g) (6.20) s€S g(zG U X 1(s,g)X2(s,g)...Xk(s,g) ( 6 . 2 1 ) (s,g)esxa Xi(s,g) is the set {u,- € F, | /,(s ) ■ < u ,- ■ < fi{g)}, where Fi is the set of values in the generalization hierarchy for feature and x < y means th at either value x is less general than value y in the generalization hierarchy for Fi, or th at x = y. W hen the hierarchy for Fi is lattice structured, and has a m axim um depth of d, there can be 0 (wd) hypotheses between s and g, where w is the width (branching factor) of the lattice. The DFA for _X,-(s,< 7) therefore has 0 (wd) states. Let X (s,g ) stand for X \ (s, g )X 2(s ,g )... X k(s, g). The DFA for X (s,g ) is a con catenation of the DFAs for each of the features, Xi(s,g), and has at most 0 (kwd) states. The DFA for g j e s x G ^ ^ ' ^ ^ as mos* 0 ( |‘ 5'||G|A:zz;‘ i) states, by argu m ents similar to the ones used for tree structured features. This relation between the num ber of states in a regular set and the boundary sets sizes of an equivalent convex set can be used to compare the complexities of intersecting and enum erating hypotheses from both regular and convex sets. Let {Qi,Si,Si,Fi,'Si} be a regular set th at represents the same set of hypotheses as the convex set (Si,G\), and let (Qi,Si,£i,jFi,Ei) be a regular set th at represents the same set of hypotheses as the convex set (S 2 ,G 2 ). The cost of intersecting the two regular sets and enum erating a hypothesis from the intersection is bounded by 0 ( |Q i ||g a|(«**(«i) + cost(S2) (Equation 6.8), where cost(6i) and cost(S2) are the costs of com puting the next-state functions, 8\ and S2. These costs can generally be 140 assumed to be constant unless they are proportional to a relevant scale-up variable. Substituting 0(|Sx||Gi|fcu>rf) for the size of Q\, and O d S ^ ^ lfc io '2 ) for the size of Q 2 , yields the cost for intersecting and enum erating a hypothesis from the two regular gram m ars in term s of the boundary set sizes of the equivalent convex sets. This cost is shown in Equation 6.22. 0 (|S i||S 2||G1||G2|(fcio‘ f)2) (6.22) This cost of intersecting two regular sets and enum erating a hypothesis from their intersection can now be compared to the cost of performing the same operation on two convex sets. Let (S i,G i) and (5'2,G 2) be two convex sets th at represent, respectively, the same sets of hypotheses as the two regular sets (Qi,si,^i, T\, Ei) and (Q 2 , s 2 , 6 2 , F 21 E 2). Since the cost of enum erating a hypothesis from a convex set is 0 (1 ) (see Section 6.3.4), the cost of intersection and enum eration is just the cost of intersecting two convex sets, as given in Equation 6.11. This equation depends on the boundary set sizes of both convex sets, the cost of determ ining the generality relation between two hypotheses (t<), and the costs of finding the least upper bound (//«&) and greatest lower bound {tgib) of two hypotheses in the generalization hierarchy. For lattice structured feature hierarchies, the cost of determ ining whether one hypothesis is more general than another is 0(kwd) [Hirsh, 1990]. The costs of finding the least upper bound or greatest lower bound of two hypotheses in the generalization hierarchy are also bounded by 0{kwd). The intuition behind this bound is th at all three of these operations are essentially searches in the generalization hierarchy, starting from the hypotheses in question. The hierarchy for each feature can be searched independently. There are k features, each with a generalization hierarchy of width w and depth d. This leads to k independent searches, each visiting at most wd nodes in the hierarchy. Substituting 0(kwd) for t/ut, tgib and t< in Equation 6.11 yields the cost bound shown in Equation 6.23. 15x11521^+ |G1||G 2|A:u;tf+ [(|5'i ||52|)2 + |5 i ||52|(|Gx| + |G2|) + (|Gx||G2|)2 + |G1||G2|(|iS'i | + |S2|)]W* (6.23) This is in turn bounded by Equation 6.24. 0([(15x||52|)2 + |51||52|(|G1| + |G2|) + (|G1||G2|)2 + |G1||G2|(|51| + |5'2|)]A;u;d) (6.24) 141 W hen the feature generalization hierarchies are lattice structured, the cost of intersecting two convex sets and enum erating a hypothesis from their intersection (Equation 6.22) is roughly equivalent to the cost for performing the same operation on convex sets (Equation 6.24), depending on the relative sizes of the boundary sets and the feature hierarchies. If the boundary sets are all about the same size, x, then the cost for regular sets becomes 0 ( x 4(kw d)2) and the cost for convex sets becomes 0 { x 4kw d). The convex sets are a little more efficient, by a factor of kw d, the cost of comparing two hypotheses. The regular sets are less efficient, since the comparison is encoded in the states of the DFA, and these extra states combine m ultiplicatively when the DFAs are intersected. If the size of the boundary sets is significantly larger than kw d, then this additional factor does not have much of an im pact, and the complexities of regular and convex sets are about the same with respect to the combined intersection and enum eration operation. W hen the boundary set sizes differ, so th at the S sets are much smaller or larger than the G sets, regular sets are more efficient. Let the sizes of S\ and S 2 each be s, and the sizes of G\ and G2 each be g. The complexity of the intersection and enum eration operation for regular sets is 0 ( s 2g2(kw d)2), and the complexity of the same operation on convex sets is 0 ([s 4 + g4 + s 2 g + g 2s\kw d). This difference is most likely attributable to the fact th at convex set intersection involves a m inimization step th a t requires 0 ( s 4 + g4) comparisons among the hypotheses of the boundary sets, whereas regular set intersection has no corresponding m inim ization step. The complexity for convex sets is a squared factor greater than the complexity for regular sets if (s + g)2 <C kw d. If kw d dom inates (s + g)2, then convex sets are more efficient than regular sets. 6.4 E xp on en tial B ehavior in C S-K II and R S-K II For both RS-KII and CS-KII, the worst case complexity for inducing a hypothesis from n examples is 0 ( 2n). This exponential behavior occurs in CS-KII and RS-KII for different reasons. The com putational cost of inducing a hypothesis in RS-KII is proportional to the size of the DFA for the solution set. W hen P is empty, the solution set is just C. The C set is the intersection of n DFAs, one for each 142 knowledge fragm ent. The intersection of two DFAs, A and B , results in a new DFA with IQ^UQbI states, where IQ^I is num ber of states in A, and |<3b | is the num ber of states for B . Intersecting n DFA’s, each of size r, can result in a final DFA with up to r n states. For CS-KII, the cost of integrating an example is proportional to the size of the boundary sets for the current version space. The worst case complexity occurs when integrating an example causes the boundary sets to grow geometrically. W hen a new negative example is processed, there m ay be more than one way to specialize each hypothesis in the G set in order to exclude the example. If each hypothesis has two specializations, and none of these hypotheses need to be pruned from G, then the size of the G set doubles. Recall th at a hypothesis is pruned from G if it is more specific than some other elem ent of G , or if it is not more general than any elem ent of 5. The S set grows in the same fashion when there is more than one way to generalize the hypotheses in S in order to cover a new positive example. This kind of growth can also occur when non-example knowledge fragm ents are integrated. This geometric increase in the size of a boundary set is known as fragmentation. After processing n fragm enting examples or knowledge fragm ents, the affected boundary set contains 2 n hypotheses. In a geometric series, the cost of processing the last exam ple dom inates the cost of processing all of the previous examples, so the com putational complexity is 0 (2 ” ). Although both RS-KII and CS-KII can be exponential in the worst case, the com plexity of RS-KII is bounded by th at of CS-KII for some hypothesis space languages, such as the conjunctive languages described in Section 6 .2 .2 .1 . In one case, the com plexity of RS-KII is considerably less than th at of CS-KII. This is the hypothesis space and set of examples described by Haussler [Haussler, 1988], which induces ex ponential behavior in CEA (and in IVSM and CS-KII by extension). For this same task, the complexity of RS-KII is only polynomial in the num ber of examples. 6.4.1 H aussler’s Task Haussler [Haussler, 1988] describes a hypothesis space and sequence of examples th at cause the boundary sets of CEA to grow geometrically. Both the space complexity and tim e complexity of CEA are exponential in the num ber of examples. 143 ( f , f , t , t , . . . ., t , t ) ( t , t , f , f , . , t , t ) ( "t, t , t , t , . f , f ) Figure 6 .8 : Haussler’s Negative Examples. Hypotheses in Haussler’s task are conjuncts of feature values, with one value for each of k features. The possible values for each feature are tr u e , f a l s e , and don’t - c a r e . These values are abbreviated as t , f , and d, respectively. For example, a hypothesis for k = 4 m ight be ( t , f ,d ,t ) . The don’t - c a r e value is more general than the t r u e and f a l s e values, and neither t r u e nor f a l s e is more general than the other. The examples for this task consist of k j 2 negative examples and a single positive example. The positive example is ( t , t , t , . . . t) . The negative examples have two adjacent features with the value f a l s e , and the rem aining features have the value tr u e . For the ith negative example, features 2i and 2i -f 1 are f a l s e . See Figure 6 .8 . The positive example is presented first, followed by the k /2 negative examples in order. 6.4.2 Perform ance of CS-K II and R S-K II For each negative example in Haussler’s task there are two ways to specialize each elem ent of the G set in order to exclude the negative example, one for for each of the exam ple’s f a l s e values. None of the specializations can be pruned from G, so the size of the G set doubles after processing each negative example. After processing all k / 2 examples, the G set contains 2k / 2 hypotheses. The cost of induction in CS-KII is proportional to the sizes of the S and G sets, so the cost of inducing a hypothesis from n = k / 2 of these examples is 0 (2 "). The cost of inducing a hypotheses from these same examples in RS-KII is only 0 ( n 2). This dram atic change is due to the representational differences between convex sets and DFAs. Processing one of Haussler’s negative examples doubles the num ber of m ost general hypotheses in the version space. This doubles the size of the 144 convex set representation of the version space, but does not double the size of the equivalent DFA representation, at least not in this hypothesis space. The size of the DFA may grow geometrically in other hypothesis spaces, or for other (non-example) knowledge sources. In this hypothesis space, processing one of Haussler’s negative examples only adds a single state to the DFA. Thus, the size of the DFA grows only linearly in the num ber of examples. The initial DFA has k states, so the cost of processing k f 2 negative examples is )C?=i(& + *) = 0 ( k 2). Substituting n for k j 2 yields 0 ( n 2). An example of RS-KII solving Haussler’s task for six features (k = 6 ) is given below, followed by a more detailed complexity analysis m otivated by this exam ple. The detailed analysis includes some additional costs th at do not appear in the simplified analysis above, but the overall complexity is still 0 ( n 2). 6.4.2.1 RS-K II’s Performance on Haussler’s Task For this instantiation of Haussler’s task there are six features. The examples there fore consist of one positive example, and three (k /2 ) negative examples: • P o = ( t , t , t , t , t , t ) • n i = {f,f ® n2 = , f , t , t ) • n3 = ,f) Each example is translated into a (C, P) pair, where C is the set of hypotheses consistent with the example, and P is the em pty set. C corresponds to the version space of hypotheses consistent with the example. Translations of the four examples are shown in Table 6.1. The corresponding DFAs for the C sets are shown in Figure 6.9 for the positive example, and in Figure 6.10, Figure 6.11, and Figure 6.12 for the three negative examples. The unshaded nodes are dead states. The four examples are integrated into RS-KII one at a tim e. After integrating each example, the solution set of the resulting (C ,P ) pair is tested for emptiness. If the solution set to (C , P ) is empty, then there are no hypotheses consistent with the examples seen so far, much less with all of the examples, so no more examples 145 Example C P Po = ,t ) ni = (f,f n 2 = , f , t , t ) n3 = ( f , f , f ) C0 = ( d | t ) ( d | t ) ( d | t ) ( d | t ) ( d | t ) ( d | t ) Ci = ( d |f ) ( d | f ) ( d | t ) ( d | t ) ( d | t ) ( d l t ) C 2 = ( d | t ) ( d l t ) ( d | f ) ( d |f) (d lt)( d lt) C3 = ( d | t ) ( d | t ) ( d | t ) ( d | t ) ( d | f ) ( d | f ) -Po = {} A = {} ^ 2 = {} ^3 = {} Table 6.1: Translations of Haussler’s Examples. s d It If Figure 6.9: DFA for Co, the Version Space Consistent with po. are processed. These actions em ulates the behavior of CEA, which integrates each example one at a tim e, and after integrating each example tests the resulting version space for emptiness. If it is empty, then the version space has collapsed, and the rem aining examples are not integrated. The test for emptiness is im portant for the polynomial behavior of RS-KII. The em pty test identifies and eliminates dead states from the DFA for C in the process of ascertaining whether C is empty. If these dead states are not removed, then when Itlf d ltlf Figure 6.10: DFA for C i, the Version Space Consistent with ni. 146 d ltlf Figure 6.11: DFA for C2, the Version Space Consistent with n2. Figure 6.12: DFA for C 3 , the Version Space Consistent with 7 x 3 . the C set for the next example is intersected, those dead states combine multiplica- tively with the states in C, so th at the num ber of dead states in the intersection is proportional to the product of the states in C and the num ber of original dead states. This geometric growth means th at the DFA for C o n C ifl. . . C t / 2 has 0 (2 h^2) states, m ost of which are dead states. Intersecting these DFAs takes tim e proportional to 0 ( k / 2), since intersection is a constant tim e operation in RS-KII, but enum erating a hypothesis from the intersection can take tim e proportional to the num ber of states in the DFA, or 0 ( 2f c /2). The complexity of CEA on this task is also 0 (2 k^2). However, if the dead states are elim inated after each intersection, then the DFA for ConCifl... fiCi, has only a few m ore states than the DFA for CoflCin . . . nC ,_i. This linear growth leads to a polynomial tim e complexity for RS-KII on this task, as will be seen below. The final DFA for the intersection of the C sets for all k /2 examples has only O(k) states, and the tim e spent in elim inating dead states is only 0(k2). The positive example, p0, is seen first. It is translated into (Co, Po) and integrated into RS-KII. This is the first knowledge seen, so (C ,P ) = (Co,Po). The negative 147 examples are then processed one at a tim e, starting with n j . Exam ple ni is translated into ( C i , P i) and integrated with ( C , P ), yielding ( C o f l C i , P 0 UP1 ). Constructing the implicit DFAs for CoflCi and P 0 UP1 are constant tim e oper ations. The solution set for ( C o f l C i ,P 0 UP1 ) is then tested for emptiness. The P sets are em pty for all of the examples in this task, so P 0 UP1 = 0 , and the solution set is just C o f l C i . Testing whether C o f l C i is em pty takes tim e proportional to the num ber of states in the DFA representing the intersection. The DFA for C o f l C i is shown in Figure 6.13. The unshaded states are dead states. dltlf Figure 6 .1 3 : DFA for C 0 f l C i . The em pty test tries to find a path from the start state of the DFA to an accept state. Along the way, it eliminates any dead states it finds. If all of the states are identified as dead states and elim inated, then the DFA is empty. Removing the dead states from the DFA for C o f l C i results in the DFA shown in Figure 6 .1 4 . The same process is repeated on the rem aining negative examples. Example n 2 is seen next, and translated into (C 2 , P>). Integrating (C2, P2) with (CoflC i,0) yields (CoflCiflCz, 0). The DFA for CoflCiflC 2 is shown in Figure 6 .1 5 . This DFA is tested for em ptiness, and its dead states are removed as a side effect. This results in the DFA of Figure 6 .1 6 . Figure 6 .1 4 : DFA for C o f l C i After Em pty Test. 148 Figure 6.15: DFA for CoflCiflCa- Figure 6.16: DFA for C oflC inC 2 After Em pty Test. Finally, 713 is seen. Translating and integrating 713 yields (C = C o f lC if^ n C s , 0). The DFA for C is shown in Figure 6.17. After the em pty test removes the dead states, the DFA for C is as shown in Figure 6.18. C0nCinC2 nC3 corresponds to the version space consistent with all four examples. This is the same version space com puted by IVSM from these examples, though represented as a DFA instead of as a convex set. The equivalent convex set is shown below. S = {(t.t.t.t.t.t)} G = {(t,d,t,d,t,d), (t,d,t,d,d,t), (t,d,d,t,t,d), (t,d,d,t,d,t), (d,t,t,d,t,d), (d,t,t,d,d,t), (d,t,d,t,t,d), (d,t,d,t,d,t)} 149 Figure 6.17: DFA for C'oDC'iD^nC's. T he S set contains one hypotheses, and the G set contains eight hypotheses, each w ith six features. This consumes 54 units of space. T he equivalent regular set has eleven states, with three edges per state, for a total of 33 units of space. Edges th at lead directly to a dead state have been om itted from the figures for clarity. 6.4.3 C om plexity A nalysis The costs involved in processing the examples are the cost of translating each exam ple, the cost of integrating each example, and the cost of applying the em pty test to the solution set of the resulting C set. The cost of translating example i into (0,-,P,) is the cost of constructing the DFAs for C{ and P,. The cost of constructing one of these DFA is proportional to the num ber of states and edges it has. P,- is always empty, so the cost of constructing Pi is 0(1). The DFA for the C set of the positive example has k + 2 states and (& + 2 )|S | edges. The DFA for the C set of each negative example has (2k + 1 ) states and (2k + 1)|E| edges. The cost of translating the examples is the sum of the sizes for each DFA: [(& + 2) + (fc + 2) |E|] + (k/2)[(2k + 1 ) + (2fc + 1 ) |E|] = (fc2 + 3fc/2 + 2)(|E | + l) (6.25) Each example is processed by integrating its translation, (0,-, P,), into the (C , P ) pair th at represents all of the examples integrated so far, and then testing whether 150 Figure 6.18: DFA for CoflC i(" 1(7 2 DC's After Em pty Test. the solution set of (CflC,-, PU P,) is empty. Integration in RS-KII is a constant tim e operation, so the cost of integrating all 1 + k /2 examples is 0 (k ). The cost of E m pty((C f)C i, PUP,-) is proportional to the num ber of states in the DFA for the solution set of (CnC,-, PU P,). Since P and P{ are always em pty in this task, the solution set is just C flC t -. The cost of the em pty test is therefore proportional to the num ber of states in the DFA for CflC,-. The positive example is the first one integrated, so the em pty test is applied to ( C o , P o ) , the translation of po- Since P o is empty, this is equivalent to testing whether Co is empty. The DFA for Co is m inim al, as are the DFAs for all of the examples. D eterm ining the emptiness of a m inimal DFA is a constant tim e operation, so the em ptiness of Co can be determ ined in constant tim e. For a negative example, n,-, the em pty test is applied to Cf)C;, where (C,-, 0) is the translation for n,-, and C is the set of hypotheses consistent with all of the previously observed examples (p0 through rii-i). Determ ining the emptiness of CflC,- is proportional to the num ber of states in the DFA for CflC,-. This num ber can be expressed as a function of i and the num ber of features, k. After processing examples Po through n;_i, the DFA for C is of the form shown in Figure 6.19. The DFA for C consists of i — 1 “diam onds” , each with three states, and a chain of k — 2(i — 1 ) states. Intersecting the DFA for C with the DFA for C,-, results in a DFA of the form shown in Figure 6.20. This DFA is identical to the DFA for C up to 151 k - 2( 1 -1) ... 4 U Q 1 -1 ... (|_2 Figure 6.19: DFA for C oflC ifl... flC,_i After Em pty Test. state i — 1 . After th at, it has one more “diam ond” , two chains of length k — 2i, and a dead state. There are i diamonds, each with three states for a total of 3z states. This does not include the state labeled i. There are two branches of k — 2i states, and a dead state. The total num ber of states in CflC,- is therefore: 3i + 1 + 2(k - 2i) + 1 = 2k - i + 2 (6.26) k- 21 • ■ • f 1 -1 k — 2i Figure 6.20: DFA for CflC,-. The lower branch in Figure 6.20 (unfilled circles) consists entirely of dead states, and is pruned by the em pty test. The dead branch has (k — 2i + 1) states, so after applying the em pty test, the num ber of states in CflC,- is given by the following equation: ( 2 k - i + 2 ) - ( k - 2 i + l) = k + i + l (6.27) 152 The cost of applying the em pty test to C'flC,- for all k/2 negative examples is given by Equation 6.28. k/2 k/2 ^ 2 k - i + 2 = k2 + k - J 2 i 1 = 1 1 = 1 = k2 + k-(k/2)(k/2 + l)/2 = k2 + k - (fc2 + 2fc)/16 = ] f * 2 + \ k (6.28) The cost for the entire induction task is the translation cost (Equation 6.25) plus the integration cost (0(&)) plus the cost of the em pty test (Equation 6.28). The sum of these costs is given by the following equation: (k2 + 3k/2 + 2)(|E | + 1 ) + k + ^ k 2 + 7 - k lo o = ( l ^ + |E|)*2 + (2! + !|E|)fc + 2|E| + 2 = 0 { k 2 |E |) In this task, E is { t , f , d} regardless of the num ber of features, so |S | can be treated as a constant. Substituting n for k/2 , we get a total cost of 0 ( n 2). 6.4.4 Sum m ary The improvements of RS-KII over CS-KII on this task are due to representational differences between convex and regular sets. A convex set stores the entire G set and S set. In Haussler’s task, the size of the G set doubles with each negative example, so th a t it contains 0 (2 n) elements after processing n examples. Representing the same version space as a DFA requires only a constant num ber of states. The DFA effectively stores the version space in factored form. Factored convex-set representa tions have been shown to have similar complexity improvements [Subramanian and Feigenbaum, 1986]. T he regular set representation has similar complexity results for a range of hy pothesis spaces, including the conjunctive languages with tree-structured features, and to a lesser degree for conjunctive languages with lattice-structured features. For the latter hypothesis space, the cross-feature fragm entation is controlled for by the 153 DFA representation, but there is also within-feature fragm entation th at increases the complexity. However, the complexity is still less than it would be for a completely unfactored representation. 6.5 D iscu ssion Regular and convex sets overlap in the knowledge they can represent, but neither representation strictly subsumes the other. There are m any knowledge fragments th at can be expressed as either a regular set or a convex set, but there are also fragm ents th at can be expressed in only one of the representations and not the other. The relative com putational complexities of RS-KII and CS-KII depend on the knowledge being utilized, but for conjunctive languages with tree and lattice struc tured features, both RS-KII and CS-KII have the same worst-case complexity. RS- KII effectively represents the version space in factored form [Subramanian and Feigenbaum, 1986], which makes the cost of enum erating a hypothesis from the intersection of regular sets cheaper than the cost of intersecting equivalent convex sets by about a square factor. The factored representation also allows RS-KII to learn from Haussler’s examples [Haussler, 1988] in polynomial tim e, whereas IVSM and CEA both take exponential tim e. 154 C hapter 7 R ela ted W ork 7.1 IV SM Increm ental Version Space Merging (IVSM) [Hirsh, 1990] was one of the first knowl edge integration system s for induction, and provided m uch of the m otivation for KII. IVSM integrates knowledge by translating each knowledge fragm ent into a version space of hypotheses consistent with the knowledge, and then intersecting the version spaces for the fragm ents to get a version space consistent with all of the knowledge. This theoretically allows IVSM to utilize any knowledge th a t can be expressed as a constraint on the hypothesis space. In practice, IVSM represents version spaces as convex sets, and this lim its the constraints th at can be expressed, and therefore the knowledge th at can be used. In later work, Hirsh suggests using representations other th an convex sets for version spaces, and identifies representations th at guarantee polynomial tim e bounds on the cost of induction [Hirsh, 1992]. KII expands on this work by extending the space of set representations for the version space (i.e., C ) from the few suggested by Hirsh to the space of all possible set representations. KII also expands on IVSM by allowing knowledge to be expressed in term s of preferences as well as constraints, thereby increasing the kinds of knowledge th at can be utilized. Finally, KII facilitates form alization of the space of set representations by m apping them onto the space of gram m ars, and using results from autom ata and form al language theory to establish upper bounds on the expressiveness of set representations. KII strictly subsumes IVSM, in th at IVSM can be cast as an instantiation of KII with convex sets (CS- K II). 155 7.2 G rendel Grendel [Cohen, 1992] is another cognitive ancestor of KII. The m otivation behind Grendel is to better understand the effects of inductive biases on learning by express ing the biases explicitly in the form of a gram m ar. Grendel expresses the biases as hard constraints on the hypothesis space, and as preferences on the order in which the space is searched. The constraints are expressed as an antecedent description gram m ar1, and the preferences are represented by m arking productions in the gram m ar as preferred or deferred. The gram m ar representing a given collection of biases is constructed by a translator th at takes as input all of the biases and outputs the gram m ar. The language of the gram m ar is the biased hypothesis space. The gram m ar is then searched for a hypothesis (string) consistent with the examples. The search is guided by an inform ation gain m etric, and the preference m arkings on the gram m ar productions. Grendel can utilize a wide range of knowledge (biases), but it cannot integrate knowledge. The integration work is done by the translator, not by Grendel itself. A translator takes as input all of the biases, and outputs a single gram m ar. It is not possible to translate the biases independently and integrate the gram m ars, as is done in KII, because the gram m ar is essentially context free and not closed under inter section. In Grendel, each new combination of biases requires a new translator. This is in contrast to KII, where knowledge fragm ents can be translated and integrated independently, so th at it is possible to have a single translator for each knowledge fragm ent. This allows knowledge to be added or om itted much more flexibly than in Grendel. This independence also means th at the knowledge integration effort in KII occurs prim arily within K II’s integration operator, as opposed to occurring prim arily within a single translator, as is the case with Grendel. K II’s greater flexibility in integrating knowledge comes from two sources. First, constraints can be expressed in languages th at are closed under intersection, which allows constraints to be specified independently and composed via set intersection. Second, preferences are represented as a separate gram m ar instead of as orderings on the productions of the constraint gram m ar, as is the case in Grendel. This removes 1An antecedent description grammar is essentially a context free grammar. 156 a source of dependence between preferences and the biases encoded in the constraint gram m ar, and provides a potentially more expressive language for the preferences. KII also removes the somewhat arbitrary distinction between biases and exam ples. Grendel treats these as separate entities, but KII treats them both equally. This uniformity facilitates certain analyses—such as determ ining an upper bound on the expressiveness of set representations for which induction is com putable—th at would be more difficult if examples were treated differently than other forms of knowledge. Grendel treats examples in a fixed way, assuming th at they are noise- free and using an information gain m etric to select among the strictly consistent hypotheses. KII allows examples to be translated in more than one way, which per m its other assum ptions about the examples, such as th at they have noise conforming to the bounded inconsistency model. 7.3 B ayesian P aradigm s Buntine [Buntine, 1991] describes a knowledge integration system in which knowl edge is represented as prior probability distributions over the hypotheses. The prior probability for a hypothesis is its prior probability of being the target concept. The hypothesis space is then searched for a hypothesis with the highest prior probability. This is similar to KII, except th at knowledge is expressed as probability distribu tions instead of as constrained optim ization problems. Each of these representations make different trade-offs between expressiveness, efficiency, and ease of constructing translators. W hether the constraint paradigm or the Bayesian paradigm is more appropriate depends on the available knowledge. In the Bayesian paradigm , it may be difficult to find priors th a t adequately express a piece of knowledge. Solving the probability equations to find the m ost probable concept may also be difficult, depending on the distribution. KII is most appropriate when the knowledge can be easily expressed as constraints and pref erences in set representations for which induction is com putable, and preferably polynomial. 157 7.4 D eclarative Specification o f B iases Russell and Grosof [Russel and Grosof, 1987] describe a knowledge integration sys tem for induction in which knowledge, in the form of inductive biases, is translated into determinations. The determ inations and the examples deductively imply the target concept. It is also possible to search for a desired set of determ inations, such as those th at do not overconstrain the solution, and deduce the target concept from these determ inations. This yields the ability to dynamically shift the bias. This system shares w ith KII the idea th a t induction is the process of identifying a hypothesis th at is deductively implied by the knowledge. The inductive leaps come from unsupported assum ptions m ade by the biases (knowledge), and from having to select a hypothesis arbitrarily when more than one hypothesis is deductively implied by the knowledge. A determ ination expresses a bias—th at is, a preference for selecting certain hy potheses over others as the target concept. This can certainly be expressed in term s of constraints and preferences over the hypothesis space. For example, a deter m ination of the form n a tio n a lity ( ? x , ?n) ::> la n g u a g e (? x ,? l) can be trans lated along with an example, n a t i o n a l i t y ( F r i t z , German), language ( F r itz , German), into a preference for hypotheses th at imply language(?x, German) is true whenever n a tio n a lity ( ? x , German) is true. The exact form of the preference would depend on the hypothesis space and the preference set representation. Deter m inations are no more expressive than (H , C, P ) tuples, but it m ay be more natural to express some knowledge in term s of determ inations than in term s of constraints and preferences, and vice versa. As with the Bayesian paradigm , the appropriate ness of a given framework depends on how naturally the knowledge at hand can be expressed in th at framework. KII can have instantiations at various trade-offs points between expressiveness and complexity. This is useful for investigating the effects of knowledge on induction, and for generating induction algorithm s th at guarantee certain tim e complexities. The framework of Russell and Grosof can not make such trade-offs directly. It may be possible to find restricted determ ination languages with desirable complexity properties, but this is not supported in any principled fashion by the framework. It can, however, shift the bias by selecting which knowledge to utilize. KII has 158 no equivalent capability, although the choice of set representation does determ ine which biases can be utilized, albeit at a much coarser grain than Russell and Grosof’s system. 7.5 PA C learning The PAC learning literature (e.g., [Vapnik and Chervonenkis, 1971], [Valiant, 1984]) investigates, in part, the conditions under which a concept can be learned in poly nomial tim e from a polynomial num ber of examples. One of the m ain results from this literature is th at all concepts in a family of concepts are polynomially learnable if any given concept in the family can be identified within a given error margin and confidence level by a polynomial num ber of examples, and the tim e needed to identify a hypothesis consistent with the examples is polynomial in the num ber of examples [Valiant, 1984], The KII framework is concerned with the hypothesis identification half of this result. KII provides operations for identifying hypotheses consistent with a collection of knowledge fragm ents, not just examples. The set representation determ ines the complexity of identification, and determines the knowledge th at can be expressed. If the representations for C and P are such th at a hypothesis can be enum erated from the solution set in polynomial tim e, then the cost of identification is polynomially bounded for any knowledge expressible in these representations. This complements the PAC results, which deal with the cost of identifying a hypothesis from examples only. T he accuracy of a hypothesis induced from examples depends on the Vapnik- Chervonenkis dimension (VC-dimension) [Vapnik and Chervonenkis, 1971] of the family of concepts to which the target concept belongs. The VC-dimension deter m ines how m any examples are needed in order to guarantee th at the hypotheses consistent with the examples will have a given level of accuracy. KII has nothing formal to say as yet about the num ber of knowledge fragm ents needed to guarantee a certain level of accuracy. This is an area for future research. However, accuracy is generally correlated with the num ber of correct knowledge fragm ents, and this depends in turn on the expressiveness of the set representation. More expressive set 159 representations can represent more of the available knowledge, and thus the accu racy of the learned hypothesis tends to increase with the expressiveness of the set representation. 160 C hapter 8 F uture W ork 8.1 Long Term V ision The ultim ate vision for this work is to provide a general framework for integrating knowledge into induction. I envision a library of translators for different knowledge sources, hypothesis spaces, and set representations, and a library of set representa tions with which to instantiate KII. This would provide the user m axim um flexibility in deciding which hypothesis space to use, which of the available knowledge to use, and in w hat set representation to express the knowledge. The ability to choose a set representation is particularly im portant, since it allows the user to make trade-offs between expressiveness and complexity. If it is im por tan t to utilize all of the knowledge, a very expressive representation may be most appropriate. If complexity is more of an issue, an inexpressive representation with low complexity bounds may be more desirable. The representation also helps the user evaluate the utility of available knowledge fragm ents. If a knowledge fragm ent requires an expressive representation, but the rem aining knowledge can be expressed in a more restricted representation, then the benefits of using the more expressive knowledge may not be worth the added complexity. Instead of om itting overly expensive knowledge altogether, it m ay be possible to use an approxim ation of the knowledge th a t is easier to express. For example, a preference for the globally shortest hypothesis consistent with the knowledge may be difficult to express, but a preference for the locally shortest hypothesis consistent with the knowledge may be easier to express. The set representation provides a way to evaluate the cost of various approxim ations. 161 8.2 Im m ed iate Issues The near term goals are to take steps in this direction by developing RS-KII trans lators for the knowledge used by additional induction algorithm s, and to identify set representations th at guarantee polynomial-time identification. Developing RS-KII translators for additional knowledge sources is needed to un derstand the full scope of RS-KII’s expressiveness, and its practicality as an induc tion algorithm . The expressiveness of regular sets makes it seems likely th at RS-KII can subsume a num ber of existing algorithm s, but this same expressiveness suggests th at RS-K II’s complexity could be worse th an th at of the original, more specialized algorithms. RS-K II’s complexity with respect to AQ-11 and CEA is close to th at of the original algorithm s, and in some cases much better. The expressiveness and complexity of RS-KII, and thus its ultim ate practicality as an induction algorithm , is an area for future research. Another key issue is identifying instantiations of KII th at can integrate n knowl edge fragm ents and enum erate a hypothesis from the solution set in tim e polynomial in n. This would provide a tractable induction algorithm th at can potentially utilize a range of knowledge other than examples. Additionally, the set representation for the instantiation effectively defines a class of knowledge from which hypotheses can be induced in polynomial tim e. This would complement the results in the PAC lit erature, which deal w ith polynomial-time learning from examples only (e.g.,[Vapnik and Chervonenkis, 1971], [Valiant, 1984], [Blummer et al., 1989]). Finally, the order in which an induction algorithm searches the hypothesis space is an implicit bias of the algorithm . Hypotheses th at occur earlier in the search are preferred over those th at come later, since an induction algorithm usually selects the first hypothesis it finds th a t also satisfies the goal conditions. It is often difficult to express search orderings in term s of binary preferences between hypotheses. In order to determ ine w hether one hypothesis comes before another in the search, it is often necessary to em ulate the search. In a few cases, such as best first or hill climbing in certain hypothesis spaces, it is possible to extract the search order from the hypotheses themselves. In a best first search, this is a simple m atter of determ ining which hypothesis has the better evaluation. In a hill climbing search, it is more awkward, as evidenced by the LEF translator for AQ-11 described in Section 5.2.2. 162 The issue of expressing search order biases needs to be investigated in more detail. One way to circumvent this problem is to replace the search-order bias with a bias th a t can be more naturally expressed as an (H , C, P ) tuple. Since the search order is often an approxim ation of a more restrictive bias, an alternate approxim ation may be well justified. In the case of AQ-11, the strict bias is to prefer hypotheses that m aximize the LEF. The cost of finding such a hypothesis is prohibitive, so AQ-11 uses a beam search to find a locally m axim al hypothesis. It may be possible to find some other approxim ation of the LEF th at can be expressed more naturally as an (H,C,P) tuple. 163 C hapter 9 C onclusions KII is a framework for integrating arbitrary knowledge into induction. This theo retically allows all of the available knowledge to be utilized by induction, thereby increasing the accuracy of the induced hypothesis. It also allows hybrid induction algorithm s to be constructed by “m ixing and m atching” the knowledge and implicit biases of various algorithms. Knowledge is expressed uniformly in term s of constraints and preferences on the hypothesis space, which are expressed as sets. Theoretically, just about any knowledge can be expressed this way, but in practice the set representation deter mines what knowledge can be expressed, and the cost of inducing a hypothesis from th at knowledge. This reflects what seems to be an inherent trade-off between the com putational complexity of induction and the breadth of utilizable knowledge. Instantiations of KII at various trade-off points between complexity and expres siveness can be generated by selecting an appropriate set representation. The space of possible set representations can be m apped onto the space of gram m ars. This provides a principled way to investigate the space of possible trade-offs, and to es tablish the trade-off limits. One such lim it is th at (C xC )C \P can be at m ost context free, which effectively lim its C to the regular languages, and P to the context free languages. If the ability to integrate knowledge is sacrificed, C can be context free and P can be regular. Otherwise, the solution set is not com putable, so it is not possible to induce a hypothesis from the knowledge. This expressiveness bound also applies to search in general, and by extension to other induction algorithms. The C set corresponds to the goal conditions, and the P set to the relative “goodness” of goals. The solution set consists of the best 164 goals. If C and P are too expressive, then it is not possible to find the best goal. The goal conditions m ust be relaxed, or the goodness ordering m ust be altered to allow sub-optimal goals. To the extent th at other induction algorithm s use search to identify the target concept, they are also lim ited by these bounds. The vision m otivating KII is the desire to integrate arbitrary knowledge into induction. The reality is th at complexity tends to increase with expressiveness, and places an ultim ate upper bound on expressiveness. RS-KII, an expressive instanti ation of KII, was developed to test the range of knowledge th at can be practically expressed and integrated within these bounds. RS-KII can utilize the knowledge used by two disparate induction algorithms, AQ-11 and (for some hypothesis spaces) CEA. RS-KII can also utilize noisy examples with bounded inconsistency, and can utilize domain theories. This knowledge can be integrated with the knowledge from AQ-11 and/or CEA, thereby forming hybrid induction algorithms. It is likely th at RS-KII can utilize the knowledge utilized by m any other induction algorithms as well. This would allow RS-KII not only to subsume these algorithm s, but also to form hybrid algorithms by “mixing-and- m atching” knowledge from different algorithms. Since RS-KII is expressive, it can also be com putationally expensive. However, when RS-KII uses only the knowledge used by AQ-11, it’s com putational complexity is only slightly worse than th a t of AQ-1 1 . W hen using only the knowledge of CEA, the complexity of RS-KII is comparable to th at of CEA. For at least one collection of knowledge (Haussler’s examples), the complexity of CEA is exponential, but the complexity of RS-KII is only polynomial. Similar results may obtain when using the knowledge of other induction algorithms, but developing translators for additional knowledge sources is an area for future work. 165 R eferen ce List [Aho et al., 1974] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis o f Computer Algorithms. Addison-Wesley, Reading, MA., 1974. [Blummer et al., 1989] A. Blummer, A. Ehrenfect, D. Haussler, and M. W arm uth. Learnability and the Vapnik-Chervonenkis dimension. Journal o f the Association fo r Computing M achinery, 36(4):929— 965, 1989. [Brieman et al., 1984] L. Brieman, Friedm an J., R. Olshen, and C. Stone. Classifi cation and Regression Trees. W adsworth and Brooks, 1984. [Bundy et al., 1985] A. Bundy, B. Silver, and D. Plum m er. An analytical compari son of some rule-learning programs. Artificial Intelligence, 27, 1985. [Buntine, 1991] W . Buntine. Classifiers: A theoretical and em pirical study. In 12th International Joint Conference on Artificial Intelligence, pages 638-644, Sydney, 1991. [Chomsky, 1959] N. Chomsky. On certain formal properties of gram m ars. Inform a tion and Control, 2, 1959. [Clark and N iblett, 1989] P. Clark and T. N ibiett. The CN2 induction algorithm. Machine Learning, 3(?):261— 283, 1989. [Cohen, 1992] W . W. Cohen. Compiling prior knowledge into an explicit bias. In D. Sleeman and P. Edwards, editors, Machine Learning: Proceedings o f the Ninth International Workshop, pages 102-110, Aberdeen, 1992. [DeJong and Mooney, 1986] G. F. DeJong and R. Mooney. Explanation based learn ing: An alternative view. Machine Learning, 1(2):145-176, 1986. [Dietterich et al., 1982] T. Dietterich, B. London, K. Clarkson, and G. Dromey. Learning and inductive inference. In P. Cohen and E. Feigenbaum, editors, The Handbook o f Artificial Intelligence, Volume III. W illiam Kaufm ann, Los Altos, CA, 1982. [Fisher, 1936] R. Fisher. The use of m ultiple m easurem ents in taxonom ic problems. Annual Eugenics, 7:179-188, 1936. 166 [Flann and D ietterich, 1989] N. S. Flann and T. G. Dietterich. A study of explanation-based m ethods for inductive learning. Machine Learning, 4:187-226, 1989. [Gordon and desJardins, 1995] D. Gordon and M. desJardins. Evaluation and se lection of biases in machine learning. Machine Learning, 20(l/2):5-22, 1995. [Haussler, 1988] D. Haussler. Quantifying inductive bias: Al learning algorithms and V aliant’s learning framework. Artificial Intelligence, 36:177-221, 1988. [Hirsh, 1990] H. Hirsh. Incremental Version Space Merging: A General Framework fo r Concept Learning. Kluwer Academic Publishers, Boston, MA, 1990. [Hirsh, 1992] H. Hirsh. Polynomial-tim e learning with version spaces. In AAAI-92: Proceedings, Tenth National Conference on Artificial Intelligence, pages 117-122, San Jose, CA, 1992. [Hopcroft and Ullman, 1979] J. E. Hopcroft and J. D. Ullman. Introduction to A u tomata Theory, Languages, and Computation. Addison-Wesley, Reading, MA, 1979. [Kumar and Laveen, 1983] V. K um ar and K. Laveen. A geeneral branch and bound form ulation for understanding and synthesizing and/or tree search procedures. Artificial Intelligence, 21:179-198, 1983. [Kumar, 1992] V. Kum ar. Search, branch and bound. In Encylopedia o f Artificial Intelligence, pages 1000-1004. John W iley & Sons, Inc., second edition, 1992. [Michalski and Chilausky, 1980] R.S. Michalski and R.L. Chilausky. Learning by being told and learning from examples: An experim ental comparison of the two m ethods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. Policy Analysis and Inform ation Systems, 4(3):219- 244, Septem ber 1980. [Michalski, 1974] R.S. Michalski. Variable-valued logic: System VLi. In Proeedings of the Fourth International Symposium on Multiple- Valued Logic, Morgantown, West Virginia, May 1974. [Michalski, 1978] R.S. Michalski. Selection of m ost representative training examples and increm ental generation of VLi hypotheses: The underlying m ethodology and the descriptions of programs ESEL and A Q ll. Technical Report 877, D epartm ent of Com puter Science, University of Illinois, Urbana, Illinois, May 1978. [Mitchell et al., 1986] T. Mitchell, R. Keller, and S. Kedar-Cabelli. Explanation- based generalization: A unifying view. Machine Learning, 1:47-80, 1986. 167 [Mitchell, 1980] T.M . M itchell. The need for biases in learning generalizations. Technical Report CBM-TR-117, Rutgers University, 1980. [Mitchell, 1982] T.M . Mitchell. Generalization as search. Artificial Intelligence, 18(2):203-226, March 1982. [Pagallo and Haussler, 1990] G. Pagallo and D. Haussler. Boolean feature discovery in em pirical learning. Machine Learning, 5:71-99, 1990. [Pazzani and Kibler, 1992] M. J. Pazzani and D. Kibler. The utility of knowledge in inductive learning. Machine Learning, 9:57-94, 1992. [Pazzani, 1988] M. Pazzani. Learning causal relationsips: A n integration o f empiri cal and explanation based learning methods. PhD thesis, University of California, Los Angeles, CA, 1988. [Quinlan, 1986] J.R . Quinlan. Induction of decision trees. Machine Learning, 1:81— 106, 1986. [Quinlan, 1990] J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5:239-266, 1990. [Rosenbloom et al., 1993] P. S. Rosenbloom, H. Hirsh, W. W. Cohen, and B. D. Sm ith. Two frameworks for integrating knowledge in induction. In K. Krishen, editor, Seventh Annual Workshop on Space Operations, Applications, and Re search (SO AR ’ 93), pages 226-233, Houston, TX, 1993. Space Technology Inter dependency Group. NASA Conference Publication 3240. [Russel and Grosof, 1987] S. Russel and B. Grosof. A declarative approach to bias in concept learning. In Sixth national conference on artificial intelligence, pages 505-510, Seattle, WA, 1987. AAAI. [Subramanian and Feigenbaum, 1986] D. Subram anian and J. Feigenbaum. Factor ization in experim ent generation. In Proceedings o f the National Conference on Artificial Intelligence, pages 518-522, Philadelphia, PA, August 1986. c [Valiant, 1984] L.G. Valiant. A theory of the learnable. Communications o f the ACM , 27(11):1134-1142, 1984. [Vapnik and Chervonenkis, 1971] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264-280, 1971. [Warshall, 1962] S. W arshall. A theorem on Boolean m atrices. Journal o f the As sociation fo r Computing Machinery, 9(1):11— 12, 1962. 168 [Winston et al., 1983] P. W inston, T. Binford, B. Katz, and M. Lowry. Learning physical descriptions from functional definitions, examples, and precedents. In Proceedings o f the National Conference on Artificial Intelligence, pages 433-439, W ashington, D.C., August 1983. AAAI, M organ-Kaufmann. 169

Abstract (if available)

Linked assets

University of Southern California Dissertations and Theses

Conceptually similar

PDF

Learning effective and robust knowledge for semantic query optimization

PDF

Gait-ER-Aid: A three level diagnostic knowledge-based system for gait analysis

PDF

Efficient production system match and constraint satisfaction problem solving

PDF

Task orientation and tailoring of interactive software explanations.

PDF

Model-based human pose estimation and tracking

PDF

Editing techniques for multi-version objects

PDF

Learning helicopter control through "teaching by showing"

PDF

On the design of sub-kilogram intelligent tele-robots

PDF

A Neural Code For Face Representation: From V1 Receptive Fields To It 'Face' Cells

PDF

Integrating motivation, spatial knowledge, and response behavior in a model of rodent navigation.

PDF

A machine learning approach to multilingual proper name recognition

PDF

Automated postediting of documents

PDF

Bounding the cost of learned rules: A transformational approach

PDF

Machine skill acquisiton based on self-discovery

PDF

Numerical solution of the generalized eigenvalue problem and the eigentuple-eigenvector problem

PDF

Integrating mathematical and structural pattern recognition.

PDF

The role of intelligent agency in synthetic instructor and human student dialogue

PDF

Sociorobotics: Learning for coordination in robot colonies

PDF

Integrated logical and physical optimizations for deep submicron circuits

PDF

Knowledge-based organizational process redesign: Using process flow measures to transform procurement.

Asset Metadata

Creator Smith, Benjamin Douglas (author)

Core Title Induction As Knowledge Integration.

Degree Doctor of Philosophy

Publisher University of Southern California (original), University of Southern California. Libraries (digital)

Tag artificial intelligence,Computer Science,OAI-PMH Harvest

Language English

Contributor Digitized by ProQuest (provenance)

Permanent Link (DOI) https://doi.org/10.25549/usctheses-c17-81726

Unique identifier UC11354432

Identifier 9625274.pdf (filename),usctheses-c17-81726 (legacy record id)

Legacy Identifier 9625274.pdf

Dmrecord 81726

Document Type Dissertation

Rights Smith, Benjamin Douglas

Type texts

Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection)

Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...

Repository Name University of Southern California Digital Library

Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA