Redescription Mining with Multi-target Predictive Clustering Trees

Mihelčić, Matej; Džeroski, Sašo; Lavrač, Nada; Šmuc, Tomislav

doi:10.1007/978-3-319-39315-5_9

Matej Mihelčić^18,20,
Sašo Džeroski^19,20,
Nada Lavrač^19,20 &
…
Tomislav Šmuc¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9607))

Included in the following conference series:

International Workshop on New Frontiers in Mining Complex Patterns

560 Accesses
7 Citations

Abstract

Redescription mining is a field of knowledge discovery that aims to find different descriptions of subsets of elements in the data by using two or more disjoint sets of descriptive attributes. The ability to find connections between different sets of descriptive attributes and provide a more comprehensive set of rules makes it very useful in practice. In this work, we introduce redescription mining algorithm for generating and iteratively improving a redescription set of user defined size based on multi-target Predictive Clustering Trees. This approach uses information about element membership in different generated rules to search for new redescriptions and is able to produce highly accurate, statistically significant redescriptions described by Boolean, nominal or numeric attributes. As opposed to current tree-based approaches that use multi-class or binary classification, we explore benefits of using multi target classification and regression to create redescriptions. The process of iterative redescription set improvement is illustrated on the dataset describing 199 world countries and their trading patterns. The performance of the algorithm is compared against the state of the art redescription mining algorithms.

Download conference paper PDF

Redescription mining augmented with random forest of multi-target predictive clustering trees

Article 08 February 2017

Matej Mihelčić, Sašo Džeroski, … Tomislav Šmuc

InterSet: Interactive Redescription Set Exploration

From sets of good redescriptions to good sets of redescriptions

Article 19 January 2018

Janis Kalofolias, Esther Galbrun & Pauli Miettinen

Keywords

1 Introduction

Pattern mining [1, 13] aims at discovering descriptive rules learned from data. Redescription mining [19] shares this goal but tries to find different descriptions of patterns by using two or more disjoint sets of descriptive attributes which are finally presented to the user. It is an unsupervised, descriptive knowledge discovery task. This analysis allows finding similarities between different elements and connections between different descriptive attribute sets (views) which ultimately lead to better understanding of the underlying data. Redescription mining is highly applicable in biology, economy, pharmacy, ecology and many other fields, where it is important to understand connections between different descriptors and to find regularities that are valid for different element subsets. Redescriptions are represented in the form of rules and the aim is to make these rules understandable and interpretable.

The field of redescription mining was introduced by Ramakrishnan et al. [19]. Their paper presents a novel algorithm to mine redescriptions based on decision trees, called the CARTwheels. The algorithm works by building two decision trees (one for each view) that are joined in the leaves. Redescriptions are found by examining the paths from the root node of the first tree to the root node of the second and the algorithm uses multi class classification to guide the search between the two views. Other approaches to mine redescriptions include approach proposed by Zaki and Ramakrishnan [23] which uses a lattice of closed descriptor sets to find redescriptions. Further, Parida and Ramakrishnan [17] introduce algorithms for mining exact and approximate redescriptions, Gallo et al. [10] present the greedy and the MID algorithm based on frequent itemset mining.

Galbrun and Miettinen [6] present a novel greedy algorithm for mining redescriptions. In this work they extend the greedy approach by Gallo et al. [10] to work on numeric data since all previous approaches worked only on Boolean data. Redescription mining was extended by Galbrun and Kimming to a relational [5] and by Galbrun and Miettinen to the interactive setting [8]. Recently, two novel tree-based algorithms were proposed by Zinchenko [24], which explore using decision trees in a non-Boolean setting and present different methods of layer by layer tree construction, which allows making informed splits based on nodes at each level of the tree.

In this work, we explore creation and iterative improvement of redescription sets containing a user defined number of redescriptions. With this goal in mind, we developed a novel algorithm for mining redescriptions based on multi-target predictive clustering trees (PCTs) [3, 14]. Our approach uses multi-target classification or regression to find highly accurate, statistically significant redescriptions, which differentiates it from other tree based approaches, especially the CARTwheels approach. Each node in a tree represents a separate rule that is used as a target in the construction of a PCT from the opposite view. Using multi-target PCTs allows us to build one model to find multiple redescriptions using nodes at all levels of the tree, further it allows to find features that are connected with multiple target features (rules) and finally due to inductive transfer [18], multi-target trees can outperform single label classification or regression trees. We have developed a procedure for rule minimization that allows us to find the smallest subset of attributes that describe a given pattern, thus we have the ability to get shorter rules even when using trees of bigger depth size. The approach is related to multi-view [2] and multilayer [11] clustering, though the main goal here is to find accurate redescriptions of interesting subsets of data, while clustering tends to find clusters that are not always easy to interpret.

After introducing the necessary notation (Sect. 2), we present the algorithm, introduce the procedure for rule minimization and perform the run-time analysis of redescription mining process (Sect. 3). We use the algorithm to iteratively improve redescription set describing 199 different world countries based on their trading behaviour [21] and general country information [22] for the year 2012 (Sect. 4). The main focus is on rules containing only logical conjunction operators, since these rules are the most interpretable and very easy to understand. In Sect. 5 we analyse redescription sets mined with one state of the art redescription mining algorithm, optimize redescription sets of equal size with our approach, compare these sets by using several criteria and discuss the results. Finally, we conclude and outline directions for future work in Sect. 6.

2 Notation and Definitions

Redescription mining in general considers redescriptions constructed on a set of views $\{W_1,W_2,\dots ,W_n\}$, however in this paper we use only two views $\{W_1,W_2\}$. The corresponding attribute (variable) sets are denoted by $V_1$ and $V_2$. Each view contains the same set of |E| elements and two different sets of attributes of size $|V_1|$ and $|V_2|$. Value $W_1(i,j)$ is the value of element $e_i$ for the attribute $a_j$ in view $W_1$. The data $D=(V_1,V_2,E,W_1,W_2)$ is a quintuple of the attribute sets, the element set, and the appropriate view mappings. A query (denoted q) is a logical formula F, where $q_1$ contains literals from $V_1$. The set of elements described by a query is called its support. A redescription $R=(q_1, q_2)$ is defined as a pair of queries, one for each view in the data. The support of a redescription is the set of elements supported by both queries that constitute this redescription: $supp(R)=supp(q_1)\cap supp(q_2)$. We use attr(R) to denote the multiset of attributes used in the redescription R. The accuracy of a redescription $R=(q_1,q_2)$ is measured using the Jaccard coefficient (Jaccard similarity index):

$$\begin{aligned} JS(R)=\frac{|supp(q_1) \cap supp(q_2))|}{|supp(q_1)\cup supp(q_2)|} \end{aligned}$$

The Jaccard coefficient is not the only measure used in the field because it is possible to obtain redescriptions covering huge element subsets that necessarily have very good overlap of their queries. In this cases it is preferred to have redescriptions that reveal some more specific knowledge about the studied problem that is harder to obtain by random sampling from the underlying data distribution. This is why we compute the statistical significance (p-value) of each obtained redescription. We denote the marginal probability of a query $q_1$, $q_2$ with $p_1=\frac{supp(q_1)}{|E|}$ and $p_2=\frac{supp(q_2)}{|E|}$ respectively. We define the set of elements in the intersection of the queries with $o=supp(q_1)\cap supp(q_2)$. The corresponding p-value [9] is defined as

$$\begin{aligned} pV(q_1,q_2)=\sum _{n=|o|}^{|E|} {|E|\atopwithdelims ()n}(p_1\cdot p_2)^n\cdot (1-p_1\cdot p_2)^{|E|-n} \end{aligned}$$

The p-value tells us if we can dismiss the null hypothesis that assumes that we obtained a given subset of elements by joining two random rules with marginal probabilities equal to the fraction of covered elements. If the obtained p-value is lower than some predefined threshold, called the significance level, then this null hypothesis should be rejected. This is a somewhat optimistic criterion, since the assumption that all elements can be sampled with equal probability need not hold for all datasets.

3 The CLUS-RM Algorithm

In this section, we describe the algorithm for mining redescriptions named CLUS-RM, that at each step improves the redescription set of the size defined by the user. The algorithm uses multi-target predictive clustering trees (PCTs) [3, 14] to create a cluster hierarchy that is later transformed into redescriptions. We start by explaining the pseudo code of the algorithm (Algorithm 1) and then go into the details of each procedure in the algorithm.

The algorithm starts by creating initial clusters for both views (line 2 and 3 in Algorithm 1) which is achieved by transforming a non-labeled dataset into a labeled dataset of positive elements and artificially generated negative elements. For each element in the original view, we construct one negative, synthetic element (see Fig. 1) in such a way so that the original correlations among the attributes are broken. We achieve this by random shuffling of attribute values between the elements. The procedure allows experimentation with the number of shuffling steps and the number of attributes that are copied from the original elements to the artificial element. Complete randomization is achieved when the number of shuffling steps equals the number of attributes in the dataset and exactly one attribute value is copied to the artificial element at each step from a randomly chosen original element. The original elements are assigned a target label of 1.0, while the artificial elements are assigned a target label of 0.0 (see Table 1). The division between the original and the artificial elements (the idea previously used in [11]), allows us to construct a cluster hierarchy, simultaneously creating descriptions of the original elements. The described procedure is one possible way to construct the initial clusters; other approaches include assigning a random target attribute or using clusters computed by some other clustering algorithm. However, the initialization procedure used in our algorithm should preserve any strong (specific) connections and correlations that exist in the original data which are broken by using an approach that assigns random target labels.

Table 1. Creation of artificial elements for the random initialization procedure.

Full size table

After creating the initial dataset, we build predictive clustering trees on both views by performing regression on the target label and using other attributes as descriptive. The decision to use regression trees instead of decision trees is purely technical, since it generates more rules because of the additional threshold associated with the target variable. These trees are converted to rules that describe element sets and are necessary for the next step of the algorithm. The rule lists R$W_1$ and R$W_2$ contain generated rules, and a new rule is added to the list if it differs from all other rules in a predefined number of attributes or if it describes a new unique element subset (the $\cup _{*}$ operator in Algorithm 1). The iterative process of the algorithm begins right after rule creation. Here, we create targets based on the rules obtained in the previous step or in the initialization step. The Rules obtained by predictive clustering on $W_1$ are used to build targets for clustering on $W_2$ (denoted $W_1T_1$, $W_1T_2$), and vice versa. For each element in the dataset we assign label 1.0 if the element is described by some specific rule, otherwise 0.0 (see Table 2). For example, the attribute $W_2T_1$ from dataset for view 1 represents the condition $IF\ W_2A_1=TRUE$ (constructed on dataset for view 2), which describes elements $E_1$, $E_2$, $E_4$, $E_5$. By placing this target attribute in the view 1 dataset, we guide the PCT construction to create a cluster containing and describing the same set of elements with descriptive variables of view 1 (a choice that satisfies this condition is $IF\ W_1A_3>0$).

Table 2. Intermediate generation of labels based on discovered rules.

Full size table

Rules obtained in the previous step are combined into redescriptions if they satisfy a given set of constraints ConstSet. The set of constraints consists of minimal Jaccard coefficient (minJS), maximum allowed p-value (maxPval) and minimum and maximum support ($minSupp,\ maxSupp$) which have to be satisfied for a redescription to be considered as a candidate for the redescription set.

3.1 The Procedure for Creating Redescriptions

The algorithm for creating redescriptions from rules (Algorithm 2) joins view 1 rules (or its negation, if allowed by the user) with rules (or its negation) from view 2 (see Fig. 1 and line 2 in Algorithm 2). We distinguish three cases of creating redescriptions from rules (expansion types):

1.
Unguided initial: $ UInit\leftarrow (RW_1 \times ^{opSet\backslash \{\vee \}}_{ConstSet} RW_2)$
2.
Unguided: $U\leftarrow (RW_{1_{newRuleIt}} \times ^{opSet\backslash \{\vee \}}_{ConstSet} RW_{2_{newRuleIt}})$
3.
Guided: $G\leftarrow (RW_{1_{newRuleIt}} \times ^{opSet\backslash \{\vee \}}_{ConstSet} RW_{2_{oldRuleIt}}) \cup $ $(RW_{1_{oldRuleIt}}\times ^{opSet\backslash \{\vee \}}_{ConstSet} RW_{2_{newRuleIt}})$

The $\times ^{opSet}_{ConstSet}$ operator denotes a Cartesian product of two sets, allowing the use of logical operators from opSet and leaving only those redescriptions that satisfy a given set of constraints ConstSet. The unguided expansion allows obtaining redescriptions with more diverse subsets of elements that can later be improved through the iteration process.

The algorithm finds first numRed redescriptions and then iteratively enriches this set by exchanging the redescription with the worst comparative score with the newly created redescription (lines 3-14 in Algorithm 2). The algorithm uses 4 arrays (elFreq, attrFreq, redScoreEl, redScoreAt) to incrementally add and improve redescriptions in the redescription set. The element/attribute frequency arrays contain the number of times each element/attribute from the dataset occurs in redescriptions from a redescription set. Redescription scores are computed as $redScoreEl(R)= \sum _{e\in supp(R)} (elFreq[e]-1)$, and $redScoreAt(R)= \sum _{a\in attr(R)} (attrFreq[a]-1)$. The score of a new redescription is computed in the same way by using existing frequencies from the set. If the algorithm finds a redescription $R'$ such that $R_i=argmax_{R\in \mathcal {R} |\ R.pval\ge R'.pval} score(R',R)$, where $score(R',R)=(\frac{(1.0-R'.elSc+1.0-R'.atrSc+R'.JS}{3}-\frac{(1.0-R.elSc+1.0-R.attrSc+R.JS)}{3})$, all arrays are updated so that the frequencies of elements described by $R_i$ and attributes contained in it’s queries are decreased by one, while the frequencies of elements and attributes associated with $R'$ are increased. This score favours redescriptions that describe elements with low frequency by using non frequent attributes. At the same time it finds as accurate and significant redescriptions as possible.

Element weighting has been used before in subgroup discovery [12, 15] to model covering importance for elements. Our approach is similar but uses different weighting mechanism, adapts it to the redescription mining setting by combining element and attribute weights and incorporates it into the framework of iterative redescription set refinement in which some redescriptions can be replaced with more suitable candidates.

The algorithm can use three types of logical operators (disjunction, conjunction and negation). The disjunction operator is used to increase redescription accuracy and support (lines 15–26 in Algorithm 2). For a redescription $R=(q_1,q_2)$, we find rules r that maximize:

1.
$JS(supp(q_1\vee r) \backslash supp(R), supp(q_2)\backslash supp(R))$
2.
$JS(supp(q_1\vee \lnot r) \backslash supp(R), supp(q_2)\backslash supp(R))$
3.
$JS(supp(q_1) \backslash supp(R), supp(q_2\vee r)\backslash supp(R))$
4.
$JS(supp(q_1) \backslash supp(R), supp(q_2 \vee \lnot r)\backslash supp(R))$

The rule r is found so that it covers elements that are supported by $q_2$ but not by $q_1$ ($R.maxRef(r'),\ r'\in RW_1$) and vice versa.

3.2 Rule Size Minimization

Rule minimization procedure is applied in the final step of redescription set creation. The main goal of this procedure is to find a minimal attribute set for all rules contained in redescriptions that describe the same pattern as the original redescription. This leads to better understandability and readability of returned redescriptions.

The method minimizes conjunctive formulas $F=v_1\wedge v_2\wedge v_3 \wedge v_4 \wedge \dots \wedge v_n$, where each $v_i$ denotes one literal of the form $v_i=c$ in the case of Boolean or categorical attributes or $c_1 \le v_i \le c_2$ in the case of numerical attributes. The procedure chooses each $v_i$ in turn, computes $\mathcal {S}_{v_i}=supp(v_i)\backslash supp(F)$ and then finds the minimal set $\mathcal {T}=\{v_k,\dots , v_m\}$ such that $\forall e\in \mathcal {S}_{v_i}, \ \exists v_j\in \mathcal {T},\ e\notin supp(v_j)$ and $\cap _{k} v_k = supp(F),\ v_k\in \mathcal {T}$ (see Fig. 2). The procedure returns a family of sets $\mathcal {F}=\{\mathcal {T}_i,\ i=1,\dots , n \}$ and chooses the representative set containing the smallest number of attributes.

The procedure is related to a procedure for finding a minimal set of generators in [23]. It is constructed with a purpose of minimizing rules contained in already constructed redescriptions whereas minimal set of generators is used to construct redescriptions which requires it to compute a closed lattice of descriptors.

3.3 Algorithm Time Complexity

In this subsection we analyse the algorithm’s time complexity. We start from the known results [20] that predictive clustering tree construction has the worst time complexity of $O(z\cdot m\cdot |E|^2)$ to completely induce the tree, where m denotes the number of descriptive variables in a selected view and z the total number of internal nodes in the tree.

We use the HashSet and the HashMap data structure with open addressing to store elements which have the time complexity of O(1) for add, remove, contains and size assuming the hash function behaves in a random enough manner (uniform hashing).

The initialization step has the complexity of $O(|E|\cdot (|V_1|+|V_2|))$ and the PCT to rules transformation has the complexity of O(z). Creation of redescriptions via extraction/filtering of pairs obtained from Cartesian product of two rule sets has the worst time complexity of $O(n + n')$, where n equals the number of elements covered by the rule created on W1 and $n'$ denotes the number of elements covered by the rule created on W2. To compute the Cartesian product of two rule sets we make $\sum _{i \in R_L} \sum _{j \in R_R} (n_i + n_j)$ steps. As both $n\le |E|$ and $n'\le |E|$, the worst time complexity of this step is $O(z^2\cdot |E|)$. However, if we have a balanced tree, the complexity is closer to $O(z\cdot d\cdot |E|)$, where d equals the tree depth. Updating the attribute and element frequency tables and the total redescription scores has the complexity of O(|E|). The computation of rules containing negation and disjunction operators has a complexity of $O(z\cdot |E|)$.

The minimization procedure has the time complexity of $O(|\mathcal {R}|\cdot ((a+a')\cdot |E|+(a^3+a'^3)\cdot |E|))$, where a, $a'$ represent the number of attributes in redescription rules which are constrained with the tree depth d (or a constant multiple of d in case of rules containing disjunctions). As the the number of elements in support of such constrained attributes is much smaller then |E|, the worst case time complexity is $O(d^3\cdot |E|)$.

The total algorithm time complexity is: $O(|E|\cdot (|V_1|+|V_2|)+z\cdot |V_1|\cdot |E|^2+z\cdot |V_2|\cdot |E|^2+2\cdot z+z^2\cdot |E|+z^2\cdot |E|+2\cdot z\cdot |E|+d^3\cdot |E|)$ which is $O(z\cdot (|V_1|+|V_2|)\cdot |E|^2+z^2\cdot |E|)$. The pessimistic worst time complexity assuming inadequate hashing function is $O(z^2\cdot |E|^2+z\cdot (|V_1|+|V_2|)\cdot |E|^2)$.

Optimizations that could speed up computing redescriptions include the use of rule indexing that would allow combining only those rules certain to cross the user defined thresholds and Local Sensitive Hashing [4].

4 Mining Redescriptions on Data Describing Countries

We present the experimental results of mining redescriptions with our algorithm on data describing 199 world countries in the year 2012 ([11, 21, 22]). The dataset has two views, both containing numerical attributes with possible missing values. One view contains 312 attributes representing the importance of import and export of different commodities for countries, while the second view contains 49 attributes with country information provided by the World Bank.

There are several techniques described in [6] for computing Jaccard coefficient when data contains missing values. We compute the Jaccard coefficient guided by the principle that an element can not be in a support of a rule containing only conjunction operator if it has missing values for some of the attributes contained in a condition of a rule. We use notation from [6] to denote $E_{1,1}=supp(q_1)\cap supp(q_2)$, $E_{1,0}= supp(q_1)\backslash supp(q_2)$, $E_{0,1}=supp(q_2)\backslash supp(q_1)$, $E_{1,?}=supp(q_1)\cap missing(q_2)$, $E_{?,1}=missing(q_1)\cap supp(q_2)$, where $R=(q_1,q_2)$ and missing(q) represents a set of elements for which we can not determine if they are in support of q due to missing values. We define the Jaccard coefficient as:

$$\begin{aligned} JS_m(R)=\frac{|E_{1,1}|}{|E_{1,0}|+|E_{0,1}|+|E_{1,1}|+|E_{1,?}|+|E_{?,1}|} \end{aligned}$$

It holds that $JS_{pes}(R) \le JS_m(R)\le JS_{opt}(R)$, where $JS_{opt}$ and $JS_{pes}$ denote optimistic and pessimistic estimate of JS when dealing with missing values.

The algorithm was tested with 50, 200, 800 iterations and only rules containing the conjunction operator were allowed. For each number of iterations, we performed 10 runs of the algorithm, computed redescription sets containing 50 redescriptions and measured the average Jaccard coefficient and the average redescription support. Allowed redescription supports were in range [5, 120], the maximum p-value equalled 0.01 and the minimum Jaccard coefficient was 0.6. We used complete randomization in initialization procedure.

Figure 3 shows that with increased number of iterations, the algorithm finds redescriptions with higher accuracy, but describing smaller subsets of countries. The mean value of the total overall coverage of elements in the redescription set varies between $47\,\%$ and $53\,\%$. This indicates that the algorithm managed to find highly accurate redescriptions describing a significant number of total elements from the dataset.

We demonstrate one highly accurate, statistically significant redescription mined on the Country dataset. Several additional examples can be seen in [16].

This redescription describes 14 world countries (United Kingdom, Switzerland, Sweden, Spain, Singapore, Netherlands, Malta, Luxembourg, Germany, France, Finland, Denmark, Cyprus and Austria) with Jaccard coefficient 1.0. We found that the vulnerable employment of these countries ranges from $[5.6,12.5]\%$, the percentage of population aged $0-14$ is in $[13.2,18.3]\%$, and the domestic credit to private sector is in $[99.2,305.1]\%$ of the GDP. In addition, export to import ratio of medicinal and pharmaceutical products is in $[0.9,4.6]\%$, export to import ratio of basic food is in $[0.4, 1.5]\%$ and this ratio for pulp and waste paper is in $[0.3,859.2]\%$. This is a statistically highly significant redescription with a p-value of $1.5\cdot 10^{-13}$, it contains 3 descriptive variables for view 1 and 3 variables for view 2. It is a medium size redescription, based on its rule size.

5 Algorithm Evaluation and Comparison

In this section, we compare rules produced by our algorithm with the current state of the art algorithm ReReMi, described in [9]. We used the Siren tool [7] to perform redescription mining with the ReReMi algorithm on the Country dataset described in Sect. 4. The layered/split tree algorithms (described in [24]) currently do not work with data that contain missing values.

Redescription mining algorithm comparison was mainly done in the literature by selecting and discussing properties of the individual redescriptions. We try to make objective evaluation of redescription sets produced by different algorithms by using the same set of constraints to construct redescriptions. Another condition we imposed is to have the same size of the final redescription sets. This is done by first finding redescription set with the ReReMi, and then forcing the same size of the redescription set on the CLUS-RM, since it produces much more redescriptions than the ReReMi algorithm.

We divided the results based on the operators allowed for query construction. In the first experiment we allow using disjunctions, conjunctions, negations and in the second experiment only conjunctions. For the ReReMi, we used max product buckets = 200, max number of pairs = 500 when using all logical operators, max number of pairs = 1000 when using only conjunctions. Also, we allowed a maximum of 15 variables for each query. Redescriptions were required to have the maximal $p-value$ of 0.01, the minimal Jaccard coefficient of 0.5 and the minimal support of 5 elements. After obtaining redescriptions with the ReReMi algorithm, we used the Filter redundant redescriptions option to remove duplicate and redundant redescriptions with the max overlap option equal to 0.99. For each redescription set, we optimized a redescription set of the same size by using the CLUS-RM algorithm. We used 800 iterations keeping constraints for the Jaccard coefficient, the $p-value$ and support. Maximum allowed average tree depth was set to 8 and we used the complete randomization in the initialization procedure.

For the generated redescription sets, we plot comparative boxplots for the Jaccard coefficient, the $log_{10}$ of the $p-value$, the element overlap, the attribute overlap and the rule size. The element overlap is the average Jaccard coefficient of covered elements by one redescription with respect to all other redescriptions in the redescription set, similarly the attribute overlap is the average Jaccard coefficient of the attributes contained in the redescription queries compared to every other redescription in the set. To emphasize importance of the redescription size from the point of understandability ($|attr(R)|\ge 20$ considered to be highly complex to understand), we calculate the normalized redescription size as follows:

$$ R_{size}=\left\{ \begin{array}{ll} \frac{|attr(R)|}{20} &{}, |attr(R)|< 20\\ 1 &{}, 20\le |attr(R)| \\ \end{array} \right. $$

To obtain comparative results, we optimized $JS_{pes}$ with ReReMi algorithm and then recalculated the score for each redescription to obtain $JS_{m}$.

The Fig. 4 shows that the CLUS-RM found statistically significant redescriptions with Jaccard coefficient higher than those produced by the ReReMi algorithm. Due to its goal of finding highly accurate but minimally overlapping redescriptions in terms of elements and attributes, it found redescriptions with smaller support when conjunctions, disjunctions and negations are allowed. One important thing is that this was achieved by using redescriptions that mostly have smaller query size than the ReReMi produced redescriptions. We report two more statistics, the element coverage (the percentage of total elements described by at least one redescription) and attribute coverage (the percentage of attributes used in redescription rules). The CLUS-RM described $99\,\%$ of elements while ReReMi described $100\,\%$ elements. The CLUS-RM used $47\,\%$ of all attributes in the rules and the ReReMi used $41\,\%$.

The evaluation on the redescription sets constructed by only using conjunction operator showed that the CLUS-RM produced redescriptions with higher Jaccard coefficient, higher support and smaller $p-value$ than the ReReMi algorithm. As a consequence, the CLUS-RM has higher element overlap but also somewhat smaller query size in redescriptions. Attribute overlaps are comparable between approaches. The CLUS-RM covered $25\,\%$ elements and the ReReMi algorithm $53\,\%$ while the attribute coverage is $27\,\%$ and $36\,\%$.

The CLUS-RM approach produces redescriptions containing mainly conjunction operators while the ReReMi approach uses mostly disjunctions if allowed. The redescription sets obtained with the CLUS-RM contained highly accurate, statistically significant, mostly non overlapping redescriptions. There are two possible techniques available to obtain redescriptions with higher support with the CLUS-RM algorithm: to increase the minimal support or to increase the redescription set size. We believe that the proposed approach complements the ReReMi approach by finding many significant conjunction based redescriptions.

6 Conclusion

This work introduces a novel redescription mining framework which optimizes a redescription set of user defined size. The algorithm is based on multi-target predictive clustering trees, which allows using element coverage by rules constructed on one view as targets for the other view. Produced redescriptions incrementally improve the redescription set by using a predefined set of criteria (the Jaccard coefficient, the p-value, the element overlap and the attribute overlap). The ability to construct many different redescriptions and use them to optimize a set of fixed size differentiates the approach from currently proposed solutions. We analysed the algorithm time complexity and measured its performance on data describing world countries. The results show that, when finding redescriptions containing only conjunction operator, there are benefits of using more iterations. Generated redescriptions are statistically relevant with p-values less than $10^{-5}$. Many generated rules contained the maximum of 6 attributes per rule in a redescription. Finally, we compare some characteristics of redescription sets generated by the CLUS-RM and the ReReMi algorithms. These results and comparison reveal the main difference in algorithm preference - CLUS-RM producing more accurate redescriptions using much more conjunctive rules.

In future work, we plan to extend the current framework by deploying Random Forest of PCTs, which should further boost resulting redescription sets in terms of size, diversity and quality. We also intend to work on more comprehensive and objective evaluation of redescription sets.

References

Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 207–216, Washington, D.C. (1993)
Google Scholar
Bickel, S., Scheffer, T.: Multi-view clustering. In: Proceedings of the Fourth IEEE International Conference on Data Mining, pp. 19–26, Washington, D.C. (2004)
Google Scholar
Blockeel., H.: Top-down induction of first order logical decision trees. Ph.d. thesis, Katholieke Universiteit Leuven, Department of Computer Science (1998)
Google Scholar
Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J., D., Yang, C.: Finding interesting associations without support pruning. In: ICDE, pp. 489–499 (2000)
Google Scholar
Galbrun, E., Kimmig, A.: Finding relational redescriptions. Mach. Learn. 96, 225–248 (2014)
Article MathSciNet MATH Google Scholar
Galbrun, E., Miettinen, P.: From black and white to full color: extending redescription mining outside the Boolean world. Stat. Anal. Data Mining 5, 284–303 (2012)
Article MathSciNet Google Scholar
Galbrun, E., Miettinen, P.: Siren : An interactive tool for mining and visualizing geospatial redescriptions. In: KDD, pp. 1544–1547 (2012)
Google Scholar
Galbrun, E., Miettinen, P.: A case of visual and interactive data analysis: geospatial redescription mining. In: Instant Interactive Data Mining Workshop @ ECML-PKDD (2012)
Google Scholar
Galbrun, E.: Methods for redescription mining. Ph.d. thesis, University of Helsinki (2013)
Google Scholar
Gallo, A., Miettinen, P., Mannila, H.: Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the SIAM International Conference on Data Mining, Atlanta, Georgia, pp. 334–345 (2008)
Google Scholar
Gamberger, D., Mihelčić, M., Lavrač, N.: Multilayer clustering: a discovery experiment on country level trading data. In: Džeroski, S., Panov, P., Kocev, D., Todorovski, L. (eds.) DS 2014. LNCS, vol. 8777, pp. 87–98. Springer, Heidelberg (2014)
Google Scholar
Gamberger, D., Lavrač, N.: Expert-guided subgroup discovery: methodology and application. J. Artif. Intell. Res. 17, 501–527 (2002)
MATH Google Scholar
Giacometti, A., Li, D.H., Marcel, P., Soulet, A.: 20 years of pattern mining: a bibliometric survey. SIGKDD Explor. Newsl. 15, 41–50 (2014)
Article Google Scholar
Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recogn. 46, 817–833 (2013)
Article Google Scholar
Lavrač, N., Kavšek, B., Flach, P., Lj, T.: Subgroup discovery with CN2-SD. J. Mach. Learn. Res. 5, 153–188 (2004)
MathSciNet Google Scholar
Mihelčić, M., Džeroski, S., Lavrač, N., Šmuc, T.: Redescription mining with multi-label predictive clustering trees. In: Proceedings of the Fourth Workshop on New Frontiers in Mining Complex Patterns @ ECML-PKDD, pp. 86–97. Porto (2015)
Google Scholar
Parida, L., Ramakrishnan, N.: Redescription mining: structure theory and algorithms. In: Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, Pennsylvania, pp. 837–844 (2004)
Google Scholar
Piccart, B.: Algorithms for multi-target learning. Ph.d. thesis, Katholieke Universiteit Leuven (2012)
Google Scholar
Ramakrishnan, N., Kumar, D., Mishra, B., Potts, M., Helm, R. F.: Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 266–275. ACM, Seattle, WA (2004)
Google Scholar
Stojanova, D., Ceci, M., Appice, A., Džeroski, S.: Network regression with predictive clustering trees. Data Min. Knowl. Disc. 25, 378–413 (2012)
Article MathSciNet MATH Google Scholar
UNCTAD database. http://unctadstat.unctad.org/EN/
World Bank database. http://data.worldbank.org/
Zaki, M. J., Ramakrishnan, N.: Reasoning about sets using redescription mining. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 364–373. ACM, Chicago, Illinois (2005)
Google Scholar
Zinchenko, T., Redescription mining over non-binary data sets using decision trees. Masters thesis, Universität des Saarlandes (2014)
Google Scholar

Download references

Acknowledgement

The authors would like to acknowledge the European Commission’s support through the MAESTRA project (Gr. no. 612944), the MULTIPLEX project (Gr.no. 317532), the InnoMol project (Gr. no. 316289), and support of the Croatian Science Foundation (Pr. no. 9623: Machine Learning Algorithms for Insightful Analysis of Complex Data Structures).

Author information

Authors and Affiliations

Ruđer Boškovič Institute, Bijenička cesta 54, 10000, Zagreb, Croatia
Matej Mihelčić & Tomislav Šmuc
Jožef Stefan Institute, Jamova cesta 39, 1000, Ljubljana, Slovenia
Sašo Džeroski & Nada Lavrač
Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000, Ljubljana, Slovenia
Matej Mihelčić, Sašo Džeroski & Nada Lavrač

Authors

Matej Mihelčić
View author publications
You can also search for this author in PubMed Google Scholar
Sašo Džeroski
View author publications
You can also search for this author in PubMed Google Scholar
Nada Lavrač
View author publications
You can also search for this author in PubMed Google Scholar
Tomislav Šmuc
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matej Mihelčić .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Università degli Studi di Bari Aldo Moro, Bari, Italy
Corrado Loglisci
ICAR-CNR, Rende, Italy
Giuseppe Manco
ICAR-CNR, Rende, Italy
Elio Masciari
University of North Carolina, Charlotte, North Carolina, USA
Zbigniew W. Ras

A Appendix

We present several shorter redescriptions mined by the CLUS-RM and the ReReMi algorithm. The full names of the attributes used in redescription queries can be seen in Fig. 5.

Table 3. Redescription examples produced by CLUS-RM and ReReMi algorithm using only conjunction operator

Full size table

Table 4. Redescription examples produced by CLUS-RM and ReReMi algorithm using conjunction, disjunction and negation operators

Full size table

In Table 3 we show two very accurate redescriptions mined with the ReReMi algorithm and compare it to two redescriptions mined with the CLUS-RM.

In Table 4, we present two redescriptions containing conjunction and disjunction operator obtained with the ReReMi algorithm, and two redescriptions containing conjunctions and negations obtained with the CLUS-RM algorithm. This examples demonstrate the main difference between the methodologies. The ReReMi algorithm uses disjunction operator often in redescription construction whereas the CLUS-RM mostly uses conjunction operator to construct redescriptions.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mihelčić, M., Džeroski, S., Lavrač, N., Šmuc, T. (2016). Redescription Mining with Multi-target Predictive Clustering Trees. In: Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z. (eds) New Frontiers in Mining Complex Patterns. NFMCP 2015. Lecture Notes in Computer Science(), vol 9607. Springer, Cham. https://doi.org/10.1007/978-3-319-39315-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-39315-5_9
Published: 18 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39314-8
Online ISBN: 978-3-319-39315-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Redescription Mining with Multi-target Predictive Clustering Trees

Abstract

Similar content being viewed by others

Redescription mining augmented with random forest of multi-target predictive clustering trees

InterSet: Interactive Redescription Set Exploration

From sets of good redescriptions to good sets of redescriptions

Keywords

1 Introduction

2 Notation and Definitions

3 The CLUS-RM Algorithm