CCAR: An efficient method for mining class association rules with itemset constraints

https://doi.org/10.1016/j.engappai.2014.08.013Get rights and content

Abstract

Class association rules (CARs) are basically used to build a classification model for prediction; they can also be used to describe correlations between itemsets and class labels. The latter is very popular in mining medical data. For example, epidemiologists often consider rules which indicate the relations between risk factors (itemsets) and HIV test results (class labels). However, in the real world, end users are often interested in a subset of class association rules. Particularly, they may consider only rules which contain at least one itemset from a user-defined set of itemsets in the rule antecedent. For example, when classifying which populations are at high risk for HIV infection, epidemiologists often concentrate on rules that include demographic information such as sex, age, and marital status in rule antecedents. Two naive strategies are to solve this problem by applying the itemset constraints into the pre-processing or post-processing step. However, such approaches are time-intensive. This paper thus proposes an efficient method for integrating the constraints into the class association rule mining process. The experimental results show that the proposed algorithm outperforms two basic approaches in the mining time and the memory consumption. The practical benefits of our method are demonstrated by a real-life application in the HIV/AIDS domain.

Introduction

The problem of mining class association rules (CARs) is finding of the complete set of CARs that satisfies the user-specified minimum support and minimum confidence thresholds from a dataset. Numerous approaches have been proposed to solve this problem. Examples include the Apriori-based algorithm CBA (Liu et al., 1998), the FP tree-based algorithm CMAR (Li et al., 2001), mining CARs based on the vertical dataset layout (Zhao et al., 2009), the use of an equivalence class rule tree (Vo and Le, 2009), the lattice-based approach for mining CARs (Nguyen et al., 2012), the use of a modified ECR tree with Obidset (Nguyen et al., 2013), and parallel mining CARs on the multi-core processor architecture (Nguyen et al., 2014).

Mining CARs to discover associations between itemsets and class labels is very popular and useful in practice, especially in mining medical data. However, end users often consider only a subset of CARs, for instance, those that contain at least one itemset from a user-defined set of itemsets. Itemset constraints reduce the number of obtained CARs and decrease the search space, improving the performance of the mining process. Additionally, constrained CARs also help to discover interesting or useful rules particular to the end user. For example, in cancer treatment applications, biologists often focus on rules involving new drugs to understand the effectiveness of new treatment strategies. Thus, the present study considers constraints in the form of Boolean expressions over the presence of itemsets in the antecedents of classification rules. The main contributions of this paper are as follows. Firstly, a tree structure named the Constraint Class Rule tree (CCR-tree) is proposed for efficiently mining CARs with itemset constraints. At the first level, the tree contains both constrained nodes which include constrained itemsets and frequent nodes which include frequent 1-itemsets. At the following levels, the tree contains constrained nodes only. Using this tree structure, only nodes that contain constrained itemsets are generated. Secondly, two theorems for quickly pruning infrequent itemsets are derived. Finally, an efficient and fast algorithm for mining CARs with itemset constraints is developed. Compared to two existing pre- and post-processing approaches, the proposed method does not generate all rules which significantly accelerates the mining time and also reduces the memory consumption. The experimental results also show that the proposed algorithm can achieve up to 3 × and 12 × speedups in comparison with pre- and post-processing methods, respectively.

The rest of this paper is organized as follows. In Section 2, some preliminary concepts of CAR mining are briefly given. Work related to mining association rules with itemset constraints and mining class association rules with itemset constraints is introduced in Section 3. The primary contributions are presented in Section 4, in which the CCR-tree structure is presented and two theorems for eliminating infrequent itemsets are provided. The proposed algorithm, Constraint Class Association Rule (CCAR), for efficiently mining CARs with itemset constraints is also described in this section. The experimental results are presented in Section 5. Section 6 describes a real-life application of the proposed method in the HIV/AIDS domain. Finally, conclusions and future work are discussed in Section 7.

Section snippets

Preliminary concepts

Let D be a dataset with n attributes {A1, A2,...,An} and |D| records (objects) where each record has an object identifier (OID). Let C={c1,c2,...,ck} be a list of class labels. A specific value of an attribute Ai and class C is denoted by lower-case letters aim and cj, respectively.

Definition 1

An item is described as an attribute and a specific value for that attribute, denoted by 〈(Ai,aim)〉, e.g. 〈(A1,a11)〉, 〈(A1,a12)〉, 〈(A2,a21)〉, etc.

Definition 2

An itemset is a set of items, e.g., 〈(A1,a11),(A2,a21)〉, 〈(A1,a11),(A3,

Mining association rules with itemset constraints

The problem of mining association rules with itemset constraints has been widely researched in the literature. Since the introduction of mining association rules with itemset constraints (Srikant et al., 1997), three main strategies have been proposed. The first group, post-processing methods, first mines frequent itemsets by using an algorithm such as Apriori (Agrawal and Srikant, 1994) or FP-Growth (Han et al., 2000) and then filters out the ones that do not satisfy the itemset constraints in

Tree structure

This study proposes the CCR-tree structure. In the CCR-tree, each node contains one itemset along with the following information:

  • (1)

    (Obidset1,Obidset2,...,Obidsetk): each Obidseti is a set of object identifiers that contains both itemset and class ci. Note that k is the number of classes in the dataset.

  • (2)

    pos: stores the position of the class with the maximum cardinality of Obidseti, i.e., pos=argmaxi∈[1,k]{|Obidseti|}.

  • (3)

    total: stores the sum of cardinality of all Obidseti, i.e., total=i=1k|Obidseti|.

Experiments

All experiments were conducted on a computer with an Intel Core i5-540 M CPU at 2.53 GHz and 4 GB of RAM running Windows 7 Enterprise (32-bit) SP1. The experimental datasets were obtained from the UCI Machine Learning Repository (http://mlearn.ics.uci.edu). The algorithms were coded in C# using MS Visual Studio.NET 2010 Express.

An application of the proposed method in the HIV/AIDS domain

Data mining has practical applications in many areas such as business, retail, banking, education, healthcare, science, engineering, etc. In the engineering domain, applications of data mining are becoming popular. Kamsu-Foguem et al. (2013) applied sequential rule mining to the production process for quality improvement. Their study reported some interesting results for the drill production process. Sequential rule mining was also used in intelligent tutoring agents (Faghihi et al., 2012,

Conclusions and future work

This study has proposed an efficient method for mining CARs with itemset constraints. Unlike post-processing and pre-processing approaches, our approach generates only rules that satisfy the itemset constraints. The framework of the proposed algorithm is based on a novel tree structure which includes only nodes containing constrained itemsets and two theorems for quickly pruning infrequent itemsets. To validate the efficiency of the proposed method, a series of experiments was conducted on four

Acknowledgments

This work was funded by the Vietnam׳s National Foundation for Science and Technology Development (NAFOSTED) under Grant no. 102.01-2012.17. The authors would like to thank Ho Chi Minh City Provincial AIDS Committee (PAC) which provided the real VCT dataset used in this study.

References (32)

  • P. Potes Ruiz et al.

    Generating knowledge in maintenance from experience feedback

    Knowl.-based Syst.

    (2014)
  • Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules in large databases. In: Proceedings of the...
  • C. Campbell et al.

    The role of HIV counseling and testing in the developing world

    AIDS Educ. Prev.

    (1997)
  • Dat, T., Ray, R., Binh, N., Vinh, D., Thang, N., Cuong, N., Mitchell, W., Long, N., An, C., 2009. Application of...
  • Han, J., Pei, J., Yin, Y., 2000. Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM...
  • M. Kamb et al.

    Efficacy of risk-reduction counseling to prevent human immunodeficiency virus and sexually transmitted diseases: a randomized controlled trial

    J. Am. Med. Assoc.

    (1998)
  • Cited by (24)

    • A guided FP-Growth algorithm for mining multitude-targeted item-sets and class association rules in imbalanced data

      2021, Information Sciences
      Citation Excerpt :

      Another variation of the item-set tree structure [14] has been designed to reduce memory consumption, by having each single-prefix path portion of the tree be represented by a single node. One may consider the task of targeted item-set mining as a special case of frequent item-set mining, which involves an additional constraint specifying interesting subsets of item-sets [1,39,11,36,38]. Various constraints have been studied in frequent item-set mining.

    • An efficient algorithm for unique class association rule mining

      2021, Expert Systems with Applications
      Citation Excerpt :

      One dataset is selected form each group showed in Fig. 3. Based on the related work two efficient CARs’ mining algorithms have been selected for the comparison which are CCAR (Nguyen et al., 2015) and LD-CARM-IC (Nguyen et al., 2016). Both algorithm requires specifying a minimum support and selectivity as an important input constraint or preference for the search process.

    • A lattice-based approach for mining high utility association rules

      2017, Information Sciences
      Citation Excerpt :

      The HGB-HAR algorithm took a long time to complete the task of mining HARs from the Accidents dataset, while LARM only needed an average of 14.5 ms to complete this (Fig. 13). Actually, with this dataset we needed 6 ms to construct HUIL from HUIs, which was extracted from the FHIM algorithm with min-util = 14% [13]. Then, using this HUIL, we could mine all HARs easily within an average of 9.5 ms. This result again indicates the good performance of LARM as well as the reusability of HUIL.

    • Efficient mining of class association rules with the itemset constraint

      2016, Knowledge-Based Systems
      Citation Excerpt :

      Finally, after generating all CARs which satisfy the constraint, the algorithm clears all marks of nodes to prepare for the next lattice traverse with the new itemset constraint. : Please refer [8]. ∎

    • An improved algorithm for mining class association rules using the difference of Obidsets

      2015, Expert Systems with Applications
      Citation Excerpt :

      Association rule mining has been extensively studied due to its application in numerous fields such as market basket analysis, medicine, protein sequencing, census data processing, and fraud detection. Many subjects have attracted researchers, including mining association rules (Duong, Tin, & Vo, 2014; Grahne & Zhu, 2005; Lucchese, Orlando, & Perego, 2006; Vo, Hong, & Le, 2012; Vo, Hong, & Le, 2013; Zaki & Hsiao, 2005) and classification based on association rules (Abdelhamid, Ayesh, Thabtah, Ahmadi, & Hadi, 2012; Chien & Chen, 2010; Coenen, Leng, & Zhang, 2007; Li, Han, & Pei, 2001; Lim & Lee, 2010; Liu, Hsu, & Ma, 1998; Liu, Jiang, Liu, & Yang, 2008; Liu, Ma, & Wong, 2000; Nguyen & Vo, 2014; Nguyen, Vo, Hong, & Thanh, 2012; Nguyen, Vo, Hong, & Thanh, 2013; Nguyen, Vo, & Le, 2014; Nguyen, Vo, & Le, 2015; Thabtah, Cowling, & Peng, 2004; Thabtah, Cowling, & Hammoud, 2006; Veloso, Meira, Goncalves, Almeida, & Zaki, 2007; Veloso, Meira, Goncalves, Almeida, & Zaki, 2011; Veloso, Meira, & Zaki, 2006; Vo & Le, 2008; Yang, Mabu, Shimada, & Hirasawa, 2011; Yin & Han, 2003; Zhang, Chen, & Wei, 2011; Zhao, Tsang, Chen, & Wang, 2010). A common issue in these problems is frequent itemset mining.

    View all citing articles on Scopus
    View full text