CCAR: An efficient method for mining class association rules with itemset constraints

doi:10.1016/j.engappai.2014.08.013

Engineering Applications of Artificial Intelligence

Volume 37, January 2015, Pages 115-124

https://doi.org/10.1016/j.engappai.2014.08.013 Get rights and content

Abstract

Class association rules (CARs) are basically used to build a classification model for prediction; they can also be used to describe correlations between itemsets and class labels. The latter is very popular in mining medical data. For example, epidemiologists often consider rules which indicate the relations between risk factors (itemsets) and HIV test results (class labels). However, in the real world, end users are often interested in a subset of class association rules. Particularly, they may consider only rules which contain at least one itemset from a user-defined set of itemsets in the rule antecedent. For example, when classifying which populations are at high risk for HIV infection, epidemiologists often concentrate on rules that include demographic information such as sex, age, and marital status in rule antecedents. Two naive strategies are to solve this problem by applying the itemset constraints into the pre-processing or post-processing step. However, such approaches are time-intensive. This paper thus proposes an efficient method for integrating the constraints into the class association rule mining process. The experimental results show that the proposed algorithm outperforms two basic approaches in the mining time and the memory consumption. The practical benefits of our method are demonstrated by a real-life application in the HIV/AIDS domain.

Introduction

The problem of mining class association rules (CARs) is finding of the complete set of CARs that satisfies the user-specified minimum support and minimum confidence thresholds from a dataset. Numerous approaches have been proposed to solve this problem. Examples include the Apriori-based algorithm CBA (Liu et al., 1998), the FP tree-based algorithm CMAR (Li et al., 2001), mining CARs based on the vertical dataset layout (Zhao et al., 2009), the use of an equivalence class rule tree (Vo and Le, 2009), the lattice-based approach for mining CARs (Nguyen et al., 2012), the use of a modified ECR tree with Obidset (Nguyen et al., 2013), and parallel mining CARs on the multi-core processor architecture (Nguyen et al., 2014).

Mining CARs to discover associations between itemsets and class labels is very popular and useful in practice, especially in mining medical data. However, end users often consider only a subset of CARs, for instance, those that contain at least one itemset from a user-defined set of itemsets. Itemset constraints reduce the number of obtained CARs and decrease the search space, improving the performance of the mining process. Additionally, constrained CARs also help to discover interesting or useful rules particular to the end user. For example, in cancer treatment applications, biologists often focus on rules involving new drugs to understand the effectiveness of new treatment strategies. Thus, the present study considers constraints in the form of Boolean expressions over the presence of itemsets in the antecedents of classification rules. The main contributions of this paper are as follows. Firstly, a tree structure named the Constraint Class Rule tree (CCR-tree) is proposed for efficiently mining CARs with itemset constraints. At the first level, the tree contains both constrained nodes which include constrained itemsets and frequent nodes which include frequent 1-itemsets. At the following levels, the tree contains constrained nodes only. Using this tree structure, only nodes that contain constrained itemsets are generated. Secondly, two theorems for quickly pruning infrequent itemsets are derived. Finally, an efficient and fast algorithm for mining CARs with itemset constraints is developed. Compared to two existing pre- and post-processing approaches, the proposed method does not generate all rules which significantly accelerates the mining time and also reduces the memory consumption. The experimental results also show that the proposed algorithm can achieve up to 3 × and 12 × speedups in comparison with pre- and post-processing methods, respectively.

The rest of this paper is organized as follows. In Section 2, some preliminary concepts of CAR mining are briefly given. Work related to mining association rules with itemset constraints and mining class association rules with itemset constraints is introduced in Section 3. The primary contributions are presented in Section 4, in which the CCR-tree structure is presented and two theorems for eliminating infrequent itemsets are provided. The proposed algorithm, Constraint Class Association Rule (CCAR), for efficiently mining CARs with itemset constraints is also described in this section. The experimental results are presented in Section 5. Section 6 describes a real-life application of the proposed method in the HIV/AIDS domain. Finally, conclusions and future work are discussed in Section 7.

Section snippets

Preliminary concepts

Let D be a dataset with n attributes {A₁, A₂,...,A_n} and |D| records (objects) where each record has an object identifier (OID). Let C={c₁,c₂,...,c_k} be a list of class labels. A specific value of an attribute A_i and class C is denoted by lower-case letters a_im and c_j, respectively.

Definition 1

An item is described as an attribute and a specific value for that attribute, denoted by 〈(A_i,a_im)〉, e.g. 〈(A₁,a₁₁)〉, 〈(A₁,a₁₂)〉, 〈(A₂,a₂₁)〉, etc.

Definition 2

An itemset is a set of items, e.g., 〈(A₁,a₁₁),(A₂,a₂₁)〉, 〈(A₁,a₁₁),(A₃,

Mining association rules with itemset constraints

The problem of mining association rules with itemset constraints has been widely researched in the literature. Since the introduction of mining association rules with itemset constraints (Srikant et al., 1997), three main strategies have been proposed. The first group, post-processing methods, first mines frequent itemsets by using an algorithm such as Apriori (Agrawal and Srikant, 1994) or FP-Growth (Han et al., 2000) and then filters out the ones that do not satisfy the itemset constraints in

Tree structure

This study proposes the CCR-tree structure. In the CCR-tree, each node contains one itemset along with the following information:

(1)
(Obidset₁,Obidset₂,...,Obidset_k): each Obidset_i is a set of object identifiers that contains both itemset and class c_i. Note that k is the number of classes in the dataset.
(2)
pos: stores the position of the class with the maximum cardinality of Obidset_i, i.e., pos=argmax_i∈[1,k]{|Obidset_i|}.
(3)
total: stores the sum of cardinality of all Obidset_i, i.e., $t o t a l = \sum_{i = 1}^{k} | O b i d s e t_{i} |$ .

Experiments

All experiments were conducted on a computer with an Intel Core i5-540 M CPU at 2.53 GHz and 4 GB of RAM running Windows 7 Enterprise (32-bit) SP1. The experimental datasets were obtained from the UCI Machine Learning Repository (http://mlearn.ics.uci.edu). The algorithms were coded in C# using MS Visual Studio.NET 2010 Express.

An application of the proposed method in the HIV/AIDS domain

Data mining has practical applications in many areas such as business, retail, banking, education, healthcare, science, engineering, etc. In the engineering domain, applications of data mining are becoming popular. Kamsu-Foguem et al. (2013) applied sequential rule mining to the production process for quality improvement. Their study reported some interesting results for the drill production process. Sequential rule mining was also used in intelligent tutoring agents (Faghihi et al., 2012,

Conclusions and future work

This study has proposed an efficient method for mining CARs with itemset constraints. Unlike post-processing and pre-processing approaches, our approach generates only rules that satisfy the itemset constraints. The framework of the proposed algorithm is based on a novel tree structure which includes only nodes containing constrained itemsets and two theorems for quickly pruning infrequent itemsets. To validate the efficiency of the proposed method, a series of experiments was conducted on four

Acknowledgments

This work was funded by the Vietnam׳s National Foundation for Science and Technology Development (NAFOSTED) under Grant no. 102.01-2012.17. The authors would like to thank Ho Chi Minh City Provincial AIDS Committee (PAC) which provided the real VCT dataset used in this study.

References (32)

H. Duong et al.
An efficient method for mining frequent itemsets with double constraints
Eng. Appl. Artif. Intell.
(2014)
U. Faghihi et al.
A computational model for causal learning in cognitive agents
Knowl.-based Syst.
(2012)
P. Fournier-Viger et al.
CMRules: mining sequential rules common to several sequences
Knowl.-based Syst.
(2012)
B. Kamsu-Foguem et al.
Mining association rules for the quality improvement of the production process
Expert Syst. Appl.
(2013)
G. Köksal et al.
A review of data mining applications for quality improvement in manufacturing industry
Expert Syst. Appl.
(2011)
A. Mirabadi et al.
Application of association rules in Iranian Railways (RAI) accident data analysis
Saf. Sci.
(2010)
D. Nguyen et al.
Efficient strategies for parallel mining class association rules
Expert Syst. Appl.
(2014)
L.T. Nguyen et al.
Classification based on association rules: a lattice-based approach
Expert Syst. Appl.
(2012)
L.T. Nguyen et al.
CAR-Miner: an efficient algorithm for mining class-association rules
Expert Syst. Appl.
(2013)
R. Nkambou et al.
Learning task models in ill-defined domain using an hybrid knowledge discovery framework
Knowl.-based Syst.
(2011)

P. Potes Ruiz et al.

Generating knowledge in maintenance from experience feedback

Knowl.-based Syst.

(2014)

Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules in large databases. In: Proceedings of the...

C. Campbell et al.

The role of HIV counseling and testing in the developing world

AIDS Educ. Prev.

(1997)

Dat, T., Ray, R., Binh, N., Vinh, D., Thang, N., Cuong, N., Mitchell, W., Long, N., An, C., 2009. Application of...

Han, J., Pei, J., Yin, Y., 2000. Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM...

M. Kamb et al.

Efficacy of risk-reduction counseling to prevent human immunodeficiency virus and sexually transmitted diseases: a randomized controlled trial

J. Am. Med. Assoc.

(1998)

Cited by (24)

A guided FP-Growth algorithm for mining multitude-targeted item-sets and class association rules in imbalanced data
2021, Information Sciences
Citation Excerpt :
Another variation of the item-set tree structure [14] has been designed to reduce memory consumption, by having each single-prefix path portion of the tree be represented by a single node. One may consider the task of targeted item-set mining as a special case of frequent item-set mining, which involves an additional constraint specifying interesting subsets of item-sets [1,39,11,36,38]. Various constraints have been studied in frequent item-set mining.
Identifying frequent item-sets is a popular data-mining task. It consists of finding sets of items frequently appearing in data. Yet, finding all frequent item-sets in large or dense datasets may be time-consuming, and a user may be interested merely in some specific item-sets rather than all of them. Recently, methods have been proposed for targeted item-set mining; that is to calculate the support of some item-sets of interest. Though this approach is often more suitable for real applications than traditional item-set mining approaches, performance remains an issue. To address that issue, this paper presents a novel algorithm for multitude-targeted mining, named Guided Frequent Pattern-Growth (GFP-Growth). The GFP-Growth algorithm is designed to quickly mine a given set of item-sets using a small amount of memory. This paper proves that GFP-Growth yields the exact frequency-counts for each item-set of interest. It further shows that GFP-Growth can boost the performance for several problems requiring item-set mining. We specifically study the problem of generating minority-class rules from imbalanced data and develop the Minority-Report Algorithm (MRA) that uses GFP-Growth to solve this problem efficiently. We prove several theoretical properties of MRA and present experimental results showing substantial performance gain.
An efficient algorithm for unique class association rule mining
2021, Expert Systems with Applications
Citation Excerpt :
One dataset is selected form each group showed in Fig. 3. Based on the related work two efficient CARs’ mining algorithms have been selected for the comparison which are CCAR (Nguyen et al., 2015) and LD-CARM-IC (Nguyen et al., 2016). Both algorithm requires specifying a minimum support and selectivity as an important input constraint or preference for the search process.
Association rule mining is one of the main means in Knowledge discovery and Machine learning. Such kind of rules present knowledge of interrelations among items in a dataset. Class Association Rules (CARs) are a subset of association rules which are always mined using labeled datasets. Simply, a typical CAR has an itemset that is associated to a class label. Mining CARs is vital for construction of pattern or rule-based classification models and has received recently increasing research interest. In this work, a complete efficient but not exhaustive CAR mining algorithm (UniqAR) is introduced. UniqAR generates always and only $100 %$ accurate CARs which are called unique association rules using two rule search hypothesis of Subsumption and Nonsense to find unique itemsets in order to generate the Unique CARs. Unlike alternatives of CAR mining algorithms, UniqAR mined association rules aren’t based on itemset frequency or item selectivity. It can generate both frequent and rare association rules. No preferences of support, coverage, or item participant in itemsets are required to be provided for the proposed mining process. The main contribution of this work to CARs’ state of the art is describing unique itemsets and class association rules and providing an efficient mining process for them. Unlike the other unique rule mining alternatives in the literature, the proposed novel mining process depends on a complete but not exhaustive search that employs rules inter-relations. UniqAR has been modeled with computational analysis and extended evaluation. It is shown that UniqAR can extract all unique itemsets for unique association mining with no need to setup any user preferences, template or any constraints. Moreover, it describes accurately the effects of different dataset criteria like number of attributes/features, feature values, cases, and class labels on UniqAR unique itemset extraction mining process in an efficient way that avoids a huge number of itemsets/cases comparisons. Results have shown that the proposed UniqAR algorithm is feasible and promising.
ACPRISM: Associative classification based on PRISM algorithm
2017, Information Sciences
Associative classification (AC) is an integration between association rules and classification tasks that aim to predict unseen samples. Several studies indicate that the AC algorithms produce more accurate results than classical data mining algorithms. However, current AC algorithms inherit from association rules two major drawbacks resulting in a massive set of generated rules, in addition to a very large number of models (classifiers). In response to these two drawbacks, a new AC algorithm based on PRISM algorithm (ACPRISM) is proposed which employs the power of the PRISM algorithm to decrease the number of generated rules.
To investigate the efficiency and the performance of the proposed algorithm, five different algorithms were tested, namely FACA, CBA, MAC, PRISM and RIPPER. Two experiments were conducted on groundwater and 16 different well-known datasets using predictive accuracy (%), number of generated rules and time taken to build the model (learning times).
Our experimental results show that the ACPRISM produced the lowest number of rules, and is much more efficient and more scalable than all considered algorithms with regard to learning times. Finally, the ACPRISM outperformed the CBA, MCAR, PRISM and RIPPER algorithms in terms of predictive accuracy, and produced comparable results to the FACA algorithm.
A lattice-based approach for mining high utility association rules
2017, Information Sciences
Citation Excerpt :
The HGB-HAR algorithm took a long time to complete the task of mining HARs from the Accidents dataset, while LARM only needed an average of 14.5 ms to complete this (Fig. 13). Actually, with this dataset we needed 6 ms to construct HUIL from HUIs, which was extracted from the FHIM algorithm with min-util = 14% [13]. Then, using this HUIL, we could mine all HARs easily within an average of 9.5 ms. This result again indicates the good performance of LARM as well as the reusability of HUIL.
Most businesses focus on the profits. For example, supermarkets often analyze sale activities to investigate which products bring the most revenue, as well as find out customer trends based on their carts. To achieve this, a number of studies have examined high utility itemsets (HUI). Traditional association rule mining algorithms only generate a set of highly frequent rules, but these rules do not provide useful answers for what the high utility association rules are. Therefore, Sahoo et al. (2015) proposed an approach to generate utility-based non-redundant high utility association rules and a method for reconstructing all high utility association rules. This approach includes three phases: (1) mining high utility closed itemsets (HUCI) and generators; (2) generating high utility generic basic (HGB) association rules; and (3) mining all high utility association rules based on HGB. The third phase of this approach consumes more time when the HGB list is large and each rule in HGB has many items in both antecedent and consequent. To overcome this limitation, in this paper, we propose an algorithm for mining high utility association rules using a lattice. Our approach has two phases: (1) building a high utility itemsets lattice (HUIL) from a set of high utility itemsets; and (2) mining all high utility association rules (HARs) from the HUIL. The experimental results show that mining HARs using HUIL is more efficient than mining HARs from HGB (which is generated from HUCI and generators) in terms of runtime and memory usage.
Efficient mining of class association rules with the itemset constraint
2016, Knowledge-Based Systems
Citation Excerpt :
Finally, after generating all CARs which satisfy the constraint, the algorithm clears all marks of nodes to prepare for the next lattice traverse with the new itemset constraint. : Please refer [8]. ∎
Mining class association rules (CARs) with the itemset constraint is concerned with the discovery of rules, which contain a set of specific items in the rule antecedent and a class label in the rule consequent. This task is commonly encountered in mining medical data. For example, when classifying which section of the population is at high risk for the HIV infection, epidemiologists often concentrate on rules which include demographic information such as gender, age, and marital status in the rule antecedent, and HIV-Positive in the rule consequent. There are two naive strategies to solve this problem, namely pre-processing and post-processing. The post-processing methods have to generate and consider a huge number of candidate CARs while the performance of the pre-processing methods depend on the number of records filtered out. Therefore, such approaches are time consuming. This study proposes an efficient method for mining CARs with the itemset constraint based on a lattice structure and the difference between two sets of object identifiers (diffset). Firstly, a lattice structure is built to store all frequent itemsets in the dataset. To reduce memory usage, instead of the entire set of object identifiers, the diffset is used. Secondly, the lattice is traversed to generate only rules which satisfy the itemset constraint. The experimental results show that the proposed algorithm outperforms existing methods in terms of both the mining time and memory usage.
An improved algorithm for mining class association rules using the difference of Obidsets
2015, Expert Systems with Applications
Citation Excerpt :
Association rule mining has been extensively studied due to its application in numerous fields such as market basket analysis, medicine, protein sequencing, census data processing, and fraud detection. Many subjects have attracted researchers, including mining association rules (Duong, Tin, & Vo, 2014; Grahne & Zhu, 2005; Lucchese, Orlando, & Perego, 2006; Vo, Hong, & Le, 2012; Vo, Hong, & Le, 2013; Zaki & Hsiao, 2005) and classification based on association rules (Abdelhamid, Ayesh, Thabtah, Ahmadi, & Hadi, 2012; Chien & Chen, 2010; Coenen, Leng, & Zhang, 2007; Li, Han, & Pei, 2001; Lim & Lee, 2010; Liu, Hsu, & Ma, 1998; Liu, Jiang, Liu, & Yang, 2008; Liu, Ma, & Wong, 2000; Nguyen & Vo, 2014; Nguyen, Vo, Hong, & Thanh, 2012; Nguyen, Vo, Hong, & Thanh, 2013; Nguyen, Vo, & Le, 2014; Nguyen, Vo, & Le, 2015; Thabtah, Cowling, & Peng, 2004; Thabtah, Cowling, & Hammoud, 2006; Veloso, Meira, Goncalves, Almeida, & Zaki, 2007; Veloso, Meira, Goncalves, Almeida, & Zaki, 2011; Veloso, Meira, & Zaki, 2006; Vo & Le, 2008; Yang, Mabu, Shimada, & Hirasawa, 2011; Yin & Han, 2003; Zhang, Chen, & Wei, 2011; Zhao, Tsang, Chen, & Wang, 2010). A common issue in these problems is frequent itemset mining.
Class association rules play an important role in decision support systems and have thus been extensively studied. Recently, an efficient algorithm for mining class association rules, named CAR-Miner, has been proposed. It, however, consumes a lot of memory for storing the Obidsets (sets of object identifiers that contain itemsets) of itemsets and requires a lot of time to compute the intersection between two Obidsets, especially in the large datasets. This paper proposes an improved algorithm for mining class association rules that uses the difference between two Obidsets (d2O) to save memory usage and run time. Firstly, the d2O concept is developed. A strategy for reducing the storage space and computation time of d2O is then derived. Experimental results show that the proposed algorithm is more efficient than CAR-Miner in terms of run time and memory usage.

View all citing articles on Scopus

View full text

CCAR: An efficient method for mining class association rules with itemset constraints

Abstract

Introduction

Section snippets

Preliminary concepts

Mining association rules with itemset constraints

Tree structure

Experiments

An application of the proposed method in the HIV/AIDS domain

Conclusions and future work

Acknowledgments

Eng. Appl. Artif. Intell.

Knowl.-based Syst.

Knowl.-based Syst.

Expert Syst. Appl.

Expert Syst. Appl.

Saf. Sci.

Expert Syst. Appl.

Expert Syst. Appl.

Expert Syst. Appl.

Knowl.-based Syst.

Knowl.-based Syst.

The role of HIV counseling and testing in the developing world

AIDS Educ. Prev.

Efficacy of risk-reduction counseling to prevent human immunodeficiency virus and sexually transmitted diseases: a randomized controlled trial

J. Am. Med. Assoc.