Ranking discovered rules from data mining with multiple criteria by data envelopment analysis

https://doi.org/10.1016/j.eswa.2006.08.007Get rights and content

Abstract

In data mining applications, it is important to develop evaluation methods for selecting quality and profitable rules. This paper utilizes a non-parametric approach, Data Envelopment Analysis (DEA), to estimate and rank the efficiency of association rules with multiple criteria. The interestingness of association rules is conventionally measured based on support and confidence. For specific applications, domain knowledge can be further designed as measures to evaluate the discovered rules. For example, in market basket analysis, the product value and cross-selling profit associated with the association rule can serve as essential measures to rule interestingness. In this paper, these domain measures are also included in the rule ranking procedure for selecting valuable rules for implementation. An example of market basket analysis is applied to illustrate the DEA based methodology for measuring the efficiency of association rules with multiple criteria.

Introduction

Data mining techniques have become widespread in business. Moreover, various rules may be obtained using data mining techniques, and only a small number of these rules may be selected for implementation due, at least in part, to limitations of budget and resources. Association rule mining differs from traditional machine learning techniques by permitting decision makers to pick from the many potential models that can be supported by the data (Webb & Zhang, 2005). Generally, association rule mining discovers all rules that meet certain sets of criteria or constraints, such as minimum support and minimum confidence, rather than generating a single model that best matches the data.

Evaluating the interestingness or usefulness of association rules is important in data mining. In many business applications, it is necessary to rank rules from data mining due to the number of quality rules (Tan & Kumar, 2000) and business resource constraint (Choi, Ahn, & Kim, 2005). Selecting the more valuable rules for implementation increases the possibility of success in data mining. For example, in market basket analysis, understanding which products are usually bought together by customers and how the cross-selling promotions are beneficial to sellers both attract marketing analysts. The former makes sellers to provide appropriate products by considering the customers’ preferences, and the later allows sellers to gain increased profits by considering the sellers’ profits. Customers’ preferences can be measured based on support and confidence in association rules. On the other hand, seller profits can be assessed using domain related measures such as sale profit and cross-selling profit associated with the association rules.

Since high value products are relatively uncommonly bought by customers, a rule that is profitable to sellers may not be discovered by setting constraints of minimum support and minimum confidence in the mining process. Cohen et al. (2000) described a good example of this, namely the Ketel vodka and Beluga caviar problem. Although, most customers infrequently buy either of these two products, and they rarely appear in frequent itemsets, their profits may be potentially higher than many lower value products that are more frequently bought. Another example regarding the interesting infrequent itemsets is described in Tao, Murtagh, and Farid (2003). The association rule of [wine  salmon, 1%, 80%] may be more interesting to analysts than [bread  milk, 3%, 80%] despite the first rule having lower support. The items in the first rule typically are associated with more profit per unit sale.

From the examples of Ketel vodka and Beluga caviar and wine and salmon, infrequent itemsets may be interesting for certain applications provided that domain information is considered (Tao et al., 2003, Webb and Zhang, 2005). However, the traditional association rule mining algorithms (Agrawal et al., 1993, Srikant and Agrawal, 1997) cannot classify such infrequent products to interesting itemsets since the subjective domain knowledge is ignored. A lower threshold can be set to identify the infrequent itemsets with a high value. However, numerous association rules are consequently generated, and it is extremely difficult for analysts to select the useful rules between them.

In previous studies dealing with the discovery of subjectively interesting association rules, most approaches require manual input or interaction by asking users to explicitly distinguish between interesting and uninteresting rules (Liu, Hsu, Chen, & Ma, 2000). Liu et al. briefly reviewed these existing approaches. The measures of interestingness are specified as constraints in the mining process, and only the rules that satisfied these constraints are retrieved. Klemetinen M, Mannila H, Ronkainen P, Toivonen H, and Verkamo AI (1994) proposed an item constraint, which describes the occurrence of certain items in the conditional (right hand side) and consequent (left hand side) parts. Srikant, Vu, and Agrawal (1997) also proposed a mining algorithm that considered the item and item hierarchy constraints specified by analysts. Moreover, Lakshmanan, Han, and Pang (1998) extended the approach developed by Srikant et al. to consider much more complicated constraints, including domain, class, and SQL-style aggregate constraints. The approach developed by Ng et al. can support constraint based, human-centered exploratory mining of association rules. Goethals and Van den Bussche (2000) also proposed an interactive approach based on querying conditions within the association rule mining process.

Liu et al. (2000) proposed an approach to assist analysts in finding interesting rules from a set of mined association rules by analyzing the rules using the domain information. The mined rules are then ranked according to two subjective interestingness measures, unexpectedness and actionability. The degree of unexpectedness of rules can be measured by the extent to which they surprise the analyst (Liu and Hsu, 1996, Silberschatz and Tuzhilin, 1996). Meanwhile, the degree of actionability can be measured by the extent to which analysts can use the discovered rules to their advantage. The system developed by Liu et al. (2000) is an interactive and iterative post-processing technique. This system first asks analysts to specify their existing domain knowledge, and then analyzes the discovered rules to identify the potentially interesting ones. However, Liu et al. focused on unexpected rules, which are measured by unexpectedness.

Choi et al. (2005) proposed a group decision making approach based on Analytic Hierarchy Process (AHP) to rank the association rules generated from data mining. This approach would construct a consensus provided that a group of managers work together to select discovered. The rule quality can be improved by considering both objective criteria and subjective preferences of managers. However, this approach encounters a problem of requiring considerable human interaction to find out the weights of criteria by aggregating the opinions of various managers.

Most existing association rule mining algorithms take the measure of large support to find frequent itemsets, and all items are considered to have equal weight (Tao et al., 2003). Therefore, these approaches are unsuitable for discovering the interesting infrequent itemsets as described in the above two examples. Tao et al. developed an approach that used an improved model of weighted support. In the approach of weighted association rule mining, itemsets are no longer simply counted as they appear in a transaction, and the subjective measures (e.g., profit) are also adopted for rule evaluation.

Most of the abovementioned approaches focus on computation efficiency by embedding the subjective constraints in the mining procedure to prune the search space. However, a huge amount of subjective domain knowledge may exist, which can be considered as potential subjective constraints and interestingness measures. It is sophisticated to determine the subjective constraints and interestingness measures before discovering some rules. Provided that the constraints are not adequately stated, the interesting rules may not be discovered after the mining procedure. Additionally, rule interestingness may be a relative measure, but not an absolute one. Generally, decision makers can suitably select interesting rules for implementation after making comparisons between some potential rules.

In data mining, it is substantial to bring together the statistic based rule extraction and profit based action to meet the enterprises’ objectives (Wang, Zhou, & Han, 2002). This paper aims at using a non-parametric approach, Data Envelopment Analysis (DEA), to estimate and rank the efficiency (interestingness or usefulness) of association rules with multiple criteria. The interestingness of association rules is measured by multiple criteria involving support, confidence and domain related measures. This paper uses DEA as a post-processing approach. After the rules have been discovered from the association rule mining algorithms, DEA is used to rank those discovered rules based on the specified criteria. The remainder of this paper is organized as follows. Section 2 introduces the concept of association rules. Section 3 then presents the DEA method. Furthermore, the proposed approach is described in Section 4. Next, an example of market basket analysis is illustrated in Section 5. Finally, Section 6 makes a conclusion to this paper.

Section snippets

Association rules

Association rule mining discovers the relationships between items from the set of transactions. These relationships can be expressed by association rules such as [i1  i2, i3 support = 3.5%, confidence = 45%]. This association rule implies that 3.5% of all the transactions under analysis show that items i1, i2 and i3 appear jointly. A confidence of 45% indicates that 45% of the transactions containing i1 also contain i2 and i3. Associations may include any number of items on either side of the rule.

Data envelopment analysis

In 1978, Data Envelopment Analysis (DEA) was initiated by Charnes, Cooper and Rhodes (CCR), and they demonstrated how to change a fractional linear measure of efficiency into a linear programming model (Charnes, Cooper, & Rhodes, 1978). DEA was defined by Charnes et al. as: a mathematical programming model applied to observational data, which provides a new method of obtaining empirical estimates of extremal relations-such as the production functions and/or efficient production possibility

Proposed post-processing approach

The interestingness of a rule can be use to filter a large number of rules and report only those which may be useful to decision makers (Mitra, Pal, & Mitra, 2002). The thresholds of support and confidence are selected by only considering the database perspective. However, the interestingness of an association rule is commonly application-dependent (Srikant et al., 1997). The domain information in application areas can potentially provide useful criteria for picking important rules, and can be

Illustrative example

An example of market basket data is used to illustrate the proposed approach presented in Section 4. Association rules first are discovered by the Apriori algorithm, in which minimum support and minimum confidence are set to 1.0% and 10.0%, respectively. Forty-six rules then are identified in Step 2. The itemset values and cross-selling profits for these 46 rules are then calculated, as summarized in Table 1.

Preference scores (Zi) for each rule listed in Table 1 are calculated by Cook and

Conclusion

Association rule discovery is one of the popular techniques recently developed in the area of data mining. Evaluating the interestingness or usefulness of association rules is an essential task in data mining applications. In market basket analysis, marketing analysts are no longer satisfied by a set of rules or patterns discovered by a data mining algorithm. Instead, marketing analysts wish to develop rules or patterns that are ranked with respect to certain criteria. The complexity of rule

Acknowledgements

The authors would like to thank the National Science Council of the Republic of China, Taiwan for financially supporting this research under Contract No. NSC 95-2416-H-009-034-MY3.

References (33)

  • R. Srikant et al.

    Mining generalized association rules

    Future Generation Computer Systems

    (1997)
  • Y.-F. Wang et al.

    A personalized recommender system for the cosmetic business

    Expert Systems with Applications

    (2004)
  • R. Agrawal et al.

    Mining association rules between sets of items in large databases

    Proceedings of the ACM SIGMOD Conference on Management of Data

    (1993)
  • M.-C. Chen

    Configuration of cellular manufacturing systems using association rule induction

    International Journal of Production Research

    (2003)
  • Chen, M.-C., & Lin, C.-P. (in press). A data mining approach to product assortment and shelf space allocation, Expert...
  • M.-S. Chen et al.

    Data mining: an overview from a database perspective

    IEEE Transactions on Knowledge and Data Engineering

    (1996)
  • Cited by (56)

    • Efficiency drivers in Brazilian insurance: A two-stage DEA meta frontier-data mining approach

      2016, Economic Modelling
      Citation Excerpt :

      New developments in statistical software technologies for data mining can be used to support systematic theory testing and development (Chen and Cheng, 2013; Osei-Bryson and Ngwenyama, 2014; Tsui et al., 2000). More specifically, established management science techniques such as DEA can be used in conjunction with several data mining techniques to more effectively explore the impact of efficiency drivers on performance (Chen, 2007). The remainder of this paper is structured in five further sections.

    • Correlation as an ARM Interestingness Measure for Numeric Datasets

      2023, ACM International Conference Proceeding Series
    View all citing articles on Scopus
    View full text