Ranking discovered rules from data mining with multiple criteria by data envelopment analysis
Introduction
Data mining techniques have become widespread in business. Moreover, various rules may be obtained using data mining techniques, and only a small number of these rules may be selected for implementation due, at least in part, to limitations of budget and resources. Association rule mining differs from traditional machine learning techniques by permitting decision makers to pick from the many potential models that can be supported by the data (Webb & Zhang, 2005). Generally, association rule mining discovers all rules that meet certain sets of criteria or constraints, such as minimum support and minimum confidence, rather than generating a single model that best matches the data.
Evaluating the interestingness or usefulness of association rules is important in data mining. In many business applications, it is necessary to rank rules from data mining due to the number of quality rules (Tan & Kumar, 2000) and business resource constraint (Choi, Ahn, & Kim, 2005). Selecting the more valuable rules for implementation increases the possibility of success in data mining. For example, in market basket analysis, understanding which products are usually bought together by customers and how the cross-selling promotions are beneficial to sellers both attract marketing analysts. The former makes sellers to provide appropriate products by considering the customers’ preferences, and the later allows sellers to gain increased profits by considering the sellers’ profits. Customers’ preferences can be measured based on support and confidence in association rules. On the other hand, seller profits can be assessed using domain related measures such as sale profit and cross-selling profit associated with the association rules.
Since high value products are relatively uncommonly bought by customers, a rule that is profitable to sellers may not be discovered by setting constraints of minimum support and minimum confidence in the mining process. Cohen et al. (2000) described a good example of this, namely the Ketel vodka and Beluga caviar problem. Although, most customers infrequently buy either of these two products, and they rarely appear in frequent itemsets, their profits may be potentially higher than many lower value products that are more frequently bought. Another example regarding the interesting infrequent itemsets is described in Tao, Murtagh, and Farid (2003). The association rule of [wine ⇒ salmon, 1%, 80%] may be more interesting to analysts than [bread ⇒ milk, 3%, 80%] despite the first rule having lower support. The items in the first rule typically are associated with more profit per unit sale.
From the examples of Ketel vodka and Beluga caviar and wine and salmon, infrequent itemsets may be interesting for certain applications provided that domain information is considered (Tao et al., 2003, Webb and Zhang, 2005). However, the traditional association rule mining algorithms (Agrawal et al., 1993, Srikant and Agrawal, 1997) cannot classify such infrequent products to interesting itemsets since the subjective domain knowledge is ignored. A lower threshold can be set to identify the infrequent itemsets with a high value. However, numerous association rules are consequently generated, and it is extremely difficult for analysts to select the useful rules between them.
In previous studies dealing with the discovery of subjectively interesting association rules, most approaches require manual input or interaction by asking users to explicitly distinguish between interesting and uninteresting rules (Liu, Hsu, Chen, & Ma, 2000). Liu et al. briefly reviewed these existing approaches. The measures of interestingness are specified as constraints in the mining process, and only the rules that satisfied these constraints are retrieved. Klemetinen M, Mannila H, Ronkainen P, Toivonen H, and Verkamo AI (1994) proposed an item constraint, which describes the occurrence of certain items in the conditional (right hand side) and consequent (left hand side) parts. Srikant, Vu, and Agrawal (1997) also proposed a mining algorithm that considered the item and item hierarchy constraints specified by analysts. Moreover, Lakshmanan, Han, and Pang (1998) extended the approach developed by Srikant et al. to consider much more complicated constraints, including domain, class, and SQL-style aggregate constraints. The approach developed by Ng et al. can support constraint based, human-centered exploratory mining of association rules. Goethals and Van den Bussche (2000) also proposed an interactive approach based on querying conditions within the association rule mining process.
Liu et al. (2000) proposed an approach to assist analysts in finding interesting rules from a set of mined association rules by analyzing the rules using the domain information. The mined rules are then ranked according to two subjective interestingness measures, unexpectedness and actionability. The degree of unexpectedness of rules can be measured by the extent to which they surprise the analyst (Liu and Hsu, 1996, Silberschatz and Tuzhilin, 1996). Meanwhile, the degree of actionability can be measured by the extent to which analysts can use the discovered rules to their advantage. The system developed by Liu et al. (2000) is an interactive and iterative post-processing technique. This system first asks analysts to specify their existing domain knowledge, and then analyzes the discovered rules to identify the potentially interesting ones. However, Liu et al. focused on unexpected rules, which are measured by unexpectedness.
Choi et al. (2005) proposed a group decision making approach based on Analytic Hierarchy Process (AHP) to rank the association rules generated from data mining. This approach would construct a consensus provided that a group of managers work together to select discovered. The rule quality can be improved by considering both objective criteria and subjective preferences of managers. However, this approach encounters a problem of requiring considerable human interaction to find out the weights of criteria by aggregating the opinions of various managers.
Most existing association rule mining algorithms take the measure of large support to find frequent itemsets, and all items are considered to have equal weight (Tao et al., 2003). Therefore, these approaches are unsuitable for discovering the interesting infrequent itemsets as described in the above two examples. Tao et al. developed an approach that used an improved model of weighted support. In the approach of weighted association rule mining, itemsets are no longer simply counted as they appear in a transaction, and the subjective measures (e.g., profit) are also adopted for rule evaluation.
Most of the abovementioned approaches focus on computation efficiency by embedding the subjective constraints in the mining procedure to prune the search space. However, a huge amount of subjective domain knowledge may exist, which can be considered as potential subjective constraints and interestingness measures. It is sophisticated to determine the subjective constraints and interestingness measures before discovering some rules. Provided that the constraints are not adequately stated, the interesting rules may not be discovered after the mining procedure. Additionally, rule interestingness may be a relative measure, but not an absolute one. Generally, decision makers can suitably select interesting rules for implementation after making comparisons between some potential rules.
In data mining, it is substantial to bring together the statistic based rule extraction and profit based action to meet the enterprises’ objectives (Wang, Zhou, & Han, 2002). This paper aims at using a non-parametric approach, Data Envelopment Analysis (DEA), to estimate and rank the efficiency (interestingness or usefulness) of association rules with multiple criteria. The interestingness of association rules is measured by multiple criteria involving support, confidence and domain related measures. This paper uses DEA as a post-processing approach. After the rules have been discovered from the association rule mining algorithms, DEA is used to rank those discovered rules based on the specified criteria. The remainder of this paper is organized as follows. Section 2 introduces the concept of association rules. Section 3 then presents the DEA method. Furthermore, the proposed approach is described in Section 4. Next, an example of market basket analysis is illustrated in Section 5. Finally, Section 6 makes a conclusion to this paper.
Section snippets
Association rules
Association rule mining discovers the relationships between items from the set of transactions. These relationships can be expressed by association rules such as [i1 ⇒ i2, i3 support = 3.5%, confidence = 45%]. This association rule implies that 3.5% of all the transactions under analysis show that items i1, i2 and i3 appear jointly. A confidence of 45% indicates that 45% of the transactions containing i1 also contain i2 and i3. Associations may include any number of items on either side of the rule.
Data envelopment analysis
In 1978, Data Envelopment Analysis (DEA) was initiated by Charnes, Cooper and Rhodes (CCR), and they demonstrated how to change a fractional linear measure of efficiency into a linear programming model (Charnes, Cooper, & Rhodes, 1978). DEA was defined by Charnes et al. as: a mathematical programming model applied to observational data, which provides a new method of obtaining empirical estimates of extremal relations-such as the production functions and/or efficient production possibility
Proposed post-processing approach
The interestingness of a rule can be use to filter a large number of rules and report only those which may be useful to decision makers (Mitra, Pal, & Mitra, 2002). The thresholds of support and confidence are selected by only considering the database perspective. However, the interestingness of an association rule is commonly application-dependent (Srikant et al., 1997). The domain information in application areas can potentially provide useful criteria for picking important rules, and can be
Illustrative example
An example of market basket data is used to illustrate the proposed approach presented in Section 4. Association rules first are discovered by the Apriori algorithm, in which minimum support and minimum confidence are set to 1.0% and 10.0%, respectively. Forty-six rules then are identified in Step 2. The itemset values and cross-selling profits for these 46 rules are then calculated, as summarized in Table 1.
Preference scores (Zi) for each rule listed in Table 1 are calculated by Cook and
Conclusion
Association rule discovery is one of the popular techniques recently developed in the area of data mining. Evaluating the interestingness or usefulness of association rules is an essential task in data mining applications. In market basket analysis, marketing analysts are no longer satisfied by a set of rules or patterns discovered by a data mining algorithm. Instead, marketing analysts wish to develop rules or patterns that are ranked with respect to certain criteria. The complexity of rule
Acknowledgements
The authors would like to thank the National Science Council of the Republic of China, Taiwan for financially supporting this research under Contract No. NSC 95-2416-H-009-034-MY3.
References (33)
- et al.
Review of ranking methods in the data envelopment analysis context
European Journal of Operational Research
(2002) - et al.
Foundations of data envelopment analysis for Pareto–Koopmans efficient empirical production functions
Journal of Econometrics
(1985) - et al.
Measuring the efficiency of decision making units
European Journal of Operational Research
(1978) - et al.
A multiplicative model for efficiency analysis
Socio-Economic Planning Sciences
(1982) - et al.
Mining changes in customer behavior in retail marketing
Expert Systems with Applications
(2005) - et al.
Aggregation of orders in distribution centers using data mining
Expert Systems with Applications
(2005) - et al.
An association-based clustering approach to order batching considering customer demand patterns
Omega-International Journal of Management Science
(2005) A ranked voting system using a DEA/AR exclusion model: a note
European Journal of Operational Research
(1997)- et al.
A method for discriminating efficient candidates with ranked voting data
European Journal of Operational Research
(2003) - et al.
Measuring DEA efficiency in internet companies
Decision Support Systems
(2005)
Mining generalized association rules
Future Generation Computer Systems
A personalized recommender system for the cosmetic business
Expert Systems with Applications
Mining association rules between sets of items in large databases
Proceedings of the ACM SIGMOD Conference on Management of Data
Configuration of cellular manufacturing systems using association rule induction
International Journal of Production Research
Data mining: an overview from a database perspective
IEEE Transactions on Knowledge and Data Engineering
Cited by (56)
A Multi Attribute Value Theory approach to rank association rules for leveraging better business decision making
2017, Procedia Computer ScienceEfficiency drivers in Brazilian insurance: A two-stage DEA meta frontier-data mining approach
2016, Economic ModellingCitation Excerpt :New developments in statistical software technologies for data mining can be used to support systematic theory testing and development (Chen and Cheng, 2013; Osei-Bryson and Ngwenyama, 2014; Tsui et al., 2000). More specifically, established management science techniques such as DEA can be used in conjunction with several data mining techniques to more effectively explore the impact of efficiency drivers on performance (Chen, 2007). The remainder of this paper is structured in five further sections.
An analysis of Asian airlines efficiency with two-stage TOPSIS and MCMC generalized linear mixed models
2015, International Journal of Production EconomicsAn analysis of African airlines efficiency with two-stage TOPSIS and neural networks
2015, Journal of Air Transport ManagementCorrelation as an ARM Interestingness Measure for Numeric Datasets
2023, ACM International Conference Proceeding Series