doi:10.1016/j.datak.2005.10.004
Copyright © 2005 Elsevier B.V. All rights reserved.
Mining itemset utilities from transaction databases
aDepartment of Computer Science, University of Regina, 3737 Wascana Parkway, Regina, SK, Canada S4S 0A2
Received 13 October 2005;
accepted 13 October 2005.
Available online 18 November 2005.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
The rationale behind mining frequent itemsets is that only itemsets with high frequency are of interest to users. However, the practical usefulness of frequent itemsets is limited by the significance of the discovered itemsets. A frequent itemset only reflects the statistical correlation between items, and it does not reflect the semantic significance of the items. In this paper, we propose a utility based itemset mining approach to overcome this limitation. The proposed approach permits users to quantify their preferences concerning the usefulness of itemsets using utility values. The usefulness of an itemset is characterized as a utility constraint. That is, an itemset is interesting to the user only if it satisfies a given utility constraint. We show that the pruning strategies used in previous itemset mining approaches cannot be applied to utility constraints. In response, we identify several mathematical properties of utility constraints. Then, two novel pruning strategies are designed. Two algorithms for utility based itemset mining are developed by incorporating these pruning strategies. The algorithms are evaluated by applying them to synthetic and real world databases. Experimental results show that the proposed algorithms are effective on the databases tested.
Keywords: Utility mining; Data mining; Semantic significance; User preference; Itemset
Fig. 1. Itemset semi-lattice for the UMining algorithm. (a) Complete search space and (b) portion used by UMining.
Fig. 2. Itemset semi-lattice for the UMining_H algorithm. (a) Complete search space and (b) portion used by UMining_H.
Fig. 3. Graph of itemsets ranked by their utility values.
Table 1.
A transaction database

Table 2.
The unit profit for the items

Table 3.
The support, and profit for all itemsets

Table 4.
A transaction database causing erroneous pruning

Table 5.
The UMining algorithm

Table 6.
The Scan function for the UMining algorithm

Table 7.
The CalculateAndStore function for the UMining algorithm

Table 8.
The Discover function for the UMining algorithm

Table 9.
The Generate function for the UMining algorithm

Table 10.
The Prune function for the UMining algorithm

Table 11.
The CalculateAndStore_H function for the UMining_H algorithm

Table 12.
The Prune_H function for the UMining_H algorithm

Table 13.
Experimental results on synthetic dataset for UMining and UMining_H

Table 14.
The accuracy and time saved for UMining_H on the synthetic dataset

Table 15.
Experimental results on the customer database for UMining and UMining_H

Table 16.
The accuracy and time saved for UMining_H on the customer database

Table 17.
The effect of the size of the commercial database on the running time

Table 18.
Comparison of UMining and SIP on the IBM synthetic dataset

Table 19.
Comparison of UMining and SIP on the commercial database

Table 20.
Itemsets ranked by their utility values and support values: (a) 16 itemsets, (b) ranked by utility and (c) ranked by support
