Top-k high utility pattern mining with effective threshold raising strategies
Introduction
Data mining finds useful information hidden in large databases. One of the data mining techniques, pattern mining discovers meaningful information as pattern forms composed of items. In pattern mining, users generally set a minimum threshold to extract crucial patterns from the databases. As a result, they obtain a set of patterns such that their values are not lower than the threshold. Accordingly, the size of the mining result depends on the user-specified threshold. However, it is not easy for the users to determine an appropriate minimum threshold in real-world applications. The reason for this is that if the value is assigned too high, no useful pattern may be found. Otherwise, if it is set too low, an enormous number of pattern results may be extracted, which degrades mining performance due to the large search space. It signifies that the users cannot predict the exact number of patterns mined by the threshold and control the mining result precisely.
To address this issue, top-k mining has been proposed [3], [8], [23], [32]. Instead of minimum threshold settings, it allows users to set the desired number of patterns, k, and discovers patterns from ones with the highest value to ones with the kth highest value. In top-k frequent pattern mining [3], [5], [11], [26], a set of top-k frequent patterns is extracted from binary databases of which the items are represented as a binary form in transactions and treated with the same importance. In this framework, the anti-monotone property (also known as downward closure property) [1] is used to reduce the search space, which makes a significant contribution to improving efficiency of the mining process. The property means that if any pattern is infrequent, all of its possible super patterns are also infrequent, where they include not only all of the items in the pattern but also at least one other item. Items in real-world applications such as retail market data analysis, meanwhile, have their own importance such as profits. Moreover, multiple copies of an item can be sold within a transaction. That is, top-k frequent pattern mining cannot consider the above characteristics of real-world databases. Although this problem can be solved with the concept of utility mining [12], [14], [17], [33], it does not satisfy the anti-monotone property. Therefore, it is hard to directly apply into top-k utility pattern mining the techniques of top-k frequent pattern mining relying upon the anti-monotonicity. For this purpose, overestimation methods [19], [29], [30] can be employed. A relevant algorithm [35] has been suggested in recent years for top-k high utility pattern mining with one of the methods [29], [30]. However, it generates a huge number of candidate patterns, which lead to performance degradation. In the framework of high utility pattern mining, decreasing extracted candidate patterns is a significant issue since the more candidates an algorithm produces, the greater its execution time becomes [30], [41], [2]. Consequently, the main challenge of top-k high utility pattern mining is how to raise a minimum utility threshold effectively under the overestimation model in order to reduce the candidates for performance improvement.
Top-k high utility pattern mining can play a significant role in real-world applications with non-binary databases such as web click stream analysis, mobile commerce environment planning [27], cross-marketing in retail stores, and biological gene database analysis. In retail market data analysis, especially, there is a need to analyze an enormous amount of sales data generated from all branches every day so as to establish sales strategies related to company benefits such as inventory preparation, product arrangement, and promotion. Furthermore, it is necessary to perform a rapid analysis with respect to huge sales databases of the branches during non-opening hours for their smooth running according to plan. Hence, to satisfy this requirement, mining performance in terms of execution time is significant.
In this paper, motivated from the above, we propose an algorithm, called Raising threshold with Exact and Pre-calculated utilities for Top-k high utility pattern mining (REPT), with strategies that can effectively increase a minimum utility threshold in top-k high utility pattern mining, through which we can mine top-k high utility patterns efficiently from non-binary databases with item importance by reducing the search space. Major contributions of this paper are summarized as follows:
- 1.
We develop three strategies to raise a minimum utility threshold effectively in the construction of a global tree structure. They utilize exact and pre-evaluated utilities of itemsets with the length of 1 or 2.
- 2.
We suggest a strategy that reduces the search space in the identification process of actual top-k high utility patterns from candidates by sorting them and increasing the threshold with the exact and pre-calculated utilities.
- 3.
We also propose an efficient algorithm for top-k high utility pattern mining with highly decreased candidates based on the four strategies as well as two other techniques [35].
- 4.
Comprehensive experiments on both real and synthetic datasets are conducted to evaluate performance of the proposed algorithm compared to state-of-the-art ones. Experimental results show that our algorithm outperforms the state-of-the-art methods.
The remainder of this paper is organized as follows. In Section 2, we introduce related work. In Section 3, we illustrate the proposed algorithm with the strategies in detail. In Section 4, we show and analyze experimental results for performance evaluation. Finally, conclusions are given in Section 5.
Section snippets
Frequent pattern mining
Frequent pattern mining [13], [31], [39], [4] is one of the fundamental researches in data mining, which mines patterns with no smaller supports than a given minimum support threshold. Numerous relevant studies have been conducted [7], [22], [25], [40] including two well-known representative algorithms: Apriori [1] and FP-Growth [9]. Apriori applies a level-wise candidate generation-and-test approach, and thereby it has problems of scanning databases multiple times and extracting a large number
Mining top-k high utility patterns with effective threshold raising techniques
In this section, we propose an efficient algorithm, called REPT, for mining top-k high utility patterns. Moreover, we develop four strategies, PUD, RIU, RSD, and SEP, which can reduce the search space by increasing a minimum threshold effectively in the mining process. The framework of the proposed method consists of three steps. In the first step, our algorithm scans the original database twice to construct a global tree. In the second step, the proposed method generates top-k high utility
Experimental environment and datasets
In this section, we conduct experiments for performance evaluation of the proposed algorithm, REPT, compared to the previous tree-based algorithms: TKU [35], UP-Growth [29], [30], and UP-Growth+ [29]. Note that, in the experiments, we use optimal parameters for the high utility pattern mining algorithms, UP-Growth and UP-Growth+, in order to mine the same number of patterns as the case of REPT and TKU. The reason for this is that it is necessary for miners to set a minimum utility threshold in
Discussion
Mining all high utility patterns is a general approach in utility mining. In this framework, discovered patterns have larger utility values than a minimum utility threshold. However, the number of high utility patterns mined from a huge database can be enormous depending on the threshold, and thereby it is hard for users to find and understand useful knowledge hidden in the database. Although the users can raise the threshold to reduce the number of generated high utility patterns, it is not an
Conclusions
In this paper, we proposed an efficient algorithm for mining top-k high utility patterns. We also suggested three strategies that can raise the minimum utility threshold effectively with exact and pre-evaluated utilities of itemsets of length 1 or 2 in Phase I, which made a considerable contribution to reducing the search space. Furthermore, we proposed a strategy to raise the threshold more efficiently with the exact and pre-calculated utilities for identifying a set of actual top-k high
Acknowledgements
This research was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF No. 2013005682), the MSIP (Ministry of Science, ICT & Future Planning), Korea, under ICT/SW Creative research program supervised by the NIPA (National ICT Industry Promotion Agency) (NIPA-2014-H0502-14-3008), and the Business for Cooperative R&D between Industry, Academy, and Research Institute funded Korea Small and Medium Business Administration in
References (41)
- et al.
Effective utility mining with the measure of average utility
Expert Syst. Appl.
(2011) - et al.
Sliding window based weighted maximal frequent pattern mining over data streams
Expert Syst. Appl.
(2014) - et al.
An effective tree structure for mining high utility itemsets
Expert Syst. Appl.
(2011) - et al.
An incremental mining algorithm for high utility itemsets
Expert Syst. Appl.
(2012) - et al.
High utility pattern mining using the maximal itemset property and lexicographic tree structures
Inf. Sci.
(2012) - et al.
Isolated items discarding strategy for discovering high utility itemsets
Data Knowl. Eng.
(2008) - et al.
Efficient prime-based method for interactive mining of frequent patterns
Expert Syst. Appl.
(2011) - et al.
Efficient frequent pattern mining based on linear prefix tree
Knowl.-Based Syst.
(2014) - et al.
Efficient algorithms for mining maximal high utility itemsets from data streams with different models
Expert Syst. Appl.
(2012) An adaptive approach to mining frequent itemsets efficiently
Expert Syst. Appl.
(2012)
A fast perturbation algorithm using tree structure for privacy preserving utility mining
Expert Syst. Appl.
Mining maximal frequent patterns by considering weight conditions over data streams
Knowl.-Based Syst.
High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates
Expert Syst. Appl.
Efficient tree structures for high utility pattern mining in incremental databases
IEEE Trans. Knowl. Data Eng.
Mining frequent itemsets without support threshold: with and without item constraints
IEEE Trans. Knowl. Data Eng.
Mining top-k frequent patterns in the presence of the memory constraint
VLDB J.
Assessing data mining results via swap randomization
ACM Trans. Knowl. Discov. Data
Fast algorithms for frequent itemset mining using FP-trees
IEEE Trans. Knowl. Data Eng.
Cited by (115)
Targeted mining of top-k high utility itemsets
2023, Engineering Applications of Artificial IntelligenceAn efficient biobjective evolutionary algorithm for mining frequent and high utility itemsets
2023, Applied Soft ComputingEfficient mining of top-k high utility itemsets through genetic algorithms
2023, Information SciencesHLHUI: An improved version of local high utility itemset mining
2023, Procedia Computer ScienceTKN: An efficient approach for discovering top-k high utility itemsets with positive or negative profits
2022, Information Sciences