Abstract
The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-K) FI’s and AR’s. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization of the VC-dimension of this range space and a proof that it is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call d-index, namely the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (ε, δ)-approximation of the collection of FI’s is \(O(\frac{1}{\varepsilon^2}(d+\log\frac{1}{\delta}))\) (resp. \(O(\frac{2+\varepsilon}{\varepsilon^2(2-\varepsilon)\theta}(d\log\frac{2+\varepsilon}{(2-\varepsilon)\theta}+\log\frac{1}{\delta}))\)) transactions, which is a significant improvement over previous known results. We present an extensive experimental evaluation of our technique on real and artificial datasets, demonstrating the practicality of our methods, and showing that they achieve even higher quality approximations than what is guaranteed by the analysis.
Work was supported in part by NSF award IIS-0905553.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22, 207–216 (1993)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB 1994 (1994)
Alon, N., Spencer, J.H.: The Probabilistic Method, 3rd edn. Wiley (2008)
Brönnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with ease. In: KDD 2003 (2003)
Ceglar, A., Roddick, J.F.: Association mining. ACM Comput. Surv. 38(5) (2006)
Chakaravarthy, V.T., Pandit, V., Sabharwal, Y.: Analysis of sampling techniques for association rule mining. In: ICDT 2009 (2009)
Chandra, B., Bhaskar, S.: A new approach for generating efficient sample from market basket data. Expert Sys. with Appl. 38(3), 1321–1325 (2011)
Chazelle, B.: The discrepancy method: randomness and complexity, Cambridge (2000)
Chen, B., Haas, P., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: KDD 2002 (2002)
Chen, C., Horng, S.-J., Huang, C.-P.: Locality sensitive hashing for sampling-based algorithms in association rule mining. Expert Sys. with Appl. 38(10), 12388–12397 (2011)
Cheung, Y.-L., Fu, A.W.-C.: Mining frequent itemsets without support threshold: With and without item constraints. IEEE Trans. on Knowl. and Data Eng. 16, 1052–1069 (2004)
Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive Sampling for Association Rules Based on Sampling Error Estimation. In: Adv. in Knowl. Disc. and Data Mining. Springer, Heidelberg (2005)
Chuang, K.-T., Huang, J.-L., Chen, M.-S.: Power-law relationship and self-similarity in the itemset support distribution: analysis and applications. The VLDB Journal 17(5) (2008)
Fu, A.W.-C., Kwong, R.W.-W., Tang, J.: Mining N-most Interesting Itemsets. In: Ohsuga, S., Raś, Z.W. (eds.) ISMIS 2000. LNCS (LNAI), vol. 1932, pp. 59–67. Springer, Heidelberg (2000)
Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15, 55–86 (2007)
Har-Peled, S., Sharir, M.: Relative (p,ε)-approximations in geometry. Discr. & Comput. Geometry 45(3), 462–496 (2011)
Hu, X., Yu, H.: The Research of Sampling for Mining Frequent Itemsets. In: Wang, G.-Y., Peters, J.F., Skowron, A., Yao, Y. (eds.) RSKT 2006. LNCS (LNAI), vol. 4062, pp. 496–501. Springer, Heidelberg (2006)
Hwang, W., Kim, D.: Improved association rule mining by modified trimming. In: CIT 2006 (2006)
Jia, C., Lu, R.: Sampling Ensembles for Frequent Patterns. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3613, pp. 1197–1206. Springer, Heidelberg (2005)
Jia, C.-Y., Gao, X.-P.: Multi-scaling sampling: An adaptive sampling method for discovering approximate association rules. J. of Comp. Sci. and Tech. 20, 309–318 (2005)
John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: KDD 1996 (1996)
Li, Y., Gopalan, R.: Effective Sampling for Mining Association Rules. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 391–401. Springer, Heidelberg (2004)
Linial, N., Mansour, Y., Rivest, R.L.: Results on learnability and the Vapnik-Chervonenkis dimension. Information and Computation 1, 33–49 (1991)
Löffler, M., Phillips, J.M.: Shape Fitting on Point Sets with Probability Distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009)
Mahafzah, B.A., Al-Badarneh, A.F., Zakaria, M.Z.: A new sampling technique for association rule mining. J. of Information Science 3, 358–376 (2009)
Mampaey, M., Tatti, N., Vreeken, J.: Tell me what i need to know: succinctly summarizing data with itemsets. In: KDD 2011 (2011)
Mannila, H., Toivonen, H., Verkamo, I.: Efficient algorithms for discovering association rules. In: KDD 1994 (1994)
Parthasarathy, S.: Efficient progressive sampling for association rules. In: ICDM 2002 (2002)
Pietracaprina, A., Riondato, M., Upfal, E., Vandin, F.: Mining top-K frequent itemsets through progressive sampling. Data Min. Knowl. Discov. 21, 310–326 (2010)
Pietracaprina, A., Vandin, F.: Efficient Incremental Mining of Top-K Frequent Closed Itemsets. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 275–280. Springer, Heidelberg (2007)
Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR abs/1111.6937 (2011)
Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2002)
Toivonen, H.: Sampling large databases for association rules. In: VLDB 1996 (1996)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag (1999)
Vapnik, V.N., Chervonenkis, A.J.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Prob. and its Appl. 16(2), 264–280 (1971)
Vasudevan, D., Vojonović, M.: Ranking through random sampling. MSR-TR-2009-8 8, Microsoft Research (2009)
Wang, J., Han, J., Lu, Y., Tzvetkov, P.: TFP: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans. on Knowl. and Data Eng. 17, 652–664 (2005)
Wang, S., Dash, M., Chia, L.-T.: Efficient Sampling: Application to Image Data. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 452–463. Springer, Heidelberg (2005)
Zaki, M., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: RIDE 1997 (1997)
Zhang, C., Zhang, S., Webb, G.I.: Identifying approximate itemsets of interest in large databases. Applied Intelligence 18, 91–104 (2003)
Zhao, Y., Zhang, C., Zhang, S.: Efficient frequent itemsets mining by sampling. In: AMT 2006 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Riondato, M., Upfal, E. (2012). Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-33460-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33459-7
Online ISBN: 978-3-642-33460-3
eBook Packages: Computer ScienceComputer Science (R0)