Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

Riondato, Matteo; Upfal, Eli

doi:10.1007/978-3-642-33460-3_7

Matteo Riondato²⁰ &
Eli Upfal²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7523))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

5188 Accesses
8 Citations

Abstract

The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-K) FI’s and AR’s. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization of the VC-dimension of this range space and a proof that it is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call d-index, namely the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (ε, δ)-approximation of the collection of FI’s is \(O(\frac{1}{\varepsilon^2}(d+\log\frac{1}{\delta}))\) (resp. \(O(\frac{2+\varepsilon}{\varepsilon^2(2-\varepsilon)\theta}(d\log\frac{2+\varepsilon}{(2-\varepsilon)\theta}+\log\frac{1}{\delta}))\)) transactions, which is a significant improvement over previous known results. We present an extensive experimental evaluation of our technique on real and artificial datasets, demonstrating the practicality of our methods, and showing that they achieve even higher quality approximations than what is guaranteed by the analysis.

Work was supported in part by NSF award IIS-0905553.

Download to read the full chapter text

Chapter PDF

Probabilistic Maximal Frequent Itemset Mining Over Uncertain Databases

Iterative sampling based frequent itemset mining for big data

Article 20 March 2015

A Faster Algorithm for Truth Discovery via Range Cover

Article 25 March 2019

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22, 207–216 (1993)
Article Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB 1994 (1994)
Google Scholar
Alon, N., Spencer, J.H.: The Probabilistic Method, 3rd edn. Wiley (2008)
Google Scholar
Brönnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with ease. In: KDD 2003 (2003)
Google Scholar
Ceglar, A., Roddick, J.F.: Association mining. ACM Comput. Surv. 38(5) (2006)
Google Scholar
Chakaravarthy, V.T., Pandit, V., Sabharwal, Y.: Analysis of sampling techniques for association rule mining. In: ICDT 2009 (2009)
Google Scholar
Chandra, B., Bhaskar, S.: A new approach for generating efficient sample from market basket data. Expert Sys. with Appl. 38(3), 1321–1325 (2011)
Article Google Scholar
Chazelle, B.: The discrepancy method: randomness and complexity, Cambridge (2000)
Google Scholar
Chen, B., Haas, P., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: KDD 2002 (2002)
Google Scholar
Chen, C., Horng, S.-J., Huang, C.-P.: Locality sensitive hashing for sampling-based algorithms in association rule mining. Expert Sys. with Appl. 38(10), 12388–12397 (2011)
Article Google Scholar
Cheung, Y.-L., Fu, A.W.-C.: Mining frequent itemsets without support threshold: With and without item constraints. IEEE Trans. on Knowl. and Data Eng. 16, 1052–1069 (2004)
Article Google Scholar
Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive Sampling for Association Rules Based on Sampling Error Estimation. In: Adv. in Knowl. Disc. and Data Mining. Springer, Heidelberg (2005)
Google Scholar
Chuang, K.-T., Huang, J.-L., Chen, M.-S.: Power-law relationship and self-similarity in the itemset support distribution: analysis and applications. The VLDB Journal 17(5) (2008)
Google Scholar
Fu, A.W.-C., Kwong, R.W.-W., Tang, J.: Mining N-most Interesting Itemsets. In: Ohsuga, S., Raś, Z.W. (eds.) ISMIS 2000. LNCS (LNAI), vol. 1932, pp. 59–67. Springer, Heidelberg (2000)
Chapter Google Scholar
Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15, 55–86 (2007)
Article MathSciNet Google Scholar
Har-Peled, S., Sharir, M.: Relative (p,ε)-approximations in geometry. Discr. & Comput. Geometry 45(3), 462–496 (2011)
Article MathSciNet MATH Google Scholar
Hu, X., Yu, H.: The Research of Sampling for Mining Frequent Itemsets. In: Wang, G.-Y., Peters, J.F., Skowron, A., Yao, Y. (eds.) RSKT 2006. LNCS (LNAI), vol. 4062, pp. 496–501. Springer, Heidelberg (2006)
Chapter Google Scholar
Hwang, W., Kim, D.: Improved association rule mining by modified trimming. In: CIT 2006 (2006)
Google Scholar
Jia, C., Lu, R.: Sampling Ensembles for Frequent Patterns. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3613, pp. 1197–1206. Springer, Heidelberg (2005)
Chapter Google Scholar
Jia, C.-Y., Gao, X.-P.: Multi-scaling sampling: An adaptive sampling method for discovering approximate association rules. J. of Comp. Sci. and Tech. 20, 309–318 (2005)
Article MathSciNet Google Scholar
John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: KDD 1996 (1996)
Google Scholar
Li, Y., Gopalan, R.: Effective Sampling for Mining Association Rules. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 391–401. Springer, Heidelberg (2004)
Chapter Google Scholar
Linial, N., Mansour, Y., Rivest, R.L.: Results on learnability and the Vapnik-Chervonenkis dimension. Information and Computation 1, 33–49 (1991)
Article MathSciNet Google Scholar
Löffler, M., Phillips, J.M.: Shape Fitting on Point Sets with Probability Distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009)
Chapter Google Scholar
Mahafzah, B.A., Al-Badarneh, A.F., Zakaria, M.Z.: A new sampling technique for association rule mining. J. of Information Science 3, 358–376 (2009)
Article Google Scholar
Mampaey, M., Tatti, N., Vreeken, J.: Tell me what i need to know: succinctly summarizing data with itemsets. In: KDD 2011 (2011)
Google Scholar
Mannila, H., Toivonen, H., Verkamo, I.: Efficient algorithms for discovering association rules. In: KDD 1994 (1994)
Google Scholar
Parthasarathy, S.: Efficient progressive sampling for association rules. In: ICDM 2002 (2002)
Google Scholar
Pietracaprina, A., Riondato, M., Upfal, E., Vandin, F.: Mining top-K frequent itemsets through progressive sampling. Data Min. Knowl. Discov. 21, 310–326 (2010)
Article MathSciNet Google Scholar
Pietracaprina, A., Vandin, F.: Efficient Incremental Mining of Top-K Frequent Closed Itemsets. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 275–280. Springer, Heidelberg (2007)
Chapter Google Scholar
Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR abs/1111.6937 (2011)
Google Scholar
Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2002)
MathSciNet Google Scholar
Toivonen, H.: Sampling large databases for association rules. In: VLDB 1996 (1996)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag (1999)
Google Scholar
Vapnik, V.N., Chervonenkis, A.J.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Prob. and its Appl. 16(2), 264–280 (1971)
Article MathSciNet MATH Google Scholar
Vasudevan, D., Vojonović, M.: Ranking through random sampling. MSR-TR-2009-8 8, Microsoft Research (2009)
Google Scholar
Wang, J., Han, J., Lu, Y., Tzvetkov, P.: TFP: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans. on Knowl. and Data Eng. 17, 652–664 (2005)
Article Google Scholar
Wang, S., Dash, M., Chia, L.-T.: Efficient Sampling: Application to Image Data. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 452–463. Springer, Heidelberg (2005)
Chapter Google Scholar
Zaki, M., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: RIDE 1997 (1997)
Google Scholar
Zhang, C., Zhang, S., Webb, G.I.: Identifying approximate itemsets of interest in large databases. Applied Intelligence 18, 91–104 (2003)
Article Google Scholar
Zhao, Y., Zhang, C., Zhang, S.: Efficient frequent itemsets mining by sampling. In: AMT 2006 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Brown University, Providence, RI, USA
Matteo Riondato & Eli Upfal

Authors

Matteo Riondato
View author publications
You can also search for this author in PubMed Google Scholar
Eli Upfal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK
Peter A. Flach , Tijl De Bie & Nello Cristianini , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Riondato, M., Upfal, E. (2012). Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-33460-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33459-7
Online ISBN: 978-3-642-33460-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

Abstract

Chapter PDF

Similar content being viewed by others

Probabilistic Maximal Frequent Itemset Mining Over Uncertain Databases

Iterative sampling based frequent itemset mining for big data

A Faster Algorithm for Truth Discovery via Range Cover

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

Abstract

Chapter PDF

Similar content being viewed by others

Probabilistic Maximal Frequent Itemset Mining Over Uncertain Databases

Iterative sampling based frequent itemset mining for big data

A Faster Algorithm for Truth Discovery via Range Cover

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation