Identifying Approximate Itemsets of Interest in Large Databases

Zhang, Chengqi; Zhang, Shichao; Webb, Geoffrey I.

doi:10.1023/A:1020995206763

Identifying Approximate Itemsets of Interest in Large Databases

Published: January 2003

Volume 18, pages 91–104, (2003)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Chengqi Zhang¹,
Shichao Zhang^1,2 &
Geoffrey I. Webb³

85 Accesses
20 Citations
Explore all metrics

Abstract

This paper presents a method for discovering approximate frequent itemsets of interest in large scale databases. This method uses the central limit theorem to increase efficiency, enabling us to reduce the sample size by about half compared to previous approximations. Further efficiency is gained by pruning from the search space uninteresting frequent itemsets. In addition to improving efficiency, this measure also reduces the number of itemsets that the user need consider. The model and algorithm have been implemented and evaluated using both synthetic and real-world databases. Our experimental results demonstrate the efficiency of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

C. Aggarawal and P. Yu, “A new framework for itemset generation,” in Proceedings of the ACM PODS, 1998, pp. 18–24.
R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the ACM SIGMOD Conference on Management of Data, 1993, pp. 207–216.
R. Agrawal, T. Imielinski, and A. Swami, “Database Mining: A Performance Perspective,” IEEE Trans. Knowledge and Data Eng., vol. 5, no.6, pp. 914–925, 1993.
Google Scholar
S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets: Generalizing association rules to Correlations,” in Proceedings of the ACMSIGMOD International Conference on Management of Data, 1997, pp. 265–276.
C. Carter, H. Hamilton, and N. Cercone, “Share based measures for itemsets,” in Principles of Data Mining and Knowledge Discovery, edited by J. Komorowski and J. Zytkow, pp. 14–24, 1997.
J. Park, M. Chen, and P. Yu, “Using a Hash-based method with transaction trimming for mining association rules,” IEEE Trans. Knowledge and Data Eng., vol. 9, no.5, pp. 813–824, 1997.
Google Scholar
T. Shintani and M. Kitsuregawa, “Parallel mining algorithms for generalized association rules with classification hierarchy,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1998, pp. 25–36.
R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996, pp. 1–12.
R. Srikant and R. Agrawal, “Mining generalized association rules,” Future Generation Computer Systems, vol. 13, pp. 161–180, 1997.
Google Scholar
D. Tsur, J. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosenthal, “Query flocks: A generalization of association-rule mining,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1998, pp. 1–12.
S. Brin, R. Motwani, J. Ullman, and S. Tsur, “Dynamic item-set counting and implication rules for market basket data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997, pp. 255–264.
H. Toivonen, “Sampling large databases for association rules,” in Proceedings of the 22nd VLDB Conference, 1996, pp. 134–145.
G. Webb, “Efficient search for association rules,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000, pp. 99–107.
R. Durrett, Probability: Theory and Examples, Duxbury Press, 1996.
T. Hagerup and C. Rub, “A guided tour of Chernoff bounds,” Information Processing Letters, vol. 33, pp. 305–308, 1989.
Google Scholar
R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th VLDB Conference, 1994, pp. 487–499.
E. Omiecinski and A. Savasere, “Efficient mining of association rules in large dynamic databases,” in Proceedings of 16th British National Conference on Databases BNCOD 16, Cardiff, Wales, UK, 1998, pp. 49–63.
A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases,” in Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995, pp. 688–692.
G. Piatetsky-Shapiro, “Discovery, analysis, and presentation of strong rules,” in Knowledge Discovery in Databases, edited by G. Piatetsky-Shapiro and W. Frawley, AAAI Press/MIT Press, pp. 229–248, 1991.
D. Cheung, J. Han, V. Ng, and C. Wong, “Maintenance of discovered association rules in large databases: An incremental updating technique,” in Proceedings of IEEE, 1996, pp. 106–114.
R. Godin and R. Missaoui, “An incremental concept formation approach for learning from databases,” Theoretical Computer Science, vol. 133, pp. 387–419, 1994.
Google Scholar
J. Han, Y. Cai, and N. Cercone, “Knowledge discovery in databases: An attribute-oriented approach,” in Proceedings of VLDB-92, Canada, 1992, pp. 547–559.
M. Houtsma and A. Swami, “Set-oriented data mining in relational databases,” Data & Knowledge Engineering, vol. 17, pp. 245–262, 1995.
Google Scholar
R. Miller and Y. Yang, “Association rules over interval data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997, pp. 452–461.
D. Rasmussen and R. Yager, “Induction of fuzzy characteristic rules,” in Principles of Data Mining and Knowledge Discovery, edited by J. Komorowski and J. Zytkow, pp. 123–133. 1997.
E. Han, G. Karypis, and V. Kumar, “Scalable parallel data mining for association rules,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997, pp. 277–288.
M. Chen, J. Han, and P. Yu, “Data mining: An overview from a database perspective,” IEEE Trans. Knowledge and Data Eng., vol. 8, no.6, pp. 866–881, 1996.
Google Scholar
U. Fayyad and P. Stolorz, “Data mining and KDD: Promise and challenges,” Future Generation Computer Systems, vol. 13, pp. 99–115, 1997.
Google Scholar
J. Hosking, E. Pednault, and M. Sudan, “A statistical perspective on data mining,” Future Generation Computer Systems, vol. 13,pp. 117–134, 1997.
Google Scholar
H. Liu and H. Motoda, Instance Selection and Construction for Data Mining, Kluwer Academic Publishers: Dordrecht, 2001.
Google Scholar
N. Syed, H. Liu, and K. Sung, “From incremental learning to model independent instance selection—A support vector machine approach,” Technical Report, TRA9/99, School of Computing, National University of Singapore, Sept, 1999 (http://techrep.comp.nus.edu.sg/techreports/1999/TRA9-99.asp).

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, University of Technology, Sydney, PO Box 123, Broadway, NSW, 2007, Australia
Chengqi Zhang & Shichao Zhang
School of Computing, Guangxi University, People's Republic of China
Shichao Zhang
School of Computing and Mathematics, Deakin University, Geelong, Vic, 3217, Australia
Geoffrey I. Webb

Authors

Chengqi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shichao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey I. Webb
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, C., Zhang, S. & Webb, G.I. Identifying Approximate Itemsets of Interest in Large Databases. Applied Intelligence 18, 91–104 (2003). https://doi.org/10.1023/A:1020995206763

Download citation

Issue Date: January 2003
DOI: https://doi.org/10.1023/A:1020995206763

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying Approximate Itemsets of Interest in Large Databases

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Stratified random sampling from streaming and stored data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Identifying Approximate Itemsets of Interest in Large Databases

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Stratified random sampling from streaming and stored data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation