research-article

MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension

Authors:
Matteo Riondato

Amherst College, East Drive, Amherst, MA

Amherst College, East Drive, Amherst, MA
View Profile

,
Fabio Vandin

Università di Padova, Via G. Gradenigo, Padova, Italy

Università di Padova, Via G. Gradenigo, Padova, Italy
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 14 Issue 5Article No.: 56pp 1–31https://doi.org/10.1145/3385653

Published:21 June 2020Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

We present MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different popular interestingness measures, from a random sample of a transactional dataset. We describe a new formulation of these measures as functions of averages, that makes it possible to approximate them using sampling. We then discuss how pseudodimension, a key concept from statistical learning theory, relates to the sample size needed to obtain an high-quality approximation of the most interesting subgroups. We prove an upper bound on the pseudodimension of the problem at hand, which depends on characteristic quantities of the dataset and of the language of patterns of interest. This upper bound then leads to small sample sizes. Our evaluation on real datasets shows that MiSoSouP outperforms state-of-the-art algorithms offering the same guarantees, and it vastly speeds up the discovery of subgroups w.r.t. analyzing the whole dataset.

References

Martin Anthony and Peter L. Bartlett. 1999. Neural Network Learning -- Theoretical Foundations. Cambridge University Press.Google Scholar
Martin Atzmueller. 2015. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5, 1 (2015), 35--49.Google ScholarDigital Library
Michele Borassi and Emanuele Natale. 2016. KADABRA is an ADaptive algorithm for betweenness via random approximation. In Proceedings of the 24th Annual European Symposium on Algorithms (ESA’16). 20:1--20:18.Google Scholar
Guillaume Bosc, Jean-François Boulicaut, Chedy Raïssi, and Mehdi Kaytoue. 2018. Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Mining and Knowledge Discovery 32, 3 (2018), 604--650.Google ScholarDigital Library
Victor De la Peña and Evarist Giné. 1999. Decoupling: From Dependence to Independence. Springer.Google ScholarCross Ref
Wouter Duivesteijn, Ad Feelders, and Arno Knobbe. 2012. Different slopes for different folks: Mining for exceptional regression models with Cook’s distance. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). ACM, 868--876.Google ScholarDigital Library
Tapio Elomaa and Matti Kääriäinen. 2002. Progressive Rademacher sampling. In Proceedings of the18th National Conference on Artificial Intelligence, Rina Dechter and Richard S. Sutton (Eds.). AAAI Press/The MIT Press, 140--145.Google Scholar
Franciso Herrera, Cristóbal José Carmona, Pedro González, and María José Del Jesus. 2011. An overview on subgroup discovery: Foundations and applications. Knowledge and Information Systems 29, 3 (2011), 495--525.Google ScholarDigital Library
Wassily Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 301 (1963), 13--30.Google ScholarCross Ref
Matti Kääriäinen, Tuomo Malinen, and Tapio Elomaa. 2004. Selective Rademacher penalization and reduced error pruning of decision trees. Journal of Machine Learning Research 5 (Sep. 2004), 1107--1126.Google ScholarDigital Library
Willi Klösgen. 1992. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems 7, 7 (1992), 649--673.Google ScholarCross Ref
Willi Klösgen. 1996. Explora: A multipattern and multistrategy discovery assistant. In Proceedings of the Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, 249--271.Google Scholar
Vladimir Koltchinskii. 2001. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory 47, 5 (July 2001), 1902--1914.Google ScholarDigital Library
Petra Kralj Novak, Nada Lavrač, and Geoffrey I. Webb. 2009. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research 10, Feb (2009), 377--403.Google ScholarDigital Library
Yi Li, Philip M. Long, and Aravind Srinivasan. 2001. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences 62, 3 (2001), 516--527.Google ScholarDigital Library
M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.Google Scholar
Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 422--436.Google Scholar
Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press.Google ScholarDigital Library
Sandy Moens and Mario Boley. 2014. Instant exceptional model mining using weighted controlled pattern sampling. In Proceedings of the International Symposium on Intelligent Data Analysis. Springer, 203--214.Google ScholarCross Ref
Yotam Ottolenghi. 2012. Yotam Ottolenghi’s recipes for char-grilled sprouting broccoli with sweet tahini, plus gingery fish balls in miso soup. Retrieved from https://www.theguardian.com/lifeandstyle/2012/feb/03/grilled-broccoli-fishball-soup-recipes.Google Scholar
Gregory Piatetsky-Shapiro. 1991. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases. AAAI/MIT Press, 229--248.Google Scholar
David Pollard. 1984. Convergence of Stochastic Processes. Springer.Google Scholar
Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Ré. 2017. SLiMFast: Guaranteed results for data fusion and source reliability. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). ACM, 1399--1414.Google ScholarDigital Library
Matteo Riondato and Eli Upfal. 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Transactions on Knowledge Discovery from Data 8, 4 (2014), 20. DOI:https://doi.org/10.1145/2629586Google Scholar
Matteo Riondato and Eli Upfal. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, 1005--1014.Google ScholarDigital Library
Matteo Riondato and Eli Upfal. 2018. ABRA: Approximating betweenness centrality in static and dynamic graphs with Rademacher averages. ACM Transactions on Knowledge Discovery from Data 12, 5 (2018), 1--38.Google ScholarDigital Library
Matteo Riondato and Fabio Vandin. 2014. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining, Mohammed Javeed Zaki, Zoran Obradovic, Pang-Ning Tan, Arindam Banerjee, Chandrika Kamath, and Srinivasan Parthasarathy (Eds.). SIAM, 497--505. DOI:https://doi.org/10.1137/1.9781611973440.57Google Scholar
Matteo Riondato and Fabio Vandin. 2018. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18). ACM, 2130--2139.Google ScholarDigital Library
Tobias Scheffer and Stefan Wrobel. 2002. Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research. 3 (Dec. 2002), 833--862.Google ScholarDigital Library
Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.Google ScholarDigital Library
Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996--13001.Google ScholarCross Ref
Matthijs van Leeuwen and Arno Knobbe. 2011. Non-redundant subgroup discovery in large and complex data. In Proceedings of the Machine Learning and Knowledge Discovery in Databases (ECML PKDD’11). 459--474.Google ScholarDigital Library
Matthijs van Leeuwen and Antti Ukkonen. 2016. Expect the unexpected -- On the significance of subgroups. In Proceedings of the International Conference on Discovery Science (DS’16).Google ScholarDigital Library
Vladimir N. Vapnik. 1998. Statistical Learning Theory. Wiley.Google ScholarCross Ref
Vladimir N. Vapnik and Alexey J. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 2 (1971), 264--280. DOI:https://doi.org/10.1137/1116025Google Scholar
Stefan Wrobel. 1997. An algorithm for multi-relational discovery of subgroups. In Proceedings of the European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD’97). 78--87.Google ScholarDigital Library
Shengjia Zhao, Enze Zhou, Ashish Sabharwal, and Stefano Ermon. 2016. Adaptive concentration inequalities for sequential decision problems. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’16). 1343--1351.Google Scholar

Index Terms

MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension

Recommendations

MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We present MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different interestingness measures, from a random sample of a transactional dataset. We describe a new formulation of ...
Read More
Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks

We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting W be the number of weights and L be the number of layers, we ...
Read More
Upper bounds for can-tree and FP-tree
Abstract
Two efficient tree structures known as Can-tree (Leung et al., Knowledge and Information Systems, 11(3), 287–311, 2007) and FP-tree (Han et al., 2000) are used to store a database in memory for mining frequent patterns. However, there has been no ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 14, Issue 5
Special Issue on KDD 2018, Regular Papers and Survey Paper
October 2020
376 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3407672
Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
Minginglamp Academy of Sciences, China
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2020
- Online AM: 7 May 2020
- Accepted: 1 February 2020
- Revised: 1 October 2019
- Received: 1 January 2019
Published in tkdd Volume 14, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Pattern mining
statistical learning theory
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 151
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension

Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks

Upper bounds for can-tree and FP-tree