Abstract
We present MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different popular interestingness measures, from a random sample of a transactional dataset. We describe a new formulation of these measures as functions of averages, that makes it possible to approximate them using sampling. We then discuss how pseudodimension, a key concept from statistical learning theory, relates to the sample size needed to obtain an high-quality approximation of the most interesting subgroups. We prove an upper bound on the pseudodimension of the problem at hand, which depends on characteristic quantities of the dataset and of the language of patterns of interest. This upper bound then leads to small sample sizes. Our evaluation on real datasets shows that MiSoSouP outperforms state-of-the-art algorithms offering the same guarantees, and it vastly speeds up the discovery of subgroups w.r.t. analyzing the whole dataset.
- Martin Anthony and Peter L. Bartlett. 1999. Neural Network Learning -- Theoretical Foundations. Cambridge University Press.Google Scholar
- Martin Atzmueller. 2015. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5, 1 (2015), 35--49.Google ScholarDigital Library
- Michele Borassi and Emanuele Natale. 2016. KADABRA is an ADaptive algorithm for betweenness via random approximation. In Proceedings of the 24th Annual European Symposium on Algorithms (ESA’16). 20:1--20:18.Google Scholar
- Guillaume Bosc, Jean-François Boulicaut, Chedy Raïssi, and Mehdi Kaytoue. 2018. Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Mining and Knowledge Discovery 32, 3 (2018), 604--650.Google ScholarDigital Library
- Victor De la Peña and Evarist Giné. 1999. Decoupling: From Dependence to Independence. Springer.Google ScholarCross Ref
- Wouter Duivesteijn, Ad Feelders, and Arno Knobbe. 2012. Different slopes for different folks: Mining for exceptional regression models with Cook’s distance. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). ACM, 868--876.Google ScholarDigital Library
- Tapio Elomaa and Matti Kääriäinen. 2002. Progressive Rademacher sampling. In Proceedings of the18th National Conference on Artificial Intelligence, Rina Dechter and Richard S. Sutton (Eds.). AAAI Press/The MIT Press, 140--145.Google Scholar
- Franciso Herrera, Cristóbal José Carmona, Pedro González, and María José Del Jesus. 2011. An overview on subgroup discovery: Foundations and applications. Knowledge and Information Systems 29, 3 (2011), 495--525.Google ScholarDigital Library
- Wassily Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 301 (1963), 13--30.Google ScholarCross Ref
- Matti Kääriäinen, Tuomo Malinen, and Tapio Elomaa. 2004. Selective Rademacher penalization and reduced error pruning of decision trees. Journal of Machine Learning Research 5 (Sep. 2004), 1107--1126.Google ScholarDigital Library
- Willi Klösgen. 1992. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems 7, 7 (1992), 649--673.Google ScholarCross Ref
- Willi Klösgen. 1996. Explora: A multipattern and multistrategy discovery assistant. In Proceedings of the Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, 249--271.Google Scholar
- Vladimir Koltchinskii. 2001. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory 47, 5 (July 2001), 1902--1914.Google ScholarDigital Library
- Petra Kralj Novak, Nada Lavrač, and Geoffrey I. Webb. 2009. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research 10, Feb (2009), 377--403.Google ScholarDigital Library
- Yi Li, Philip M. Long, and Aravind Srinivasan. 2001. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences 62, 3 (2001), 516--527.Google ScholarDigital Library
- M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.Google Scholar
- Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 422--436.Google Scholar
- Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press.Google ScholarDigital Library
- Sandy Moens and Mario Boley. 2014. Instant exceptional model mining using weighted controlled pattern sampling. In Proceedings of the International Symposium on Intelligent Data Analysis. Springer, 203--214.Google ScholarCross Ref
- Yotam Ottolenghi. 2012. Yotam Ottolenghi’s recipes for char-grilled sprouting broccoli with sweet tahini, plus gingery fish balls in miso soup. Retrieved from https://www.theguardian.com/lifeandstyle/2012/feb/03/grilled-broccoli-fishball-soup-recipes.Google Scholar
- Gregory Piatetsky-Shapiro. 1991. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases. AAAI/MIT Press, 229--248.Google Scholar
- David Pollard. 1984. Convergence of Stochastic Processes. Springer.Google Scholar
- Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Ré. 2017. SLiMFast: Guaranteed results for data fusion and source reliability. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). ACM, 1399--1414.Google ScholarDigital Library
- Matteo Riondato and Eli Upfal. 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Transactions on Knowledge Discovery from Data 8, 4 (2014), 20. DOI:https://doi.org/10.1145/2629586Google Scholar
- Matteo Riondato and Eli Upfal. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, 1005--1014.Google ScholarDigital Library
- Matteo Riondato and Eli Upfal. 2018. ABRA: Approximating betweenness centrality in static and dynamic graphs with Rademacher averages. ACM Transactions on Knowledge Discovery from Data 12, 5 (2018), 1--38.Google ScholarDigital Library
- Matteo Riondato and Fabio Vandin. 2014. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining, Mohammed Javeed Zaki, Zoran Obradovic, Pang-Ning Tan, Arindam Banerjee, Chandrika Kamath, and Srinivasan Parthasarathy (Eds.). SIAM, 497--505. DOI:https://doi.org/10.1137/1.9781611973440.57Google Scholar
- Matteo Riondato and Fabio Vandin. 2018. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18). ACM, 2130--2139.Google ScholarDigital Library
- Tobias Scheffer and Stefan Wrobel. 2002. Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research. 3 (Dec. 2002), 833--862.Google ScholarDigital Library
- Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.Google ScholarDigital Library
- Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996--13001.Google ScholarCross Ref
- Matthijs van Leeuwen and Arno Knobbe. 2011. Non-redundant subgroup discovery in large and complex data. In Proceedings of the Machine Learning and Knowledge Discovery in Databases (ECML PKDD’11). 459--474.Google ScholarDigital Library
- Matthijs van Leeuwen and Antti Ukkonen. 2016. Expect the unexpected -- On the significance of subgroups. In Proceedings of the International Conference on Discovery Science (DS’16).Google ScholarDigital Library
- Vladimir N. Vapnik. 1998. Statistical Learning Theory. Wiley.Google ScholarCross Ref
- Vladimir N. Vapnik and Alexey J. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 2 (1971), 264--280. DOI:https://doi.org/10.1137/1116025Google Scholar
- Stefan Wrobel. 1997. An algorithm for multi-relational discovery of subgroups. In Proceedings of the European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD’97). 78--87.Google ScholarDigital Library
- Shengjia Zhao, Enze Zhou, Ashish Sabharwal, and Stefano Ermon. 2016. Adaptive concentration inequalities for sequential decision problems. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’16). 1343--1351.Google Scholar
Index Terms
- MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension
Recommendations
MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningWe present MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different interestingness measures, from a random sample of a transactional dataset. We describe a new formulation of ...
Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks
We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting W be the number of weights and L be the number of layers, we ...
Comments