skip to main content
research-article

MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension

Published:21 June 2020Publication History
Skip Abstract Section

Abstract

We present MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different popular interestingness measures, from a random sample of a transactional dataset. We describe a new formulation of these measures as functions of averages, that makes it possible to approximate them using sampling. We then discuss how pseudodimension, a key concept from statistical learning theory, relates to the sample size needed to obtain an high-quality approximation of the most interesting subgroups. We prove an upper bound on the pseudodimension of the problem at hand, which depends on characteristic quantities of the dataset and of the language of patterns of interest. This upper bound then leads to small sample sizes. Our evaluation on real datasets shows that MiSoSouP outperforms state-of-the-art algorithms offering the same guarantees, and it vastly speeds up the discovery of subgroups w.r.t. analyzing the whole dataset.

References

  1. Martin Anthony and Peter L. Bartlett. 1999. Neural Network Learning -- Theoretical Foundations. Cambridge University Press.Google ScholarGoogle Scholar
  2. Martin Atzmueller. 2015. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5, 1 (2015), 35--49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Michele Borassi and Emanuele Natale. 2016. KADABRA is an ADaptive algorithm for betweenness via random approximation. In Proceedings of the 24th Annual European Symposium on Algorithms (ESA’16). 20:1--20:18.Google ScholarGoogle Scholar
  4. Guillaume Bosc, Jean-François Boulicaut, Chedy Raïssi, and Mehdi Kaytoue. 2018. Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Mining and Knowledge Discovery 32, 3 (2018), 604--650.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Victor De la Peña and Evarist Giné. 1999. Decoupling: From Dependence to Independence. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  6. Wouter Duivesteijn, Ad Feelders, and Arno Knobbe. 2012. Different slopes for different folks: Mining for exceptional regression models with Cook’s distance. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). ACM, 868--876.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Tapio Elomaa and Matti Kääriäinen. 2002. Progressive Rademacher sampling. In Proceedings of the18th National Conference on Artificial Intelligence, Rina Dechter and Richard S. Sutton (Eds.). AAAI Press/The MIT Press, 140--145.Google ScholarGoogle Scholar
  8. Franciso Herrera, Cristóbal José Carmona, Pedro González, and María José Del Jesus. 2011. An overview on subgroup discovery: Foundations and applications. Knowledge and Information Systems 29, 3 (2011), 495--525.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wassily Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 301 (1963), 13--30.Google ScholarGoogle ScholarCross RefCross Ref
  10. Matti Kääriäinen, Tuomo Malinen, and Tapio Elomaa. 2004. Selective Rademacher penalization and reduced error pruning of decision trees. Journal of Machine Learning Research 5 (Sep. 2004), 1107--1126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Willi Klösgen. 1992. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems 7, 7 (1992), 649--673.Google ScholarGoogle ScholarCross RefCross Ref
  12. Willi Klösgen. 1996. Explora: A multipattern and multistrategy discovery assistant. In Proceedings of the Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, 249--271.Google ScholarGoogle Scholar
  13. Vladimir Koltchinskii. 2001. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory 47, 5 (July 2001), 1902--1914.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Petra Kralj Novak, Nada Lavrač, and Geoffrey I. Webb. 2009. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research 10, Feb (2009), 377--403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yi Li, Philip M. Long, and Aravind Srinivasan. 2001. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences 62, 3 (2001), 516--527.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  17. Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 422--436.Google ScholarGoogle Scholar
  18. Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sandy Moens and Mario Boley. 2014. Instant exceptional model mining using weighted controlled pattern sampling. In Proceedings of the International Symposium on Intelligent Data Analysis. Springer, 203--214.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yotam Ottolenghi. 2012. Yotam Ottolenghi’s recipes for char-grilled sprouting broccoli with sweet tahini, plus gingery fish balls in miso soup. Retrieved from https://www.theguardian.com/lifeandstyle/2012/feb/03/grilled-broccoli-fishball-soup-recipes.Google ScholarGoogle Scholar
  21. Gregory Piatetsky-Shapiro. 1991. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases. AAAI/MIT Press, 229--248.Google ScholarGoogle Scholar
  22. David Pollard. 1984. Convergence of Stochastic Processes. Springer.Google ScholarGoogle Scholar
  23. Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Ré. 2017. SLiMFast: Guaranteed results for data fusion and source reliability. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). ACM, 1399--1414.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Matteo Riondato and Eli Upfal. 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Transactions on Knowledge Discovery from Data 8, 4 (2014), 20. DOI:https://doi.org/10.1145/2629586Google ScholarGoogle Scholar
  25. Matteo Riondato and Eli Upfal. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, 1005--1014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Matteo Riondato and Eli Upfal. 2018. ABRA: Approximating betweenness centrality in static and dynamic graphs with Rademacher averages. ACM Transactions on Knowledge Discovery from Data 12, 5 (2018), 1--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Matteo Riondato and Fabio Vandin. 2014. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining, Mohammed Javeed Zaki, Zoran Obradovic, Pang-Ning Tan, Arindam Banerjee, Chandrika Kamath, and Srinivasan Parthasarathy (Eds.). SIAM, 497--505. DOI:https://doi.org/10.1137/1.9781611973440.57Google ScholarGoogle Scholar
  28. Matteo Riondato and Fabio Vandin. 2018. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18). ACM, 2130--2139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tobias Scheffer and Stefan Wrobel. 2002. Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research. 3 (Dec. 2002), 833--862.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996--13001.Google ScholarGoogle ScholarCross RefCross Ref
  32. Matthijs van Leeuwen and Arno Knobbe. 2011. Non-redundant subgroup discovery in large and complex data. In Proceedings of the Machine Learning and Knowledge Discovery in Databases (ECML PKDD’11). 459--474.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Matthijs van Leeuwen and Antti Ukkonen. 2016. Expect the unexpected -- On the significance of subgroups. In Proceedings of the International Conference on Discovery Science (DS’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vladimir N. Vapnik. 1998. Statistical Learning Theory. Wiley.Google ScholarGoogle ScholarCross RefCross Ref
  35. Vladimir N. Vapnik and Alexey J. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 2 (1971), 264--280. DOI:https://doi.org/10.1137/1116025Google ScholarGoogle Scholar
  36. Stefan Wrobel. 1997. An algorithm for multi-relational discovery of subgroups. In Proceedings of the European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD’97). 78--87.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Shengjia Zhao, Enze Zhou, Ashish Sabharwal, and Stefano Ermon. 2016. Adaptive concentration inequalities for sequential decision problems. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’16). 1343--1351.Google ScholarGoogle Scholar

Index Terms

  1. MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Knowledge Discovery from Data
            ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 5
            Special Issue on KDD 2018, Regular Papers and Survey Paper
            October 2020
            376 pages
            ISSN:1556-4681
            EISSN:1556-472X
            DOI:10.1145/3407672
            Issue’s Table of Contents

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 21 June 2020
            • Online AM: 7 May 2020
            • Accepted: 1 February 2020
            • Revised: 1 October 2019
            • Received: 1 January 2019
            Published in tkdd Volume 14, Issue 5

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format