Abstract
The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database. The heuristics are based upon well-known measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Atkinson, A.B.: On the measurement of inequality. Journal of Economic Theory 2, 244–263 (1970)
Berger, W.H., Parker, F.L.: Diversity of planktonic forminifera in deep-sea sediments. Science 168, 1345–1347 (1970)
Bournaud, I., Ganascia, J.-G.: Accounting for domain knowledge in the construction of a generalization space. In: Proceedings of the Third International Conference on Conceptual Structures, pp. 446–459. Springer, Heidelberg (1997)
Bray, J.R., Curtis, J.T.: An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs 27, 325–349 (1957)
Freitas, A.A.: On objective measures of rule surprisingness. In: Zytkow, J., Quafafou, M. (eds.) Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD 1998), Nantes, France, September 1998, pp. 1–9 (1998)
Godin, R., Missaoui, R., Alaoui, H.: Incremental concept formation algorithms based on galois (concept) lattices. Computational Intelligence 11(2), 246–267 (1995)
Hamilton, H.J., Hilderman, R.J., Li, L., Randall, D.J.: Generalization lattices. In: Zytkow, J., Quafafou, M. (eds.) Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD 1998), Nantes, France, September 1998, pp. 328–336 (1998)
Hilderman, R.J., Hamilton, H.J.: Heuristics for ranking the interestingness of discovered knowledge. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS (LNAI), vol. 1574, pp. 204–210. Springer, Heidelberg (1999)
Hilderman, R.J., Hamilton, H.J., Barber, B.: Ranking the interestingness of summaries from data mining systems. In: Proceedings of the 12th International Florida Artificial Intelligence Research Symposium (FLAIRS 1999), Orlando, Florida, May 1999, pp. 100–106 (1999)
Hilderman, R.J., Hamilton, H.J., Kowalchuk, R.J., Cercone, N.: Parallel knowledge discovery using domain generalization graphs. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 25–35. Springer, Heidelberg (1997)
Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)
Liu, H., Lu, H., Yao, J.: Identifying relevant databases for multidatabase mining. In: Wu, X., Kotagiri, R., Korb, K.B. (eds.) PAKDD 1998. LNCS, vol. 1394, pp. 210–221. Springer, Heidelberg (1998)
MacArthur, R.H.: Patterns of species diversity. Biological Review 40, 510–533 (1965)
McIntosh, R.P.: An index of diversity and the relation of certain concepts to diveristy. Ecology 48(3), 392–404 (1967)
Rosenkrantz, W.A.: Introduction to Probability and Statistics for Scientists and Engineers. McGraw-Hill, New York (1997)
Schutz, R.R.: On the measurement of income inequality. American Economic Review 41, 107–122 (1951)
Shannon, C.E., Weaver, W.: The mathematical theory of communication. University of Illinois Press, Urbana (1949)
Simpson, E.H.: Measurement of diversity. Nature 163, 688 (1949)
Stumme, G., Wille, R., Wille, U.: Conceptual knowledge discovery in databases using formal concept analysis methods. In: Zytkow, J., Quafafou, M. (eds.) Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD 1998), Nantes, France, September 1998, pp. 450–458 (1998)
Theil, H.: Economics and information theory. Rand McNally (1970)
Whittaker, R.H.: Evolution and measurement of species diversity. Taxon 21(2/3), 213–251 (1972)
Yao, Y.Y., Wong, S.K.M., Butz, C.J.: On information-theoretic measures of attribute importance. In: Zhong, N., Zhou, L. (eds.) Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 1999), Beijing, China, April 1999, pp. 133–137 (1999)
Young, J.F.: Information theory. John Wiley & Sons, Chichester (1971)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hilderman, R.J., Hamilton, H.J. (1999). Heuristic Measures of Interestingness. In: Żytkow, J.M., Rauch, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1999. Lecture Notes in Computer Science(), vol 1704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-48247-5_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-48247-5_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66490-1
Online ISBN: 978-3-540-48247-5
eBook Packages: Springer Book Archive