Improved Algorithms for Univariate Discretization of Continuous Features

Kujala, Jussi; Elomaa, Tapio

doi:10.1007/978-3-540-74976-9_20

Improved Algorithms for Univariate Discretization of Continuous Features

Jussi Kujala¹ &
Tapio Elomaa¹

Conference paper

3545 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4702))

Abstract

In discretization of a continuous variable its numerical value range is divided into a few intervals that are used in classification. For example, Naïve Bayes can benefit from this processing. A commonly-used supervised discretization method is Fayyad and Irani’s recursive entropy-based splitting of a value range. The technique uses ent-mdl as a model selection criterion to decide whether to accept the proposed split.

We argue that theoretically the method is not always close to ideal for this application. Empirical experiments support our finding. We give a statistical rule that does not use the ad-hoc rule of Fayyad and Irani’s approach to increase its performance. This rule, though, is quite time consuming to compute. We also demonstrate that a very simple Bayesian method performs better than ent-mdl as a model selection criterion.

Download to read the full chapter text

Chapter PDF

References

Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S. (eds.) Proc. 12th International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann, San Francisco, CA (1995)
Google Scholar
Hsu, C.N., Huang, H.J., Wong, T.T.: Implications of the Dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers. Machine Learning 53, 235–263 (2003)
Article MATH Google Scholar
Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: An enabling technique. Data Mining and Knowledge Discovery 6, 393–423 (2002)
Article MathSciNet Google Scholar
Yang, Y., Webb, G.I.: A comparative study of discretization methods for naive-Bayes classifiers. In: Proc. Pacific Rim Knowledge Acquisition Workshop (PKAW), pp. 159–173 (2002)
Google Scholar
Elomaa, T., Rousu, J.: Efficient multisplitting revisited: Optima-preserving elimination of partition candidates. Data Mining and Knowledge Discovery 8, 97–126 (2004)
Article MathSciNet Google Scholar
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027. Morgan Kaufmann, San Francisco, CA (1993)
Google Scholar
Wong, A., Chiu, D.: Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis 9, 796–805 (1987)
Article Google Scholar
Catlett, J.: On changing continuous attributes into ordered discrete attributes. In: Kodratoff, Y. (ed.) Machine Learning - EWSL-91. LNCS, vol. 482, pp. 164–178. Springer, Heidelberg (1991)
Chapter Google Scholar
Hand, D.J., Yu, K.: Idiot Bayes? not so stupid after all. International Statistical Review 69, 385–398 (2001)
Article MATH Google Scholar
Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI-01 workshop on “Empirical Methods in AI” (2001)
Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29, 103–130 (1997)
Article MATH Google Scholar
Chlebus, B.S., Nguyen, S.H.: On finding optimal discretizations for two attributes. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 537–544. Springer, Heidelberg (1998)
Chapter Google Scholar
Elomaa, T., Rousu, J.: On decision boundaries of naïve Bayes in continuous domains. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 144–155. Springer, Heidelberg (2003)
Google Scholar
Călinescu, G., Dumitrescu, A., Karloff, H., Wan, P.J.: Separating points by axis-parallel lines. International Journal of Computational Geometry & Applications 15, 575–590 (2005)
Article MATH MathSciNet Google Scholar
Elomaa, T., Kujala, J., Rousu, J.: Approximation algorithms for minimizing empirical error by axis-parallel hyperplanes. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 547–555. Springer, Heidelberg (2005)
Chapter Google Scholar
John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proc. 11th Annual Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Machine Learning 8, 87–102 (1992)
MATH Google Scholar
Kohavi, R., Sahami, M.: Error-based and entropy-based discretization of continuous features. In: Simoudis, E., Han, J.W., Fayyad, U. (eds.) Proc. 2nd International Conference on Knowledge Discovery and Data Mining, pp. 114–119. AAAI Press, Menlo Park, CA (1996)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons, New York (1991)
MATH Google Scholar
Kerber, R.: Chimerge: Discretization of numeric attributes. In: Proc. 10th National Conference on Artificial Intelligence, pp. 123–128. MIT Press, Cambridge (1992)
Google Scholar
Dietterich, T.G.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation 10, 1895–1923 (1998)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Software Systems, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland
Jussi Kujala & Tapio Elomaa

Authors

Jussi Kujala
View author publications
You can also search for this author in PubMed Google Scholar
Tapio Elomaa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Joost N. Kok Jacek Koronacki Ramon Lopez de Mantaras Stan Matwin Dunja Mladenič Andrzej Skowron

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kujala, J., Elomaa, T. (2007). Improved Algorithms for Univariate Discretization of Continuous Features. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds) Knowledge Discovery in Databases: PKDD 2007. PKDD 2007. Lecture Notes in Computer Science(), vol 4702. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74976-9_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-74976-9_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74975-2
Online ISBN: 978-3-540-74976-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics