An Experimental and Theoretical Comparison of Model Selection Methods

Kearns, Michael; Mansour, Yishay; Ng, Andrew Y.; Ron, Dana

doi:10.1023/A:1007344726582

An Experimental and Theoretical Comparison of Model Selection Methods

Published: April 1997

Volume 27, pages 7–50, (1997)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

An Experimental and Theoretical Comparison of Model Selection Methods

Download PDF

Michael Kearns¹,
Yishay Mansour²,
Andrew Y. Ng³ &
…
Dana Ron⁴

1303 Accesses
67 Citations
Explore all metrics

Abstract

We investigate the problem of model selection in the setting of supervised learning of boolean functions from independent random examples. More precisely, we compare methods for finding a balance between the complexity of the hypothesis chosen and its observed error on a random training sample of limited size, when the goal is that of minimizing the resulting generalization error. We undertake a detailed comparison of three well-known model selection methods — a variation of Vapnik's Guaranteed Risk Minimization (GRM), an instance of Rissanen's Minimum Description Length Principle (MDL), and (hold-out) cross validation (CV). We introduce a general class of model selection methods (called penalty-based methods) that includes both GRM and MDL, and provide general methods for analyzing such rules. We provide both controlled experimental evidence and formal theorems to support the following conclusions:

•Even on simple model selection problems, the behavior of the methods examined can be both complex and incomparable. Furthermore, no amount of “tuning” of the rules investigated (such as introducing constant multipliers on the complexity penalty terms, or a distribution-specific “effective dimension”) can eliminate this incomparability.

•It is possible to give rather general bounds on the generalization error, as a function of sample size, for penalty-based methods. The quality of such bounds depends in a precise way on the extent to which the method considered automatically limits the complexity of the hypothesis selected.

•For any model selection problem, the additional error of cross validation compared to any other method can be bounded above by the sum of two terms. The first term is large only if the learning curve of the underlying function classes experiences a phase transition” between (1-γ)m and m examples (where gamma is the fraction saved for testing in CV). The second and competing term can be made arbitrarily small by increasing γ.

•The class of penalty-based methods is fundamentally handicapped in the sense that there exist two types of model selection problems for which every penalty-based method must incur large generalization error on at least one, while CV enjoys small generalization error on both.

Avoid common mistakes on your manuscript.

References

Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37, 1034-1054.
Article Google Scholar
Blum, A., & Rivest R. L. (1989). Training a 3-node neural net is NP-Complete. In David S. Touretzky, editor, Advances in Neural Information Processing Systems I, (pp. 494-501). Morgan Kaufmann, San Mateo, CA.
Google Scholar
Cover T., & Thomas J. (1991). Elements of Information Theory. Wiley.
Haussler, D., & Kearns, M., & Seung, S., & Tishby, N. (1994). Rigourous learning curve bounds from statistical mechanics. In Proceedings of the Seventh Annual ACM Confernce on Computational Learning Theory, (pp. 76-87).
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13-30.
Google Scholar
Kearns, M. (1995). A bound on the error of cross validation, with consequences for the training-test split. In Advances in Neural Information Processing Systems 8. The MIT Press.
Kearns, M., & Schapire, R., & Sellie, L. (1992). Toward efficient agnostic learning. In Proceedings of the 5th Annual Workshop on Computational Learning Theory, (pp. 341-352).
Pitt, L., & Valiant, L. (1988). Computational limitations on learning from examples. Journal of the ACM, 35, 965-984.
Article Google Scholar
Quinlan, J., & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227-248.
Google Scholar
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465-471.
Article Google Scholar
Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14, 1080-1100.
Google Scholar
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, volume 15 of Series in Computer Science. World Scientific.
Schaffer, C. (1994). A conservation law for generalization performance. In Proceedings of the Eleventh International Conference on Machine Learning, (pp. 259-265).
Seung, H. S., & Sompolinsky, H., & Tishby, N. (1992). Statistical mechanics of learning from examples. Physical Review, A45, 6056-6091.
Article Google Scholar
tone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B, 36, 111-147.
Google Scholar
tone, M. (1977). Asymptotics for and against cross-validation. Biometrika, 64, 29-35.
Google Scholar
Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag.
Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16, 264-280.
Google Scholar
Wolpert, D. (1992). On the connection between in-sample testing and generalization error. Complex Systems, 6, 47-94.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Laboratories Research, Murray Hill, NJ
Michael Kearns
Department of Computer Science, Tel Aviv University, Tel Aviv, Israel
Yishay Mansour
Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Andrew Y. Ng
Laboratory of Computer Science, MIT, Cambridge, MA
Dana Ron

Authors

Michael Kearns
View author publications
You can also search for this author in PubMed Google Scholar
Yishay Mansour
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Y. Ng
View author publications
You can also search for this author in PubMed Google Scholar
Dana Ron
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kearns, M., Mansour, Y., Ng, A.Y. et al. An Experimental and Theoretical Comparison of Model Selection Methods. Machine Learning 27, 7–50 (1997). https://doi.org/10.1023/A:1007344726582

Download citation

Issue Date: April 1997
DOI: https://doi.org/10.1023/A:1007344726582

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Experimental and Theoretical Comparison of Model Selection Methods

Abstract

Article PDF

Similar content being viewed by others

What Is Important About the No Free Lunch Theorems?

Probabilistic Connection between Cross-Validation and Vapnik Bounds

Minimization of Empirical Risk as a Means of Choosing the Number of Hypotheses in Algebraic Machine Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

An Experimental and Theoretical Comparison of Model Selection Methods

Abstract

Article PDF

Similar content being viewed by others

What Is Important About the No Free Lunch Theorems?

Probabilistic Connection between Cross-Validation and Vapnik Bounds

Minimization of Empirical Risk as a Means of Choosing the Number of Hypotheses in Algebraic Machine Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation