Abstract
Recently ensemble methods like ADABOOST have been applied successfully in many problems, while seemingly defying the problems of overfitting.
ADABOOST rarely overfits in the low noise regime, however, we show that it clearly does so for higher noise levels. Central to the understanding of this fact is the margin distribution. ADABOOST can be viewed as a constraint gradient descent in an error function with respect to the margin. We find that ADABOOST asymptotically achieves a hard margin distribution, i.e. the algorithm concentrates its resources on a few hard-to-learn patterns that are interestingly very similar to Support Vectors. A hard margin is clearly a sub-optimal strategy in the noisy case, and regularization, in our case a “mistrust” in the data, must be introduced in the algorithm to alleviate the distortions that single difficult patterns (e.g. outliers) can cause to the margin distribution. We propose several regularization methods and generalizations of the original ADABOOST algorithm to achieve a soft margin. In particular we suggest (1) regularized ADABOOSTREG where the gradient decent is done directly with respect to the soft margin and (2) regularized linear and quadratic programming (LP/QP-) ADABOOST, where the soft margin is attained by introducing slack variables.
Extensive simulations demonstrate that the proposed regularized ADABOOST-type algorithms are useful and yield competitive results for noisy data.
Article PDF
Similar content being viewed by others
References
Bennett, K. (1998). Combining support vector and mathematical programming methods for induction. In B. Schölkopf, C. Burges,& A. Smola (Eds.), Advances in kernel methods—SV learning. Cambridge, MA: MIT Press.
Bennett, K.& Mangasarian, O. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34.
Bertoni, A., Campadelli, P.,& Parodi, M. (1997).Aboosting algorithm for regression. In W. Gerstner, A. Germond, M. Hasler,& J.-D. Nicoud (Eds.), LNCS, Vol. V: Proceedings ICANN'97: Int. Conf. on Artificial Neural Networks (pp. 343–348). Berlin: Springer.
Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford: Clarendon Press.
Boser, B., Guyon, I.,& Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings COLT'92: Conference on Computational Learning Theory (pp. 144–152). New York, NY: ACM Press.
Breiman, L. (1996). Bagging predictors. Mechine Learning, 26(2), 123–140.
Breiman, L. (1997a). Arcing the edge. Technical Report 486, Statistics Department, University of California.
Breiman, L. (1997b). Prediction games and arcing algorithms. Technical Report 504, Statistics Department, University of California.
Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26(3), 801–849.
Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Statistics Department, University of California.
Cortes, C.& Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297.
Frean, M.& Downs, T. (1998). A simple cost function for boosting. Technical Report, Department of Computer Science and Electrical Engineering, University of Queensland.
Freund, Y.& Schapire, R. (1994). A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings EuroCOLT'94: European Conference on Computational Learning Theory. LNCS.
Freund, Y.& Schapire, R. (1996). Game theory, on-line prediction and boosting. In Proceedings COLT'86: Conf. on Comput. Learning Theory (pp. 325–332). New York, NY: ACM Press.
Friedman, J. (1999). Greedy function approximation. Technical Report, Department of Statistics, Stanford University.
Friedman, J., Hastie, T.,& Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting. Technical Report, Department of Statistics, Sequoia Hall, Stanford University.
Frieß, T.& Harrison, R. (1998). Perceptrons in kernel feature space. Research Report RR-720, Department of Automatic Control and Systems Engineering, University of Sheffield, Sheffield, UK.
Grove, A.& Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artifical Intelligence.
Kirkpatrick, S. (1984). Optimization by simulated annealing: Quantitative studies. J. Statistical Physics, 34, 975–986.
LeCun, Y., Jackel, L., Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Müller, U., Säckinger, E., Simard, P.,& Vapnik, V. (1995). Learning algorithms for classification: A comparism on handwritten digit recognition. Neural Networks, 261–276.
Mangasarian, O. (1965). Linear and nonlinear separation of patterns by linear programming. Operations Research, 13, 444–452.
Mason, L., Bartlett, P. L.,& Baxter, J. (2000a). Improved generalization through explicit optimization of margins. Machine Learning 38(3), 243–255.
Mason, L., Baxter, J., Bartlett, P. L.,& Frean, M. (2000b). Functional gradient techniques for combining hypotheses. In A. J. Smola, P. Bartlett, B. Schölkopf,& C. Schuurmans (Eds.), Advances in Large Margin Classifiers. Cambridge, MA: MIT Press.
Moody, J.& Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2), 281–294.
Müller, K.-R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J.,& Vapnik, V. (1998). Using support vector machines for time series prediction. In B. Schölkopf, C. Burges,& A. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning. Cambridge, MA: MIT Press.
Onoda, T., Rätsch, G.,& Müller, K.-R. (1998). An asymptotic analysis of ADABOOST in the binary classification case. In L. Niklasson, M. Bodén,& T. Ziemke (Eds.), Proceedings ICANN'98: Int. Conf. on Artificial Neural Networks (pp. 195–200).
Onoda, T., Rätsch, G.,& Müller, K.-R. (2000). An asymptotical analysis and improvement of ADABOOST in the binary classification case. Journal of Japanese Society for AI, 15(2), 287–296 (in Japanese).
Press, W., Flannery, B., Teukolsky, S.,& Vetterling, W. (1992). Numerical Recipes in C (2nd ed.). Cambridge: Cambridge University Press.
Quinlan, J. (1992). C4.5: Programs for Machine Learning. Los Altos, CA: Morgan Kaufmann.
Quinlan, J. (1996). Boosting first-order learning. In S. Arikawa& A. Sharma (Eds.), LNAI, Vol. 1160: Proceedings of the 7th International Workshop on Algorithmic Learning Theory (pp. 143–155). Berlin: Springer.
Rätsch, G. (1998). Ensemble learning methods for classification. Master's Thesis, Department of Computer Science, University of Potsdam, Germany (in German).
Rätsch, G., Onoda, T.,& Müller, K.-R. (1998). Soft margins for ADABOOST. Technical Report NC-TR-1998-021, Department of Computer Science, Royal Holloway, University of London, Egham, UK.
Rätsch, G., Onoda, T.,& Müller, K.-R. (1999). Regularizing ADABOOST. In M. Kearns, S. Solla,& D. Cohn (Eds.), Advances in Neural Information Processing Systems 11 (pp. 564–570). Cambridge, MA: MIT Press.
Rätsch, G., Schölkopf, B., Smola, A., Mika, S., Onoda, T.,& Müller, K.-R. (2000). Robust ensemble learning. In A. Smola, P. Bartlett, B. Schölkopf,& D. Schuurmans (Eds.), Advances in Large Margin Classifiers (pp. 207–219). Cambridge, MA: MIT Press.
R00E4;tsch, G., Warmuth, M., Mika, S., Onoda, T., Lemm, S.,& Müller, K.-R. (2000). Barrier boosting. In Proceedings COLT'00: Conference on Computational Learning Theory (pp. 170–179). Los Altos, CA: Morgan Kaufmann.
Rokui, J.& Shimodaira, H. (1998). Improving the generalization performance of the minimum classification error learning and its application to neural networks. In Proc. of the Int. Conf. on Neural Information Processing (ICONIP) (pp. 63–66). Japan, Kitakyushu.
Schapire, R. (1999). Theoretical views of boosting. In Proceedings EuroCOLT'99: European Conference on Computational Learning Theory.
Schapire, R., Freund, Y., Bartlett, P.,& Lee, W. (1997). Boosting the margin:Anewexplanation for the effectiveness of voting methods. In Proceedings ICML'97: International Conference on Machine Learning (pp. 322–330). Los Altos, CA: Morgan Kaufmann.
Schapire, R.& Singer, Y. (1998). Improved boosting algorithms using confidence-rated predictions. In Proceedings COLT'98: Conference on Computational Learning Theory (pp. 80–91).
Schölkopf, B. (1997). Support Vector Learning. R. Oldenbourg Verlag, Berlin.
Schölkopf, B., Smola, A.,& Williamson, R. (2000). New support vector algorithms. Neural Computation. also NeuroCOLT TR-31-89, 12:1083–1121.
Schwenk, H.& Bengio, Y. (1997). AdaBoosting neural networks. In W. Gerstner, A. Germond, M. Hasler,& J.-D. Nicoud (Eds.), Proceedings ICANN'97: Int. Conf. on Artificial Neural Networks, Vol. 1327 of LNCS (pp. 967–972). Berlin: Springer.
Smola, A. J. (1998). Learning with kernels. Ph.D. Thesis, Technische Universität Berlin.
Smola, A., Schölkopf, B.,& Müller, K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637–649.
Tikhonov, A.& Arsenin, V. (1977). Solutions of Ill-Posed Problems. Washington, D.C.: W.H. Winston.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. Berlin: Springer.
Weston, J. (1999). LOO-support vector machines. In Proceedings of IJCNN'99.
Weston, J., Gammerman, A., Stitson, M. O., Vapnik, V., Vovk, V.,& Watkins, C. (1997). Density estimation using SV machines. Technical Report CSD-TR-97-23, Royal Holloway, University of London, Egham, UK.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Rätsch, G., Onoda, T. & Müller, KR. Soft Margins for AdaBoost. Machine Learning 42, 287–320 (2001). https://doi.org/10.1023/A:1007618119488
Issue Date:
DOI: https://doi.org/10.1023/A:1007618119488