Abstract
We describe a new incremental algorithm for training linear threshold functions: the Relaxed Online Maximum Margin Algorithm, or ROMMA. ROMMA can be viewed as an approximation to the algorithm that repeatedly chooses the hyperplane that classifies previously seen examples correctly with the maximum margin. It is known that such a maximum-margin hypothesis can be computed by minimizing the length of the weight vector subject to a number of linear constraints. ROMMA works by maintaining a relatively simple relaxation of these constraints that can be efficiently updated. We prove a mistake bound for ROMMA that is the same as that proved for the perceptron algorithm. Our analysis implies that the maximum-margin algorithm also satisfies this mistake bound; this is the first worst-case performance guarantee for this algorithm. We describe some experiments using ROMMA and a variant that updates its hypothesis more aggressively as batch algorithms to recognize handwritten digits. The computational complexity and simplicity of these algorithms is similar to that of perceptron algorithm, but their generalization is much better. We show that a batch algorithm based on aggressive ROMMA converges to the fixed threshold SVM hypothesis.
Article PDF
Similar content being viewed by others
References
Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837.
Anthony, M. & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge, UK: Cambridge University Press.
Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123-135.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Workshop on Computational Learning Theory (pp. 144-152).
Burges, C. & Crisp, D. J. (1999). Uniqueness of the SVM solution. In Advances in neural information processing systems, 12.
Campbell, C. & Cristianini, N. (1998). Simple learning algorithms for training support vector machines. Technical report, University of Bristol.
Chapelle, O. & Vapnik,V. (1999). Model selection for support vector machines. In Advances in Neural Information Processing Systems.
Cortes, C. & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20:3, 273-297.
Cristianini, N. & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press.
Fletcher, R. (1987). Practical methods of optimization. (2nd edn.). New York: John Wiley and Sons.
Freund, Y. & Schapire, R. E. (1998). Large margin classification using the perceptron algorithm. In Proceedings of the Eleventh Conference on Computational Learning Theory (pp. 209-217).
Friedman, J. H. (1996). Another approach to polychotomous classification. Technical report, Department of Statistics, Stanford, CA: Stanford University.
Friess, T. T., Cristianini, N., & Campbell, C. (1998). The kernel adatron algorithm: A fast and simple learning procedure for support vector machines. In Proceedings of the Fifteenth International Conference on Machine Learning.
Gallant, S. I. (1986). Optimal linear discriminants. In Proceedings of the Eighth International Conference on Pattern Recognition. Paris, France (pp. 849-852).
Gilbert, E. G. (1996). Minimizing the quadratic form on a convex set. SIAM J. Control, 4, 61-79.
Guo, Y., Bartlett, P. L., Shawe-Taylor, J., & Williamson, R. (1999). Covering numbers for support vector machines. In Proceedings of the 1999 Conference on Computational Learning Theory (pp. 267-277.)
Helmbold, D. & Warmuth, M. K. (1995). On weak learning. Journal of Computer and System Sciences, 50, 551-573.
Hertz, J. A., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Redwood, CA: Addison-Wesley.
Joachims, T. (1998). Making large-scale support vector machines learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.). Advances in kernel methods: Support vector machines (pp. 169-184).
Kaufman, L. (1998). Solving the quardratic programming problem arising in support vector classification. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.). Advances in kernel methods: Support vector machines.
Kearns, M., Li, M., Pitt, L., & Valiant, L. G. (1987). On the learnability of Boolean formulae. In Proceedings of the 19th Annual Symposium on the Theory of Computation (pp. 285-295).
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (1999).Afast iterative nearest point algorithm for support vector machine classifier design. Technical report, Indian Institute of Science. TR-ISL-99-03.
Klasner, N. & Simon, H. U. (1995). From noise-free to noise-tolerant and from on-line to batch learning. In Proceedings of the 1995 Conference on Computational Learning Theory (pp. 250-257).
Knerr, S., Personnaz, L., & Dreyfus, G. (1990). Single-layer learning revisited: A stepwise procedure for building and training a neural network. In Fogelman-Soulie & Herault (Eds.). Neurocomputing: Algorithms, architectures and applications. NATO ASI: Springer.
Kowalczyk, A. (1999). Maximal margin perceptron. In A. Smola, P. Bartlett, B. Schölkopf, & O. Schuurmans (Eds.). Advances in large margin classifiers. Cambridge, MA: MIT Press.
LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Muller, U., Sackinger, E., Simard, P., & Vapnik, V. (1995). Comparison of learning algorithms for handwritten digit recognition. In Proceedings of the Fifth International Conference on Artificial Neural Networks (pp. 53-60).
Li, Y. (2000). Selective voting for perceptron-like online learning. In Proceedings of the 17th International Conference on Machine Learning (pp. 559-566).
Littlestone, N. (1998). Learning quickly when irrelevant attributes abound: A new lenear-threshold algorithm. Machine Learning, 2, 285-318.
Littlestone, N. (1989a). From on-line to batch learning. In Proceedings of the SecondWorkshop on Computational Learning Theory (pp. 269-284).
Littlestone, N. (1989b). Mistake-bounds and logarithmic linear-threshold learning algorithms. Ph.D. thesis, UC Santa Cruz.
Minsky, M. & Papert, S. (1969). expanded edition 1988, Perceptrons. Cambridge, MA: MIT Press.
Mitchell, B. F., Dem'yanov, V. F., & Malozemov, V. N. (1974). Finding the point of a polyhedron closet to the origin. SIAM J. Control, 12, 19-26.
Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata (pp. 615-622).
Opper, M. & Winther, O. (1999). Gaussian processes and SVM: Mean field results and leave-one-out. In Smola, Bartlett, Schölkopf, & Schuurmans (Eds.). Advances in large margin Classifiers. Cambridge, MA: MIT Press
Osuna, E., Freund R., & Girosi, F. (1997). An improved training algorithm for support vector machines. In J. Principle, L. Gile, N. Margan, & E. Wilson (Eds.). Neural networks for signal processing VII-Proceedings of the 1997 IEEE workshop (pp. 276-285).
Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, & A. Smola (Eds.). Advances in kernel methods: Support vector machines. Cambridge, MA: MIT Press.
Platt, J., Cristianini, N., & Shawe-Taylor, J. (1999). Large margin DAGs for multiclass classification. In Advances in Neural Information Processing Systems, 12.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-407.
Rosenblatt, F. (1962). Principles of neurodynamics: Perceptrons and the theory of brain mechanisms.Washington, D. C.: Spartan Books.
Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the Margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:5, 1651-1686.
Shawe-Taylor, J., Bartlett, P., Williamson, R., & Ony, M. A. (1998). Structural risk minimization over datadependent hierarchies. IEEE Transactions on Information Theory, 44:5, 1926-1940.
Smola, A., Óvári, Z., & Williamson, R. (2000). Regularization with dot-product kernels. submitted to NIPS00.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Wahba, G. (1999). Support vector machines, reproducing kernel hilbert spaces and the randomized GACV. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.). Advances in kernel methods-Support vector learning (pp.69-88). Cambridge, MA: MIT Press.
Williams, C. K. I (1998). Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.). Learning and inference in graphical models. Dordrecht: Kluwer.
Williamson, R. C., Smola, A., & Scholkpof, B. (1998). Generalization bounds for regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Li, Y., Long, P.M. The Relaxed Online Maximum Margin Algorithm. Machine Learning 46, 361–387 (2002). https://doi.org/10.1023/A:1012435301888
Issue Date:
DOI: https://doi.org/10.1023/A:1012435301888