Innovative Applications of O.R.
Optimization approaches to Supervised Classification

https://doi.org/10.1016/j.ejor.2017.02.020Get rights and content

Highlights

  • A review of optimization models for Supervised Classification.

  • Covers models based on geometrical arguments and Support Vector Machines.

  • Considers two-group and general multi-group problems.

  • Discusses both statistical as well as algorithmic issues.

  • Includes a historical perspective and recent developments.

Abstract

The Supervised Classification problem, one of the oldest and most recurrent problems in applied data analysis, has always been analyzed from many different perspectives. When the emphasis is placed on its overall goal of developing classification rules with minimal classification cost, Supervised Classification can be understood as an optimization problem. On the other hand, when the focus is in modeling the uncertainty involved in the classification of future unknown entities, it can be formulated as a statistical problem. Other perspectives that pay particular attention to pattern recognition and machine learning aspects of Supervised Classification have also a long history that has lead to influential insights and different methodologies.

In this review, two approaches to Supervised Classification strongly related to optimization theory will be discussed and compared. In particular, we will review methodologies based on Mathematical Programming models that optimize observable criteria linked to the true objective of misclassification error (or cost) minimization, and approaches derived from the minimization of known bounds on the true misclassification error. The former approach is known as the Mathematical Programming approach to Supervised Classification, while the latter is in the origin of the well known Classification Support Vector Machines.

Throughout the review two-group as well as general multi-group problems will be considered, and the review will conclude with a discussion of the most promising research directions in this area.

Introduction

One of the most important problems in data analysis is Supervised Classification (SC) which involves assigning entities to one of several a priori defined groups based on their characteristics on a set of relevant attributes. Examples of SC applications can be found is almost every area of human activity. For instance, in medicine (Mangasarian, Street, & Wolberg, 1995), psychology (Ahmadi & Raiszadeh, 1990) and genetics (Guo, Hastie, & Tibshirani, 2007) SC methodologies have been used for a long time. In business, applications of SC include, among others, credit scoring (Lam, Choo, & Moy, 1996), early warning system for financial distress (Altman, Alvery, Eisenbeis, Sinkey, 1980, du Jardin, Séverin, 2012), and the assessment of brand loyalty (Rao, 1973).

Conceptually SC can be formalized as an optimization problem. Let XRn be the attributes domain and G={G1,G2,,Gk} be the set of group labels. The main goal of SC is the establishment of a map h(.):XG, known as a classifier or classification rule, that optimizes an appropriate measure of prediction ability when applied to entities with unknown group membership. However, conceptual prediction measures are not directly observable and many different approaches have been proposed in the literature to tackle this problem.

The classical statistical approach to SC derives optimal classification rules assuming that the attribute distributions are known, and then converts these theoretical rules into empirical equivalents using estimates based on a training sample. Other approaches establish classification rules directly without explicit reference to a statistical model or a theoretical classification rule. In this review we will focus on two streams of SC literature where optimization theory plays a central role.

The first stream to be discussed considers classification rules based on Mathematical Programming (MP) models derived from geometric arguments that optimize measures based on deviations from classification boundaries. In this approach a restricted functional form is assumed for the classification rule, and its parameters are found by optimizing an observable accuracy measure that can be evaluated in the training sample. This approach has been proposed and discussed mostly within the Operations Research literature and can be understood as a natural extension of Fisher’s original derivation (Fisher, 1936) of the classical Linear Discriminant Function (LDF) for two-group SC. In fact, the LDF was originally derived as the linear function that optimizes a particular measure of group discrimination in a training sample, namely the ratio of between-group to within-group sample variances. The LDF is also known as an estimate of the classification function that minimizes the misclassification probability, or the expected misclassification cost, under a homoscedastic Gaussian model. More generally, it can be shown that for Gaussian data l2-norm measures, such as variance ratios, have optimal properties. However, for other distributions l2-norm measures are no longer optimal, and have a tendency to overemphasize outlying entities. In contrast, measures based on sums of absolute values (l1-norm measures) are less influenced by outliers, and may be more appropriate for data originating from distributions that do not necessarily share the light-tail properties of the Gaussian distribution (see, e.g., Dodge, 1987, Huber, 1987).

Classification rules for two-group problems that optimize l1-norm measures in the training sample were first proposed by Smith (1968) following early work of Rosen (1965) and Mangasarian (1965; 1968) in the use of optimization models to find perfect separation boundaries in training samples. Several extensions that consider alternative measures and functional forms for the classification boundaries, and different normalization schemes, received renewed attention from the early 1980s, following the efforts of Fred Glover (Freed, Glover, 1981a, Freed, Glover, 1981b, Freed, Glover, 1986, Glover, 1990), Antonie Stam (Asparouhov, Stam, 1997, Duarte Silva, Stam, 1994, Duarte Silva, Stam, 1997, Stam, Joachimsthaler, 1989, Stam, Ragsdale, 1992), and John Glen (Glen, 1999, Glen, 2003, Glen, 2004, Glen, 2005, Glen, 2006, Glen, 2008). Furthermore, Mangasarian continued to develop influential work in this area, in particular with new insights on NP-hard models that try to minimize the number of training sample misclassifications (Mangasarian, 1994), and proposing new measures based on geometric distances (Mangasarian, 1999). Recent developments focus on general k-group (k > 2) problems (Sun, 2011, Sun, 2012), improved algorithms to optimize the training sample misclassification rate or misclassification cost (Fréville, Hanafi, Semet, Yanev, 2010, Hanafi, Yanev, 2011, Maskooki, 2013, Pfetsch, 2008, Xu, Papageorgiou, 2009), and on models that address simultaneously the twin problems of selecting the set of relevant attributes and establishing classification rules (Chinneck, 2012, Falangis, Glen, 2010). Since this approach relies heavily on Mathematical Programming (MP) optimization models, it is often referred to as the MP approach for classification in Discriminant Analysis.

A second stream of SC literature that will be reviewed addresses methods based on the derivation of distribution-free bounds for misclassification probabilities, and then optimizes training sample accuracy measures closely related to these bounds. This approach has its origins in the field of Computer Science, and can be largely traced back to early work of Valiant (1984); Vapnik and Chervonenkis (1971) and Blumer, Ehrenfeucht, Haussler, and Warmuth (1989), on the uniform convergence of empirical measures of classification accuracy to their population counterparts. This research motivated the development of Support Vector Machines (SVMs) (Boser, Guyon, & Vapnik, 1992) which quickly established themselves as one of the most successful approaches to SC. The first proposals in this area lead to the development of the Maximal Margin Classifier (Vapnik & Lerner, 1963; Vapnik & Chervonenkis, 1964) which is a two-group classifier that assumes that the groups are linearly separable, and implements a linear classification rule maximizing the training sample margin between the regions assigned to each group in the attribute space. It can be shown that the misclassification probability of future entities can be bounded by monotone functions of this margin. Modern SVMs build from these ideas, but introduce implicit non-linear mappings and relax perfect separation constraints. These methods rely extensively on optimization theory and techniques. In particular, the classification rules are found by solving optimization problems, the implicit non-linear mappings are based on dual formulations of these problems, and specialized implementations for large data sets explore the Karush–Kuhn–Tucker conditions of these formulations. These issues will be discussed in Section 4 of this review.

Two other reviews with similar aims, but different focus, are given in Smaoui, Chabchoub, and Aouni (2009) and Carrizosa and Morales (2013). However, these reviews consider only two-group problems, and the former one only discusses the MP approach, while the second one concentrates in SVMs. In contrast, this review considers both two-group as well as general multi-group problems, and discusses and compares the two approaches referred above.

The role of optimization in SC extends beyond the methods to be discussed in this review. In particular, here we will assume that the set of relevant attributes for classification is known beforehand. In the practice of SC, this is usually not the case and it is common to initially collect data on a large set of attributes and let the data analysis recommend the final ones used for classification. This is usually done either by performing a combinatorial search of relevant attributes (Duarte Silva, 2001, Duarte Silva, Rizzi, Vichi, Bock, 1998), or by adapting parameter estimation in a way that the selection of attributes is automatically performed. Both routes require special care to avoid undesirable optimistic selection bias of the prediction measures, and rely extensively on optimization tools (see, e.g., Fan, Feng, Tong, 2012, Guyon, Weston, Barnhill, Vapnik, 2002, Ma, Huang, 2008, Mai, 2013).

Finally, the relation between SC and optimization has also been exploited in the other direction, trying to improve the performance of optimization algorithms by a smart use of SC methods. In particular, Cassioli, Di Lorenzo, Locatelli, Schoen, and Sciandrone (2012) have shown how SVM’s could be used to design self-learning algorithms for solving global optimization problems. This approach has recently been successfully applied in the development of new smart heuristics for the standard quadratic problem (Dellepiane & Palagi, 2015).

The remainder of this review is structured as follows. The next section will review several frameworks adopted for the analysis of SC problems. Although not directly related to optimization, this section will introduce the notation used in the review and will present several concepts to be used in the discussion of subsequent sections. Section 3 will describe the MP approach to SC. Section 4 will discuss SVMs and Section 5 will conclude the review.

The following conventions will be adopted in this review. Vectors will be denoted by bold lower case letters, matrices by bold upper case letters, scalars by italic lower case letters, and matrix and vector components by subscript indices. All vectors will be assumed to be column-vectors, unless denoted as transposed by a T superscript. The null and unit vector of dimension d will be denoted respectively by 0d and 1d, and vector inequalities will be interpreted as holding componentwise for all vector elements. So, for instance, bRn will denote an n-dimensional column vector, bT the corresponding row vector, dRm,d0m an m-dimensional column vector with nonnegative elements, di its ith element, BRn×k an n by k matrix, Bij the entry in the ith row, jth column of B, and Bi. and B.j respectively the ith row and the jth column of B. The indicator function I(E) will be used to denote the map that assigns I(E) to 1 if the event E is true, and to 0 otherwise.

Section snippets

The statistical perspective

Suppose that entities belonging to one of k mutually exclusive groups are described by n-dimensional attribute vectors. The statistical approach to the study of classification rules is based on models for the relevant probability distributions. Denote the attribute vector of entity i by xi, the label of group a by Ga, the probability or probability density of xi given membership in Ga by p(xi|Ga), and the prior probabilities of membership in Ga by πa. Then, the Bayes rule that minimizes the

The Mathematical Programming approach to SC

An intuitive approach to SC searches directly for boundaries of the region of the attribute domain for which C(a|a)πap(xi|Ga)>C(a|a)πap(xi|Ga), without making any assumptions about the attribute distributions. This approach assumes that these boundaries can be described by equations of the form fs(bs,xi)=cs, where the index s denotes a particular branch of the boundary line (e.g., the boundary separating a particular group pair), bs=(b1sT,b2sT,,btsT)T are vectors of unknown parameters, the

Statistical Learning foundations of SVMs

Classification Support Vector Machines (Boser, Guyon, Vapnik, 1992, Cortes, Vapnik, 1995) are optimization based classifiers with a firm theoretical basis rooted on Statistical Learning Theory (Vapnik, 1998, Vapnik, 2013). The goal of Statistical Learning Theory is the development and study of predictive algorithms with guaranteed good generalization ability. When applied to SC problems, Statistical Learning Theory deals with classifiers with known distribution-free bounds on the

Conclusions and perspectives

The theory and the tools of optimization have always played an important role in Supervised Classification (SC). The SC problem itself can be conceptually formulated as an optimization problem, albeit one defined over unobservable quantities such as the true probabilities, or the expected costs, of incorrect classifications for some future unknown entities. The Mathematical Programming (MP) approach to SC relies on this theoretical formulation, but replaces all unknowns by observed proxies. As

Acknowledgments

The author thanks four anonymous referees for their suggestions and constructive criticism that helped to improve the quality of this review. Financial support from Fundação para a Ciência e Tecnologia (through project UID/GES/00731/2013) is gratefully acknowledged.

References (181)

  • A.P. Duarte Silva

    Efficient variable screening for multivariate analysis

    Journal of Multivariate Analysis

    (2001)
  • N. Freed et al.

    Simple but powerful goal programming formulations for the discriminant problem

    European Journal of Operational Research

    (1981)
  • A. Fréville et al.

    A tabu search with an oscillation strategy for the discriminant analysis problem

    Computers and Operations Research

    (2010)
  • U.M. García-Palomares et al.

    Novel linear programming approach for building a piecewise nonlinear binary classifier with a priori accuracy

    Decision Support Systems

    (2012)
  • W.V. Gehrlein

    General mathematical programming formulations for the statistical classification problem

    Operations Research Letters

    (1986)
  • J.J. Glen

    An iterative mixed integer programming method for classification accuracy maximizing discriminant analysis

    Computers and Operations Research

    (2003)
  • J.J. Glen

    A comparison of standard and two-stage mathematical programming discriminant analysis methods

    European Journal of Operational Research

    (2006)
  • C.J. Huberty et al.

    Assessing predictive accuracy in discriminant analysis

    Multivariate Behavioral Research

    (1987)
  • A. Karam et al.

    Arbitrary-norm hyperplane separation by variable neighbourhood search

    IMA Journal of Management Mathematics

    (2007)
  • LamK.F. et al.

    Minimizing deviations from the group mean: A new linear programming approach for the two-group classification problem

    European Journal of Operational Research

    (1996)
  • C. Leslie et al.

    Fast kernels for inexact string matching

  • M. Ahmadi et al.

    Predicting underachievement in business statistics

    Educational and Psychological Measurement

    (1990)
  • E. Altman et al.

    Applications of classification techniques in business, banking and finance

    (1980)
  • E. Amaldi et al.

    On the maximum feasible subsystem problem, IISs and IIS-hypergraphs

    Mathematical Programming

    (2003)
  • J.A. Anderson

    Separate sample logistic discrimination

    Biometrika

    (1972)
  • O.K. Asparouhov et al.

    Oscillation heuristics for the two-group classification problem

    Journal of Classification

    (2004)
  • O.K. Asparouhov et al.

    Mathematical programming formulations for two-group classification with binary variables

    Annals of Operations Research

    (1997)
  • A. Astorino et al.

    Polyhedral separability through successive LP

    Journal of Optimization Theory and Applications

    (2002)
  • A. Astorino et al.

    Conic separation of finite sets i. the homogeneous case

    Journal of Convex Analysis

    (2014)
  • C. Audet et al.

    Exact l2 norm plane separation

    Optimization Letters

    (2008)
  • A.M. Bagirov

    Max-min separability

    Optimization Methods and Software

    (2005)
  • A.M. Bagirov et al.

    Piecewise linear classifiers based on nonsmooth optimization approaches

  • S.M. Bajgier et al.

    An experimental comparison of statistical and linear programming approaches to the discriminant problem

    Decision Sciences

    (1982)
  • W.J. Banks et al.

    An efficient optimal solution algorithm for the classification problem

    Decision Sciences

    (1991)
  • A.E. Barón

    Misclassification among methods used for multiple group discrimination - the effects of distributional properties

    Statistics in Medicine

    (1991)
  • P. Bartlett et al.

    Generalization performance of support vector machines and other pattern classifiers

    Advances in Kernel Methods – Support Vector Learning

    (1999)
  • K.P. Bennett et al.

    A parametric optimization method for machine learning

    INFORMS Journal on Computing

    (1997)
  • K.P. Bennett et al.

    Robust linear programming discrimination of two linearly inseparable sets

    Optimization Methods and Software

    (1992)
  • K.P. Bennett et al.

    Multicategory discrimination via linear programming

    Optimization Methods and Software

    (1994)
  • A. Blumer et al.

    Learnability and the Vapnik-Chervonenkis dimension

    Journal of the ACM

    (1989)
  • B.E. Boser et al.

    A training algorithm for optimal margin classifiers

    Proceedings of the fifth annual workshop on Computational Learning Theory

    (1992)
  • P.S. Bradley et al.

    Massive data discrimination via linear support vector machines

    Optimization Methods and Software

    (2000)
  • E.J. Bredensteiner et al.

    Multicategory classification by support vector machines

    Computational Optimization and Applications

    (1999)
  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • J.P. Brooks

    Support vector machines with the ramp loss and the hard margin loss

    Operations Research

    (2011)
  • E. Carrizosa et al.

    On the selection of the globally optimal prototype subset for nearest-neighbor classification

    INFORMS Journal on Computing

    (2007)
  • A. Cassioli et al.

    Machine learning for global optimization

    Computational Optimization and Applications

    (2012)
  • ChangC.C. et al.

    LIBSVM: A library for support vector machines

    ACM Transactions on Intelligent Systems and Technology (TIST)

    (2011)
  • J.W. Chinneck

    An effective polynomial-time heuristic for the minimum-cardinality IIS set-covering problem

    Annals of Mathematics and Artificial Intelligence

    (1996)
  • J.W. Chinneck

    Finding a useful subset of constraints for analysis in an infeasible linear program

    INFORMS Journal on Computing

    (1997)
  • Cited by (24)

    • A maximum-margin multisphere approach for binary Multiple Instance Learning

      2022, European Journal of Operational Research
      Citation Excerpt :

      Nowadays Operation Research plays a relevant role in Machine Learning (ML), as confirmed by the rich literature in this field (see the surveys Carrizosa, Molero-Ro, & Romero Morales, 2021; Carrizosa & Romero Morales, 2013; Gambella, Ghaddar, & Naoum-Sawaya, 2021; Pedro Duarte Silva, 2017).

    • Faster Maximum Feasible Subsystem solutions for dense constraint matrices

      2022, Computers and Operations Research
      Citation Excerpt :

      We select for removal all candidates up to the first abrupt change identified. Amaldi (1994), Parker (1995), Chinneck (2009, 2012, 2001), and Silva (2017) have shown how the binary classification problem can be transformed into a MAX FS problem: The candidate list is obtained using Eq. (1), which proved effective in earlier research (Chinneck, 2001).

    • Tightening big Ms in integer programming formulations for support vector machines with ramp loss

      2020, European Journal of Operational Research
      Citation Excerpt :

      For interested readers, a comparison between the proposed methodologies and the existing solution methods in terms of classification performance can be found in the supplementary material. As stated in Duarte Silva (2017), the search for exact solutions of the optimization problems resulting from the ramp loss models with large datasets is an open problem. This paper has presented various new exact approaches which are applicable to larger datasets and they have proven to be faster than the state-of-the-art algorithms.

    • Sparsity in optimal randomized classification trees

      2020, European Journal of Operational Research
      Citation Excerpt :

      The mainstream trend of using a greedy strategy in the construction of decision trees may lead to myopic decisions, which, in turn, may affect the overall learning performance. The major advances in Mathematical Optimization (Carrizosa & Romero Morales, 2013; Olafsson, Li, & Wu, 2008; Silva, 2017) have led to different approaches to build decision trees with some overall optimality criterion, called hereafter optimal classification trees. It is worth mentioning recent proposals which grow optimal classification trees of a pre-established depth, both deterministic (Bertsimas & Dunn, 2017; Firat, Crognier, Gabor, Hurkens, & Zhang, 2019; Günlük, Kalagnanam, Menickelly, & Scheinberg, 2019; Verwer & Zhang, 2017; Verwer, Zhang, & Ye, 2017) and randomized (Blanquero, Carrizosa, Molero-Río, & Romero Morales, 2018).

    View all citing articles on Scopus
    View full text