Innovative Applications of O.R.Optimization approaches to Supervised Classification
Introduction
One of the most important problems in data analysis is Supervised Classification (SC) which involves assigning entities to one of several a priori defined groups based on their characteristics on a set of relevant attributes. Examples of SC applications can be found is almost every area of human activity. For instance, in medicine (Mangasarian, Street, & Wolberg, 1995), psychology (Ahmadi & Raiszadeh, 1990) and genetics (Guo, Hastie, & Tibshirani, 2007) SC methodologies have been used for a long time. In business, applications of SC include, among others, credit scoring (Lam, Choo, & Moy, 1996), early warning system for financial distress (Altman, Alvery, Eisenbeis, Sinkey, 1980, du Jardin, Séverin, 2012), and the assessment of brand loyalty (Rao, 1973).
Conceptually SC can be formalized as an optimization problem. Let be the attributes domain and be the set of group labels. The main goal of SC is the establishment of a map known as a classifier or classification rule, that optimizes an appropriate measure of prediction ability when applied to entities with unknown group membership. However, conceptual prediction measures are not directly observable and many different approaches have been proposed in the literature to tackle this problem.
The classical statistical approach to SC derives optimal classification rules assuming that the attribute distributions are known, and then converts these theoretical rules into empirical equivalents using estimates based on a training sample. Other approaches establish classification rules directly without explicit reference to a statistical model or a theoretical classification rule. In this review we will focus on two streams of SC literature where optimization theory plays a central role.
The first stream to be discussed considers classification rules based on Mathematical Programming (MP) models derived from geometric arguments that optimize measures based on deviations from classification boundaries. In this approach a restricted functional form is assumed for the classification rule, and its parameters are found by optimizing an observable accuracy measure that can be evaluated in the training sample. This approach has been proposed and discussed mostly within the Operations Research literature and can be understood as a natural extension of Fisher’s original derivation (Fisher, 1936) of the classical Linear Discriminant Function (LDF) for two-group SC. In fact, the LDF was originally derived as the linear function that optimizes a particular measure of group discrimination in a training sample, namely the ratio of between-group to within-group sample variances. The LDF is also known as an estimate of the classification function that minimizes the misclassification probability, or the expected misclassification cost, under a homoscedastic Gaussian model. More generally, it can be shown that for Gaussian data l2-norm measures, such as variance ratios, have optimal properties. However, for other distributions l2-norm measures are no longer optimal, and have a tendency to overemphasize outlying entities. In contrast, measures based on sums of absolute values (l1-norm measures) are less influenced by outliers, and may be more appropriate for data originating from distributions that do not necessarily share the light-tail properties of the Gaussian distribution (see, e.g., Dodge, 1987, Huber, 1987).
Classification rules for two-group problems that optimize l1-norm measures in the training sample were first proposed by Smith (1968) following early work of Rosen (1965) and Mangasarian (1965; 1968) in the use of optimization models to find perfect separation boundaries in training samples. Several extensions that consider alternative measures and functional forms for the classification boundaries, and different normalization schemes, received renewed attention from the early 1980s, following the efforts of Fred Glover (Freed, Glover, 1981a, Freed, Glover, 1981b, Freed, Glover, 1986, Glover, 1990), Antonie Stam (Asparouhov, Stam, 1997, Duarte Silva, Stam, 1994, Duarte Silva, Stam, 1997, Stam, Joachimsthaler, 1989, Stam, Ragsdale, 1992), and John Glen (Glen, 1999, Glen, 2003, Glen, 2004, Glen, 2005, Glen, 2006, Glen, 2008). Furthermore, Mangasarian continued to develop influential work in this area, in particular with new insights on NP-hard models that try to minimize the number of training sample misclassifications (Mangasarian, 1994), and proposing new measures based on geometric distances (Mangasarian, 1999). Recent developments focus on general k-group (k > 2) problems (Sun, 2011, Sun, 2012), improved algorithms to optimize the training sample misclassification rate or misclassification cost (Fréville, Hanafi, Semet, Yanev, 2010, Hanafi, Yanev, 2011, Maskooki, 2013, Pfetsch, 2008, Xu, Papageorgiou, 2009), and on models that address simultaneously the twin problems of selecting the set of relevant attributes and establishing classification rules (Chinneck, 2012, Falangis, Glen, 2010). Since this approach relies heavily on Mathematical Programming (MP) optimization models, it is often referred to as the MP approach for classification in Discriminant Analysis.
A second stream of SC literature that will be reviewed addresses methods based on the derivation of distribution-free bounds for misclassification probabilities, and then optimizes training sample accuracy measures closely related to these bounds. This approach has its origins in the field of Computer Science, and can be largely traced back to early work of Valiant (1984); Vapnik and Chervonenkis (1971) and Blumer, Ehrenfeucht, Haussler, and Warmuth (1989), on the uniform convergence of empirical measures of classification accuracy to their population counterparts. This research motivated the development of Support Vector Machines (SVMs) (Boser, Guyon, & Vapnik, 1992) which quickly established themselves as one of the most successful approaches to SC. The first proposals in this area lead to the development of the Maximal Margin Classifier (Vapnik & Lerner, 1963; Vapnik & Chervonenkis, 1964) which is a two-group classifier that assumes that the groups are linearly separable, and implements a linear classification rule maximizing the training sample margin between the regions assigned to each group in the attribute space. It can be shown that the misclassification probability of future entities can be bounded by monotone functions of this margin. Modern SVMs build from these ideas, but introduce implicit non-linear mappings and relax perfect separation constraints. These methods rely extensively on optimization theory and techniques. In particular, the classification rules are found by solving optimization problems, the implicit non-linear mappings are based on dual formulations of these problems, and specialized implementations for large data sets explore the Karush–Kuhn–Tucker conditions of these formulations. These issues will be discussed in Section 4 of this review.
Two other reviews with similar aims, but different focus, are given in Smaoui, Chabchoub, and Aouni (2009) and Carrizosa and Morales (2013). However, these reviews consider only two-group problems, and the former one only discusses the MP approach, while the second one concentrates in SVMs. In contrast, this review considers both two-group as well as general multi-group problems, and discusses and compares the two approaches referred above.
The role of optimization in SC extends beyond the methods to be discussed in this review. In particular, here we will assume that the set of relevant attributes for classification is known beforehand. In the practice of SC, this is usually not the case and it is common to initially collect data on a large set of attributes and let the data analysis recommend the final ones used for classification. This is usually done either by performing a combinatorial search of relevant attributes (Duarte Silva, 2001, Duarte Silva, Rizzi, Vichi, Bock, 1998), or by adapting parameter estimation in a way that the selection of attributes is automatically performed. Both routes require special care to avoid undesirable optimistic selection bias of the prediction measures, and rely extensively on optimization tools (see, e.g., Fan, Feng, Tong, 2012, Guyon, Weston, Barnhill, Vapnik, 2002, Ma, Huang, 2008, Mai, 2013).
Finally, the relation between SC and optimization has also been exploited in the other direction, trying to improve the performance of optimization algorithms by a smart use of SC methods. In particular, Cassioli, Di Lorenzo, Locatelli, Schoen, and Sciandrone (2012) have shown how SVM’s could be used to design self-learning algorithms for solving global optimization problems. This approach has recently been successfully applied in the development of new smart heuristics for the standard quadratic problem (Dellepiane & Palagi, 2015).
The remainder of this review is structured as follows. The next section will review several frameworks adopted for the analysis of SC problems. Although not directly related to optimization, this section will introduce the notation used in the review and will present several concepts to be used in the discussion of subsequent sections. Section 3 will describe the MP approach to SC. Section 4 will discuss SVMs and Section 5 will conclude the review.
The following conventions will be adopted in this review. Vectors will be denoted by bold lower case letters, matrices by bold upper case letters, scalars by italic lower case letters, and matrix and vector components by subscript indices. All vectors will be assumed to be column-vectors, unless denoted as transposed by a T superscript. The null and unit vector of dimension d will be denoted respectively by and and vector inequalities will be interpreted as holding componentwise for all vector elements. So, for instance, will denote an n-dimensional column vector, the corresponding row vector, an m-dimensional column vector with nonnegative elements, its ith element, an n by k matrix, the entry in the ith row, jth column of and and respectively the ith row and the jth column of . The indicator function I(E) will be used to denote the map that assigns I(E) to 1 if the event E is true, and to 0 otherwise.
Section snippets
The statistical perspective
Suppose that entities belonging to one of k mutually exclusive groups are described by n-dimensional attribute vectors. The statistical approach to the study of classification rules is based on models for the relevant probability distributions. Denote the attribute vector of entity i by the label of group a by Ga, the probability or probability density of given membership in Ga by and the prior probabilities of membership in Ga by πa. Then, the Bayes rule that minimizes the
The Mathematical Programming approach to SC
An intuitive approach to SC searches directly for boundaries of the region of the attribute domain for which without making any assumptions about the attribute distributions. This approach assumes that these boundaries can be described by equations of the form where the index s denotes a particular branch of the boundary line (e.g., the boundary separating a particular group pair), are vectors of unknown parameters, the
Statistical Learning foundations of SVMs
Classification Support Vector Machines (Boser, Guyon, Vapnik, 1992, Cortes, Vapnik, 1995) are optimization based classifiers with a firm theoretical basis rooted on Statistical Learning Theory (Vapnik, 1998, Vapnik, 2013). The goal of Statistical Learning Theory is the development and study of predictive algorithms with guaranteed good generalization ability. When applied to SC problems, Statistical Learning Theory deals with classifiers with known distribution-free bounds on the
Conclusions and perspectives
The theory and the tools of optimization have always played an important role in Supervised Classification (SC). The SC problem itself can be conceptually formulated as an optimization problem, albeit one defined over unobservable quantities such as the true probabilities, or the expected costs, of incorrect classifications for some future unknown entities. The Mathematical Programming (MP) approach to SC relies on this theoretical formulation, but replaces all unknowns by observed proxies. As
Acknowledgments
The author thanks four anonymous referees for their suggestions and constructive criticism that helped to improve the quality of this review. Financial support from Fundação para a Ciência e Tecnologia (through project UID/GES/00731/2013) is gratefully acknowledged.
References (181)
- et al.
Mathematical programming based heuristics for improving LP-generated classifiers for the multiclass supervised classification problem
European Journal of Operational Research
(2006) - et al.
The complexity and approximability of finding maximum feasible subsystems of linear relations
Theoretical Computer Science
(1995) - et al.
On the performance of linear programming heuristics applied on a quadratic transformation in the classification problem
European Journal of Operational Research
(1994) - et al.
Classification by vertical and cutting multi-hyperplane decision tree induction
Decision Support Systems
(2010) - et al.
Detecting relevant variables and interactions in supervised classification
European Journal of Operational Research
(2011) - et al.
A nested heuristic for parameter tuning in support vector machines
Computers and Operations Research
(2014) - et al.
Supervised classification and mathematical optimization
Computers and Operations Research
(2013) Integrated classifier hyperplane placement and feature selection
Expert Systems with Applications
(2012)- et al.
Using SVM to combine global heuristics for the standard quadratic problem
European Journal of Operational Research
(2015) An introduction to l1-norm based statistical data analysis
Computational Statistics and Data Analysis
(1987)
Efficient variable screening for multivariate analysis
Journal of Multivariate Analysis
Simple but powerful goal programming formulations for the discriminant problem
European Journal of Operational Research
A tabu search with an oscillation strategy for the discriminant analysis problem
Computers and Operations Research
Novel linear programming approach for building a piecewise nonlinear binary classifier with a priori accuracy
Decision Support Systems
General mathematical programming formulations for the statistical classification problem
Operations Research Letters
An iterative mixed integer programming method for classification accuracy maximizing discriminant analysis
Computers and Operations Research
A comparison of standard and two-stage mathematical programming discriminant analysis methods
European Journal of Operational Research
Assessing predictive accuracy in discriminant analysis
Multivariate Behavioral Research
Arbitrary-norm hyperplane separation by variable neighbourhood search
IMA Journal of Management Mathematics
Minimizing deviations from the group mean: A new linear programming approach for the two-group classification problem
European Journal of Operational Research
Fast kernels for inexact string matching
Predicting underachievement in business statistics
Educational and Psychological Measurement
Applications of classification techniques in business, banking and finance
On the maximum feasible subsystem problem, IISs and IIS-hypergraphs
Mathematical Programming
Separate sample logistic discrimination
Biometrika
Oscillation heuristics for the two-group classification problem
Journal of Classification
Mathematical programming formulations for two-group classification with binary variables
Annals of Operations Research
Polyhedral separability through successive LP
Journal of Optimization Theory and Applications
Conic separation of finite sets i. the homogeneous case
Journal of Convex Analysis
Exact l2 norm plane separation
Optimization Letters
Max-min separability
Optimization Methods and Software
Piecewise linear classifiers based on nonsmooth optimization approaches
An experimental comparison of statistical and linear programming approaches to the discriminant problem
Decision Sciences
An efficient optimal solution algorithm for the classification problem
Decision Sciences
Misclassification among methods used for multiple group discrimination - the effects of distributional properties
Statistics in Medicine
Generalization performance of support vector machines and other pattern classifiers
Advances in Kernel Methods – Support Vector Learning
A parametric optimization method for machine learning
INFORMS Journal on Computing
Robust linear programming discrimination of two linearly inseparable sets
Optimization Methods and Software
Multicategory discrimination via linear programming
Optimization Methods and Software
Learnability and the Vapnik-Chervonenkis dimension
Journal of the ACM
A training algorithm for optimal margin classifiers
Proceedings of the fifth annual workshop on Computational Learning Theory
Massive data discrimination via linear support vector machines
Optimization Methods and Software
Multicategory classification by support vector machines
Computational Optimization and Applications
Random forests
Machine Learning
Support vector machines with the ramp loss and the hard margin loss
Operations Research
On the selection of the globally optimal prototype subset for nearest-neighbor classification
INFORMS Journal on Computing
Machine learning for global optimization
Computational Optimization and Applications
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
An effective polynomial-time heuristic for the minimum-cardinality IIS set-covering problem
Annals of Mathematics and Artificial Intelligence
Finding a useful subset of constraints for analysis in an infeasible linear program
INFORMS Journal on Computing
Cited by (24)
A linear multivariate decision tree with branch-and-bound components
2024, NeurocomputingMathematical optimization modelling for group counterfactual explanations
2024, European Journal of Operational ResearchA maximum-margin multisphere approach for binary Multiple Instance Learning
2022, European Journal of Operational ResearchCitation Excerpt :Nowadays Operation Research plays a relevant role in Machine Learning (ML), as confirmed by the rich literature in this field (see the surveys Carrizosa, Molero-Ro, & Romero Morales, 2021; Carrizosa & Romero Morales, 2013; Gambella, Ghaddar, & Naoum-Sawaya, 2021; Pedro Duarte Silva, 2017).
Faster Maximum Feasible Subsystem solutions for dense constraint matrices
2022, Computers and Operations ResearchCitation Excerpt :We select for removal all candidates up to the first abrupt change identified. Amaldi (1994), Parker (1995), Chinneck (2009, 2012, 2001), and Silva (2017) have shown how the binary classification problem can be transformed into a MAX FS problem: The candidate list is obtained using Eq. (1), which proved effective in earlier research (Chinneck, 2001).
Tightening big Ms in integer programming formulations for support vector machines with ramp loss
2020, European Journal of Operational ResearchCitation Excerpt :For interested readers, a comparison between the proposed methodologies and the existing solution methods in terms of classification performance can be found in the supplementary material. As stated in Duarte Silva (2017), the search for exact solutions of the optimization problems resulting from the ramp loss models with large datasets is an open problem. This paper has presented various new exact approaches which are applicable to larger datasets and they have proven to be faster than the state-of-the-art algorithms.
Sparsity in optimal randomized classification trees
2020, European Journal of Operational ResearchCitation Excerpt :The mainstream trend of using a greedy strategy in the construction of decision trees may lead to myopic decisions, which, in turn, may affect the overall learning performance. The major advances in Mathematical Optimization (Carrizosa & Romero Morales, 2013; Olafsson, Li, & Wu, 2008; Silva, 2017) have led to different approaches to build decision trees with some overall optimality criterion, called hereafter optimal classification trees. It is worth mentioning recent proposals which grow optimal classification trees of a pre-established depth, both deterministic (Bertsimas & Dunn, 2017; Firat, Crognier, Gabor, Hurkens, & Zhang, 2019; Günlük, Kalagnanam, Menickelly, & Scheinberg, 2019; Verwer & Zhang, 2017; Verwer, Zhang, & Ye, 2017) and randomized (Blanquero, Carrizosa, Molero-Río, & Romero Morales, 2018).