Elsevier

Pattern Recognition

Volume 44, Issue 7, July 2011, Pages 1426-1434
Pattern Recognition

Inference on the prediction of ensembles of infinite size

https://doi.org/10.1016/j.patcog.2010.12.021Get rights and content

Abstract

In this paper we introduce a framework for making statistical inference on the asymptotic prediction of parallel classification ensembles. The validity of the analysis is fairly general. It only requires that the individual classifiers are generated in independent executions of some randomized learning algorithm, and that the final ensemble prediction is made via majority voting. Given an unlabeled test instance, the predictions of the classifiers in the ensemble are obtained sequentially. As the individual predictions become known, Bayes' theorem is used to update an estimate of the probability that the class predicted by the current ensemble coincides with the classification of the corresponding ensemble of infinite size. Using this estimate, the voting process can be halted when the confidence on the asymptotic prediction is sufficiently high. An empirical investigation in several benchmark classification problems shows that most of the test instances require querying only a small number of classifiers to converge to the infinite ensemble prediction with a high degree of confidence. For these instances, the difference between the generalization error of the finite ensemble and the infinite ensemble limit is very small, often negligible.

Introduction

Ensembles are among the most successful methods used to address supervised learning problems [1], [2], [3], [4], [5], [6], [7]. The prediction of an ensemble is obtained by combining the individual predictions of a collection of diverse classifiers. Provided that these predictions are complementary, ensembles provide an effective mechanism to achieve better generalization performance. In this work we consider parallel ensembles of classifiers of the same type. The individual classifiers in the ensemble are generated in independent executions of a randomized learning algorithm. This procedure takes advantage of instabilities in the base learning algorithm to generate a collection of diverse classifiers [3], [6]. Finally, the prediction of the ensemble is computed by majority voting. Bagging [1], random forest [2], extra-trees [7], subagging [4], rotation forest [6] and class-switching ensembles [5] are representative ensembles of this kind.

In these types of ensembles the generalization error typically decreases as the size of the ensemble increases [1], [2], [8], [7], [5]. In general, the larger the ensemble is, the more accurate its prediction. However, the rate of improvement in performance becomes smaller as the size of the ensemble increases. Furthermore, the computational costs of generation, storage and prediction increase linearly with the number of classifiers included in the ensemble. Therefore, it is important to determine whether it is possible to estimate the prediction of an ensemble of very large size (ideally of infinite size) using only the predictions of a finite collection of classifiers. Or, alternatively, to quantify how confident one can be that the prediction of an ensemble of finite size coincides with the prediction of the corresponding ensemble of infinite size. In this work we show that the answer to these questions strongly depends on the particular instance that is being classified. For most instances, the infinite ensemble prediction can be estimated with a very high degree of confidence using the predictions of only a small number of classifiers. By contrast, instances that are close to classification frontiers (usually a small fraction of the instances considered) require querying a very large number of classifiers to converge to the asymptotic (infinite) ensemble prediction.

These questions can be addressed by analyzing the convergence of majority voting in the infinite ensemble limit. The probabilistic framework described in [9], [10] is particularly suited for this purpose. For a given instance, the asymptotic prediction of the ensemble can be expressed in terms of the set of probabilities that an individual classifier assigns a particular class label to that instance. The difficulty is that these class probabilities, which depend on the particular instance considered, are initially unknown. Nevertheless, the voting process provides information that can be used to estimate their distribution. Starting from a uniform prior, Bayes' theorem is used to compute a posterior that incorporates the evidence given by the predictions of the individual classifiers as they become known. The posterior distribution describes the uncertainty of the provisional estimates of the class probabilities. This distribution is then used to compute the probability that the class label currently predicted by the finite ensemble (the current majority class) coincides with the class label that an ensemble of infinite size would predict. Provided that a small amount of uncertainty in the final prediction is acceptable, the voting process can be stopped when the probability estimate exceeds some specified threshold, π. This stopping strategy guarantees that the differences between the classification error of the finite ensemble and of the infinite ensemble are at most 1π. This is because the differences in error are necessarily smaller than the differences in class predictions. In particular, if the changes in the class labels affect correctly and incorrectly classified instances in approximately equal numbers, the differences in classification error should be much smaller than this upper bound. The validity of this analysis is illustrated in extensive experiments in benchmark classification problems. In these problems, most of the test instances require knowing the output of only a few classifiers to produce a reliable estimate of the asymptotic ensemble prediction. Furthermore, the error of the ensemble in this subset of test instances is very close to the asymptotic (infinite ensemble) limit.

The organization of the manuscript is as follows: In Section 2 we analyze the prediction process of the ensemble by majority voting. This analysis is used to make inference about the prediction of the ensemble in the infinite-size limit. Section 3 discusses the relation of the present work with analyses found in the literature. In Section 4 the results of experiments in a wide range of classification problems are used to illustrate the validity of the proposed framework. Finally, the results and conclusions of this investigation are summarized in Section 5.

Section snippets

Inference on the asymptotic ensemble prediction

Consider an ensemble {hi(x)}i=1t composed of t classifiers. Assuming that majority voting is used to combine the decisions of the individual predictors, the class label assigned by the ensemble to an unlabeled instance described by the vector of attributes x isy^t=argmaxyi=1tI(hi(x)=y),yY,where I is an indicator function and Y={y1,,yl} is the set of possible class labels.

As described in [10], if the individual classifiers of the ensemble are built independently when conditioned to the

Related work

The analysis of the prediction process presented in the previous section applies to any kind of parallel ensemble in which the individual classifiers are generated in independent executions of a randomized learning algorithm. These include bagging [1], random forest [2], extra-trees [7], subagging [4], rotation forest [6] and class-switching ensembles [5], among others. By contrast, the analysis cannot be directly applied to sequential ensemble algorithms, such as boosting [13], [14]. In

Experiments

The application of the probabilistic framework for inference on the asymptotic prediction of parallel ensembles is illustrated in a variety of classification problems from the UCI repository [18]. The ensembles used for the empirical validation of this analysis are bagging [1] and random forest [2]. In bagging, the individual classifiers are built by applying the same learning algorithm to independent bootstrap samples of the training set [1]. Each bootstrap sample has the same size as the

Conclusions

In this paper we have introduced a probabilistic framework for making inference on the asymptotic (infinite) ensemble prediction. For this, we have used the evidence given by the output of a finite set of ensemble classifiers. The analysis presented is based on the estimation of the class prediction probabilities of a single classifier for a given test instance. To estimate these probabilities, the classifiers in the ensemble are queried sequentially. Starting from a uniform prior, Bayes'

Acknowledgments

The authors acknowledge support from the Spanish Ministerio de Ciencia e Innovación, projects TIN2007-66862-C02-02 and TIN2010-21575-C02-02.

Daniel Hernández-Lobato received the degree of Engineer in Computer Science and the M.Sc. and Ph.D. degrees in Computer Science from Universidad Autónoma de Madrid, Spain, in 2004, 2007 and 2009, respectively. Daniel was the recipient of an FPI grant from Consejería de Educación de la Comunidad de Madrid in 2005. Currently, he is a postdoc researcher at Université catholique de Louvain, Belgium. His research interests include pattern recognition, machine learning methods and Bayesian inference.

References (21)

There are more references available in the full text version of this article.

Cited by (0)

Daniel Hernández-Lobato received the degree of Engineer in Computer Science and the M.Sc. and Ph.D. degrees in Computer Science from Universidad Autónoma de Madrid, Spain, in 2004, 2007 and 2009, respectively. Daniel was the recipient of an FPI grant from Consejería de Educación de la Comunidad de Madrid in 2005. Currently, he is a postdoc researcher at Université catholique de Louvain, Belgium. His research interests include pattern recognition, machine learning methods and Bayesian inference.

Gonzalo Martínez-Muñoz received the university degree in Physics (1995) and Ph.D. degree in Computer Science (2006) from the Universidad Autónoma de Madrid (UAM). From 1996 to 2002, he worked in industry. Until 2008 he was an interim assistant professor in the Computer Science Department of the UAM. During 2008/2009, he worked as a Fulbright postdoc researcher at Oregon State University in the group of Professor Thomas G. Dietterich. He is currently a professor at Computer Science Department at UAM. His research interests include machine learning, computer vision, pattern recognition, neural networks, decision trees, and ensemble learning.

Alberto Suárez received the degree of Licenciado in Chemistry from the Universidad Autónoma de Madrid (Spain), in 1988, and the Ph.D. degree in Physical Chemistry from the Massachusetts Institute of Technology (Cambridge, USA) in 1993. After holding postdoctoral positions at Stanford University (USA), the Université Libre de Bruxelles (Belgium), and the Katholieke Universiteit Leuven (Belgium), he is currently a professor in the Computer Science Department of the Universidad Autónoma de Madrid (Spain). He has worked on relaxation theory in condensed media, stochastic and thermodynamic theories of non-equilibrium systems, lattice-gas automata, and decision tree induction. His current research interests include machine learning, computational finance, modeling and analysis of time-series, and information processing in the presence of noise.

View full text