Bayesian partition modelling

https://doi.org/10.1016/S0167-9473(01)00073-1Get rights and content

Abstract

This paper reviews recent ideas in Bayesian classification modelling via partitioning. These methods provide predictive estimates for class assignments using averages of a sample of models generated from the posterior distribution of the model parameters. We discuss modifications to the basic approach more suitable for problems when there are many predictor variables and/or a large training smple.

Introduction

The general classification problem involves predicting the class conditional probabilities at any location x within some domain of interest X. This is nearly always achieved by “training” a model with a set of historic data pairings, D={yi,xi}1n. Adopting a Bayesian approach involves determining the posterior predictive distribution, p(y=k|x,D), for xX, where k∈{1,…,K} and denotes the class label. A model, indexed by parameters θ, is often introduced to capture our beliefs about the underlying probability surface. This allows us to write the (predictive) distribution of interest asp(y|x,D)=Θp(y|x,D,θ)p(θ|D)dθΘp(y|x,θ)p(D|θ)p(θ)dθ,where Θ is the space defined by all the allowable models.

In essence, predictive Bayesian modelling is straightforward. Once a set of models to perform the classification is identified, all that is required is a prior for θ and the definition of a likelihood p(D|θ). Further, the likelihood is usually defined naturally from the context of the problem so the model prior is very important. Hence, the definition of the model space, Θ, is critical to the performance of the data analysis. Even though (1) compensates to some degree for model misspecification, through integration out of the model parameters, if the model space has no support over models close to the truth poor out-of-sample predictive results will ensue. Thus, taking Θ to cover a wide variety of possible models is desirable but this often means that the integral in (1) cannot be performed analytically. Approximations to the integral are readily available by simulation from p(θ|D), most commonly with Markov chain Monte Carlo methods (e.g. Gelfand and Smith, 1990). However, depending on the form of Θ, the model space can be difficult to sample from, often due to correlation in the parameters of θ. As efficient estimation of p(y|x,D) relies on an approximation to the posterior p(θ|D), forms for θ which allow efficient sampling should be preferred. One such type of model is the Bayesian partition model of Holmes et al. (1999).

In this paper, we think of partition models as those which explain the data by fitting local distributions to local areas of the predictor space X. Further, we make the extra restriction that the parameters of the local distributions are mutually independent. Partition models of this type are well established in the literature dating back at least to the work on recursive partitioning by Morgan and Sonquist (1963), which was further developed by Friedman (1979), leading to the well-known book on classification and regression trees by Breiman et al. (1984). Classification trees involve determining partitions of the data which lead to groupings of the datapoints into terminal nodes which, ideally, are dominated by one class. Classification trees are highly interpretable models but suffer from relatively poor out-of-sample predictive performance. This is because the partitions of the predictor space are forced to be made on a single predictor, so are necessarily axis-parallel, and the partitioning proceeds greedily, so successive “best” splits to the data are made at each step. The machine learning community, in particular, has continued to research generalisations of (recursive) partitioning models, most notably improvements in the space of possible splits (e.g. Murthy et al., 1994).

The Bayesian partition model (BPM) is motivated by the same basic principles of earlier models of the same type, i.e. points close by in predictor space come from the same local distribution. However, a more flexible partitioning scheme is proposed. Further, Bayesian model averaging (Raftery et al., 1996) ideas allow for prediction using averages (or ensembles) of models, rather than single ones which have a somewhat unappealing discontinuous nature. As well as describing the BPM of Holmes et al. (1999) in detail, we comment on both its strengths and weaknesses. In particular, we introduce a new model which is based on the partition model but is more suited to data-mining applications when there are a large number of training data points and/or a large number of predictor variables (or features). This new model, known as the product partition model (PPM), utilises a modified form of standard naive Bayes formula (e.g. Domingos and Pazzini, 1997) which assumesp(y|x)∝p(x|y)×p(y)=i=1Pp(xi|y)×p(y),where x=(x1,…,xP) is a general feature vector assuming P predictors. The main advantage of this model is the speed with which the MCMC algorithm can be run, allowing fast computation of the posterior. This greater speed comes at the cost of slightly poorer predictive power but this degradation appears acceptable for the large datasets that are typically encountered in modern data-mining. Further, the method allows the user to identify important variables and gives an idea as to how the class assignment is affected by each predictor.

The paper is arranged as follows. In the next section, we review the BPM of Holmes et al. (1999) and Section 3 then introduces the PPM. Section 4 contains a comparison of the two approaches in both classification error and computational time and lastly Section 5 contains a brief discussion.

Section snippets

Bayesian partition models

Bayesian partition models are motivated by the basic premise that points nearby in predictor space come from the same local distribution. This idea is well known in many statistical models including local polynomial regression (e.g. Fan and Gijbels, 1996) and classification trees (Breiman et al., 1984). Using this idea, we partition the design space X into a number of disjoint regions, where within each region the data are assumed exchangeable. We choose to construct these regions through the

Product partition models

Bayesian partition models are particularly effective when the number of predictors is small but can suffer both from (relatively) poor predictive results as well as being computationally expensive algorithm when either the number of predictors, or the number of training data examples, becomes large. A number of factors are likely to account for this including:

(a)for every extra predictor variable the dimension of θ increases by M leading to more computational effort being required to simulate

Collections data example

This example is concerned with retail credit, in particular, the conduct of a bank loan which is monitored throughout its lifetime. Accounts that are deemed to be conducted badly enter a process called collections, where more vigorous effort is made to correct the problem. The organisation that provided this data set are especially interested in being able to identify customers who spend less than 30 days in collections. Thus, we have a two class classification problem. The bank has a vast

Discussion

In this paper, we have outlined two methods for performing classification using models that partition the predictor space into disjoint regions. These models try to capture local features in the data to aid prediction. The BPM model has been shown to be unsuitable when there are many predictors and/or a large number of datapoints so a new model has been introduced which alleviates some of these problems. This PPM has been shown to be competitive with other standard classification procedures on

Acknowledgments

The work of the first author was supported by a grant from the Nuffield Foundation.

References (17)

  • Y. Freund

    Boosting a weak learning algorithm by majority

    Inform. Comput.

    (1995)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • Clark, L.A., Pregibon, D., 1992. Tree based models, Statistical Models in S. Wadsworth, Pacific...
  • Denison, D.G.T., Adams, N.A., Hand, D.J., Holmes, C.C., 2000. Product partitioned models. Technical Report, Department...
  • P. Domingos et al.

    On the optimality of the simple bayesian classifier under zero-one loss

    Mach. Learning

    (1997)
  • J. Fan et al.

    Local Polynomial Modelling and its Applications

    (1996)
  • J.H. Friedman

    A tree-structured approach to nonparametric multiple regression

  • A.E. Gelfand et al.

    Sampling-based approaches to calculating marginal densities

    J. Amer. Statist. Assoc.

    (1990)
There are more references available in the full text version of this article.

Cited by (46)

  • Optimal data-based binning for histograms and histogram-based probability density models

    2019, Digital Signal Processing: A Review Journal
    Citation Excerpt :

    This neglects uncertainty about the number of bins, which means that the variance in the bin heights is underestimated. Equal width bins can be very inefficient in describing multi-modal density functions (as in Fig. 1G.) In such cases, variable bin-width models such as the maximum likelihood estimation introduced by Wegman [6], Bayesian partitioning [7], Bayesian Blocks [8], Bayesian bin distribution inference [9], Bayesian regression of piecewise constant functions [10], and Bayesian model determination through techniques such as reversible jump Markov chain Monte Carlo [33] may be more appropriate options in certain research applications. A Matlab implementation of the basic algorithm is given in the Appendix and the OPTBINS package can be downloaded from https://github.com/khknuth/histo.

View all citing articles on Scopus
View full text