Bayesian partition modelling

doi:10.1016/S0167-9473(01)00073-1

Computational Statistics & Data Analysis

Volume 38, Issue 4, 28 February 2002, Pages 475-485

https://doi.org/10.1016/S0167-9473(01)00073-1 Get rights and content

Abstract

This paper reviews recent ideas in Bayesian classification modelling via partitioning. These methods provide predictive estimates for class assignments using averages of a sample of models generated from the posterior distribution of the model parameters. We discuss modifications to the basic approach more suitable for problems when there are many predictor variables and/or a large training smple.

Introduction

The general classification problem involves predicting the class conditional probabilities at any location $x$ within some domain of interest $X$ . This is nearly always achieved by “training” a model with a set of historic data pairings, $D ={y_{i}, x_{i}}_{1}^{n}$ . Adopting a Bayesian approach involves determining the posterior predictive distribution, $p(y=k| x, D)$ , for $x ∈ X$ , where k∈{1,…,K} and denotes the class label. A model, indexed by parameters $θ$ , is often introduced to capture our beliefs about the underlying probability surface. This allows us to write the (predictive) distribution of interest as $p(y| x, D)= ∫ Θ p(y| x, D, θ)p(θ | D) d θ ∝ ∫ Θ p(y| x, θ)p(D | θ)p(θ) d θ,$ where Θ is the space defined by all the allowable models.

In essence, predictive Bayesian modelling is straightforward. Once a set of models to perform the classification is identified, all that is required is a prior for $θ$ and the definition of a likelihood $p(D | θ)$ . Further, the likelihood is usually defined naturally from the context of the problem so the model prior is very important. Hence, the definition of the model space, Θ, is critical to the performance of the data analysis. Even though (1) compensates to some degree for model misspecification, through integration out of the model parameters, if the model space has no support over models close to the truth poor out-of-sample predictive results will ensue. Thus, taking Θ to cover a wide variety of possible models is desirable but this often means that the integral in (1) cannot be performed analytically. Approximations to the integral are readily available by simulation from $p(θ | D)$ , most commonly with Markov chain Monte Carlo methods (e.g. Gelfand and Smith, 1990). However, depending on the form of Θ, the model space can be difficult to sample from, often due to correlation in the parameters of $θ$ . As efficient estimation of $p(y| x, D)$ relies on an approximation to the posterior $p(θ | D)$ , forms for $θ$ which allow efficient sampling should be preferred. One such type of model is the Bayesian partition model of Holmes et al. (1999).

In this paper, we think of partition models as those which explain the data by fitting local distributions to local areas of the predictor space $X$ . Further, we make the extra restriction that the parameters of the local distributions are mutually independent. Partition models of this type are well established in the literature dating back at least to the work on recursive partitioning by Morgan and Sonquist (1963), which was further developed by Friedman (1979), leading to the well-known book on classification and regression trees by Breiman et al. (1984). Classification trees involve determining partitions of the data which lead to groupings of the datapoints into terminal nodes which, ideally, are dominated by one class. Classification trees are highly interpretable models but suffer from relatively poor out-of-sample predictive performance. This is because the partitions of the predictor space are forced to be made on a single predictor, so are necessarily axis-parallel, and the partitioning proceeds greedily, so successive “best” splits to the data are made at each step. The machine learning community, in particular, has continued to research generalisations of (recursive) partitioning models, most notably improvements in the space of possible splits (e.g. Murthy et al., 1994).

The Bayesian partition model (BPM) is motivated by the same basic principles of earlier models of the same type, i.e. points close by in predictor space come from the same local distribution. However, a more flexible partitioning scheme is proposed. Further, Bayesian model averaging (Raftery et al., 1996) ideas allow for prediction using averages (or ensembles) of models, rather than single ones which have a somewhat unappealing discontinuous nature. As well as describing the BPM of Holmes et al. (1999) in detail, we comment on both its strengths and weaknesses. In particular, we introduce a new model which is based on the partition model but is more suited to data-mining applications when there are a large number of training data points and/or a large number of predictor variables (or features). This new model, known as the product partition model (PPM), utilises a modified form of standard naive Bayes formula (e.g. Domingos and Pazzini, 1997) which assumes $p(y| x)∝p(x |y)×p(y)= ∏ i=1 P p(x_{i} |y)×p(y),$ where $x =(x_{1},…,x_{P})$ is a general feature vector assuming P predictors. The main advantage of this model is the speed with which the MCMC algorithm can be run, allowing fast computation of the posterior. This greater speed comes at the cost of slightly poorer predictive power but this degradation appears acceptable for the large datasets that are typically encountered in modern data-mining. Further, the method allows the user to identify important variables and gives an idea as to how the class assignment is affected by each predictor.

The paper is arranged as follows. In the next section, we review the BPM of Holmes et al. (1999) and Section 3 then introduces the PPM. Section 4 contains a comparison of the two approaches in both classification error and computational time and lastly Section 5 contains a brief discussion.

Section snippets

Bayesian partition models

Bayesian partition models are motivated by the basic premise that points nearby in predictor space come from the same local distribution. This idea is well known in many statistical models including local polynomial regression (e.g. Fan and Gijbels, 1996) and classification trees (Breiman et al., 1984). Using this idea, we partition the design space $X$ into a number of disjoint regions, where within each region the data are assumed exchangeable. We choose to construct these regions through the

Product partition models

Bayesian partition models are particularly effective when the number of predictors is small but can suffer both from (relatively) poor predictive results as well as being computationally expensive algorithm when either the number of predictors, or the number of training data examples, becomes large. A number of factors are likely to account for this including:

(a)	for every extra predictor variable the dimension of $θ$ increases by $M$ leading to more computational effort being required to simulate

Collections data example

This example is concerned with retail credit, in particular, the conduct of a bank loan which is monitored throughout its lifetime. Accounts that are deemed to be conducted badly enter a process called collections, where more vigorous effort is made to correct the problem. The organisation that provided this data set are especially interested in being able to identify customers who spend less than 30 days in collections. Thus, we have a two class classification problem. The bank has a vast

Discussion

In this paper, we have outlined two methods for performing classification using models that partition the predictor space into disjoint regions. These models try to capture local features in the data to aid prediction. The BPM model has been shown to be unsuitable when there are many predictors and/or a large number of datapoints so a new model has been introduced which alleviates some of these problems. This PPM has been shown to be competitive with other standard classification procedures on

Acknowledgments

The work of the first author was supported by a grant from the Nuffield Foundation.

References (17)

Y. Freund
Boosting a weak learning algorithm by majority
Inform. Comput.
(1995)
L. Breiman et al.
Classification and Regression Trees
(1984)
Clark, L.A., Pregibon, D., 1992. Tree based models, Statistical Models in S. Wadsworth, Pacific...
Denison, D.G.T., Adams, N.A., Hand, D.J., Holmes, C.C., 2000. Product partitioned models. Technical Report, Department...
P. Domingos et al.
On the optimality of the simple bayesian classifier under zero-one loss
Mach. Learning
(1997)
J. Fan et al.
Local Polynomial Modelling and its Applications
(1996)
J.H. Friedman
A tree-structured approach to nonparametric multiple regression
A.E. Gelfand et al.
Sampling-based approaches to calculating marginal densities
J. Amer. Statist. Assoc.
(1990)

There are more references available in the full text version of this article.

Cited by (46)

Optimal data-based binning for histograms and histogram-based probability density models
2019, Digital Signal Processing: A Review Journal
Citation Excerpt :
This neglects uncertainty about the number of bins, which means that the variance in the bin heights is underestimated. Equal width bins can be very inefficient in describing multi-modal density functions (as in Fig. 1G.) In such cases, variable bin-width models such as the maximum likelihood estimation introduced by Wegman [6], Bayesian partitioning [7], Bayesian Blocks [8], Bayesian bin distribution inference [9], Bayesian regression of piecewise constant functions [10], and Bayesian model determination through techniques such as reversible jump Markov chain Monte Carlo [33] may be more appropriate options in certain research applications. A Matlab implementation of the basic algorithm is given in the Appendix and the OPTBINS package can be downloaded from https://github.com/khknuth/histo.
Histograms are convenient non-parametric density estimators, which continue to be used ubiquitously. Summary quantities estimated from histogram-based probability density models depend on the choice of the number of bins. We introduce a straightforward data-based method of determining the optimal number of bins in a uniform bin-width histogram. By assigning a multinomial likelihood and a non-informative prior, we derive the posterior probability for the number of bins in a piecewise-constant density model given the data. In addition, we estimate the mean and standard deviations of the resulting bin heights, examine the effects of small sample sizes and digitized data, and demonstrate the application to multi-dimensional histograms.
Varying dimensional Bayesian acoustic waveform inversion for 1D semi-infinite heterogeneous media
2015, Probabilistic Engineering Mechanics
This paper introduces a methodology to infer the spatial variation of the acoustic characteristics of a 1D vertical elastic heterogeneous earth model via a Bayesian calibration approach, given a prescribed sequence of loading and the corresponding time history response registered at the ground level. This involves solving an inverse problem that maps the ground seismic response onto a random profile of the ground stratigraphy (i.e. a 1D continuous spatial random field). From a Bayesian point of view, the solution to an inverse problem is fully characterized by a posterior density function of the forward model random parameters, which explicitly overcomes the solution's non-uniqueness. This subsurface earth model is parameterized using a Bayesian partition model, where the number of soil layers, the location of the layers' interfaces, and their corresponding mechanical characteristics are defined as random variables. The partition model approach to an inverse problem is closely related to a Bayesian model selection problem, where the likely dimensionality of the inverse problem (number of unknowns) is inferred conditioned on the experimental observations. The main benefit of the proposed approach is that the explicit regularization of the inverted profile by global damping procedures is not required. A Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm is used to sample the target posterior of varying dimension, dependent on the number of layers. A synthetic case study is provided to indicate the applicability of the proposed methodology.
Modeling the R-ratio and hadronic contributions to g- 2 with a Treed Gaussian process
2023, European Physical Journal C
Modeling the R-ratio and hadronic contributions to g-2 with a Treed Gaussian Process
2023, arXiv
Facies-constrained transdimensional amplitude versus angle inversion using machine learning assisted priors
2023, Geophysical Prospecting
Bayesian Decision Trees via Tractable Priors and Probabilistic Context-Free Grammars
2023, arXiv

View all citing articles on Scopus

View full text

Bayesian partition modelling

Abstract

Introduction

Section snippets

Bayesian partition models

Product partition models

Collections data example

Discussion

Acknowledgments

Inform. Comput.

Classification and Regression Trees

On the optimality of the simple bayesian classifier under zero-one loss

Mach. Learning

Local Polynomial Modelling and its Applications

A tree-structured approach to nonparametric multiple regression

Sampling-based approaches to calculating marginal densities

J. Amer. Statist. Assoc.