Fitting multiple change-point models to data☆
Introduction
The change-point model is appropriate for some data sets with a natural ordering. This model is that the sequence of data can be broken down into segments with the observations following the same statistical model within each segment, but different models in different segments. One example of a change-point model is that in which the data follow a common distributional form (for example normal) whose parameters (mean, variance or both) change from one segment to another. Another more complex model is the discontinuous segmented regression model in which the observations in each segment follow a linear regression, but the parameter(s) of this regression (slopes and/or intercept) change from one segment to the next.
Change-point models involve three issues – the choice of suitable parametric forms for the within-segment models; the choice of segment boundaries, or change-points, and the determination of the appropriate number of change-points to use in modeling the specific data set. Our discussion focuses on the second of these questions. The third question is outside the scope of this article but will be commented upon.
The best-known application of change-point modeling in data analysis is that of regression trees. In the most widely used implementation (Breiman et al., 1984), the data set is ordered by a continuous or ordinal predictor and then split into two subsequences – those cases whose predictor value falls below some change-point and those whose predictor value is above the change-point. The change-point is chosen to maximize the separation between the two subsequences. The same binary splitting algorithm is then applied to each of the subsequences, and repeated recursively until the subsequences can no longer be usefully subdivided. This is a “greedy” algorithm – it seeks to select each change-point to maximize an immediate return. As is generally the case with greedy algorithms (and as we shall see later by example) this hierarchic binary splitting, though fast, usually fails to give the optimum splits if there are two or more of them.
In this paper, we provide an exact and reasonably fast algorithm for performing a multiway split. We will do this, not only for the case of a normal mean (as used in regression trees) but for an arbitrary parameter in an exponential family model.
In the following sections, we will derive the likelihood equations for optimal multiway splitting of data following an exponential-family distribution. Showing that this satisfies Bellman's ‘Principle of Optimality’ it then follows that the optimal splits can be found with a dynamic programming algorithm. Finally, we will work out the details for a number of common data modeling distributions and illustrate them with actual data sets.
The exponential family provides a rich set of models for data. Familiar members of the family are the normal distribution, the exponential, the gamma, the binomial and the Poisson. The family also includes normal-error linear regression and some generalized linear models. Starting with the simpler (non-regression) models, the canonical form of the exponential family distribution or density function is
The parameter and data may be either scalar or vector-valued. If vectors, they must be of the same dimension. Given a random sample of size n, , all mutually independent, the sufficient statistic for is
This statistic is the maximum likelihood estimator (MLE) of the parametric function , for which it is unbiased. Solving the equation gives the MLE of . Substituting this back into the likelihood gives the maximized likelihood.
Section snippets
The change-point model
Now extend the formulation to the change-point model. In this model, there are a number of change points, τ1,τ2,…,τk−1 such that the observations with τj−1<i≤τj follow the particular exponential family model with parameter In other words, the distributional form remains the same for all segments, but the parameter changes whenever one crosses over one of the change points τj.
As there are k−1 change-points, there are a total of k segments in this model. To simplify notation, we will
Particular applications
Changepoint in normal mean: We will start with the familiar example of scalar normal data with constant variance, where the mean may change from one segment to the next. This problem and the DP solution are discussed in more detail in Hawkins and Merriam 1973, Hawkins and Merriam 1975. As this is the problem addressed by regression trees (Breiman et al., 1984), it is particularly interesting to compare their implementation with exact optimization.
Turning the normal density into canonical
Formal testing for the number of segments
F(k,n) is the negative doubled maximized likelihood of the model fitting k segments to the full sequence of data. It therefore gives rise to generalized likelihood ratio tests:
To test the null hypothesis of a single segment versus the alternative of k segments, the GLR statistic is F(1,n)−F(k,n).
To test the null hypothesis of at most (k−1) segments against the alternative of k, the GLR statistic is F(k−1,n)−F(k,n).
On the face of it, the incremental change F(k−1,n)−F(k,n) should follow an
A regression-tree-type example
We start with a data set showing a non-linear relationship between a predictor and a dependent variable. In the absence of a parametric model, this data set might be subjected to analysis with a regression tree. The data set is shown as Fig. 1a, and the optimal segmentation into 2,3,…,6 segments is Fig. 1b. Table 1 shows the optimal segment boundaries, the pooled residual sum of squares F(r,n) and the change in residual sum of squares as we go from one value of r to the next. Note that the
Conclusion
The change-point model for the general exponential family can be thought of as a generalized non-linear model. As such it would seem to be computationally intensive in the number of non-linear parameters – the changepoints. On the contrary however, the model can be fitted in a time linear in the number of change-points using a dynamic programming formulation making it quite a small task with moderate size data sets.
We have discussed the single-parameter exponential family in some detail. The
Acknowledgements
The author is grateful to the referees for several suggestions for improving the paper.
References (19)
- et al.
Zonation of sequences of heteroscedastic multivariate data
Comput. Geosci.
(1979) - et al.
Finding multiple abrupt change points
Comput. Statist. Data Anal.
(1996) - et al.
Applied Dynamic Programming
(1962) - et al.
Curve fitting by segmented straight lines
J. Amer. Statist. Assoc.
(1969) - Bhattacharya, P.K., 1994. Some aspects of change-point analysis. In: Carlstein, E., Muller, H.G., Siegmund, D. (Eds.),...
- et al.
Classification and Regression Trees.
(1984) - et al.
Testing and locating variance change-points with applications to stock prices
J. Amer. Statist. Assoc.
(1997) Multiple-changepoint testing for an alternating segments model of a binary sequence
Biometrics
(2000)On the choice of segments in piecewise approximation
J. Inst. Math. Appl.
(1972)
Cited by (159)
Change-detection-assisted multiple testing for spatiotemporal data
2023, Journal of Statistical Planning and InferenceA shape-based multiple segmentation algorithm for change-point detection
2023, Computers and Industrial EngineeringA Quasi-Bayesian change point detection with exchangeable weights
2023, Journal of Statistical Planning and InferenceInfluence of climate variability on water resource availability in the upper basin of Oum-Er-Rabiaa, Morocco
2022, Groundwater for Sustainable DevelopmentCitation Excerpt :This change is marked by positive peaks, pointing to a maximum value of the Pettitt U statistic and indicating the onset of a significant change in the rainfall dynamics. Conversely, minimum values indicate that the series tends to regain the central tendency (Xie et al., 2013; Hawkins, 2001). In this way, the entire series is segmented into several sub-periods according to the location of the fluctuation points.
Asymptotic properties of M-estimators based on estimating equations and censored data in semi-parametric models with multiple change points
2021, Journal of Mathematical Analysis and ApplicationsRevisiting HISTALP precipitation dataset
2023, International Journal of Climatology
- ☆
Work supported by the National Science Foundation under grant DMS 9803622.