Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Semiparametric maximum likelihood probability density estimation

Abstract

A comprehensive methodology for semiparametric probability density estimation is introduced and explored. The probability density is modelled by sequences of mostly regular or steep exponential families generated by flexible sets of basis functions, possibly including boundary terms. Parameters are estimated by global maximum likelihood without any roughness penalty. A statistically orthogonal formulation of the inference problem and a numerically stable and fast convex optimization algorithm for its solution are presented. Automatic model selection over the type and number of basis functions is performed with the Bayesian information criterion. The methodology can naturally be applied to densities supported on bounded, infinite or semi-infinite domains without boundary bias. Relationships to the truncated moment problem and the moment-constrained maximum entropy principle are discussed and a new theorem on the existence of solutions is contributed. The new technique compares very favourably to kernel density estimation, the diffusion estimator, finite mixture models and local likelihood density estimation across a diverse range of simulation and observation data sets. The semiparametric estimator combines a very small mean integrated squared error with a high degree of smoothness which allows for a robust and reliable detection of the modality of the probability density in terms of the number of modes and bumps.

1 Introduction

Estimating a probability density function from a given data sample is one of the oldest and most fundamental smoothing problems in statistics with a vast literature on the underlying theoretical issues and numerous applications in almost all areas of science, for example, in astronomy, bioinformatics, climate science, finance, economics, engineering, geoscience and medicine. Methods for probability density estimation can be classified into three groups: parametric, nonparametric and semiparametric approaches.

In the parametric approach, a small number of parameters are fitted by maximum likelihood or the method of moments within some prescribed model family such as, for example, the Gaussian, gamma, log-normal or beta distribution. Obviously, the flexibility of this approach is limited as the chosen distribution family imposes quite a strong constraint on the range of probability densities which can be covered.

Nonparametric probability density estimation offers a far greater flexibility in modelling a given data sample. The most widely used nonparametric technique is kernel density estimation [13]. Any kernel density estimate depends crucially on the choice of bandwidth [1, 4]. Basically, two different data-driven bandwidth selection techniques are available: the classical cross-validation approach [1] and the plug-in method [5]. Until relatively recently the plug-in method was based on the normal reference rule [6] which can adversely affect the density estimate if the normality assumption does not hold very well, particularly for bi- or multimodal densities. An improved plug-in bandwidth selection technique dropping the normal reference rule was proposed and superior practical performance demonstrated on a host of different simulation examples [7].

Standard kernel density estimation with second-order kernels has a couple of limitations. Firstly, in one dimension no positive kernel can have a mean integrated squared error decaying faster than N−4/5 with N being the sample size. In higher dimensions the speed of convergence is even slower. Secondly, kernel density estimates on bounded or semi-infinite domains exhibit considerable boundary bias as the kernel does not take account of information about the support of the density. Thirdly, standard kernels lack local adaptivity which often leads to the occurrence of spurious bumps and a tendency to flatten the peaks and valleys of the density.

Higher-order kernels have been used to improve accuracy (for example, [8, 9]), but these have the drawback of not guaranteeing to provide a proper non-negative density estimate. Boundary bias has been tackled by boundary kernel estimators [10] and boundary correction schemes [11]. The lack of local adaptivity has been addressed by adaptive kernel estimators [1214]. The most systematic and comprehensive solution attempt to all of the three issues is the diffusion estimator [7] which has improved asymptotic accuracy, deals naturally with domain boundaries, features data-adaptive smoothing and always provides a bona fide probability density, that is, it is non-negative and integrates to unity.

Semiparametric density estimation methodologies lie in between parametric and nonparametric approaches, combining advantages of both. While technically they are parametric the number of parameters and thus the complexity of the model may vary in a data-driven manner, usually increasing when more data are available. Semiparametric density estimates are considerably more flexible than the parametric approach but usually smoother than kernel density estimates. This is particularly advantageous when assessing the number of modes or bumps of a distribution. In the following four different semiparametric approaches are discussed.

Finite mixture models [15] are convex linear combinations of component densities taken from a standard family of distributions. Gaussian mixture models are most common; on bounded or semi-infinite domains also beta, gamma, Weibull or other mixture models are possible. The number of components can be treated as a hyperparameter and determined by cross-validation or with an information criterion. Alternatively, it can be treated as a random variable and estimated in a Bayesian framework.

In local likelihood density estimation [16, 17] the probability density at a given point is estimated by assuming a simple local model for the log-density, typically constant, linear or quadratic, in the neighbourhood of that point and inferring its parameters by maximum likelihood. The domain of influence of the local model is described by a kernel function with a bandwidth parameter. The method always gives a non-negative density estimate but exhibits boundary bias on bounded or semi-infinite intervals.

Another approach is to expand the density into a system of orthogonal functions. The idea goes back to the classical Edgeworth and Gram–Charlier type A series where the density is expanded into Hermite polynomials about a Gaussian reference density and expansion coefficients are determined by moment matching. These classical expansions are usually limited to just a few terms and can cover only densities relatively close to a Gaussian. The approach can be generalised [18, 19] to arbitrary domains of support and boundary conditions by choosing an appropriate set of orthogonal functions, either polynomial or trigonometric, and determining the expansion coefficients by projection of the basis functions onto the data sample which is particularly simple. See [20] for an extensive review. However, the density estimates obtained from a truncated set of basis function are not necessarily non-negative. A similar method which always provides a bona fide density estimate expands a density on a bounded interval into Bernstein polynomials and derives expansion coefficients from a discretised version of the empirical distribution function [21]. For all of these series expansions the inference connects neither to maximum likelihood nor to the method of moments which is somewhat unsatisfactory.

A more principled methodology may be obtained from expanding the logarithm of the density rather than the density itself into a set of basis functions. While being technically more involved this approach automatically yields a bona fide probability density for any truncation of the set of basis functions. The inference is rooted in maximum likelihood estimation in sequences of exponential families which is very well understood and has nice properties such as strict concavity of the log-likelihood function and convergence in relative entropy [22, 23]. The idea first occurred with the Gram–Charlier type C series where the log-density on the infinite domain is expanded into Hermite polynomials. It was later proposed as maximum penalized likelihood density estimation [2427] where the log-likelihood function is augmented with a roughness penalty functional as regularisation term to enforce smoothness of the density. Log-density estimation without a roughness penalty has been done using polynomials [28], splines [2931] and wavelets [32] as basis functions. Log-density estimation with polynomial basis functions was used to estimate large-deviation rate functions [33] and was extended to a non-stationary setting in order to detect and predict critical transitions in dynamical systems [34]. Maximum likelihood log-density estimation is equivalent to the maximum entropy method [3538]. A nonparametric extension of the maximum entropy method was proposed recently [39].

The present paper studies one-dimensional semiparametric log-density estimation from a theoretical as well as numerical point of view. It extends previous work by including the following novel elements:

  1. (i) Bounded, infinite and semi-infinite domains of support of the density are considered.
  2. (ii) The bulk basis functions such as polynomial, spline and trigonometric [23] are considered at the same time; systematic and automatic model selection over the type and number of basis functions with the Bayesian information criterion is proposed. While not theoretically deep this has to the best knowledge of the author not been done before and it proves very beneficial in terms of practical performance of the estimator.
  3. (iii) Linearly extrapolated polynomial basis functions are derived for infinite and semi-infinite domains which have better inference properties than the standard polynomials.
  4. (iv) The spline basis functions are here not restricted to natural splines; they are more flexible as they have no curvature condition at the endpoints. Moreover, novel knot placement and deletion schemes are proposed.
  5. (v) The bulk basis functions are augmented with logarithmic or rational boundary terms which increases the efficiency of the method for domains with boundaries. For example, logarithmic terms allow to capture power-law decay to zero of the probability density at a boundary point.
  6. (vi) A statistically orthogonal formulation of the inference problem is introduced based on orthogonalisation of the basis functions with respect to a data-adaptive scalar product. While the use of classical orthogonal polynomials such as Legendre polynomials [28] or B-splines [29, 31] is standard in log-density estimation here basis functions are constructed which are exactly orthogonal over the data set under consideration providing a very good condition of the optimization problem close to the solution even for rather high-dimensional models.
  7. (vii) A comprehensive comparison of the new method with standard methods such as kernel density estimation and finite mixture models is performed.

The remainder of the paper is organised as follows: In Section 2 the general methodology is developed. Bounded, infinite and semi-infinite domains of support are discussed in detail in Sections 3, 4 and 5. Section 6 explains the algorithm for maximizing the likelihood function; Section 7 details the model selection procedure. The results are presented and discussed in Section 8. Some conclusions are drawn in Section 9.

2 Semiparametric log-density estimation

We address the problem of estimating the univariate probability density function fX(x) of the random variable X from a given data sample. The domain of support of the probability density, DX, may be bounded (DX = [a, b]), infinite (DX = (−∞, ∞)) or semi-infinite (DX = [a, ∞)). Often it is numerically convenient to introduce a linear rescaling Y = (Xc)/s and y = (xc)/s with suitable constants c and s > 0. The domain DX is mapped onto a domain DY. The scaled probability density fY(y) is linked to fX(x) by fY((xc)/s) = sfX(x). In the following, for ease of notation, the subscript Y when referring to quantities associated with the rescaled variable Y will be dropped and only the subscript X will be kept when referring to quantities associated with the original variable X.

2.1 The probability density model

The probability density function is modelled as (1) where the potential function U is expanded as (2) with a given function ψ, a set of prescribed basis functions {ϕ1, …, ϕJ} and expansion coefficients α = (α1, …, αJ)T. The normalization constant or partition function is given by (3) The potential function of the normalised probability density is (4) with the log-partition function (5) For a well-specified and non-redundant model we require the set of functions {1, ψ, ϕ1, …, ϕJ} to be linearly independent on D. The potential function U is only determined up to an arbitrary additive constant.

The probability density model is a Gibbs or Boltzmann distribution. In statistical terms, it is an exponential family [4042] of order J in canonical form with canonical parameters α = (α1, …, αJ)T and canonical sufficient statistics ϕ(y) = (ϕ1(y), …, ϕJ(y))T. The exponential family is minimal and full. The canonical parameter space is a convex subset of . The exponential family is called regular if is an open set. The cumulant generating function of the sufficient statistics is K(t|α) = A(α + t) − A(α).

The choice of basis functions is very flexible which makes the modelling framework powerful. Depending on the support of the density and the nature of the data to be investigated the basis functions could be polynomials, spline functions, trigonometric functions or wavelets, possibly augmented with logarithmic or rational boundary terms.

The base measure of the exponential family is h(y) = exp[ψ(y)]. We here mostly have ψ(y) = 0 but the option allows to include into the modelling framework some well-known distributions such as the Rayleigh distribution (U(y) = log y + α1 y2), the Weibull distribution with known shape parameter (U(y) = (k − 1) log y + α1 yk) or the inverse Gaussian distribution (U(y) = −(3/2) log y + α1(1/y) + α2 y). Another application could be a choice ψ(y) = log Π(y) = log s + log ΠX(c + sy) where ΠX(x) is some pilot probability density of X obtained from the data, for example, via kernel density estimation which is then post-processed or refined by the exponential family model. This was proposed in the literature [43] and Poisson regression used for the inference. As such the present work contributes a new algorithm for this type of problem but we will not pursue this here.

Ideally, the model class for the probability densities of Xc and Y should be the same, that is, there is a bijective mapping between a parameter vector α and a parameter vector such that with a constant C which may depend on α. Then maximum likelihood estimation for p(y) is equivalent to maximum likelihood estimation directly for pX(x). This invariance of the exponential family under linear scaling is satisfied in all of the models presented here.

2.2 Parameter estimation

Given a data sample of size N, {x1, …, xN}, the data points are rescaled as yn = (xnc)/s; n = 1, …, N. The sample mean of any function e(y) on D is introduced as (6) The log-likelihood function is (7) the log-likelihood function of the original data sample is lX = lN log s. In the interior of the parameter space, , the population means of the basis functions, m = (m1, …, mJ)T, are (8) the second population moments are (9) The gradient of the log-likelihood function, the score function, ∇αl = g = (g1, …, gJ)T, is given by (10) and the Hessian matrix of second derivatives H is (11) A stationary point of the log-likelihood function, , is a zero of the score function and is characterised by the likelihood equations (12) For a minimal and full exponential family the log-partition function is real analytic and strictly convex, the log-likelihood function is strictly concave and the Hessian matrix is negative definite on int [4042]. This holds for any set of basis functions and any truncation level J. If a solution to the likelihood equations exists it is unique and corresponds to the unique global maximum of the likelihood function. The maximum likelihood estimate of p then is and the maximum likelihood estimate of fX is (13) Eq (13) constitutes the formal definition of the semiparametric density estimator to which we will refer later.

The likelihood equations have an intuitive interpretation. At the solution the population means of the basis functions match the sample means. For polynomial basis functions maximum likelihood estimation is identical to the classical method of moments; for other basis functions it corresponds to an extended method of moments with general moment functions.

The question of existence of the maximum likelihood estimator is linked to the regularity or steepness of the exponential family. For regular or steep exponential families the maximum likelihood estimator exists for any generic data sample. Almost all models considered here, in particular the most relevant ones in practice, are regular. This includes all models generated by (linearly extrapolated) polynomials and splines as well as trigonometric functions, possibly augmented with logarithmic boundary terms, on bounded, infinite and semi-infinite domains. See Sections 3, 4 and 5 for details on the basis functions for the different domains. A detailed discussion of the existence of the maximum likelihood estimator is given in S1 Appendix.

There are relationships between maximum likelihood estimation in exponential families and the truncated moment problem [4446] as well as the moment-constrained maximum entropy principle [35, 47, 48]. These relationships are not new, yet they appear to have not been fully exploited as the different problems are studied in disjoint scientific communities. In statistics, exponential families generated by polynomials are rarely discussed beyond the normal, exponential and truncated normal distributions; on the other hand, the non-statistical community seems to be often unaware of the statistical link. The different problems provide complementary points of view on the existence of solutions. A discussion is given in S2 Appendix. In particular the statements in Theorem 2 and subsequently Theorem 3 therein have, to the best knowledge of the author, not been given in the literature before. The proposition is that the conditions of positive definiteness of the relevant Hankel matrices which in most settings guarantee the existence of a unique moment-constrained maximum entropy solution are always satisfied provided that the moment estimates are given as sample moments of a generic data sample. The proposition equally holds for bounded, infinite and semi-infinite intervals. Note that here we are using newly introduced linearly extrapolated polynomials rather than standard polynomials for infinite and semi-infinite domains because of their superior inference properties (see Subsections 4.1 and 5.1).

2.3 Parameter uncertainty

The expected Fisher information matrix for exponential families is equal to the observed Fisher information matrix and both are given as (14) where denotes the Hessian of the log-likelihood at the maximum likelihood solution. Parameter errors are asymptotically normal with zero mean and covariance matrix C = J−1.

For dependent data, for example, if the data sample comes as a time series with temporal correlation we use an error covariance inflation technique [49]. The general expression for the error covariance matrix is (15) with . In the case of independent data we have V = J and recover the result above. The contribution of data point yn to the log-likelihood is ln = U(yn) − log Z and we have . A block length is chosen such that data in disjoint blocks of lb consecutive data points can be considered independent. We assume that N is an integer multiple of lb. The variables (16) are formed for i = 1, …, N/lb. The covariance matrix of is estimated empirically from the sample and we obtain (17) If N is not an integer multiple of lb the number of blocks is [N/lb], the largest integer smaller or equal N/lb. Then V is multiplied by N/([N/lb]lb) to get the final estimate.

An ensemble of size M of parameter vectors is generated as where vi are column vectors of length J of independent standard normal random variables and L is obtained from the Cholesky decomposition C = LLT. Only samples satisfying the slope inequality constraints (see Subsections 4.1 and 5.1), if any, are accepted. This yields an ensemble of probability densities consistent with the parameter uncertainty which can be used to construct confidence intervals for quantities of interest, for example, quantiles of the distribution.

An alternative method for assessing parameter uncertainty is to apply bootstrap resampling, or block bootstrap resampling for dependent data, on the entire model estimation. This approach guarantees that any resampled model satisfies the inequality constraints, if any. If the exponential family is not steep it is possible that the maximum likelihood estimator does not exist for some resamples but this should not really be a problem in practice as it happens only if the model is close to the solution boundary which is an indication that it is not appropriate anyway for the data under consideration. Moreover, almost all relevant models here are steep. Bootstrap resampling is considerably more computationally costly than the method above as the optimisation has to be repeated M times but it is still feasible in many applications.

2.4 Convergence properties of the estimator

An alternative characterisation of the maximum likelihood probability density estimate is that it minimises the relative entropy or Kullback–Leibler divergence of the true probability density f(y) with respect to p(y), (18) within the chosen exponential family [22, 23]. Note that I(fX|pX) = I(f|p). There is the Pythogorean-like decomposition (19) with p* being the information projection of f, that is, the density from the exponential family which has the same population means of the sufficient statistics as the true density: ∫D ϕj(y)p*(y)dy = ∫D ϕj(y)f(y) dy; j = 1, …, J. The first term on the right hand side of Eq (19) is the approximation error due to the truncation of the exponential family; it decreases with increasing J. The second term is the estimation error due to the finite sample size; it decreases with increasing N. Convergence of the relative entropy to zero as J → ∞ and N → ∞ has been established for many practically relevant model classes such as polynomials, splines and trigonometric functions on a bounded interval [22, 23, 28] as well as under certain conditions for polynomials on infinite and semi-infinite intervals [50]. Convergence in relative entropy implies convergence in the L1-norm [51].

For bounded intervals, it has been shown [23] that if the log-density has r square-integrable derivatives the Kullback–Leibler divergence converges to zero at rate 1/J2r + J/N as J → ∞ and J3/N → 0 in the polynomial case and J2/N → 0 in the spline and trigonometric cases. By setting J = N1/(2r+1) this specializes to the rate N−2r/(2r+1), thus approaching the parametric rate N−1 for well-behaved log-densities. For splines this rate can only be achieved when also increasing the order of the spline with the sample size. With cubic splines it is not possible to take advantage of smoothness r > 4 and the convergence rate is limited to N−8/9. Convergence rates for infinite and semi-infinite intervals are harder to obtain and depend on the tail behaviour of the true density [23].

In practice, the smoothness class of the true density is not known and the dimension J of the exponential family should be chosen automatically from the data. When selecting the dimension by a minimum description length information criterion similar to those proposed by Schwarz [52] or Rissanen [53] the density estimator on bounded intervals converges in squared Hellinger distance at rate (N−1 log N)2r/(2r+1) for polynomials, approaching the parametric rate N−1 up to a logarithmic factor, and at rate (N−1 log N)8/9 for cubic splines [54].

For densities which go to zero or have an integrable singularity at a boundary point convergence may be rather slow because of the singularity in the log-density. The inclusion of boundary terms in the estimator may then be very beneficial. For example, if there is power-law behaviour of the density close to a domain boundary as with, for example, the gamma distribution at zero a logarithmic boundary term absorbs this behaviour and the smoothness parameter r for the remainder of the log-density may increase significantly.

It should generally be kept in mind that the asymptotic behaviour of density estimators does not necessarily carry over to finite sample sizes [2]. The practical performance can only be assessed and compared through extensive numerical experiments with different types of densities and realistic finite sample sizes. Moreover, the assessment of a density estimator goes beyond integrated pointwise error measures; in particular applications interest may lie, for example, in detecting the modality of a density and changes in it over time [55, 56].

2.5 Statistically orthogonal basis functions

There is a gauge freedom in the representation of the potential function U(y) in the linear subspace spanned by a given set of basis functions {ϕ1, …, ϕJ}. When transforming the basis functions with any invertible J × J matrix and at the same time transforming the vector of expansion coefficients with the inverse matrix the potential function is invariant. Moreover, the potential function is only determined up to an arbitrary additive constant.

An empirical scalar product over the data points is defined for two functions e1(y) and e2(y) as (20) It can be interpreted as a Monte Carlo approximation to the integral ∫D e1(y)e2(y)f(y)dy. The corresponding norm is (21)

We start with a base measure exp[β(y)] and a set of basis functions {γ1, …, γJ} such that the set {1, β, γ1, …, γJ} is linearly independent. The sample means of β and {γ1, …, γJ} are removed and the basis functions are orthonormalised with respect to the empirical scalar product using the Gram–Schmidt procedure. We here actually use the modified Gram–Schmidt algorithm [57] which is numerically more stable. A pseudocode is as follows:

ψβ − 〈β

for j = 1: J

  ϕjγj − 〈γj

end

for j = 1: J

  ϕjϕj/||ϕj||

  for k = j + 1: J

    ϕkϕk − (ϕj, ϕk)ϕj

  end

end

The new basis functions {ϕ1, …, ϕJ} satisfy (22) and (23) for j, k = 1, …, J. We also have 〈ψ〉 = 0 and thus 〈U〉 = 0 for all α. The log-likelihood function, the score function and the likelihood equations now simplify to (24) (25) and (26) Also Eq (16) simplifies using Eq (22). This constitutes an approximately statistically orthogonal formulation of the inference problem as the Fisher information matrix now is (27) We still use the exact Fisher information matrix to calculate the parameter error covariance matrix. The orthonormality of the basis over the data set guarantees a very low condition number of the log-likelihood maximisation close to the solution for any reasonable density model. This is an improvement on the use of classical orthogonal polynomials such as Legendre polynomials [28] or B-splines [29, 31] as well-conditioned bases, particularly for rather high-dimensional models. In the context of the moment-constrained maximum entropy problem the present technique generalises the ideas of constraint rescaling and orthogonalisation of the basis functions [37]. Note that the orthogonalisation is here performed only once before starting the maximisation of the likelihood as opposed to the repeated reorthogonalisation with respect to the current modelled probability density proposed in [37].

2.6 Possible regularisation of the log-likelihood function

We remark in passing that a regularisation term could be added to the log-likelihood function in line with the idea of maximum penalized likelihood [2427]. One would then, for example, minimise the (convex) function rather than maximise l. Here, α0 corresponds to a reference probability density which could be uniform, normal and exponential or gamma for the bounded, infinite and semi-infinite domain, respectively. Other choices are possible if there is prior knowledge about the data supporting them. The non-negative regularisation parameter η could be determined by cross-validation. This regularisation is equivalent to a Bayesian approach which puts a Gaussian prior on α, centred at α0, with diagonal covariance matrix and equal variance for all components inverse proportional to η. This line is not pursued in the present paper as we focus on enforcing parsimony with the Bayesian information criterion without any penalty term.

3 Bounded interval

We first model probability densities supported on a known bounded interval DX = [a, b]. This includes cases where the potential function or the density is actually defined only on [a, b), (a, b] or (a, b). The domain DX is mapped onto the domain D = [−1, 1] by the rescaling y = (2xab)/(ba) and the data sample is transformed accordingly. The probability density f(y) is related to fX(x) by f((2xab)/(ba)) = [(ba)/2]fX(x).

We have (28) (29) and (30) The integrals are evaluated using Gauss–Legendre quadrature.

3.1 Polynomial basis functions

The basis functions are γj(y) = yj or more generally with distinct positive integers kj if there is prior knowledge to justify the use of particular powers. There is no point here in using any orthogonal polynomials as a statistically orthogonal basis is constructed.

If the density f(y) is known to be symmetric about zero only even powers would be used. If the density is periodic, for example, a density over angles (DX = [0, 2π]), the constraint U(−1) = U(1) would be imposed. There are not necessarily any constraints on the derivatives of U. Given a polynomial basis a basis satisfying γj(−1) = γj(1) can be constructed from any basis of the null space of the 1 × (J + 1) constraint matrix spanned by the right singular vectors corresponding to the zero singular values.

3.2 Spline basis functions

The construction of splines on the interval [yl, yu] is described. Here, the case yl = −1 and yu = 1 is relevant.

Given a sequence of (simple) knots yl < ζ1 < … < ζK < yu with K ≥ 1 a spline function is a twice continuously differentiable function whose restriction on each of the K + 1 sections [yl, ζ1], [ζ1, ζ2], …, [ζK−1, ζK], [ζK, yu] is a cubic polynomial. A spline function can be represented as (31) with on the kth section and ξik(y) = 0 otherwise. Here, is the midpoint of the kth section. The expansion coefficients {βik} are constrained by the requirement of continuity of γ, γ′ and γ″ at the knots. It is convenient to also immediately include here the constraint 〈γ〉 = ∑ki βikξik〉 = 0 as a constant function on D = [−1, 1] is still contained in the space spanned by the functions {ξik}. These are 3K + 1 homogeneous linear constraints on the 4K + 4 expansion coefficients {βik} which are arranged in a sparse (3K + 1) × (4K + 4) matrix. The spline functions form a linear space of dimension K + 3 spanned by a linearly independent set . A basis can be constructed from any basis of the null space of the constraint matrix, spanned by the right singular vectors corresponding to the zero singular values. A more localised basis leading to a sparser Hessian matrix H could be obtained by referring to B-splines [31] but sparsity will be lost anyway when going to the statistically orthogonal basis functions and the overall computation time of the density estimation is not significantly affected by this. Note that the splines here are not natural splines; there are no constraints at the end points y = yl and y = yu which are not knots and the curvature there may be different from zero. The model is more flexible this way and there is no reason for a zero curvature constraint in the context of density estimation.

For the position of the knots we study two canonical choices: equally spaced knots in y-space, (32) and equally spaced knots with respect to the empirical distribution function, (33) where is the data sample sorted into non-decreasing order and nk is the largest integer smaller or equal to kN/(K + 1).

For symmetric probability densities the knots are chosen symmetric and the symmetry conditions are added to the constraint matrix. The symmetry conditions are not all independent from the knot conditions and the singular value decomposition is used to determine the rank of the constraint matrix and thus the dimension of the spline space and a basis of it. If the density is periodic the condition γ(yl) = γ(yu) is added to the constraint matrix.

3.3 Trigonometric basis functions

A linearly independent set of basis functions of size J = 2K is . For a symmetric probability density an appropriate set is . If the density is periodic a basis can be constructed analogously to the polynomial case.

3.4 Boundary basis functions

For probability densities with a zero value or an integrable singularity at any of the boundary points y = −1 and y = 1 the model can be considerably improved by including an appropriate subset of the boundary basis functions {log(y + 1), log(1 − y), 1/(y + 1), 1/(1 − y)}. In practice, the logarithmic terms are most often used to model power-law decay of the density to zero near a boundary point as, for example, in the beta distribution. In case of symmetry the boundary terms would be introduced simultaneously at both boundaries.

4 Infinite interval

We now look at probability densities on the infinite interval DX = (−∞, ∞). The domain DX is mapped onto itself with the rescaling y = (xμ)/σ where μ is the sample mean and σ is the sample standard deviation. The probability density f(y) is linked to fX(x) by f((xμ)/σ) = σfX(x). If fX(x) is known to be symmetric about x* the rescaling y = (xx*)/σ is applied with σ being the standard deviation about x*.

4.1 Linearly extrapolated basis functions

Given a set of basis functions we introduce linearly extrapolated basis functions as (34) where ya and yb are the smallest and largest, respectively, data point in the data sample. We still have U(y) = ∑j αj ϕj(y) for all y and impose the slope conditions (35) and (36) in order to have a proper probability density. The functions may be polynomials or splines. Trigonometric functions are probably less efficient here as the density usually strongly decays at the points y = ya and y = yb given the infinite domain of support.

We do not aim at any rigorous extreme value modelling here. By construction there are no data points in the exponential tails and the tail model is just the lowest order extrapolation of the bulk model. The probability mass accounted for by the tails is of order 1/N. The probability density has only one continuous derivative at y = ya and y = yb. Thus the coupling of the bulk model to the tail is rather weak and so is also the influence of the particular choice of tail model on the overall model.

The partition function is (37) with (38) (39) (40) We have (41) with (42) (43) (44) and (45) with (46) (47) (48) (49) (50) The integrals in Eqs (38), (42) and (46) are calculated by Gauss–Legendre quadrature.

Linearly extrapolated polynomials as basis functions have a couple of advantages over the standard polynomials as the topology of the parameter space is more favourable:

  1. (i) Linearly extrapolated polynomials generate regular exponential families as opposed to standard polynomials generating models which are generally not regular and not steep (see S1 Appendix). The maximum likelihood estimator is guaranteed to exist for any generic data sample. As a simple example for lack of steepness, the model U(y) = α1 y2 + α2 y4 with standard polynomials has the upper bound on the moments (see Theorem 3 in S2 Appendix). This upper boundary is formed by the Gaussian distributions corresponding to α1 < 0 and α2 = 0. The model cannot cover any leptocurtic distributions. With linearly extrapolated polynomials there is no such restriction as the model can use a positive coefficient α2 as long as the slope conditions and are met.
  2. (ii) The standard polynomials require the highest power in the model to be even in order to get a proper probability density. There is no restriction with linearly extrapolated polynomials. The number of parameters can be increased in steps of one which may lead to more parsimonious models.
  3. (iii) The numerical evaluation of the integrals involved in the log-likelihood function and its derivatives is more stable for the linearly extrapolated polynomials as the tail parts are treated analytically. With standard polynomials one would use Gauss–Hermite quadrature which may become inaccurate and unreliable close to the boundary of the parameter space, that is, if the leading-order coefficient is close to zero. As a consequence the numerical maximization of the likelihood function might not converge to the solution even if it still exists.
  4. (iv) With standard polynomials solutions close to the boundary of the parameter space occur naturally. When doing the model selection one would gradually increase the number of basis functions. At some point, additional terms are not needed any more which manifests itself in the leading-order coefficient being close to zero. With linearly extrapolated polynomials typical solutions are located away from the boundary of the parameter space as on the infinite domain probability densities are usually clearly decaying at both extreme data points y = ya and y = yb. Solutions with one or both of the slopes U′(ya) and U′(yb) close to zero are possible, particularly for small and slightly atypical data samples, but they are very unlikely.

In the spline case, a spline basis on the interval [ya, yb] is constructed as described in Subsection 3.2 and then linearly extrapolated as described above. We do not use natural splines as opposed to [31]. A zero curvature condition at the extremal points y = ya and y = yb appears to be artificial or even detrimental to the model. It is only justified if the density actually has an exponential tail. The present setting is more flexible; the extremal points are not knots and the density has generally only one continuous derivative there as opposed to two at the knots. We also do not use any preliminary tail transformation here.

For a symmetric probability density a linearly extrapolated polynomial or spline basis is set up as before on the symmetric interval [-max(|ya|, |yb|), max(|ya|, |yb|)].

5 Semi-infinite interval

Finally, we model probability densities supported on the semi-infinite interval DX = [a, ∞) with known lower bound a. This includes cases where the potential function or the density is actually defined only on (a, ∞). The case DX = (−∞, b] is readily referred back to the case DX = [a, ∞) with a = −b by considering the variable −X instead of X. The interval DX is mapped onto the interval D = [0, ∞) with the rescaling y = (xa)/μ where μ is the sample mean of Xa. The probability density f(y) is linked to fX(x) by f((xa)/μ) = μfX(x).

5.1 Linearly extrapolated basis functions

Again, linearly extrapolated basis functions are introduced as (51) where yb is the largest data point in the data sample. The slope condition (52) guarantees a normalisable probability density. The functions are polynomials or splines, now constructed on the interval [0, yb].

The partition function is (53) with (54) and Z1 given by Eq (39). We have (55) with (56) and m1,j as in Eq (43) as well as (57) with (58) and Q1,jk as in Eq (48). The integrals in Eqs (54), (56) and (58) are again evaluated using Gauss–Legendre quadrature.

The advantages of linearly extrapolated polynomials over standard polynomials carry over accordingly from the infinite to the semi-infinite domain. A simple example for lack of steepness is the model U(y) = α1 y + α2 y2, the truncated normal distribution, which has the upper bound on the moments (see Theorem 3 in S2 Appendix). The boundary is formed by the exponential distributions corresponding to α1 < 0 and α2 = 0. The model can only cover distributions with a coefficient of variation less than or equal to 1. The corresponding linearly truncated model is regular and has no such restriction as it can use a positive coefficient α2 as long as the slope condition α1 + 2α2 yb < 0 is met.

5.2 Boundary basis functions

The basis may be augmented (prior to orthonormalisation) with an appropriate subset of the functions {log y, 1/y, log2 y}, all linearly extrapolated at yb. We do not include the function log2 y without also including log y as then the invariance of the exponential family under linear rescaling (see Subsection 2.1) would be violated. The functions logy and 1/y are genuine boundary terms modelling decay to zero or an integrable singularity of the density at y = 0 whereas the function log2 y may be useful for densities close to a log-normal distribution. In practice, most often just the boundary term log y is used to model power-law decay to zero at y = 0 as, for example, in the gamma distribution.

6 Maximising the likelihood function

The log-likelihood function is iteratively maximised with respect to the parameters α using a modified Newton–Raphson method which combines the rapid quadratic convergence of the Newton–Raphson method close to the solution with a guarantee of global convergence by ensuring an increase of the likelihood function at each iteration. Having arrived after k iterations at an estimate α(k) with gradient vector g(k) and Hessian matrix H(k) the search direction p(k+1) for the next iteration is given by (59) with M = −N−1 H where η ≥ 0 is a regularisation parameter. The vector p(k+1) is an ascent direction of the likelihood function for any choice of η(k+1); we have (dropping the iteration superscripts for convenience) (60) for m ≠ 0 as M is positive definite. An inexact line search along the regularised Newton–Raphson direction p(k+1) is performed using a simple backtracking of the step size. The parameter vector is updated from α(k) to (61) with ω(k+1) = 2i where i is the smallest non-negative integer such that l(α(k+1)) > l(α(k)) and α(k+1) satisfies the slope condition(s) (see Subsections 4.1 and 5.1), if any.

The regularisation parameter η is introduced to ward off possible ill-conditioning of the Hessian, particularly in the early iterations of the maximisation when the modelled probability density is still far away from the true probability density. It is here chosen automatically such that the condition number of the regularised Hessian does not exceed a prescribed threshold κth. The spectral condition number of the Hessian is equal to the Rayleigh ratio κ = λmaxmin where λmax and λmin are the largest and smallest eigenvalues, respectively, of M. If κκth then η = 0; otherwise η = (λmaxκthλmin)/(κth − 1) ≈ λmax/κthλmin. We set κth = 105; the choice of κth influences only the optimisation path but not the solution. Spline models tend to have better condition than polynomial models. Due to the approximate statistical orthonormality of the basis functions (see Subsection 2.5) the inference is well-conditioned close to the solution for any reasonable parsimonious model.

The iteration is started with an initial guess parameter vector α(0) satisfying the slope condition(s) (see Subsections 4.1 and 5.1), if any. For a bounded interval a simple and robust choice is α(0) = 0. For the infinite interval, α(0) is chosen to minimise subject to U′(ya) = 1 and U′(yb) = −1; for a semi-infinite interval α(0) is chosen to minimise subject to U′(yb) = −1. These are readily found as solutions to standard least-squares problems. For nested models of increasing order, for example, polynomial or trigonometric, the iteration can be started with the solution of the previous model.

The iteration is terminated as soon as |m(k)| < ε with ε = 10−5. Only models satisfying this criterion within a prescribed maximum number of iterations are included in the model selection described in Section 7. Failure to reach a solution may occur if the exponential family is not steep and the maximum likelihood estimator does not exist (see S1 Appendix), if the solution lies extremely close to the boundary of the parameter space or if the problem is very ill-conditioned. All these reasons indicate that the model is not adequate anyway for the particular data. We here set the maximum number of iterations to 50; adequate models are usually found with fewer than 15 iterations.

The calculation of the log-likelihood, the gradient and the Hessian does not require any reference to the data sample and thus the computation time does not significantly increase with the sample size N for common sample sizes. It increases linearly with the number of quadrature nodes used in the numerical integrations which is usually much smaller than the sample size. Only for very large sample sizes the initial orthonormalisation becomes dominant and the algorithm scales linearly with N.

7 Model selection

Table 1 summarises the possible configurations of basis functions for the different domains. Model selection is performed with the Bayesian information criterion (BIC) [52]: (62) The number of basis functions in the exponential family is J = Jb + J0 where Jb is the number of boundary terms and J0 is the number of bulk basis functions. The BIC is minimised over both the number and type of basis functions taking into account models up to a prescribed maximum number of basis functions, reflecting the maximum complexity of the density estimate. The BIC has been shown to pick the correct exponential family with probability tending to 1 as N → ∞ if the true density is in one of the exponential families [58]. Usually, the complexity of the chosen model increases with increasing sample size N.

thumbnail
Table 1. Overview of domains and basis functions for semiparametric density estimation.

https://doi.org/10.1371/journal.pone.0259111.t001

7.1 Spline knot deletion scheme

The spline models are refined with a knot deletion strategy in two steps:

  1. (i) Firstly, always only models with at least 4 data points in any section are considered regardless of the BIC. Going from left to right if a section has less than 4 data points the left knot is removed and the procedure repeated until each section has at least 4 data points. It is possible, though highly unlikely, that all knots are removed; then the model is dropped. Note that this first knot deletion occurs only for equally spaced knots (Eq (32)); for knots placed according to the empirical distribution (Eq (33)), there are by construction always more than 4 data points in any section for any reasonable number of knots.
  2. (ii) Secondly, a simple greedy knot deletion algorithm is applied: We consider sequences of knots with the number of knots increasing from K = 1 to a maximum number of knots K = Km, placed according to one of the two knot placement schemes introduced in Subsection 3.2. For each sequence, the knot deletion scheme (i) is applied and the maximum likelihood estimator is found, if any. The minimum BIC model out of these at most Km models is picked. Each of the knots in the model is removed in turn and the reduced-order model with the lowest BIC is kept. This is repeated until removal of any of the knots does not provide a decrease of the BIC any more. As opposed to [31] we do not apply a minimum number of knots as a function of sample size to start the knot deletion from. There is no convincing reason for this and it appears to be often very likely to find more parsimonious models when increasing the number of knots starting from K = 1.

7.2 The semiparametric density estimator (SPDE)

The proposed semiparametric density estimation algorithm is as follows. This refers to general densities without symmetries or periodic behaviour; otherwise obvious modifications apply.

  • Polynomials (POL):
    For bounded and semi-infinite intervals, choose the subsets of boundary terms to consider, if any. Choose the maximum polynomial order Pm. For J0 = 0, 1, …, Pm (bounded interval), J0 = 2, 3, …, Pm (infinite interval) or J0 = 1, 2, …, Pm (semi-infinite interval) find the maximum likelihood estimator, separately for each subset of boundary terms. The minimum BIC model out of all these models is the polynomial density estimator POL.
  • Splines with equidistant knots (SPL1):
    For bounded and semi-infinite intervals, choose the subsets of boundary terms to consider, if any. Choose the maximum number of knots Km, corresponding to a maximum number of bulk basis functions of Km + 3. Separately for each subset of boundary terms, do the spline model selection with knot deletion as described above, placing the knots as given in Eq (32). The minimum BIC model over the subsets of boundary terms is the density estimator SPL1.
  • Splines with equidistant knots with respect to the empirical distribution (SPL2):
    For bounded and semi-infinite intervals, choose the subsets of boundary terms to consider, if any. Choose the maximum number of knots Km, corresponding to a maximum number of bulk basis functions of Km + 3. Separately for each subset of boundary terms, do the spline model selection with knot deletion as described above, placing the knots as given in Eq (33). The minimum BIC model over the subsets of boundary terms is the density estimator SPL2.
  • Trigonometric functions (TRIG), only on bounded intervals:
    Choose the subsets of boundary terms to consider, if any. Choose the maximum wavenumber Lm, corresponding to a maximum number of bulk basis functions of 2Lm. For J0 = 0, 1, …, 2Lm find the maximum likelihood estimator, separately for each subset of boundary terms. The minimum BIC model out of all these models is the trigonometric density estimator TRIG.
  • Semiparametric density estimator (SPDE):
    The semiparametric density estimator, referred to as SPDE, is the minimum BIC model out of POL, SPL1, SPL2 and TRIG.

The hyperparameters Pm, Km and Lm can usually be chosen generously high such that no model with a lower BIC is found any more when increasing them further. The choice of subsets of boundary terms is slightly subjective and motivated by visual inspection of and prior knowledge about the data. We anticipate that most often the logarithmic terms will be used to model power-law behaviour near a boundary point. Usually, the subsets of boundary terms considered would be the same for all types of bulk basis functions.

We remark in passing that a more extensive model selection would be conceivable involving arbitrary subsets of the bulk basis functions rather than just adding them one by one. If the true density is not too complex this would still be feasible in many applications but we do not discuss this here for simplicity.

8 Results

The semiparametric density estimation approach is applied to a range of simulation and observation data sets and compared to established alternative methodologies. These are the Gaussian kernel density estimator with three different bandwidth selection rules [13, 7], two variants of the diffusion estimator [7], finite mixture models [15] and three variants of local likelihood density estimation [16, 17]. An overview of the density estimation techniques is given in Table 2 and all of the details are described in S3 Appendix.

The Gaussian kernel density estimators include a bandwidth selection technique proposed relatively recently by Botev et al. [7] which is an improvement on the classical Sheather–Jones solve-the-equation rule [5]. It is free from the normal reference rule and shows superior practical performance compared to existing plug-in implementations [7].

For bounded and semi-infinite domains, all of the Gaussian kernel density estimators are always combined with the reflection method applied at any boundary point as a standard boundary correction scheme. We here do not compare against more sophisticated boundary correction schemes (for example, [11]) as we compare against the diffusion estimator which has been shown to be superior to common boundary correction schemes [7]. The diffusion estimator is an interesting competitor, particularly for domains with boundaries, as it handles any boundary point naturally.

8.1 Assessment of the probability density estimates

If the true probability density fX is known a commonly used measure of error of a density estimate is the mean integrated squared error (MISE): (63) We also adopt this metric here. The expectation with respect to fX is evaluated as the mean over 100 simulations.

Often there is particular interest in detecting the large-scale structure of a probability density function as characterised by the number of modes or bumps [1] as these may be associated with distinct dynamical states or regimes of the underlying system (for example, [55, 56]). A mode is defined as a local maximum of the density; a bump is characterised by an interval on which the density has negative curvature but not necessarily a local maximum. As opposed to [56] we here refer to the curvature of the density itself and not the curvature of the potential function. The number of bumps is always larger or equal to the number of modes.

8.2 Simulated data

We start the investigation with simulated data sets where the underlying probability density is known. The test cases are listed in Table 3. Many of them are taken from [59]; some are slightly modified. Table 4 gives the MISE for the various density estimators and different sample sizes N. Table 5 lists for selected test cases how often the number of modes and bumps of the density is correctly identified.

thumbnail
Table 3. Test cases of probability densities: Densities 1–12 are supported on (−∞, ∞), density 13 on [−2, 2], density 14 on [0, 1] and density 15 on [0, ∞).

N(μ, σ2) denotes the normal density with mean μ and standard deviation σ, Beta(α, β) denotes the beta density with shape parameters α and β, and Gam(α, β) denotes the gamma density with shape parameter α and rate parameter β.

https://doi.org/10.1371/journal.pone.0259111.t003

thumbnail
Table 4. Mean integrated squared error (scaled by 104) estimated as the average over 100 realisations for the test cases in Table 3.

For test cases 14 and 15, the semiparametric estimators allow for logarithmic boundary terms at any boundary point. The best method is indicated in bold. For test cases 9 and 10, the diffusion estimators are unstable; for test cases 4, 8 and 11, they give reasonable density estimates in most realisations but are unstable in some realisations. See S3 Appendix for further explanation.

https://doi.org/10.1371/journal.pone.0259111.t004

thumbnail
Table 5. Detection of modality: Number of realisations out of 100 in which the modality of the density is correctly identified.

The upper row refers to modes, the lower row to bumps. Modes and bumps are counted on the indicated interval. For test cases 14 and 15, the semiparametric estimators allow for logarithmic boundary terms at any boundary point. The best method is indicated in bold. For test case 8, the diffusion estimators give reasonable density estimates in most realisations but are unstable in some realisations. See S3 Appendix for further explanation.

https://doi.org/10.1371/journal.pone.0259111.t005

The density estimator SPDE has an excellent overall performance in the MISE. It is sometimes the best model and is always close to the best model, except for test case 12. At the same time, the SPDE model exhibits a high degree of global smoothness which enables a top performance in modality detection. It often has the highest detection rate and is always close to the top model. The number of modes and bumps depends on the interval over which they are counted. The interval is chosen by eye around the main body of the distribution but also an extension further out into the tails was investigated. The estimators SPL1, SPL2, TRIG, SPDE and also POL in those cases where it is an appropriate model are very robust with respect to the choice of the interval and are thus adequate for fully automatic detection of modes and bumps. All of the other estimators except finite mixture models often have very small spurious modes and bumps far out in the tails.

In some cases the MISE of SPDE is smaller than the smallest among the four components POL, SPL1, SPL2 and TRIG as it picks the best for each of the 100 realisations. Often it is equal to or slightly larger than the smallest among the components. This is probably due to the different metrics used here. The density models are fitted with maximum likelihood but validated with the MISE. When evaluated in terms of Kullback–Leibler divergence the error of SPDE would probably be at least as small as the smallest among the components.

Polynomials and splines work complementarily in the semiparametric method. Polynomials perform well for well separated and not very sharp structures (test cases 1, 9 and 10); on the other hand splines are much better for overlapping and/or sharp structures (test cases 2, 3, 11 and 15) as well as long tails with little structure (test case 2). This is in line with approximation theory; put in statistical terms the density is in the latter case not well chracterised by (global) polynomial moments but local moments are much more informative. In many test cases polynomials and splines perform about equally well. Both spline knot placement schemes considerably contribute to the excellent overall performance of SPDE. We generally see that asymptotic convergence rates are not necessarily relevant at finite sample sizes. Splines have a slower convergence rate than polynomials and trigonometric functions (see Subsection 2.4); yet, in most test cases they show a smaller MISE.

Among the kernel density estimators KDE3 vastly outperforms KDE1 and KDE2 in terms of the MISE as the latter two estimators almost always oversmooth the data. They are only competitive for densities relatively close to a Gaussian (test cases 1, 4, 5 and 6). The reason is that the bandwidth selection of KDE1 and KDE2 is based on the normal reference rule whereas that of KDE3 is not [7]. On the other hand, KDE1 and KDE2 are often able to robustly detect the modality of the density whereas KDE3 is very bad at it as it easily develops spurious modes and bumps, particularly for long tails (test case 2) and densities with features of different scales (test case 8). There seems to be no way out of this dilemma as long as a non-adaptive bandwidth is used. The ability of KDE3 to detect the modality sometimes even becomes worse when increasing the sample size, in contrast to KDE1 and KDE2. We remark that the behaviour of the kernel density estimator with bandwidth given by the classical Sheather–Jones solve-the-equation rule [5] (not shown) is somewhat in between KDE1/2 and KDE3. In line with the findings in [7], it has a larger MISE than KDE3, particularly for bi- and multimodal densities, but the modality detection is better and more robust.

Among the diffusion estimators DE1 is here found to consistently outperform DE0, both in terms of MISE and detection of the modality of the density. The estimator DE1 improves on KDE3 regarding modality detection by adding some smoothness. Remarkably, at the same time it often also improves the MISE and is never much worse. However, DE1 does not reach the smoothness of the semiparametric estimators.

The model LLDE2 is usually the best among the local likelihood density estimators with some exceptions. It shows some very good and even top performances with localised structures, though there may be some bias here as the structures are Gaussian which exactly fits the locally quadratic model. But LLDE2 does not possess the broad strength of SPDE; in particular it is lacking global smoothness which shows up in some very poor modality detection performances.

Expectedly, a Gaussian mixture model (GMM) outperforms all of the other methods in terms of both the MISE and modality detection for almost all test cases (not shown) as these are given as Gaussian mixtures. It is worth noting that this is not the case if the true number of components is too high to be identified given the sample size (test case 2). Moreover, the likelihood of a GMM with many components is prone to have plenty of secondary maxima and therefore it sometimes takes very many random initialisations of the EM algorithm to actually find the correct model. It is also interesting that for test case 5, which is a mixture of two Gaussians, the estimator SPDE outperforms a GMM with two components in terms of bump detection.

A (rescaled) beta mixture model performs worse than the SPDE for test case 13 (not shown) and it is obviously biased for test case 14. For test case 15, a gamma and a Weibull mixture model perform slightly worse than the SPDE (not shown).

The points made above are illustrated in Figs 1 and 2 with typical density estimates from the various methods for the test cases 2 and 6. The SPDE in both cases is the model SPL2 with 2 and 3 knots, respectively, after knot deletion.

thumbnail
Fig 1. Test case 2: Example of density estimates obtained from the kernel density and diffusion estimators (left) as well as from the semiparametric and local likelihood estimators (right).

The SPDE is the model SPL2 with 2 knots after knot deletion. The sample size is N = 2000.

https://doi.org/10.1371/journal.pone.0259111.g001

thumbnail
Fig 2. Test case 6: Example of density estimates obtained from the kernel density and diffusion estimators (left) as well as from the semiparametric and local likelihood estimators (right).

The SPDE is the model SPL2 with 3 knots after knot deletion. The sample size is N = 2000.

https://doi.org/10.1371/journal.pone.0259111.g002

Fig 3 illustrates an example (corresponding to Fig 2) and the summary statistics of model selection for test case 6. The plot indicates all of the models considered, including every single candidate spline model encountered during the knot deletion procedure. Only BIC differences are given with the best model put at zero. Only spline models are used, both SPL1 and SPL2. The model complexity increases rather slowly with the sample size.

thumbnail
Fig 3. Test case 6: Left: Example of model selection, corresponding to the example shown in Fig 2.

Sample size is N = 2000. Right: Relative frequency out of 250 realisations of picking a particular model as the SPDE. The spline models encompass models without and with knot deletion.

https://doi.org/10.1371/journal.pone.0259111.g003

Some more examples of density estimates and model selections are displayed in S4 Appendix, including cases involving rather high-dimensional models. Some test cases make use of all of the estimators POL, SPL1 and SPL2, dependent on the sample size. Substantial gains from knot deletion are seen for the spline models. The order of polynomial models tends to increase faster with sample size than the order of spline models. For test case 4, there is a slight indication that the polynomial model is picked increasingly often at very large sample sizes, maybe due to its superior asymptotic convergence rate.

The results discussed above equally hold for bounded, infinite and semi-infinite domains. We now look at the boundary effects in test cases 13, 14 and 15. The density of test case 13 is bounded away from zero over the whole domain and also at both boundary points x = −2 and x = 2. Therefore boundary terms are not useful here. Test case 14 has zero density with finite slope at x = 0 and zero density with zero slope at x = 1; test case 15 has zero density with finite slope at x = 0. For these two test cases we consider logarithmic boundary terms at all of the boundary points. In order to assess the effects of the boundary terms we also model these two cases with all boundary terms switched off.

Tables 6 and 7 display how often which boundary terms are included in the model for test cases 14 and 15, respectively. Boundary terms are chosen at all of the boundary points in a substantial fraction of realisations; the frequency increases with the sample size. For test case 14, there is slightly more need for a boundary term at x = 0 than at x = 1 as due to the finite slope the boundary bias is larger there. We observe some differences among the various semiparametric density estimators regarding the frequency of including certain boundary terms.

thumbnail
Table 6. Test case 14: Number of realisations out of 100 in which certain combinations of logarithmic boundary terms are chosen for the various semiparametric estimators and different sample sizes.

https://doi.org/10.1371/journal.pone.0259111.t006

thumbnail
Table 7. Test case 15: Number of realisations out of 100 in which no boundary term or a logarithmic boundary term is chosen for the various semiparametric estimators and different sample sizes.

https://doi.org/10.1371/journal.pone.0259111.t007

Table 8 contrasts the MISE of the different semiparametric methods without and with boundary terms for test cases 14 and 15. There is a substantial improvement for test case 14 and a small but systematic improvement for test case 15 due to the boundary terms. Also a moderate systematic increase of the probability of correct detection of the number of modes and bumps in the densities can be observed when allowing for boundary terms (not shown).

thumbnail
Table 8. Effect of boundary terms on model accuracy: Mean integrated squared error (scaled by 104) estimated as an average over 100 realisations.

The upper row refers to models without boundary terms, the lower row to models allowing for logarithmic boundary terms at any boundary point.

https://doi.org/10.1371/journal.pone.0259111.t008

Table 9 lists the mean boundary biases of the various estimators. The semiparametric techniques, already without boundary terms, exhibit significantly reduced boundary biases compared to the kernel density and diffusion estimators. The diffusion estimators are actually consistent at any boundary point as N → ∞ [7] but convergence is very slow. The inclusion of boundary terms greatly reduces the boundary biases of the semiparametric estimators. This is particularly pertinent at large sample size. The local likelihood density estimators in some cases also have very low boundary biases, though this is achieved by the choice of a rather small bandwidth which leads to very wiggly density estimates and the inability to detect the modality of the densities. For the cases with large boundary biases a pronounced improvement is seen when increasing the order of the local model. This is in accordance with the theory of local likelihood density estimation [16, 17] which expands the boundary bias into powers of the bandwidth and predicts an improvement of the order of accuracy by one when increasing the order of the local model by one. For small boundary biases this effect is overlaid with the sampling uncertainty in only 100 realisations. Overall, the semiparametric estimators in all cases of Table 9 either perform best or are close to the best model within the sampling uncertainty.

thumbnail
Table 9. Boundary bias (scaled by 104) estimated as the mean over 100 realisations.

The true density at the boundary is fX(−2) = 0.0427 and fX(2) = 0.1450 for test case 13, fX(0) = fX(1) = 0 for test case 14, and fX(0) = 0 for test case 15. For the semiparametric estimators and test cases 14 and 15, the upper row refers to models without boundary terms and the lower row to models allowing for logarithmic boundary terms at any boundary point. The best method is indicated in bold.

https://doi.org/10.1371/journal.pone.0259111.t009

Fig 4 displays for test case 15 a couple of typical density estimates from selected methods and the corresponding model selection for the SPDE. The SPDE here is the model SPL2 with 4 knots after extensive knot deletion and a logarithmic boundary term. For this test case, knot deletion delivers a particularly huge gain in BIC over spline models without knot deletion. The lack of boundary bias and again the overall smoothness of the SPDE are clearly visible.

thumbnail
Fig 4. Test case 15: Example of density estimates from selected methods (left) and corresponding model selection for the SPDE (right).

Only the best spline model after each knot deletion step is indicated to increase the readability of the plot. The SPDE is the model SPL2 with 4 knots after knot deletion augmented with a logarithmic boundary term. The sample size is N = 2000.

https://doi.org/10.1371/journal.pone.0259111.g004

The computation time of the various techniques depends in a complicated manner on the sample size, the model complexity and the implementation details. A couple of very general comments are in order. The kernel density estimators are more than an order of magnitude faster than all of the other methods with KDE3 being particularly fast. The semiparametric techniques and the diffusion estimators are about equally costly. The full model selection of the semiparametric estimator for a data sample of test case 15 as illustrated in Fig 4 involving quite high-dimensional models, a boundary term and extensive knot deletion requires still just about a second on a PC. The finite mixture models and the local likelihood estimators are significantly more expensive than the semiparametric methods (by a factor of 3–20, depending on the sample size and the model complexity).

8.3 Old Faithful geyser data

As an example of observational data we investigate the Old Faithful geyser data. There are data sets of the eruption duration and of the waiting time between successive eruptions of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA, both consisting of N = 272 observations. Both data sets are supported on the semi-infinite interval [0, ∞) and the option of all subsets of the boundary terms {log y, 1/y} is included. As the true underlying densities are unknown here the leave-one-out cross-validation likelihood (cvlogl) is used for model validation. Table 10 lists an overview of the results for all of the density estimation techniques and Fig 5 displays the density estimates for a couple of selected methods.

thumbnail
Fig 5. Old Faithful Geyser eruption data: Density estimates of the eruption duration (left) and waiting time (right) from various methods.

For the duration data the SPDE is the model with potential function U(y) = α1log y + α2(1/y) + α3 y + α4 y2; for the waiting time data it is the model SPL2 with 1 knot after knot deletion.

https://doi.org/10.1371/journal.pone.0259111.g005

thumbnail
Table 10. Old Faithful geyser data: Bayesian information criterion, cross-validated log-likelihood, number of modes and number of bumps for various density estimators.

The number of modes and bumps is counted on the interval [1.25, 5.5] for the eruption duration data and on the interval [35, 100] for the waiting time data. The best method is indicated in bold.

https://doi.org/10.1371/journal.pone.0259111.t010

For the eruption duration data the estimators POL, SPL1 and SPL2 perform best and robustly detect 2 modes and bumps; LLDE2 is almost as good. The SPDE is the model POL with boundary terms; it has potential function U(y) = α1 log y + α2(1/y) + α3 y + α4 y2 which is a generalised inverse Gaussian distribution augmented with the quadratic term to generate the bimodality. Working without boundary terms a sixth-order polynomial model () is only slightly worse in terms of cross-validated likelihood and still better than all of the other density estimators. The spline estimators SPL1 and SPL2 have no boundary terms and a single knot, SPL2 after knot deletion. The estimators KDE1, KDE2 and KDE3 as well as DE0 and DE1 perform significantly worse as they oversmooth the data; this is particularly true for KDE1 and KDE2. Also a Gaussian mixture model (GMM) (truncated to [0, ∞) after the inference) does not perform well as both modes of the density are strongly non-Gaussian. The BIC picks a model with 3 components which detects a spurious third mode in the density. Also a gamma mixture model (GaMM) and a Weibull mixture model (WMM) are not appropriate here. The estimators LLDE0 and LLDE1 are quite wiggly and display small spurious modes and bumps.

For the waiting time data all of the semiparametric models have no boundary terms. The estimator POL is a fifth-order polynomial model (); the estimators SPL1 and SPL2 both have 1 knot after knot deletion. The model selected by the BIC is SPL2. The estimators POL, SPL1/2, LLDE0/1/2 and GMM with 2 components (truncated to [0, ∞) after the inference) perform well in terms of cross-validated likelihood with LLDE2 being best; KDE1/2/3 and DE0/1 are worse. The mixture models GaMM and WMM are worse than GMM. All of the estimators detect two modes which are both quite close to Gaussian. The models KDE2 and LLDE0 have spurious bumps, all of the others detect two bumps.

9 Conclusions

A methodology for semiparametric maximum likelihood probability density estimation was discussed. The density is represented by sequences of exponential families generated by flexible sets of basis functions including polynomials, trigonometric functions and splines with two different knot placement schemes and a knot deletion scheme, optionally augmented with logarithmic and rational boundary terms. A BIC model selection over the number as well as the type of basis functions is implemented. The technique is explored on a host of simulation and observation data sets and compared to a couple of common density estimators. The test cases include uni-, bi- and multi-modal densities, Gaussian and non-Gaussian modes, features of different scales in a density and long tails without structure on bounded, infinite and semi-infinite intervals. Unlike all of the other methods, the SPDE consistently shows an excellent overall performance taking into account the MISE/cvlogl, detection of modality and computation time. It combines a very small MISE or large cvlogl with a high degree of global smoothness which enables the correct detection of the large-scale structure of the density in terms of the number of modes and bumps. All of the types of bulk basis functions as well as the boundary terms significantly contribute to the practical performance of the SPDE.

An extension of the method to the multivariate case is straightforward. One would use multinomials [37, 38], tensor-product splines [60] or multiple Fourier series as basis functions which account for cross-variable dependencies in a natural way. Boundary terms could be added where appropriate and model selection performed as in the univariate case.

A comparison of the ranking of the different density models in terms of the BIC and the cross-validated likelihood for the Old Faithful geyser data sets reveals some limitations of the use of the BIC for model selection. The BIC is reliable in comparing models with the same type of basis functions and it is still about adequate to compare polynomial models with spline or trigonometric models. However, when including models with boundary terms and finite mixture models in the comparison there are cases where two models differ by about 6 points in the BIC but have about the same cross-validated likelihood. Ideally, the model selection should be based on some cross-validated score or even an out-of-sample score in data-rich settings. This is often computationally infeasible; if the order of the appropriate model is relatively low 10-fold cross-validation, for example, may still be practical.

Supporting information

S1 Appendix. Existence of the maximum likelihood estimator.

https://doi.org/10.1371/journal.pone.0259111.s001

(PDF)

S2 Appendix. The truncated moment problem and the moment-constrained maximum entropy principle.

https://doi.org/10.1371/journal.pone.0259111.s002

(PDF)

S3 Appendix. Alternative methods for density estimation.

https://doi.org/10.1371/journal.pone.0259111.s003

(PDF)

References

  1. 1. Silverman BW. Density Estimation for Statistics and Data Analysis. Chapman and Hall; 1986.
  2. 2. Wand MP, Jones MC. Kernel Smoothing. Chapman and Hall, London; 1995.
  3. 3. Sheather SJ. Density estimation. Statistical Science. 2004; 19:588–597.
  4. 4. Loader CR. Bandwidth selection: classical or plug-in? The Annals of Statistics. 1999; 27:415–438.
  5. 5. Sheather SJ, Jones MC. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society B. 1991; 53:683–690.
  6. 6. Jones MC, Marron JS, Sheather SJ. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association. 1996; 91:401–407.
  7. 7. Botev ZI, Grotowski JF, Kroese DP. Kernel density estimation via diffusion. The Annals of Statistics. 2010; 38:2916–2957.
  8. 8. Jones MC, Signorini DF. A comparison of higher-order bias kernel density estimators. Journal of the American Statistical Association. 1997; 92:1063–1073.
  9. 9. Ho N, Walker SG. Multivariate smoothing via the Fourier integral theorem and Fourier kernel. 2020; arXiv:2012.14482v1.
  10. 10. Jones MC, Marron JS, Sheather SJ. Simple boundary correction for kernel density estimation. Statistical Computation. 1993; 3:135–146.
  11. 11. Hall P, Park BU. New methods for bias correction at endpoints and boundaries. The Annals of Statistics. 2002; 30:1460–1479.
  12. 12. Abramson IS. On bandwidth variation in kernel estimates—a square root law. The Annals of Statistics. 1982; 10:1217–1223.
  13. 13. Samiuddin M, El-Sayyad GM. On nonparametric kernel density estimates. Biometrika. 1990; 77:865–874.
  14. 14. Jones MC, McKay IJ, Hu T-C. Variable location and scale kernel density estimation. Annals of the Institute of Statistical Mathematics. 1994; 46:521–535.
  15. 15. McLachlan G, Peel D. Finite Mixture Models. Wiley; 2000.
  16. 16. Loader CR. Local likelihood density estimation. The Annals of Statistics. 1996; 24:1602–1618.
  17. 17. Hjort NL, Jones MC. Locally parametric nonparametric density estimation. The Annals of Statistics. 1996; 24:1619–1647.
  18. 18. Watson GS. Density estimation by orthogonal series. The Annals of Mathematical Statistics. 1969; 40:1496–1498.
  19. 19. Hall P. Comparison of two orthogonal series methods of estimating a density and its derivatives on an interval. Journal of Multivariate Analysis. 1982; 12:432–449.
  20. 20. Efromovich S. Orthogonal series density estimation. Wiley Interdisciplinary Reviews Computational Statistics. 2010; 2:467–476.
  21. 21. Vitale RA. A Bernstein polynomial approach to density function estimation. Statistical Inference and Related Topics. 1975; 2:87–99.
  22. 22. Csiszar I. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability. 1975; 1:146–158.
  23. 23. Barron AR, Sheu C-H. Approximation of density functions by sequences of exponential families. The Annals of Statistics. 1991; 19:1347–1369.
  24. 24. Good IJ, Gaskins RA. Nonparametric roughness penalties for probability densities. Biometrika. 1971; 58:255–277.
  25. 25. Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. The Annals of Statistics. 1982; 10:795–810.
  26. 26. O’Sullivan F. Fast computation of fully automated log-density and log-hazard estimators. SIAM Journal on Scientific and Statistical Computing. 1988; 9:363–379.
  27. 27. Gu C, Qiu C. Smoothing spline density estimation: theory. The Annals of Statistics. 1993; 21:217–234.
  28. 28. Crain BR. Estimation of distributions using orthogonal expansions. The Annals of Statistics. 1974; 2:454–463.
  29. 29. Stone CJ, Koo C-Y. Logspline density estimation. Contemporary Mathematics. 1986; 59:1–15.
  30. 30. Stone CJ. Large sample inference for logspline models. The Annals of Statistics. 1990; 18:717–741.
  31. 31. Kooperberg C, Stone CJ. A study of logspline density estimation. Computational Statistics & Data Analysis. 1991; 12:327–347.
  32. 32. Koo J-Y, Kim W-C. Wavelet density estimation by approximation of log-densities. Statistics & Probability Letters. 1996; 26:271–278.
  33. 33. Kwasniok F. Fluctuations of finite-time Lyapunov exponents in an intermediate-complexity atmospheric model: a multivariate and large-deviation perspective. Nonlinear Processes in Geophysics. 2019; 26:195–209.
  34. 34. Kwasniok F. Predicting critical transitions in dynamical systems from time series using nonstationary probability density modeling. Physical Review E. 2013; 88:052917. pmid:24329341
  35. 35. Mead LR, Papanicolaou N. Maximum entropy in the problem of moments. Journal of Mathematical Physics. 1984; 25:2404–2417.
  36. 36. Wu X. Calculation of maximum entropy densities with application to income distribution. Journal of Econometrics. 2003; 115:347–354.
  37. 37. Abramov RV. The multidimensional maximum entropy moment problem: a review on numerical methods. Communications in Mathematical Sciences. 2010; 8:377–392.
  38. 38. Hao W, Harlim J. An equation-by-equation algorithm for solving the multidimensional moment constrained maximum entropy problem. Communications in Applied Mathematics and Computational Science. 2018; 13:189–214.
  39. 39. Farmer J, Jacobs D. High throughput nonparametric probability density estimation. PLoS ONE. 2018; 13:e0196937. pmid:29750803
  40. 40. Barndorff-Nielsen OE. Information and Exponential Families in Statistical Theory. Wiley; 1978.
  41. 41. Brown LD. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Lecture Notes. Volume 9. Institute of Mathematical Statistics; 1986.
  42. 42. Sundberg R. Statistical Modelling by Exponential Families. Cambridge University Press; 2019.
  43. 43. Efron B, Tibshirani R. Using specially designed exponential families for density estimation. The Annals of Statistics. 1996; 24:2431–2461.
  44. 44. Shohat JA, Tamarkin JD. The Problem of Moments. American Mathematical Society, Providence, RI; 1963.
  45. 45. Krein MG, Nudelman AA. The Markov Moment Problem and Extremal Problems. American Mathematical Society; 1977.
  46. 46. Schmüdgen K. The Moment Problem. Springer; 2017.
  47. 47. Jaynes ET. Information theory and statistical mechanics. Physical Review. 1957; 106:620–630.
  48. 48. Jaynes ET. On the rationale of maximum-entropy methods. Proceedings of the IEEE. 1982; 70:939–952.
  49. 49. Fawcett L, Walshaw D. Improved estimation for temporally clustered extremes. Environmetrics. 2007; 18:173–188.
  50. 50. Frontini M, Tagliani A. Entropy-convergence in Stieltjes and Hamburger moment problem. Applied Mathematics and Computation. 1997; 88:39–51.
  51. 51. Kullback S. Information Theory and Statistics. Dover, New York; 1962.
  52. 52. Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978; 6:461–464.
  53. 53. Rissanen J. A universal prior for integers and estimation by minimum description length. The Annals of Statistics. 1983; 11:416–431.
  54. 54. Barron AR, Cover TM. Minimum complexity density estimation. IEEE Transactions on Information Theory. 1991; 37:1034–1054.
  55. 55. Kwasniok F, Lohmann G. Deriving dynamical models from paleoclimatic records: Application to glacial millennial-scale climate variability. Physical Review E. 2009; 80:066104. pmid:20365228
  56. 56. Livina VN, Kwasniok F, Lenton TM. Potential analysis reveals changing number of climate states during the last 60 kyr. Climate of the Past. 2010; 6:77–82.
  57. 57. Golub GH, Van Loan CF. Matrix Computations. 3rd edition. Johns Hopkins University Press; 1996.
  58. 58. Haughton DMA. On the choice of a model to fit data from an exponential family. The Annals of Statistics. 1988; 16:342–355.
  59. 59. Marron JS, Wand MP. Exact mean integrated squared error. The Annals of Statistics. 1992; 20:712–736.
  60. 60. Stone CJ, Hansen MH, Kooperberg C, Truong YK. Polynomial splines and their tensor products in extended linear modeling. The Annals of Statistics. 1997; 25:1371–1470.