Fixed design local polynomial smoothing and bandwidth selection for right censored data

doi:10.1016/j.csda.2020.107064

Computational Statistics & Data Analysis

Volume 153, January 2021, 107064

https://doi.org/10.1016/j.csda.2020.107064 Get rights and content

Abstract

The local polynomial smoothing of the Kaplan–Meier estimate for fixed designs is explored and analyzed. The first benefit, in comparison to classical convolution kernel smoothing, is the development of boundary aware estimates of the distribution function, its derivatives and integrated derivative products of any arbitrary order. The advancements proceed by developing asymptotic mean integrated square error optimal solve-the-equation plug-in bandwidth selectors for the estimates of the distribution function and its derivatives, and as a byproduct, a mean square error optimal bandwidth rule for integrated derivative products. The asymptotic properties of all methodological contributions are quantified analytically and discussed in detail. Three real data analyses illustrate the benefits of the proposed methodology in practice. Finally, numerical evidence is provided on the finite sample performance of the proposed technique with reference to benchmark estimates.

Introduction

Let $T$ denote a continuous lifetime variable with cumulative distribution function (c.d.f.) $F_{T} (t) = P (T \leq t)$ . Frequently the available data are beyond the experimenter’s control and come in the form of scatterplot observations. For example, this is the case in lifetable analyses in the actuarial science, in data analyses in demography e.t.c., see Müller et al. (1997) and Wang et al. (1998). In such an occasion, the coordinates of the available data pairs consist of the response, which is usually an empirical estimate of the target curve, and the center of the associated time interval at which the curve is being estimated. Still, continuous estimates are desirable, especially when the analysis additionally depends on the estimate’s derivatives. For this reason, the present research considers the local polynomial smoothing of the well-known Kaplan–Meier estimate (Kaplan and Meier, 1958), with first objective to provide continuous, boundary aware estimates for the distribution function, its derivatives of any arbitrary order and integrated c.d.f. derivative products for fixed designs under the random right censorship model. The reasoning for pursuing this approach becomes immediately obvious when observing that smoothing of scatterplot data intrinsically corresponds to formulating a reasonable nonparametric regression problem. The asymptotic unbiasedness property of the Kaplan–Meier estimate together with its strong representation as the underlying c.d.f. plus an asymptotically negligible error term, prompts its use as the response. At the same time, the center of each equidistant time interval in which the observed data range is split is used as the design. Hence, application of the local polynomial technique yields estimates of $F_{T}^{(ν)}, ν = 0, 1, \dots$ by matching the coefficients of polynomials fitted locally — through kernel weighted least squares — and the derivatives of $F_{T}$ in a Taylor expansion of the regression function at a nearby point; precise formulation and details are provided in Section 2. This approach enables the development, also in Section 2, of local polynomial estimates for integrated c.d.f. derivative products of any arbitrary order. These are useful on their own right since they are necessary for the implementation of automatic bandwidth selectors, in estimation of population characteristics, statistical distance measures and in a variety of other settings.

Multiple benefits arise from the local polynomial smoothing of the Kaplan–Meier estimate. First, its definition does not involve a bandwidth and thus its use as the response in the aforementioned nonparametric regression problem greatly simplifies implementation of the resulting estimates which now depend on just one bandwidth; this is in contrast to the traditional approach which needs two bandwidths. In terms of performance, the Asymptotic Mean Integrated Square Error (AMISE) and central limit theorem for the estimates of $F_{T}^{(ν)}$ , quantified analytically in Section 3, are valid throughout the region of estimation and imply the absence of inflated bias at the endpoints. Further, the asymptotic properties of the integrated derivative product estimates, also quantified in Section 3, ensure efficient estimation of the functionals as opposed to using conventional kernel smoothers. A subsequent advantage thus results by their utilization in developing (in Section 4) a solve-the-equation AMISE-optimal plug-in bandwidth rule applicable to all estimates proposed here. The rule is built as a direct extension of the corresponding density estimation bandwidth selector for complete data proposed in Cheng (1997). The gain is its stable performance across the region of estimation; this is also reflected in its asymptotic properties, quantified analytically together with its convergence rate and asymptotic distribution in Section 4. It is worth noting here that the literature is rather thin on AMISE optimal bandwidth rules for convolution smoother estimates for right censored data. Since the plug-in rule proposed here is also applicable to classical kernel approach, it can also be thought as filling this important gap in the literature.

Section 5 investigates the finite sample performance of the proposed methodology. First, the analysis of three real world data sets illustrates how the proposed technique can help in capturing data patterns that remain undiscoverable either by the conventional kernel smoothing approach or by parametric estimates. Finally, distributional data are used to simulate and compare the finite sample MISE performance of the proposed estimates in comparison to frequently used estimates in the literature and in practice.

Section snippets

Local polynomial smoothing of the Kaplan–Meier estimate

Let $T_{1}, T_{2}, \dots, T_{n}$ be a sample of i.i.d. survival times censored on the right by i.i.d. random variables $U_{1}, U_{2}, \dots, U_{n}$ , which are independent from the $T_{i}$ ’s. Let $f_{T}$ be the common probability density function (p.d.f.) and $F_{T}$ the c.d.f. of the $T_{i}$ ’s. Denote with $H$ the c.d.f. of the $U_{i}$ ’s. Typically the observed right censored data are denoted by the pairs $(X_{i}, δ_{i})$ , $i = 1, 2, \dots, n$ with $X_{i} = min {T_{i}, U_{i}}$ and $δ_{i} = 1_{{T_{i} \leq U_{i}}}$ where $1_{{\cdot}}$ is the indicator random variable of the event ${\cdot}$ . The distribution function of the $X_{i}$ ’s

Asymptotic properties

Denote the bias and variance of ${\hat{F}}_{L}^{(ν)} (x)$ respectively by $b_{L (, c)} (x) = \{\begin{matrix} b_{L, c} (x), & x \in [0, h) \cup (M - h, M] \\ b_{L} (x), & x \in [h, M - h], \end{matrix}$ $σ_{L (, c)}^{2} (x) = \{\begin{matrix} σ_{L, c}^{2} (x), & x \in [0, h) \cup (M - h, M] \\ σ_{L}^{2} (x), & x \in [h, M - h] . \end{matrix}$ Set $g (x) = f_{T} (x) {(1 - H (x))}^{- 1}, G (x) = \int_{0}^{x} g (t) d t,$ and define the constant $C_{1} = \int_{0}^{M} g (x) d x .$ Similarly to the definition of $b_{L (, c)} (x)$ and $σ_{L (, c)}^{2} (x)$ , let $K_{ν (, c)}^{*}$ and $W_{ν (, c)}^{*}$ stand for $K_{ν, c}^{*}$ and $W_{ν, c}^{*}$ respectively in the boundary and $K_{ν}^{*}$ and $W_{ν}^{*}$ in the interior. In what follows, focus is given on the left boundary, i.e. $x = c h \in [0, h), 0 < c < 1$ , since treatment

Plug in bandwidth selection

By Theorem 1 and since the Lebesgue measure of $[0, h)$ tends to zero and therefore the corresponding integral is zero, the MISE of ${\hat{F}}_{L}^{(ν)} (x)$ can be decomposed as $MISE \{{\hat{F}}_{L}^{(ν)} (x)\} = \int_{0}^{h_{ν}} MSE \{{\hat{F}}_{L}^{(ν)} (x)\} d x + \int_{h_{ν}}^{M} MSE \{{\hat{F}}_{L}^{(ν)} (x)\} d x ≃ \frac{h_{ν}^{4}}{4} μ_{ν + 2}^{2} (K_{ν}^{*}) R (F_{T}^{(ν + 2)}) + \frac{{(ν!)}^{2}}{n h_{ν}^{2 ν}} \int_{h_{ν}}^{M} G (x) d x - 2 \frac{{(ν!)}^{2}}{n h_{ν}^{2 ν - 1}} C_{1} A_{1, 1} - h_{ν} \int_{h_{ν}}^{M} {\{F_{T}^{(ν)} (x) + h_{ν}^{2} ν! {((ν + 2)!)}^{- 1} μ_{ν + 2} (K_{ν}^{*}) F_{T}^{(ν + 2)} (x)\}}^{2} d x + O (n^{- 1} h_{ν}^{2 ν}) + o (h_{ν}^{4}),$ where $a_{n} = O (b_{n})$ if and only if $lim {sup}_{n \to \infty} | a_{n} ∕ b_{n} | < \infty$ . Write $MISE \{{\hat{F}}_{L}^{(ν)}\} = AMISE \{{\hat{F}}_{L}^{(ν)}\} + O (n^{- 1} h_{ν}^{- 2 ν}) + o (h_{ν}^{4}),$ where $AMISE \{{\hat{F}}_{L}^{(ν)}\} = \frac{h_{ν}^{4}}{4} μ_{ν + 2}^{2} (K_{ν}^{*}) R (F_{T}^{(ν + 2)}) - 2 \frac{{(ν!)}^{2}}{n h_{ν}^{2 ν - 1}}$

Numerical examples

Throughout this section, binning of each sample $(X_{i}, δ_{i}), i = 1, \dots, n$ is performed by splitting the observed data range into $g = [(X_{(n)} - X_{(1)}) ∕ b]$ disjoint intervals of equal length. In accordance to assumption A.3 the length is set to $b = x_{2} - x_{1} = n^{- 3 ∕ 4}$ . Now, let ${\hat{F}}_{L}^{(0)} \equiv {\hat{F}}_{L}$ be the estimate of $F_{T}$ and let ${\hat{S}}_{L} (x) = 1 - {\hat{F}}_{L} (x)$ be the corresponding survival function estimate. In all examples ${\hat{F}}_{L}$ and ${\hat{S}}_{L}$ are implemented with the MISE optimal bandwidth obtained by (16) for $ν = 0$ after replacing the unknown quantities by

Conclusions and future work

This research investigated the local polynomial smoothing of the Kaplan–Meier and showed that it leads to an effective and reliable way to estimate the c.d.f., its derivatives and auxiliary functionals for right censored data in the fixed design setting. The theoretical properties of all estimates and bandwidth selectors introduced herein suggest a robust asymptotic behavior throughout the region of estimation. The MISE simulations indicate that this robust behavior is valid also for finite

Acknowledgments

The authors thank the associate editor and two anonymous reviewers for their helpful comments and suggestions which have greatly improved this research.

References (23)

HallP.
Central limit theorem for integrated square error of multivariate nonparametric density estimators
J. Multivariate Anal.
(1984)
KusC.
A new lifetime distribution
Comput. Statist. Data Anal.
(2007)
PulciniG.
Modeling the failure data of a repairable equipment with bathtub type failure intensity
Reliab. Eng. Syst. Saf.
(2001)
BagkavosD. et al.
Local Polynomial Smoothing Based on the Kaplan–Meier Estimate
(2020)
BagkavosD. et al.
Local polynomial fitting in failure rate estimation
IEEE Trans. Reliab.
(2008)
ChenK. et al.
On the rate of uniform convergence of the product-limit estimator: strong and weak laws
Ann. Statist.
(1997)
ChengM.-Y.
On boundary effects of smooth curve estimators
ChengM.-Y.
Boundary aware estimators of integrated density derivative products
J. R. Stat. Soc. Ser. B Stat. Methodol.
(1997)
FanJ. et al.
Local Polynomial Modeling and Its Applications
(1996)
FunkeB. et al.
Bias correction for local linear regression estimation using asymmetric kernels via the skewing method
Econom. Stat.
(2020)

GulatiS. et al.

Families of smooth confidence bands for the survival function under the general random censorship model

Lifetime Data Anal.

(1996)

Cited by (2)

Local polynomial smoothing based on the Kaplan–Meier estimate
2022, Journal of Statistical Planning and Inference
The local polynomial modeling of the Kaplan–Meier estimate for random designs under the right censored data setting is investigated in detail. Two classes of boundary aware estimates are developed: estimates of the distribution function and its derivatives of any arbitrary order and estimates of integrated distribution function derivative products. Their statistical properties are quantified analytically and their implementation is facilitated by the development of corresponding data driven plug-in bandwidth selectors. The asymptotic rate of convergence of the plug-in rule for the estimates of the distribution function and its derivatives is quantified analytically. Numerical evidence is also provided on its finite sample performance. A real life data analysis illustrates how the methodological advances proposed herein help to generate additional insights in comparison to existing methods.
Two bias-corrected Kaplan-Meier estimators
2022, Quality and Reliability Engineering International

View full text

Fixed design local polynomial smoothing and bandwidth selection for right censored data

Abstract

Introduction

Section snippets

Local polynomial smoothing of the Kaplan–Meier estimate

Asymptotic properties

Plug in bandwidth selection

Numerical examples

Conclusions and future work

Acknowledgments

J. Multivariate Anal.

Comput. Statist. Data Anal.

Reliab. Eng. Syst. Saf.

Local Polynomial Smoothing Based on the Kaplan–Meier Estimate

Local polynomial fitting in failure rate estimation

IEEE Trans. Reliab.

On the rate of uniform convergence of the product-limit estimator: strong and weak laws

Ann. Statist.

On boundary effects of smooth curve estimators

Boundary aware estimators of integrated density derivative products

J. R. Stat. Soc. Ser. B Stat. Methodol.

Local Polynomial Modeling and Its Applications

Bias correction for local linear regression estimation using asymmetric kernels via the skewing method

Econom. Stat.

Families of smooth confidence bands for the survival function under the general random censorship model

Lifetime Data Anal.