Fixed design local polynomial smoothing and bandwidth selection for right censored data

https://doi.org/10.1016/j.csda.2020.107064Get rights and content

Abstract

The local polynomial smoothing of the Kaplan–Meier estimate for fixed designs is explored and analyzed. The first benefit, in comparison to classical convolution kernel smoothing, is the development of boundary aware estimates of the distribution function, its derivatives and integrated derivative products of any arbitrary order. The advancements proceed by developing asymptotic mean integrated square error optimal solve-the-equation plug-in bandwidth selectors for the estimates of the distribution function and its derivatives, and as a byproduct, a mean square error optimal bandwidth rule for integrated derivative products. The asymptotic properties of all methodological contributions are quantified analytically and discussed in detail. Three real data analyses illustrate the benefits of the proposed methodology in practice. Finally, numerical evidence is provided on the finite sample performance of the proposed technique with reference to benchmark estimates.

Introduction

Let T denote a continuous lifetime variable with cumulative distribution function (c.d.f.) FT(t)=P(Tt). Frequently the available data are beyond the experimenter’s control and come in the form of scatterplot observations. For example, this is the case in lifetable analyses in the actuarial science, in data analyses in demography e.t.c., see Müller et al. (1997) and Wang et al. (1998). In such an occasion, the coordinates of the available data pairs consist of the response, which is usually an empirical estimate of the target curve, and the center of the associated time interval at which the curve is being estimated. Still, continuous estimates are desirable, especially when the analysis additionally depends on the estimate’s derivatives. For this reason, the present research considers the local polynomial smoothing of the well-known Kaplan–Meier estimate (Kaplan and Meier, 1958), with first objective to provide continuous, boundary aware estimates for the distribution function, its derivatives of any arbitrary order and integrated c.d.f. derivative products for fixed designs under the random right censorship model. The reasoning for pursuing this approach becomes immediately obvious when observing that smoothing of scatterplot data intrinsically corresponds to formulating a reasonable nonparametric regression problem. The asymptotic unbiasedness property of the Kaplan–Meier estimate together with its strong representation as the underlying c.d.f. plus an asymptotically negligible error term, prompts its use as the response. At the same time, the center of each equidistant time interval in which the observed data range is split is used as the design. Hence, application of the local polynomial technique yields estimates of FT(ν),ν=0,1, by matching the coefficients of polynomials fitted locally — through kernel weighted least squares — and the derivatives of FT in a Taylor expansion of the regression function at a nearby point; precise formulation and details are provided in Section 2. This approach enables the development, also in Section 2, of local polynomial estimates for integrated c.d.f. derivative products of any arbitrary order. These are useful on their own right since they are necessary for the implementation of automatic bandwidth selectors, in estimation of population characteristics, statistical distance measures and in a variety of other settings.

Multiple benefits arise from the local polynomial smoothing of the Kaplan–Meier estimate. First, its definition does not involve a bandwidth and thus its use as the response in the aforementioned nonparametric regression problem greatly simplifies implementation of the resulting estimates which now depend on just one bandwidth; this is in contrast to the traditional approach which needs two bandwidths. In terms of performance, the Asymptotic Mean Integrated Square Error (AMISE) and central limit theorem for the estimates of FT(ν), quantified analytically in Section 3, are valid throughout the region of estimation and imply the absence of inflated bias at the endpoints. Further, the asymptotic properties of the integrated derivative product estimates, also quantified in Section 3, ensure efficient estimation of the functionals as opposed to using conventional kernel smoothers. A subsequent advantage thus results by their utilization in developing (in Section 4) a solve-the-equation AMISE-optimal plug-in bandwidth rule applicable to all estimates proposed here. The rule is built as a direct extension of the corresponding density estimation bandwidth selector for complete data proposed in Cheng (1997). The gain is its stable performance across the region of estimation; this is also reflected in its asymptotic properties, quantified analytically together with its convergence rate and asymptotic distribution in Section 4. It is worth noting here that the literature is rather thin on AMISE optimal bandwidth rules for convolution smoother estimates for right censored data. Since the plug-in rule proposed here is also applicable to classical kernel approach, it can also be thought as filling this important gap in the literature.

Section 5 investigates the finite sample performance of the proposed methodology. First, the analysis of three real world data sets illustrates how the proposed technique can help in capturing data patterns that remain undiscoverable either by the conventional kernel smoothing approach or by parametric estimates. Finally, distributional data are used to simulate and compare the finite sample MISE performance of the proposed estimates in comparison to frequently used estimates in the literature and in practice.

Section snippets

Local polynomial smoothing of the Kaplan–Meier estimate

Let T1,T2,,Tn be a sample of i.i.d. survival times censored on the right by i.i.d. random variables U1,U2,,Un, which are independent from the Ti’s. Let fT be the common probability density function (p.d.f.) and FT the c.d.f. of the Ti’s. Denote with H the c.d.f. of the Ui’s. Typically the observed right censored data are denoted by the pairs (Xi,δi), i=1,2,,n with Xi=min{Ti,Ui} and δi=1{TiUi} where 1{} is the indicator random variable of the event {}. The distribution function of the Xi’s

Asymptotic properties

Denote the bias and variance of FˆL(ν)(x) respectively by bL(,c)(x)=bL,c(x),x[0,h)(Mh,M]bL(x),x[h,Mh],σL(,c)2(x)=σL,c2(x),x[0,h)(Mh,M]σL2(x),x[h,Mh]. Set g(x)=fT(x)(1H(x))1,G(x)=0xg(t)dt,and define the constant C1=0Mg(x)dx.Similarly to the definition of bL(,c)(x) and σL(,c)2(x), let Kν(,c) and Wν(,c) stand for Kν,c and Wν,c respectively in the boundary and Kν and Wν in the interior. In what follows, focus is given on the left boundary, i.e. x=ch[0,h),0<c<1, since treatment

Plug in bandwidth selection

By Theorem 1 and since the Lebesgue measure of [0,h) tends to zero and therefore the corresponding integral is zero, the MISE of FˆL(ν)(x) can be decomposed as MISEFˆL(ν)(x)=0hνMSEFˆL(ν)(x)dx+hνMMSEFˆL(ν)(x)dxhν44μν+22(Kν)RFT(ν+2)+(ν!)2nhν2νhνMG(x)dx2(ν!)2nhν2ν1C1A1,1hνhνMFT(ν)(x)+hν2ν!((ν+2)!)1μν+2(Kν)FT(ν+2)(x)2dx+O(n1hν2ν)+o(hν4), where an=O(bn) if and only if limsupn|anbn|<. Write MISEFˆL(ν)=AMISEFˆL(ν)+O(n1hν2ν)+o(hν4),where AMISEFˆL(ν)=hν44μν+22(Kν)RFT(ν+2)2(ν!)2nhν2ν1

Numerical examples

Throughout this section, binning of each sample (Xi,δi),i=1,,n is performed by splitting the observed data range into g=[(X(n)X(1))b] disjoint intervals of equal length. In accordance to assumption A.3 the length is set to b=x2x1=n34. Now, let FˆL(0)FˆL be the estimate of FT and let SˆL(x)=1FˆL(x) be the corresponding survival function estimate. In all examples FˆL and SˆL are implemented with the MISE optimal bandwidth obtained by (16) for ν=0 after replacing the unknown quantities by

Conclusions and future work

This research investigated the local polynomial smoothing of the Kaplan–Meier and showed that it leads to an effective and reliable way to estimate the c.d.f., its derivatives and auxiliary functionals for right censored data in the fixed design setting. The theoretical properties of all estimates and bandwidth selectors introduced herein suggest a robust asymptotic behavior throughout the region of estimation. The MISE simulations indicate that this robust behavior is valid also for finite

Acknowledgments

The authors thank the associate editor and two anonymous reviewers for their helpful comments and suggestions which have greatly improved this research.

References (23)

  • GulatiS. et al.

    Families of smooth confidence bands for the survival function under the general random censorship model

    Lifetime Data Anal.

    (1996)
  • Cited by (2)

    • Local polynomial smoothing based on the Kaplan–Meier estimate

      2022, Journal of Statistical Planning and Inference
    • Two bias-corrected Kaplan-Meier estimators

      2022, Quality and Reliability Engineering International
    View full text