Fixed design local polynomial smoothing and bandwidth selection for right censored data
Introduction
Let denote a continuous lifetime variable with cumulative distribution function (c.d.f.) . Frequently the available data are beyond the experimenter’s control and come in the form of scatterplot observations. For example, this is the case in lifetable analyses in the actuarial science, in data analyses in demography e.t.c., see Müller et al. (1997) and Wang et al. (1998). In such an occasion, the coordinates of the available data pairs consist of the response, which is usually an empirical estimate of the target curve, and the center of the associated time interval at which the curve is being estimated. Still, continuous estimates are desirable, especially when the analysis additionally depends on the estimate’s derivatives. For this reason, the present research considers the local polynomial smoothing of the well-known Kaplan–Meier estimate (Kaplan and Meier, 1958), with first objective to provide continuous, boundary aware estimates for the distribution function, its derivatives of any arbitrary order and integrated c.d.f. derivative products for fixed designs under the random right censorship model. The reasoning for pursuing this approach becomes immediately obvious when observing that smoothing of scatterplot data intrinsically corresponds to formulating a reasonable nonparametric regression problem. The asymptotic unbiasedness property of the Kaplan–Meier estimate together with its strong representation as the underlying c.d.f. plus an asymptotically negligible error term, prompts its use as the response. At the same time, the center of each equidistant time interval in which the observed data range is split is used as the design. Hence, application of the local polynomial technique yields estimates of by matching the coefficients of polynomials fitted locally — through kernel weighted least squares — and the derivatives of in a Taylor expansion of the regression function at a nearby point; precise formulation and details are provided in Section 2. This approach enables the development, also in Section 2, of local polynomial estimates for integrated c.d.f. derivative products of any arbitrary order. These are useful on their own right since they are necessary for the implementation of automatic bandwidth selectors, in estimation of population characteristics, statistical distance measures and in a variety of other settings.
Multiple benefits arise from the local polynomial smoothing of the Kaplan–Meier estimate. First, its definition does not involve a bandwidth and thus its use as the response in the aforementioned nonparametric regression problem greatly simplifies implementation of the resulting estimates which now depend on just one bandwidth; this is in contrast to the traditional approach which needs two bandwidths. In terms of performance, the Asymptotic Mean Integrated Square Error (AMISE) and central limit theorem for the estimates of , quantified analytically in Section 3, are valid throughout the region of estimation and imply the absence of inflated bias at the endpoints. Further, the asymptotic properties of the integrated derivative product estimates, also quantified in Section 3, ensure efficient estimation of the functionals as opposed to using conventional kernel smoothers. A subsequent advantage thus results by their utilization in developing (in Section 4) a solve-the-equation AMISE-optimal plug-in bandwidth rule applicable to all estimates proposed here. The rule is built as a direct extension of the corresponding density estimation bandwidth selector for complete data proposed in Cheng (1997). The gain is its stable performance across the region of estimation; this is also reflected in its asymptotic properties, quantified analytically together with its convergence rate and asymptotic distribution in Section 4. It is worth noting here that the literature is rather thin on AMISE optimal bandwidth rules for convolution smoother estimates for right censored data. Since the plug-in rule proposed here is also applicable to classical kernel approach, it can also be thought as filling this important gap in the literature.
Section 5 investigates the finite sample performance of the proposed methodology. First, the analysis of three real world data sets illustrates how the proposed technique can help in capturing data patterns that remain undiscoverable either by the conventional kernel smoothing approach or by parametric estimates. Finally, distributional data are used to simulate and compare the finite sample MISE performance of the proposed estimates in comparison to frequently used estimates in the literature and in practice.
Section snippets
Local polynomial smoothing of the Kaplan–Meier estimate
Let be a sample of i.i.d. survival times censored on the right by i.i.d. random variables , which are independent from the ’s. Let be the common probability density function (p.d.f.) and the c.d.f. of the ’s. Denote with the c.d.f. of the ’s. Typically the observed right censored data are denoted by the pairs , with and where is the indicator random variable of the event . The distribution function of the ’s
Asymptotic properties
Denote the bias and variance of respectively by Set and define the constant Similarly to the definition of and , let and stand for and respectively in the boundary and and in the interior. In what follows, focus is given on the left boundary, i.e. , since treatment
Plug in bandwidth selection
By Theorem 1 and since the Lebesgue measure of tends to zero and therefore the corresponding integral is zero, the MISE of can be decomposed as where if and only if . Write where
Numerical examples
Throughout this section, binning of each sample is performed by splitting the observed data range into disjoint intervals of equal length. In accordance to assumption A.3 the length is set to . Now, let be the estimate of and let be the corresponding survival function estimate. In all examples and are implemented with the MISE optimal bandwidth obtained by (16) for after replacing the unknown quantities by
Conclusions and future work
This research investigated the local polynomial smoothing of the Kaplan–Meier and showed that it leads to an effective and reliable way to estimate the c.d.f., its derivatives and auxiliary functionals for right censored data in the fixed design setting. The theoretical properties of all estimates and bandwidth selectors introduced herein suggest a robust asymptotic behavior throughout the region of estimation. The MISE simulations indicate that this robust behavior is valid also for finite
Acknowledgments
The authors thank the associate editor and two anonymous reviewers for their helpful comments and suggestions which have greatly improved this research.
References (23)
Central limit theorem for integrated square error of multivariate nonparametric density estimators
J. Multivariate Anal.
(1984)A new lifetime distribution
Comput. Statist. Data Anal.
(2007)Modeling the failure data of a repairable equipment with bathtub type failure intensity
Reliab. Eng. Syst. Saf.
(2001)- et al.
Local Polynomial Smoothing Based on the Kaplan–Meier Estimate
(2020) - et al.
Local polynomial fitting in failure rate estimation
IEEE Trans. Reliab.
(2008) - et al.
On the rate of uniform convergence of the product-limit estimator: strong and weak laws
Ann. Statist.
(1997) On boundary effects of smooth curve estimators
Boundary aware estimators of integrated density derivative products
J. R. Stat. Soc. Ser. B Stat. Methodol.
(1997)- et al.
Local Polynomial Modeling and Its Applications
(1996) - et al.
Bias correction for local linear regression estimation using asymmetric kernels via the skewing method
Econom. Stat.
(2020)
Families of smooth confidence bands for the survival function under the general random censorship model
Lifetime Data Anal.
Cited by (2)
Local polynomial smoothing based on the Kaplan–Meier estimate
2022, Journal of Statistical Planning and InferenceTwo bias-corrected Kaplan-Meier estimators
2022, Quality and Reliability Engineering International