Nonparametric estimation for survival data with censoring indicators missing at random☆
Introduction
We consider the problem of estimation from right-censored data in the presence of covariates, when the censoring indicator is missing. Let T be a random variable representing the time to death from the cause of interest. Let C denote a right-censoring random time. Under usual random censorship, the observation is and . Let X denote a real covariate. In what follows, it is assumed that T, C and X admit densities respectively denoted by fT, g and fX. In addition, C is assumed to be independent of T conditionally to X, see e.g. Comte et al. (2011) for comments on this assumption.
When the cause of death is not recorded, the censoring indicator is missing: this is the missing censoring indicator (MCI) model, see Dinse (1986) and Subramanian (2006), which is defined as follows. Let be the missingness indicator, that is if is observed and otherwise. The observed data are then given for individual : We shall say that the model is:
- •
MCAR under the assumption that the indicator is Missing Completely At Random, i.e. is independent of T, C and X (see e.g. McKeague and Subramanian, 1998).
- •
MAR under the assumption that the indicator is Missing At Random, i.e. and are independent conditionally to Y, X.
In this paper, we mainly concentrate on the MAR model. The MCAR model will be considered in Section 2.2.
This model has been studied by several authors in the last decade. Most papers are interested in survival function and cumulative hazard rate estimation. In particular, van der Laan and McKeague (1998) improve Lo's (1991) paper and build a sieved nonparametric maximum likelihood estimator of the survival function in the MAR case. Their estimator is a generalization of the Kaplan and Meier (1958) estimator to this context and is the first proposal reaching the efficiency bound. Subramanian (2004) also proposes an efficient estimator of the survival function in the MAR case; he proves his estimate to be efficient as well. Gijbels et al. (2007) study semi-parametric and nonparametric Cox regression analysis in several contexts.
Kernel methods have also been used to build different estimators in the MAR context. Subramanian (2006) estimates the cumulative hazard rate with a ratio of kernel estimators. He provides an almost sure representation, and a Central Limit Theorem (CLT). He deduces results of the same type for the survival function. A study in a similar context is also provided by Wang and Ng (2008). Recently, Wang et al. (2009) proposed density estimator based on kernels and Kaplan–Meier-type corrections of censoring. They prove a CLT and suggest a bandwidth selection strategy. Extensions of these works to conditional functions (both cumulative hazard and survival functions) in the presence of covariates are developed in Wang and Shen (2008).
Note that several authors (Dikta, 1998, and more recently Subramanian, 2009, Subramanian, 2011) study semiparametric models for the missing process and imputation methods – but for Kaplan–Meier estimator – while we remain in pure nonparametric setting.
Both our method and our aim are rather different. We indeed consider the estimation of the conditional hazard rate given a covariate. Moreover, we provide a nonparametric mean square strategy by considering approximations of the target function on finite dimensional linear spaces spanned by convenient and simple orthonormal (functional) bases. A collection of estimators is thus defined, indexed by the dimension of the multidimensional projection space, and a penalization device allows us to select a “good” space among all the proposals.
Contrary to standard kernel methodology, our estimator has the advantage of being defined as a contrast minimizer and not a ratio of two estimators, see Wang and Shen (2008) and Subramanian (2006). It depends on an unknown function, in its definition, which has to be replaced by an estimator; this step is shared by the kernel approach. However, our precise study of the plug-in estimator allows us to non-asymptotically control the mean square risk. From an asymptotic point of view, we provide anisotropic rates corresponding to the regularity of the function under estimation, plus the rate of the intermediate plug-in estimator.
The plan of the paper is the following. We first explain in Section 2 how the contrast is built, and how it allows us to compute a collection of estimators. We conclude the section by giving the penalization device that completes the definition of the data driven estimator, up to an estimator to be plugged in the procedure. In Section 3, we state the theoretical results that ensure that the quadratic risk of our estimator behaves well provided that the intermediate estimator has small risk. Then, we show how similar tools can be used to build, compute and control the second estimator. The procedure is tested in a simulation (Section 4) for both hazard and conditional hazard rates (i.e. with or without covariate) and under different missing schemes. Technical proofs are gathered in.
Section snippets
Choice of the contrast
We consider the general MAR case as described in the Introduction, the global assumption is denoted by (A0) and has several parts that we specify below.
(A0-1) The random vectors are independent copies, for , of .
(A0-2) For , we observe Xi, , , and if , otherwise .
(A0-3) C is independent of T given X.
(A0-4) and are independent given .
The unknown function to be estimated is the conditional hazard rate of the random variable T
Main theorem
In order to state our Theorem 1, we have to define the integral norm with respect to , where is the density of the bivariate vector , that isand the associated empirical norm Theorem 1 Let be the estimator defined by (6), (7), (8), (9). Under Assumptions (A1)–(A4), and if for basis (1) and for basis (2), there exists a constant such that, for n large enough
Simulations
To evaluate the finite sample performances of our different proposals for hazard rate estimation, we made Monte Carlo studies in different settings. We study the (possibly conditional) hazard rate estimators with or without covariate, and under both settings of dependence (MAR and MCAR) for the missing of censoring indicators Section 5.
Proof of Theorem 1
Note that the two bases we use satisfy the following property. For all h in where . Moreover, we recall that all Sm's are subsets of a nesting space belonging to the collection denoted by with dimension .
Let denote the restriction of the unknown function to A. For any , we havewhere is the centered empirical process defined by
References (20)
Estimating a survival function with incomplete cause-of-death data
Journal of Multivariate Analysis
(1991)Survival analysis for the missing censoring indicator model using kernel density estimation techniques
Statistical Methodology
(2006)The multiple imputations based Kaplan–Meier estimator
Statistics and Probability Letters
(2009)Multiple imputations and the missing censoring indicator model
Journal of Multivariate Analysis
(2011)- et al.
Estimation and confidence bands of a conditional survival function with censoring indicators missing at random
Journal of Multivariate Analysis
(2008) - et al.
Probability density estimation for survival data with censoring indicators missing at random
Journal of Multivariate Analysis
(2009) - et al.
Adaptive estimation in an autoregression and a geometrical regression framework
The Annals of Statistics
(2001) - et al.
Minimum contrast estimators on sievesexponential bounds and rates of convergence
Bernoulli
(1998) - et al.
Adaptive estimation of the conditional intensity of marker-dependent counting processes
Annales de l'Institut Henri Poincaré. Probabilités et Statistique
(2011) On semiparametric random censorship models
Journal of Statistical Planning and Inference
(1998)
Cited by (6)
A kernel-assisted imputation estimating method for the additive hazards model with missing censoring indicator
2015, Statistics and Probability LettersCitation Excerpt :Thus, how to handle this type of data, has aroused much interest. For the estimation of survival function, refer to McKeague and Subramanian (1998), van der Laan and McKeague (1998), Dikta (1998), Subramanian (2004, 2006), Subramanian and Bandyopadhyay (2010), Brunel et al. (2013), among others. However, for the estimation of the covariate effects, Gijbels et al. (1993) first discussed the Cox proportional hazards model under the assumption that the failure indicators are missing completely at random (MCAR); McKeague and Subramanian (1998) developed another estimating approach; Subramanian (2000) studied estimation under proportionality of conditional hazards; Liu and Wang (2010) considered a regression imputation estimating method for the regression parameter in the Cox model under MAR; Zhou and Sun (2003), Lu and Liang (2008), Song et al. (2010) discussed the additive hazards model by the inverse probability weighted method; Goetghebeur and Ryan (1995) analyzed the Cox proportional hazards regression models with competing risks data under the condition of MAR; Gao and Tsiatis (2005) considered the linear transformation competing risks models with missing cause of failure, among others.
Weighted local polynomial estimations of a non-parametric function with censoring indicators missing at random and their applications
2022, Frontiers of Mathematics in ChinaIntegrated Square Error of Hazard Rate Estimation for Survival Data with Missing Censoring Indicators
2021, Journal of Systems Science and ComplexityCensored count data regression with missing censoring information
2021, Electronic Journal of StatisticsCensored Gamma Regression with Uncertain Censoring Status
2020, Mathematical Methods of StatisticsEstimation/Imputation Strategies for Missing Data in Survival Analysis
2013, Statistical Models and Methods for Reliability and Survival Analysis
- ☆
This work is supported by French Agence Nationale de la Recherche ANR Grant “Prognostic” ANR-09-JCJC-0101-01.