Nonparametric estimation for survival data with censoring indicators missing at random

https://doi.org/10.1016/j.jspi.2013.04.010Get rights and content

Highlights

  • • We deal with the problem of survival data with censoring indicators missing at random (MAR or MCAR settings).

  • We propose nonparametric adaptive strategies based on model selection methods for hazard rate estimation.

  • Theoretical risks bounds are provided in the Mean Square Integrated Error (MISE) sense.

  • Simulation experiments illustrate the statistical procedure in both MAR and MCAR setting, and with or without covariate.

Abstract

In this paper, we consider the problem of hazard rate estimation in the presence of covariates, for survival data with censoring indicators missing at random. We propose in the context usually denoted by MAR (missing at random, in opposition to MCAR, missing completely at random, which requires an additional independence assumption), nonparametric adaptive strategies based on model selection methods for estimators admitting finite dimensional developments in functional orthonormal bases. Theoretical risk bounds are provided, they prove that the estimators behave well in term of mean square integrated error (MISE). Simulation experiments illustrate the statistical procedure.

Introduction

We consider the problem of estimation from right-censored data in the presence of covariates, when the censoring indicator is missing. Let T be a random variable representing the time to death from the cause of interest. Let C denote a right-censoring random time. Under usual random censorship, the observation is Y=TC and δ=I(TC). Let X denote a real covariate. In what follows, it is assumed that T, C and X admit densities respectively denoted by fT, g and fX. In addition, C is assumed to be independent of T conditionally to X, see e.g. Comte et al. (2011) for comments on this assumption.

When the cause of death is not recorded, the censoring indicator is missing: this is the missing censoring indicator (MCI) model, see Dinse (1986) and Subramanian (2006), which is defined as follows. Let ξ be the missingness indicator, that is ξ=1 if δ is observed and ξ=0 otherwise. The observed data are then given for individual i{1,,n}: (Yi,Xi,δi,ξi=1)or(Yi,Xi,ξi=0).We shall say that the model is:

  • MCAR under the assumption that the indicator is Missing Completely At Random, i.e. ξ is independent of T, C and X (see e.g. McKeague and Subramanian, 1998).

  • MAR under the assumption that the indicator is Missing At Random, i.e. ξ and δ are independent conditionally to Y, X.

In this paper, we mainly concentrate on the MAR model. The MCAR model will be considered in Section 2.2.

This model has been studied by several authors in the last decade. Most papers are interested in survival function and cumulative hazard rate estimation. In particular, van der Laan and McKeague (1998) improve Lo's (1991) paper and build a sieved nonparametric maximum likelihood estimator of the survival function in the MAR case. Their estimator is a generalization of the Kaplan and Meier (1958) estimator to this context and is the first proposal reaching the efficiency bound. Subramanian (2004) also proposes an efficient estimator of the survival function in the MAR case; he proves his estimate to be efficient as well. Gijbels et al. (2007) study semi-parametric and nonparametric Cox regression analysis in several contexts.

Kernel methods have also been used to build different estimators in the MAR context. Subramanian (2006) estimates the cumulative hazard rate with a ratio of kernel estimators. He provides an almost sure representation, and a Central Limit Theorem (CLT). He deduces results of the same type for the survival function. A study in a similar context is also provided by Wang and Ng (2008). Recently, Wang et al. (2009) proposed density estimator based on kernels and Kaplan–Meier-type corrections of censoring. They prove a CLT and suggest a bandwidth selection strategy. Extensions of these works to conditional functions (both cumulative hazard and survival functions) in the presence of covariates are developed in Wang and Shen (2008).

Note that several authors (Dikta, 1998, and more recently Subramanian, 2009, Subramanian, 2011) study semiparametric models for the missing process and imputation methods – but for Kaplan–Meier estimator – while we remain in pure nonparametric setting.

Both our method and our aim are rather different. We indeed consider the estimation of the conditional hazard rate given a covariate. Moreover, we provide a nonparametric mean square strategy by considering approximations of the target function on finite dimensional linear spaces spanned by convenient and simple orthonormal (functional) bases. A collection of estimators is thus defined, indexed by the dimension of the multidimensional projection space, and a penalization device allows us to select a “good” space among all the proposals.

Contrary to standard kernel methodology, our estimator has the advantage of being defined as a contrast minimizer and not a ratio of two estimators, see Wang and Shen (2008) and Subramanian (2006). It depends on an unknown function, in its definition, which has to be replaced by an estimator; this step is shared by the kernel approach. However, our precise study of the plug-in estimator allows us to non-asymptotically control the mean square risk. From an asymptotic point of view, we provide anisotropic rates corresponding to the regularity of the function under estimation, plus the rate of the intermediate plug-in estimator.

The plan of the paper is the following. We first explain in Section 2 how the contrast is built, and how it allows us to compute a collection of estimators. We conclude the section by giving the penalization device that completes the definition of the data driven estimator, up to an estimator to be plugged in the procedure. In Section 3, we state the theoretical results that ensure that the quadratic risk of our estimator behaves well provided that the intermediate estimator has small risk. Then, we show how similar tools can be used to build, compute and control the second estimator. The procedure is tested in a simulation (Section 4) for both hazard and conditional hazard rates (i.e. with or without covariate) and under different missing schemes. Technical proofs are gathered in.

Section snippets

Choice of the contrast

We consider the general MAR case as described in the Introduction, the global assumption is denoted by (A0) and has several parts that we specify below.

(A0-1) The random vectors (Xi,Ti,Ci) are independent copies, for i=1,,n, of (X,T,C).

(A0-2) For i=1,,n, we observe Xi, Yi=TiCi, ξi, and δi=I(TiCi) if ξi=1, otherwise ξi=0.

(A0-3) C is independent of T given X.

(A0-4) ξ and δ are independent given X,Y.

The unknown function λ to be estimated is the conditional hazard rate of the random variable T

Main theorem

In order to state our Theorem 1, we have to define the integral norm with respect to dϱ(x,y)=f(X,Y)(x,y)dxdy, where f(X,Y) is the density of the bivariate vector (X,Y), that isψϱ2=ψ2(x,y)dϱ(x,y)=ψ2(x,y)f(X,Y)(x,y)dxdyand the associated empirical normψϱ,n2=1ni=1nψ2(Xi,Yi)

Theorem 1

Let λ^m^ be the estimator defined by (6), (7), (8), (9). Under Assumptions (A1)–(A4), and if Dn2n/log2(n) for basis (1) and Dnn/log2(n) for basis (2), there exists a constant κ such that, for n large enoughE(λIAλ^m^n2

Simulations

To evaluate the finite sample performances of our different proposals for hazard rate estimation, we made Monte Carlo studies in different settings. We study the (possibly conditional) hazard rate estimators with or without covariate, and under both settings of dependence (MAR and MCAR) for the missing of censoring indicators Section 5.

Proof of Theorem 1

Note that the two bases we use satisfy the following property. For all h in Sm1(1)Sm2(2) hsup(x,y)A1×[0,τ]|h(x,y)|Dm1(1)Dm2(2)h,where h2=Ah2. Moreover, we recall that all Sm's are subsets of a nesting space belonging to the collection denoted by Sn with dimension Dn.

Let λA denote the restriction of the unknown function λ to A. For any h,h2(L2L)(A), we haveΓn(h)Γn(h2)=hλAn2h2λAn22νn(hh2)2Rn(hh2)where νn is the centered empirical process defined by νn(h)=1ni=1n((δiξi+(1

References (20)

There are more references available in the full text version of this article.

Cited by (6)

This work is supported by French Agence Nationale de la Recherche ANR Grant “Prognostic” ANR-09-JCJC-0101-01.

View full text