Abstract
We propose a two-stage outcome-dependent sampling design and inference procedure for studies that concern interval-censored failure time outcomes. This design enhances the study efficiency by allowing the selection probabilities of the second-stage sample, for which the expensive exposure variable is ascertained, to depend on the first-stage observed interval-censored failure time outcomes. In particular, the second-stage sample is enriched by selectively including subjects who are known or observed to experience the failure at an early or late time. We develop a sieve semiparametric maximum pseudo likelihood procedure that makes use of all available data from the proposed two-stage design. The resulting regression parameter estimator is shown to be consistent and asymptotically normal, and a consistent estimator for its asymptotic variance is derived. Simulation results demonstrate that the proposed design and inference procedure performs well in practical situations and is more efficient than the existing designs and methods. An application to a phase 3 HIV vaccine trial is provided.
Similar content being viewed by others
References
Bickel PJ, Klaassen CA, Ritov Y, Wellner JA (1993) Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press, Baltimore
Chatterjee N, Chen Y-H, Breslow NE (2003) A pseudoscore estimator for regression problems with two-phase sampling. J Am Stat Assoc 98(461):158–168
Chen D-G, Sun J, Peace KE (2012) Interval-censored time-to-event data: methods and applications. CRC Press, Boca Raton
Chen K, Lo S-H (1999) Case-cohort and case-control analysis with Cox’s model. Biometrika 86(4):755–764
Cornfield J (1951) A method of estimating comparative rates from clinical data. applications to cancer of the lung, breast, and cervix. J Nat Cancer Inst 11(6):1269–1275
Ding J, Zhou H, Liu Y, Cai J, Longnecker MP (2014) Estimating effect of environmental contaminants on women’s subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics 15(4):636–650
Ding J, Lu T-S, Cai J, Zhou H (2017) Recent progresses in outcome-dependent sampling with failure time data. Lifetime Data Anal 23(1):57–82
Gilbert PB, Peterson ML, Follmann D, Hudgens MG, Francis DP, Gurwith M, Heyward WL, Jobes DV, Popovic V, Self SG, Sinangil F, Burke D, Berman PW (2005) Correlation between immunologic responses to a recombinant glycoprotein 120 vaccine and incidence of HIV-1 infection in a phase 3 HIV-1 preventive vaccine trial. J Infect Dis 191(5):666–677
Harro CD, Judson FN, Gorse GJ, Mayer KH, Kostman JR, Brown SJ, Koblin B, Marmor M, Bartholow BN, Popovic V et al (2004) Recruitment and baseline epidemiologic profile of participants in the first phase 3 HIV vaccine efficacy trial. J Acquir Immune Defic Syndr 37(3):1385–1392
Huang J, Rossini A (1997) Sieve estimation for the proportional-odds failure-time regression model with interval censoring. J Am Stat Assoc 92(439):960–967
Huang J, Wellner JA (1997) Interval censored survival data: a review of recent progress. In: Proceedings of the first Seattle symposium in biostatistics, pp 123–169. Springer
Huang J, Zhang Y, Hua L (2012) Consistent variance estimation in semiparametric models with application to interval-censored data. In: Chen DG, Sun J, Peace KE (eds)Interval-censored time-to-event data: methods and applications, pp 233–268
Kang S, Cai J (2009) Marginal hazards model for case-cohort studies with multiple disease outcomes. Biometrika 96(4):887–901
Kulich M, Lin D (2004) Improving the efficiency of relative-risk estimation in case-cohort studies. J Am Stat Assoc 99(467):832–844
Li Z, Nan B (2011) Relative risk regression for current status data in case-cohort studies. Canad J Stat 39(4):557–577
Li Z, Gilbert P, Nan B (2008) Weighted likelihood method for grouped survival data in case-cohort studies with application to HIV vaccine trials. Biometrics 64(4):1247–1255
Lorentz GG (1986) Bernstein polynomials. Chelsea Publishing Co, New York
Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73(1):1–11
Self SG, Prentice RL (1988) Asymptotic distribution theory and efficiency results for case-cohort studies. Ann Stat 16(1):64–81
Shen X, Wong WH (1994) Convergence rate of sieve estimates. Ann Stat 22(2):580–615
Song R, Zhou H, Kosorok MR (2009) A note on semiparametric efficient inference for two-stage outcome-dependent sampling with a continuous outcome. Biometrika 96(1):221–228
Sun J (2006) The statistical analysis of interval-censored failure time data. Springer, New York
Sun Y, Qian X, Shou Q, Gilbert PB (2017) Analysis of two-phase sampling data with semiparametric additive hazards models. Lifetime Data Anal 23(3):377–399
van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes: with applications to statistics. Springer, New York
Weaver MA, Zhou H (2005) An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. J Am Stat Assoc 100(470):459–469
Whittemore AS (1997) Multistage sampling designs and estimating equations. J R Stat Soc B 59(3):589–602
Xue H, Lam K, Li G (2004) Sieve maximum likelihood estimator for semiparametric regression models with current status data. J Am Stat Assoc 99(466):346–356
Yu J, Liu Y, Sandler DP, Zhou H (2015) Statistical inference for the additive hazards model under outcome-dependent sampling. Canad J Stat 43(3):436–453
Zeng D, Lin DY (2014) Efficient estimation of semiparametric transformation models for two-phase cohort studies. J Am Stat Assoc 109(505):371–383
Zhang Y, Hua L, Huang J (2010) A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand J Stat 37(2):338–354
Zhou H, Weaver M, Qin J, Longnecker M, Wang M (2002) A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics 58(2):413–421
Zhou H, Song R, Wu Y, Qin J (2011) Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics 67(1):194–202
Zhou Q, Zhou H, Cai J (2017a) Case-cohort studies with interval-censored failure time data. Biometrika 104(1):17–29
Zhou Q, Hu T, Sun J (2017b) A sieve semiparametric maximum likelihood approach for regression analysis of bivariate interval-censored failure time data. J Am Stat Assoc 112(518):664–672
Zhou Q, Cai J, Zhou H (2018) Outcome-dependent sampling with interval-censored failure time data. Biometrics 74(1):58–67
Acknowledgements
The authors thank the Editor, Associate Editor and reviewers for their helpful comments and suggestions that have improved the paper. The authors also thank the Global Solutions in Infectious Diseases (GSID) and Dr. Peter Gilbert for providing data from the phase 3 HIV vaccine trial VAX004. This research was partially supported by grants from the National Institutes of Health (R01ES021900, P01CA142538 and P30ES010126). Qingning Zhou’s work was supported, in part, by funds provided by the University of North Carolina at Charlotte.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary Materials
The supplementary materials include the two lemmas and their proofs as well as some additional simulation results for a smaller cohort size N=1000. (171 KB)
Appendix: Proofs of Theorems 1 and 2
Appendix: Proofs of Theorems 1 and 2
In the appendix, we sketch the proofs of Theorems 1 and 2. Let \(O\,=\,\left\{ Y=\{U,\,V,\,\Delta _1=I(T\le U),\,\Delta _2=I(U<T\le V)\},\, RZ,\, R\right\} \) denote a single observation, where U and V are two random examination times, Z is the p-dimensional covariate vector and R is the indicator of an observation being in the validation sample. The following regularity conditions are needed for proving the theorems:
- (C1)
There exists \(\eta >0\) such that \(P(V-U\ge \eta )=1\). The union of the supports of U and V is contained in the interval \([\sigma ,\tau ]\), where \(0<\sigma<\tau <+\infty \).
- (C2)
The distribution of Z, denoted by \(G_Z(z)\), has a bounded support and is not concentrated on any proper subspace of \(R^p\).
- (C3)
For \(r=1\) or 2, the function \(\Lambda _0(t)\in \mathcal {M}\) is continuously differentiable up to order r in \([\sigma ,\tau ]\) with the first derivative being strictly positive, and satisfies \(\alpha ^{-1}<\Lambda _0(\sigma )<\Lambda _0(\tau )<\alpha \) for some positive constant \(\alpha \). Also \(\beta _0\) is an interior point of \(\mathcal {B}\), a compact subset of \(R^p\). \(\mathcal {M}\) and \(\mathcal {B}\) are defined in Sect. 3.
- (C4)
The conditional density g(u, v|z) of (u, v) given z has bounded partial derivatives with respect to u and v, and the bounds of these partial derivatives do not depend on (u, v, z).
- (C5)
\(E\{\mathrm{var}(Z|U)\}\) and \(E\{\mathrm{var}(Z|V)\}\) are positive definite.
These conditions are commonly used in the studies of interval-censored data (e.g. Huang and Wellner 1997; Huang and Rossini 1997; Zhang et al. 2010). In addition, similarly as in Zeng and Lin (2014), one can show that the conclusions of Theorems 1&2 hold under the proposed sampling scheme if they hold under independent sampling. Specifically, Zeng and Lin (2014) in their “Appendix” established the following result under general two-phase cohort studies based on Le Cam’s third lemma: the consistency and asymptotic normality of MLE hold under the sampling mechanism satisfying their condition (C.6) if they hold under independent sampling. It is easy to verify that our two-stage ODS scheme satisfies the condition (C.6) in Zeng and Lin (2014) with
where \(y=\{u,v,\delta _1,\delta _2\}\) is the outcome, \(\{A_1,A_2,A_3,A_4\}\) are the four strata defined based on the outcome given in (1), \(p_0\) is the sampling fraction of the SRS component, and \(p_1\) and \(p_2\) are the sampling fraction of the supplemental components from the two tail strata \(A_1\) and \(A_2\) respectively. Thus, we assume in the following that the observations \(\{O_i,\,i=1,\ldots ,N\}\) are independent and identically distributed.
Note that the proofs of our theorems differ from those of Zhou et al. (2017b, 2018) in several aspects. First, our likelihood function is not exact, since an estimate of the covariate distribution rather than the true one is used. Thus, we have to deal with the difference between our approximate likelihood and the exact likelihood that assumes the covariate distribution to be known. For establishing consistency and rate of convergence, the proofs of our theorems follow the similar ideas as those in Zhou et al. (2017b, 2018), except that we need to additionally establish the closeness of the approximate and exact likelihoods. For establishing asymptotic normality and deriving the asymptotic covariance matrix, our approach is quite different from those in Zhou et al. (2017b, 2018), since we have to account for the additional variability induced by the estimated covariate distribution.
Before proving Theorems 1 & 2, we first define the class of functions \(\mathcal {L}_N=\{l(\theta ,O): \theta \in \Theta _N\}\), where \(l(\theta ,O)\) is the log-likelihood function based on a single observation O given by
where the covariate distribution \(G_Z(z)\) is assumed to be known and
with \(S(t|Z)=\exp (-\Lambda (t)e^{\beta 'Z})\). Let \(P_N\) denote the empirical measure. For any \(\epsilon >0\), we define the covering number \(N(\epsilon ,\mathcal {L}_N,L_1(P_N))\) as the smallest value of \(\kappa \) for which there exists \(\{\theta ^{(1)},\ldots ,\theta ^{(\kappa )}\}\in \Theta _N\) such that
for all \(\theta \in \Theta _N\). If no such \(\kappa \) exists, define \(N(\epsilon ,{\mathcal {L}}_N,L_1(P_N))=\infty \).
Proof of Theorem 1
We now prove the strong consistency of \(\hat{\theta }_N\). Based on Lemma 1 in the Supplementary Materials, the covering number of \(\mathcal {L}_N\) satisfies
Then by Lemma 2 in the Supplementary Materials, we have
Furthermore, define
Then it is easy to show that
Let \(M(\theta , O)=-l(\theta , O)\) and \(P_N\hat{M}(\theta , O)=-P_N\hat{l}(\theta , O)\). Define \(K_\epsilon =\{\theta : d(\theta , \theta _0) \ge \epsilon , \theta \in \Theta _N\}\) for \(\epsilon > 0\) and
Then we obtain
If \(\hat{\theta }_N\in K_\epsilon ,\) we have
Define \(\delta _\epsilon =\inf _{K_\epsilon }P M(\theta , O)-PM(\theta _0,O)\). Then under Conditions (C1) - (C5), using the same arguments as those in Zhang et al. (2010, p. 352), we can prove \(\delta _\epsilon >0\). It follows from (A.3) and (A.4) that
with \(\zeta _N=\zeta _{1N}+\zeta _{2N}+\zeta _{3N},\) and hence \(\zeta _N \ge \delta _\epsilon .\) This gives \(\{\hat{\theta }_N \in K_{\epsilon } \}\subseteq \{\zeta _N \ge \delta _{\epsilon }\}\), and by (A.1), (A.2) and the strong law of large numbers, we have \(\zeta _{N}\rightarrow 0\) almost surely. Therefore, \(\cup _{k=1}^{\infty }\cap _{N=k}^{\infty }\{\hat{\theta }_N \in K_{\epsilon } \} \subseteq \cup _{k=1}^{\infty }\cap _{N=k}^{\infty }\{\zeta _N \ge \delta _{\epsilon }\}\), which proves that \(d(\hat{\theta }_N,\theta _0)\rightarrow 0\) almost surely.
Now we will derive the rate of convergence by using Theorem 3.4.1 of van der Vaart and Wellner (1996). Below let \(\tilde{K}\) denote a universal positive constant that may differ from place to place. First note from Theorem 1.6.2 of Lorentz (1986) that there exists a Bernstein polynomial \(\Lambda _{N0}\) such that \(\Vert \Lambda _{N0}-\Lambda _{0}\Vert _{\infty } = O(m^{-r/2})=O(N^{-r\nu /2}).\) Define \(\theta _{N0}=(\beta _0,\Lambda _{N0})\), then \(d(\theta _{N0},\theta _0)=O(N^{-r\nu /2})\). For any \(\eta >0,\) define the class of functions \({\mathcal {F}}_{\eta }=\{l(\theta ,O)-l(\theta _{N0},O): \theta \in \Theta _N,\, \eta /2 < d(\theta ,\theta _{N0})\le \eta \}.\) One can easily show that \(P(l(\theta _0,O)-l(\theta _{N0},O))\le \tilde{K}d(\theta _0,\theta _{N0})\le \tilde{K}N^{-r\nu /2}.\) Also under Condition (C1)–(C5), using the same arguments as those in Zhang et al. (2010, p. 352), we obtain \(P(l(\theta _0,O)-l(\theta ,O))\ge \tilde{K} d^2(\theta _0,\theta )\). Therefore, for large N, we have \(P(l(\theta ,O)-l(\theta _{N0},O))=P(l(\theta ,O)-l(\theta _0,O))+P(l(\theta _0,O)-l(\theta _{N0},O))\le -\tilde{K}\eta ^2+\tilde{K}N^{-r \nu /2}=-\tilde{K}\eta ^2,\) for any \(l(\theta ,O)-l(\theta _{N0},O)\in {\mathcal {F}}_{\eta }.\)
Following the calculations in Shen and Wong (1994, p. 597), we have that for \(0<\varepsilon <\eta \), \(\log N_{[]}(\varepsilon ,{\mathcal {F}}_{\eta },L_2(P))\le \tilde{K} (m+1)\log (\eta /\varepsilon )\). Some algebraic manipulations give \(P(l(\theta ,O)-l(\theta _{N0},O))^2\le \tilde{K} \eta ^2\) for any \(l(\theta ,O)-l(\theta _{N0},O)\in {\mathcal {F}}_{\eta }.\) Also under Conditions (C1) - (C4), \({\mathcal {F}}_{\eta }\) is uniformly bounded. Then by Lemma 3.4.2 of van der Vaart and Wellner (1996), we obtain
with
Then we have
Define \(\phi _N(\eta )=(m+1)^{1/2}\eta +(m+1)N^{-1/2}\). It is obvious that \(\phi _N(\eta )/\eta \) is decreasing in \(\eta \). Let \(r_N=N^{min\{(1-\nu )/2,r\nu /2\}}\), then \(r_N^2\phi _N(1/r_N)=r_N(N^\nu +1)^{1/2}+r_N^2(N^\nu +1)N^{-1/2}\le \tilde{K}N^{1/2}\).
Note that \(d(\hat{\theta }_N,\theta _{N0})\le d(\hat{\theta }_N,\theta _0)+d(\theta _0,\theta _{N0})\rightarrow 0\) in probability. It then follows from Theorem 3.4.1 of van der Vaart and Wellner (1996) that \(r_N d(\hat{\theta }_N,\theta _{N0})=O_p(1)\). Furthermore, by \(d(\theta _{N0},\theta _0)=O(N^{-r\nu /2})\), we have \(r_N d(\hat{\theta }_N,\theta _0)\le r_N d(\hat{\theta }_N,\theta _{N0}) + r_N d(\theta _{N0},\theta _0) = O_p(1)\) which completes the proof.\(\square \)
Proof of Theorem 2
We now sketch the proof of the asymptotic normality of \(\hat{\beta }_N\). Let \(l_\beta (\theta ,O)\) denote the score for \(\beta \) given by
where
and
Consider a parametric smooth submodel with parameter \(\theta _{(s)}=(\beta ,\Lambda _{(s)})\), where \(\Lambda _{(0)}=\Lambda \) and
Let \(\mathcal {H}\subseteq L_2(P)\) denote the class of functions h defined by this equation. The score operator for \(\Lambda \) is
where
and
According to Bickel et al. (1993), the efficient score for \(\beta \) is
where \(h_0\in \mathcal {H}^p\) minimizes \(P\Vert l_\beta (\theta ,O)-l_\Lambda (\theta ,O)[h]\Vert ^2\) and is called the least favorable direction. Then the information for \(\beta \) is
where \(v^{\otimes 2} = v v' \) for a column vector \(v\in R^p\). Under Conditions (C1)-(C5), the existence of the least favorable direction and the positive definiteness of the information can be similarly established as Theorem 4.1 in Huang and Wellner (1997).
Since \(\hat{\theta }_N\) maximizes \(P_N\hat{l}(\theta ,O)\) which is obtained by replacing \(G_Z\) in \(P_Nl(\theta ,O)\) with its consistent estimator \(\hat{G}_Z\), \(\hat{\theta }_N\) is the solution to the functional equation \(P_N\hat{l}^*(\theta ,O)=0\), where \(P_N\hat{l}^*(\theta ,O)\) is obtained by replacing \(G_Z\) in \(P_Nl^*(\theta ,O)\) with \(\hat{G}_Z\). First note that \(\hat{G}_Z\) is a \(\sqrt{N}\)-consistent estimator of \(G_Z\). Similarly as the proofs of Theorem 2 in Zhang et al. (2010), one can establish that
Following the proofs of Theorem 2 in Weaver and Zhou (2005), one can further establish that
in distribution with the asymptotic covariance matrix given by
Here \(I(\beta )\) is the information for \(\beta \) defined above, and
where
with
\(\square \)
Rights and permissions
About this article
Cite this article
Zhou, Q., Cai, J. & Zhou, H. Semiparametric inference for a two-stage outcome-dependent sampling design with interval-censored failure time data. Lifetime Data Anal 26, 85–108 (2020). https://doi.org/10.1007/s10985-019-09461-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-019-09461-5