Flexible modeling of survival data with covariates subject to detection limits via multiple imputation

https://doi.org/10.1016/j.csda.2013.07.027Get rights and content

Abstract

Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.

Introduction

Biomedical datasets frequently contain variables that are subject to censoring. Censoring due to detection limits (DLs) often occurs in practice when medical instruments are unable to measure a biological factor below a certain known value. In the motivating Genetic and Inflammatory Markers of Sepsis (GenIMS) study, it is of interest to model the survival times of patients with community acquired pneumonia using several biomarkers and demographic covariates. In this study, the survival times are subject to right censoring while three important biomarkers are left-censored below known DLs. Our main objective in this paper is to develop a computationally-efficient procedure for conducting inference in the context of survival models with covariates subject to DLs.

Traditionally, a few common approaches have been taken to handle censoring due to DLs. Mostly simply, a complete case approach can be applied, where data from individuals with censored covariates are completely discarded. While we prove in Section  3.1 that a complete case analysis leads to consistent estimates in accelerated failure time models when covariates are censored due to DLs, efficiency is lost due to the deletion of data. In an attempt to use all of the data, a second approach has become increasingly common where censored observations are replaced with a fixed value such as the DL, DL/2, or DL/2 (Hornung and Reed, 1990) or the conditional mean of the censored covariate (Austin and Hoch, 2004, Giovanini, 2008, Arunajadai and Rauh, 2012). Using these naive substitution methods has been demonstrated to produce biased parameter estimates in generalized linear models (Lynn, 2001, Austin and Brunner, 2003, Lubin et al., 2004, Rigobon and Stoker, 2007, Rigobon and Stoker, 2009, Helsel, 2012, Bernhardt et al., in press) and survival models (D’Angelo and Weissfeld, 2008, Sattar et al., 2012).

A few researchers have considered alternative methods for handling censored covariates in survival models. Langohr et al. (2004) and Sattar et al. (2012) proposed fully-parametric survival models with a single interval-censored predictor. Lee et al. (2003) also considered survival models with a single covariate subject to DLs, though they proposed using semiparametric Cox proportional hazards models in which the relative risk function for the censored covariates is replaced by a nonparametric estimate of its expected value. More recently, D’Angelo and Weissfeld (2008) developed an indexing approach where censored covariate values are directly replaced by their conditional expectation given a linear combination of the fully observed covariates. While their method performs reasonably well, it is somewhat ad-hoc and limited to cases when no more than two covariates are subject to DLs.

In this paper, we develop a straightforward, computationally-efficient multiple imputation method for handling multiple covariates subject to DLs in the context of accelerated failure time (AFT) models for censored survival data. To increase flexibility in the AFT survival model, we recommend using the seminonparametric (SNP) distribution to model the error term. We establish the asymptotic consistency and normality of the multiple imputation estimator and propose a convenient variance estimation method. We additionally suggest an iterative version of this estimator which improves efficiency with only a few updates. Though multiple imputation has been studied in many missing-data problems, our development for censored covariates in flexible survival models is nonstandard. Through numerical studies, we demonstrate that our proposed estimator leads to unbiased estimates and is potentially more efficient than several competing methods. We additionally show that using the flexible SNP distribution is more robust than typical parametric methods.

The remainder of the paper is organized as follows. In Section  2, we review AFT models and the SNP distribution. We also briefly explain how to fit AFT models with an SNP error term. In Section  3, we develop the proposed multiple imputation methods and establish their asymptotic properties. In Section  4, we carry out extensive simulations to compare the performance of the proposed methods with several simpler approaches. In Section  5, we apply the proposed methods to the dataset from the GenIMS study. Finally, in Section  6, we discuss the limitations of the proposed multiple imputation methods and some avenues for further research. The technical details for the proposition and theorems appearing in this paper are provided in the online Supplementary material.

Section snippets

Seminonparametric accelerated failure time model

We first present the proposed seminonparametric accelerated failure time (SNP-AFT) model and discuss the algorithm for fitting the model when covariates are fully observed.

Problem set-up

For the remainder of this paper, we assume that some covariates in Wi are subject to censoring due to lower DLs. Thus, for the ith individual, we let Wi=(ZiT,XiT)T, where Zi=(Zi1,,Zi(pq))T is the (pq)-dimensional vector of covariates fully observed for each individual and Xi=(Xi1,,Xiq)T is the q-dimensional vector of covariates subject to censoring below d=(d1,,dq)T, the vector of DLs. For Xij,j=1,,q, we only observe Xij=max(Xij,dj) and ρij=I(Xijdj), so that the complete set of observed

Simulation

We conducted numerical studies to assess the performance of the proposed multiple imputation and iterated multiple imputation estimators. We set up the simulations to represent a situation similar to that in the application described in Section  5. Specifically, we generated the covariates Zbeta(3,2)84+18, to represent an “age” variable, and (X1X2)N{(1.4+0.0087Z2.40.0091Z),(2.5115)},to represent the log-cytokines TNF and IL-10. Observations of X1 and X2 were censored at the DLs d1=log(4)

Application to GenIMS data

We demonstrate the performance of our proposed multiple imputation and iterated multiple imputation methods by applying them to the data from the Genetic and Inflammatory Markers of Sepsis (GenIMS) study. One of the purposes of the GenIMS study was to identify the relationship between the survival time of patients with community acquired pneumonia (CAP) and several biomarkers for inflammatory responses in the body (Kellum et al., 2007). The data for the GenIMS study were obtained from

Discussion

We have proposed a multiple imputation method for handling covariates censored due to DLs in AFT survival models. We have proven that the proposed estimator is consistent and asymptotically normal, with standard errors that are relatively easy to estimate. We also suggested an iterated version of the multiple imputation procedure which provides potentially significant efficiency improvements. Through numerical studies, we demonstrated that the multiple imputation procedure and iterated multiple

Acknowledgments

The authors would like to express their appreciation to the editor, an associate editor, and two anonymous referees for their valuable comments. The authors would also like to thank Dr. Lan Kong of Penn State College of Medicine and the CRISMA (Clinical Research, Investigation, and Systems Modeling of Acute Illness) Center at the University of Pittsburgh for providing us with the GenIMS dataset. The research of Wang is supported by the NSF AwardDMS-1007420 and NSF Career AwardDMS-1149355. The

References (29)

  • S. Lee et al.

    The proportional hazards regression with a censored covariate

    Statistics & Probability Letters

    (2003)
  • S.G. Arunajadai et al.

    Handling covariates subject to limits of detection in regression

    Environmental and Ecological Statistics

    (2012)
  • P.C. Austin et al.

    Type I error inflation in the presence of a ceiling effect

    The American Statistician

    (2003)
  • P.C. Austin et al.

    Estimating linear regression models in the presence of a censored independent variable

    Statistics in Medicine

    (2004)
  • G.D. D’Angelo et al.

    An index approach for the cox model with left censored covariates

    Statistics in Medicine

    (2008)
  • P.W. Bernhardt et al.

    Statistical methods for generalized linear models with covariates subject to detection limits

    Statistics in Biosciences

    (2013)
  • K. Doehler et al.

    ‘Smooth’ inference for survival functions with arbitrarily censored data

    Statistics in Medicine

    (2008)
  • A.R. Gallant et al.

    Semi-nonparametric maximum likelihood estimation

    Econometrica

    (1987)
  • Giovanini, J., 2008. Generalized linear mixed models with censored covariates. Ph.D. Thesis. Oregon State...
  • D.R. Helsel

    Statistics for Censored Environmental Data Using Minitab and R

    (2012)
  • R.W. Hornung et al.

    Estimation of average concentration in the presence of nondetectable values

    Applied Occupation and Environmental Hygiene

    (1990)
  • J.A. Kellum et al.

    Understanding the inflammatory cytokine response in pneumonia and sepsis

    Archives of Internal Medicine

    (2007)
  • K. Langohr et al.

    A parametric survival model with an interval-censored covariate

    Statistics in Medicine

    (2004)
  • T.A. Louis

    Finding the observed information matrix when using the EM algorithm

    Journal of the Royal Statistical Society, Series B

    (1982)
  • Cited by (30)

    • The missing indicator approach for censored covariates subject to limit of detection in logistic regression models

      2019, Annals of Epidemiology
      Citation Excerpt :

      It can be shown that the CC estimators are consistent estimators for true parameters [21].

    • Study on missing data imputation and modeling for the leaching process

      2017, Chemical Engineering Research and Design
      Citation Excerpt :

      Gomez-Carracedo et al. (2014) compared the performance of four SI methods and a MI method on actual air quality datasets, and the conclusion proved that MI yielded more disperse imputed values. Bernhardt et al. (2014) proposed a computationally efficient MI method in modeling survival time of patients, and the Simulation studies demonstrated that the proposed MI method works well while alternative methods lead to estimates that are either biased or more variable. Jones et al. (2014) assessed the exposure to drinking water contaminants using the MI method, which appears to be an effective method for filling in water quality values between measures.

    • Special issue on advances in survival analysis

      2016, Computational Statistics and Data Analysis
    • A fast em algorithm for fitting joint models of a binary response and multiple longitudinal covariates subject to detection limits

      2015, Computational Statistics and Data Analysis
      Citation Excerpt :

      Sattar et al. (2012) only considered IL-10 in the model and found it to be statistically significant in predicting survival time while D’ Angelo and Weissfeld (2008) jointly modeled IL-6 and IL-10 and only found IL-6 to be strongly statistically significant. Bernhardt et al. (2014) used accelerated failure time models to jointly model TNF, IL-6, and IL-10 and only found IL-6 and IL-10 to be moderately statistically significant, though it was noted that a global test for the three biomarkers was strongly significant. No previous study of the GenIMS data used all of the longitudinal data for all three cytokines of interest simultaneously in a model for 90-day survival.

    • Spatial prediction in the presence of left-censoring

      2014, Computational Statistics and Data Analysis
      Citation Excerpt :

      However, in environmental monitoring, as well as in many other disciplines, the collected spatial data set often includes left-censored observations falling below the minimum detection limit (MDL) of the measuring device. Ways of handling this type of censoring are discussed, e.g. by Bernhardt et al. (2014) in the context of modeling survival data, when the covariates are left-censored. Some spatial prediction methods have also been proposed, ranging from rather naive distribution-free approaches to more sophisticated computer intensive model-based methods.

    View all citing articles on Scopus
    View full text