Flexible modeling of survival data with covariates subject to detection limits via multiple imputation

doi:10.1016/j.csda.2013.07.027

Computational Statistics & Data Analysis

Volume 69, January 2014, Pages 81-91

https://doi.org/10.1016/j.csda.2013.07.027 Get rights and content

Abstract

Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.

Introduction

Biomedical datasets frequently contain variables that are subject to censoring. Censoring due to detection limits (DLs) often occurs in practice when medical instruments are unable to measure a biological factor below a certain known value. In the motivating Genetic and Inflammatory Markers of Sepsis (GenIMS) study, it is of interest to model the survival times of patients with community acquired pneumonia using several biomarkers and demographic covariates. In this study, the survival times are subject to right censoring while three important biomarkers are left-censored below known DLs. Our main objective in this paper is to develop a computationally-efficient procedure for conducting inference in the context of survival models with covariates subject to DLs.

Traditionally, a few common approaches have been taken to handle censoring due to DLs. Mostly simply, a complete case approach can be applied, where data from individuals with censored covariates are completely discarded. While we prove in Section 3.1 that a complete case analysis leads to consistent estimates in accelerated failure time models when covariates are censored due to DLs, efficiency is lost due to the deletion of data. In an attempt to use all of the data, a second approach has become increasingly common where censored observations are replaced with a fixed value such as the DL, $DL / 2$ , or $DL / \sqrt{2}$ (Hornung and Reed, 1990) or the conditional mean of the censored covariate (Austin and Hoch, 2004, Giovanini, 2008, Arunajadai and Rauh, 2012). Using these naive substitution methods has been demonstrated to produce biased parameter estimates in generalized linear models (Lynn, 2001, Austin and Brunner, 2003, Lubin et al., 2004, Rigobon and Stoker, 2007, Rigobon and Stoker, 2009, Helsel, 2012, Bernhardt et al., in press) and survival models (D’Angelo and Weissfeld, 2008, Sattar et al., 2012).

A few researchers have considered alternative methods for handling censored covariates in survival models. Langohr et al. (2004) and Sattar et al. (2012) proposed fully-parametric survival models with a single interval-censored predictor. Lee et al. (2003) also considered survival models with a single covariate subject to DLs, though they proposed using semiparametric Cox proportional hazards models in which the relative risk function for the censored covariates is replaced by a nonparametric estimate of its expected value. More recently, D’Angelo and Weissfeld (2008) developed an indexing approach where censored covariate values are directly replaced by their conditional expectation given a linear combination of the fully observed covariates. While their method performs reasonably well, it is somewhat ad-hoc and limited to cases when no more than two covariates are subject to DLs.

In this paper, we develop a straightforward, computationally-efficient multiple imputation method for handling multiple covariates subject to DLs in the context of accelerated failure time (AFT) models for censored survival data. To increase flexibility in the AFT survival model, we recommend using the seminonparametric (SNP) distribution to model the error term. We establish the asymptotic consistency and normality of the multiple imputation estimator and propose a convenient variance estimation method. We additionally suggest an iterative version of this estimator which improves efficiency with only a few updates. Though multiple imputation has been studied in many missing-data problems, our development for censored covariates in flexible survival models is nonstandard. Through numerical studies, we demonstrate that our proposed estimator leads to unbiased estimates and is potentially more efficient than several competing methods. We additionally show that using the flexible SNP distribution is more robust than typical parametric methods.

The remainder of the paper is organized as follows. In Section 2, we review AFT models and the SNP distribution. We also briefly explain how to fit AFT models with an SNP error term. In Section 3, we develop the proposed multiple imputation methods and establish their asymptotic properties. In Section 4, we carry out extensive simulations to compare the performance of the proposed methods with several simpler approaches. In Section 5, we apply the proposed methods to the dataset from the GenIMS study. Finally, in Section 6, we discuss the limitations of the proposed multiple imputation methods and some avenues for further research. The technical details for the proposition and theorems appearing in this paper are provided in the online Supplementary material.

Section snippets

Seminonparametric accelerated failure time model

We first present the proposed seminonparametric accelerated failure time (SNP-AFT) model and discuss the algorithm for fitting the model when covariates are fully observed.

Problem set-up

For the remainder of this paper, we assume that some covariates in $W_{i}$ are subject to censoring due to lower DLs. Thus, for the $i$ th individual, we let $W_{i} = {(Z_{i}^{T}, X_{i}^{T})}^{T}$ , where $Z_{i} = {(Z_{i 1}, \dots, Z_{i (p - q)})}^{T}$ is the $(p - q)$ -dimensional vector of covariates fully observed for each individual and $X_{i} = {(X_{i 1}, \dots, X_{i q})}^{T}$ is the $q$ -dimensional vector of covariates subject to censoring below $d = {(d_{1}, \dots, d_{q})}^{T}$ , the vector of DLs. For $X_{i j}, j = 1, \dots, q$ , we only observe $X_{i j}^{*} = max (X_{i j}, d_{j})$ and $ρ_{i j} = I (X_{i j} \geq d_{j})$ , so that the complete set of observed

Simulation

We conducted numerical studies to assess the performance of the proposed multiple imputation and iterated multiple imputation estimators. We set up the simulations to represent a situation similar to that in the application described in Section 5. Specifically, we generated the covariates $Z \sim beta (3, 2) \cdot 84 + 18$ , to represent an “age” variable, and $(\begin{matrix} X_{1} \\ X_{2} \end{matrix}) \sim N {(\begin{matrix} 1.4 + 0.0087 Z \\ 2.4 - 0.0091 Z \end{matrix}), (\begin{matrix} 2.5 & 1 \\ 1 & 5 \end{matrix})},$ to represent the log-cytokines TNF and IL-10. Observations of $X_{1}$ and $X_{2}$ were censored at the DLs $d_{1} = log (4)$

Application to GenIMS data

We demonstrate the performance of our proposed multiple imputation and iterated multiple imputation methods by applying them to the data from the Genetic and Inflammatory Markers of Sepsis (GenIMS) study. One of the purposes of the GenIMS study was to identify the relationship between the survival time of patients with community acquired pneumonia (CAP) and several biomarkers for inflammatory responses in the body (Kellum et al., 2007). The data for the GenIMS study were obtained from

Discussion

We have proposed a multiple imputation method for handling covariates censored due to DLs in AFT survival models. We have proven that the proposed estimator is consistent and asymptotically normal, with standard errors that are relatively easy to estimate. We also suggested an iterated version of the multiple imputation procedure which provides potentially significant efficiency improvements. Through numerical studies, we demonstrated that the multiple imputation procedure and iterated multiple

Acknowledgments

The authors would like to express their appreciation to the editor, an associate editor, and two anonymous referees for their valuable comments. The authors would also like to thank Dr. Lan Kong of Penn State College of Medicine and the CRISMA (Clinical Research, Investigation, and Systems Modeling of Acute Illness) Center at the University of Pittsburgh for providing us with the GenIMS dataset. The research of Wang is supported by the NSF AwardDMS-1007420 and NSF Career AwardDMS-1149355. The

References (29)

S. Lee et al.
The proportional hazards regression with a censored covariate
Statistics & Probability Letters
(2003)
S.G. Arunajadai et al.
Handling covariates subject to limits of detection in regression
Environmental and Ecological Statistics
(2012)
P.C. Austin et al.
Type I error inflation in the presence of a ceiling effect
The American Statistician
(2003)
P.C. Austin et al.
Estimating linear regression models in the presence of a censored independent variable
Statistics in Medicine
(2004)
G.D. D’Angelo et al.
An index approach for the cox model with left censored covariates
Statistics in Medicine
(2008)
P.W. Bernhardt et al.
Statistical methods for generalized linear models with covariates subject to detection limits
Statistics in Biosciences
(2013)
K. Doehler et al.
‘Smooth’ inference for survival functions with arbitrarily censored data
Statistics in Medicine
(2008)
A.R. Gallant et al.
Semi-nonparametric maximum likelihood estimation
Econometrica
(1987)
Giovanini, J., 2008. Generalized linear mixed models with censored covariates. Ph.D. Thesis. Oregon State...
D.R. Helsel
Statistics for Censored Environmental Data Using Minitab and R
(2012)

R.W. Hornung et al.

Estimation of average concentration in the presence of nondetectable values

Applied Occupation and Environmental Hygiene

(1990)

J.A. Kellum et al.

Understanding the inflammatory cytokine response in pneumonia and sepsis

Archives of Internal Medicine

(2007)

K. Langohr et al.

A parametric survival model with an interval-censored covariate

Statistics in Medicine

(2004)

T.A. Louis

Finding the observed information matrix when using the EM algorithm

Journal of the Royal Statistical Society, Series B

(1982)

Cited by (30)

The missing indicator approach for censored covariates subject to limit of detection in logistic regression models
2019, Annals of Epidemiology
Citation Excerpt :
It can be shown that the CC estimators are consistent estimators for true parameters [21].
In several biomedical studies, one or more exposures of interest may be subject to nonrandom missingness because of the failure of the measurement assay at levels below its limit of detection. This issue is commonly encountered in studies of the metabolome using tandem mass spectrometry–based technologies. Owing to a large number of metabolites measured in these studies, preserving statistical power is of utmost interest. In this article, we evaluate the small sample properties of the missing indicator approach in logistic and conditional logistic regression models.
For nested case-control or matched case control study designs, we evaluate the bias, power, and type I error associated with the missing indicator method using simulation. We compare the missing indicator approach to complete case analysis and several imputation approaches.
We show that under a variety of settings, the missing indicator approach outperforms complete case analysis and other imputation approaches with regard to bias, mean squared error, and power.
For nested case-control and matched study designs of modest sample sizes, the missing indicator model minimizes loss of information and thus provides an attractive alternative to the oft-used complete case analysis and other imputation approaches.
Study on missing data imputation and modeling for the leaching process
2017, Chemical Engineering Research and Design
Citation Excerpt :
Gomez-Carracedo et al. (2014) compared the performance of four SI methods and a MI method on actual air quality datasets, and the conclusion proved that MI yielded more disperse imputed values. Bernhardt et al. (2014) proposed a computationally efficient MI method in modeling survival time of patients, and the Simulation studies demonstrated that the proposed MI method works well while alternative methods lead to estimates that are either biased or more variable. Jones et al. (2014) assessed the exposure to drinking water contaminants using the MI method, which appears to be an effective method for filling in water quality values between measures.
The leaching process is an important component in hydrometallurgy. A predictive model of the leaching rate lays the foundation for soft measurement and process optimization, and data collection is the key in such a modeling effort. However, because of the complexity and harshness of leaching process, data can only be collected sparsely, which results in data deficiency in the modeling process. Therefore, data imputation before modeling seems to be extremely significant. In this paper, expectation maximization imputation based on the Gaussian mixture model (GMM-EM) and multiple imputation (MI) are respectively applied to perform missing data imputation for leaching process under different data loss rates and data loss patterns, and then the imputation performances are evaluated. Simulation experiment results have shown that GMM-EM and MI both have advantages with regard to data imputation. Therefore, MI based on GMM (GMM-MI), which combines the advantages of GMM and MI, is proposed in this paper. The effectiveness of GMM-MI is verified by a series of simulations.
Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models
2016, Computational Statistics and Data Analysis
Multiple imputation is a commonly used approach to deal with missing values. In this approach, an imputer repeatedly imputes the missing values by taking draws from the posterior predictive distribution for the missing values conditional on the observed values, and releases these completed data sets to analysts. With each completed data set the analyst performs the analysis of interest, treating the data as if it were fully observed. These analyses are then combined with standard combining rules, allowing the analyst to make appropriate inferences which take into account the uncertainty present due to the missing data. In order to preserve the statistical properties present in the data, the imputer must use a plausible distribution to generate the imputed values. In data sets containing variables with different measurement scales, e.g. some categorical and some continuous variables, this is a challenging problem. A method is proposed to multiply impute missing values in such data sets by modelling the joint distribution of the variables in the data through a sequence of generalised linear models, and data augmentation methods are used to draw imputations from a proper posterior distribution using Markov Chain Monte Carlo (MCMC). The performance of the proposed method is illustrated using simulation studies and on a data set taken from a breast feeding study.
Special issue on advances in survival analysis
2016, Computational Statistics and Data Analysis
A fast em algorithm for fitting joint models of a binary response and multiple longitudinal covariates subject to detection limits
2015, Computational Statistics and Data Analysis
Citation Excerpt :
Sattar et al. (2012) only considered IL-10 in the model and found it to be statistically significant in predicting survival time while D’ Angelo and Weissfeld (2008) jointly modeled IL-6 and IL-10 and only found IL-6 to be strongly statistically significant. Bernhardt et al. (2014) used accelerated failure time models to jointly model TNF, IL-6, and IL-10 and only found IL-6 and IL-10 to be moderately statistically significant, though it was noted that a global test for the three biomarkers was strongly significant. No previous study of the GenIMS data used all of the longitudinal data for all three cytokines of interest simultaneously in a model for 90-day survival.
Joint modeling techniques have become a popular strategy for studying the association between a response and one or more longitudinal covariates. Motivated by the GenIMS study, where it is of interest to model the event of survival using censored longitudinal biomarkers, a joint model is proposed for describing the relationship between a binary outcome and multiple longitudinal covariates subject to detection limits. A fast, approximate EM algorithm is developed that reduces the dimension of integration in the E-step of the algorithm to one, regardless of the number of random effects in the joint model. Numerical studies demonstrate that the proposed approximate EM algorithm leads to satisfactory parameter and variance estimates in situations with and without censoring on the longitudinal covariates. The approximate EM algorithm is applied to analyze the GenIMS data set.
Spatial prediction in the presence of left-censoring
2014, Computational Statistics and Data Analysis
Citation Excerpt :
However, in environmental monitoring, as well as in many other disciplines, the collected spatial data set often includes left-censored observations falling below the minimum detection limit (MDL) of the measuring device. Ways of handling this type of censoring are discussed, e.g. by Bernhardt et al. (2014) in the context of modeling survival data, when the covariates are left-censored. Some spatial prediction methods have also been proposed, ranging from rather naive distribution-free approaches to more sophisticated computer intensive model-based methods.
Environmental (spatial) monitoring of different variables often involves left-censored observations falling below the minimum detection limit (MDL) of the instruments used to quantify them. Several methods to predict the variables at new locations given left-censored observations of a stationary spatial process are compared. The methods use versions of kriging predictors, being the best linear unbiased predictors minimizing the mean squared prediction errors. A semi-naive method that determines imputed values at censored locations in an iterative algorithm together with variogram estimation is proposed. It is compared with a computationally intensive method relying on Gaussian assumptions, as well as with two distribution-free methods that impute the MDL or MDL divided by two at the locations with censored values. Their predictive performance is compared in a simulation study for both Gaussian and non-Gaussian processes and discussed in relation to the complexity of the methods from a user’s perspective. The method relying on Gaussian assumptions performs, as expected, best not only for Gaussian processes, but also for other processes with symmetric marginal distributions. Some of the (semi-)naive methods also work well for these cases. For processes with skewed marginal distributions (semi-)naive methods work better. The main differences in predictive performance arise for small true values. For large true values no difference between methods is apparent.

View all citing articles on Scopus

View full text

Flexible modeling of survival data with covariates subject to detection limits via multiple imputation

Abstract

Introduction

Section snippets

Seminonparametric accelerated failure time model

Problem set-up

Simulation

Application to GenIMS data

Discussion

Acknowledgments

Statistics & Probability Letters

Handling covariates subject to limits of detection in regression

Environmental and Ecological Statistics

Type I error inflation in the presence of a ceiling effect

The American Statistician

Estimating linear regression models in the presence of a censored independent variable

Statistics in Medicine

An index approach for the cox model with left censored covariates

Statistics in Medicine

Statistical methods for generalized linear models with covariates subject to detection limits

Statistics in Biosciences

‘Smooth’ inference for survival functions with arbitrarily censored data

Statistics in Medicine

Semi-nonparametric maximum likelihood estimation

Econometrica

Statistics for Censored Environmental Data Using Minitab and R

Estimation of average concentration in the presence of nondetectable values

Applied Occupation and Environmental Hygiene

Understanding the inflammatory cytokine response in pneumonia and sepsis

Archives of Internal Medicine

A parametric survival model with an interval-censored covariate

Statistics in Medicine

Finding the observed information matrix when using the EM algorithm

Journal of the Royal Statistical Society, Series B