Data-Adaptive Bias-Reduced Doubly Robust Estimation

Karel Vermeulen; Stijn Vansteelandt

doi:10.1515/ijb-2015-0029

Publicly Available Published by De Gruyter May 26, 2016

Data-Adaptive Bias-Reduced Doubly Robust Estimation

Karel Vermeulen and Stijn Vansteelandt

From the journal The International Journal of Biostatistics

https://doi.org/10.1515/ijb-2015-0029

Abstract

Doubly robust estimators have now been proposed for a variety of target parameters in the causal inference and missing data literature. These consistently estimate the parameter of interest under a semiparametric model when one of two nuisance working models is correctly specified, regardless of which. The recently proposed bias-reduced doubly robust estimation procedure aims to partially retain this robustness in more realistic settings where both working models are misspecified. These so-called bias-reduced doubly robust estimators make use of special (finite-dimensional) nuisance parameter estimators that are designed to locally minimize the squared asymptotic bias of the doubly robust estimator in certain directions of these finite-dimensional nuisance parameters under misspecification of both parametric working models. In this article, we extend this idea to incorporate the use of data-adaptive estimators (infinite-dimensional nuisance parameters), by exploiting the bias reduction estimation principle in the direction of only one nuisance parameter. We additionally provide an asymptotic linearity theorem which gives the influence function of the proposed doubly robust estimator under correct specification of a parametric nuisance working model for the missingness mechanism/propensity score but a possibly misspecified (finite- or infinite-dimensional) outcome working model. Simulation studies confirm the desirable finite-sample performance of the proposed estimators relative to a variety of other doubly robust estimators.

Keywords: bias-reduced doubly robust estimation; causal inference; double robustness; missing data; nuisance parameters; super-learning; targeted maximum likelihood estimation (tmle); working models

1 Introduction

The estimation of most statistical parameters demands the postulation of so-called nuisance working models: models that are not of primary interest but needed to obtain a well-behaved estimator. A typical concern is then whether misspecification of these models may induce bias in the resulting estimator of the target parameter. Doubly robust estimators alleviate this concern: they consistently estimate the target parameter when at least one of two nuisance working models is correctly specified, regardless of which [1].

Doubly robust estimators are now available for many statistical parameters and have become very popular over recent years. This is on the one hand so in view of their robustness against model misspecification, but also because of their desirable efficiency properties in many settings (see e.g., Tsiatis et al. [2], Moore and van der Laan [3], Vermeulen et al. [4]) and the fact that they form a compromise between competing estimators that each rely on a single but different working model. In spite of this, doubly robust estimators have also been the subject of recent debate [5]. The reason is that model misspecification is likely to affect all working models in practice, thus making the double-protection property appear of a more academic interest. Initial reports suggested that modest misspecification of both working models might still deliver doubly robust estimators with reasonably small bias [1, 6]. However, later simulation studies signaled that their (finite-sample) bias and variance can sometimes get severely amplified under misspecification of at least one working model and may become especially severe under misspecification of both working models [5, 7].

Vermeulen and Vansteelandt [8] investigated the usefulness of doubly robust estimators from the perspective that all models are wrong. They discovered that, interestingly, some doubly robust estimators partially retain their robustness properties even under misspecification of both working models. In particular, they studied the asymptotic bias of doubly robust estimators for a scalar parameter. They found that without knowledge of the true data-generating law, the asymptotic bias of these estimators can typically be locally minimized in certain directions of the finite-dimensional nuisance parameters. This is possible by making use of specific estimators of the nuisance parameters, which target bias reduction. Vermeulen and Vansteelandt [8] referred to their procedure as bias-reduced doubly robust estimation.

Vermeulen and Vansteelandt [8] contrasted their bias-reduced doubly robust estimators with a variety of other doubly robust estimators that are primarily aimed at variance reduction under misspecification of one working model. They found their estimators to be highly competitive, although sometimes slightly more biased than so-called targeted maximum likelihood estimators (TMLE, van der Laan and Rubin [9]). TMLE, an estimation principle that may be used for pathwise differentiable target parameters in semiparametric models, reduces bias by making clever use of data-adaptive learning algorithms (infinite-dimensional nuisance working models) such as ensemble learning, and specifically super-learning [10]. In this article, we will investigate how such data-adaptive learning algorithms can be integrated in the theory of bias-reduced doubly robust estimation to aim at further bias reduction. This will also overcome one of the limitations of the proposal by Vermeulen and Vansteelandt [8], that both nuisance working models must be indexed by nuisance parameters of the same dimension.

The article is organized as follows. In Section 2, we introduce some notation and the inferential problem at stake: estimation of a population mean outcome in the presence of outcome missingness that is explainable by measured auxiliary covariates; equivalently, the estimation of a mean counterfactual outcome in the presence of data on all relevant confounders. In Section 3, we briefly review the theory on bias-reduced doubly robust estimation in the context of the example given in Section 2. Next, in Section 4, we outline the proposed extension of the bias-reduced doubly robust estimator to incorporate data-adaptive learning algorithms. The motivation behind this proposal is to fluctuate an initial data-adaptive estimator for the outcome-covariate associations in a way that aims to reduce the bias in the direction of the finite-dimensional parameter indexing the propensity score model. We therefore call this procedure data-adaptive bias-reduced doubly robust estimation. We also show how to perform inference based on the asymptotic linearity of the estimator (under certain regularity conditions to allow for data-adaptive estimation of the outcome working model and the condition that the parametric model for the propensity score is correct). In Section 5, we illustrate the performance of our proposal relative to the original bias-reduced doubly robust estimator and the TMLE procedure [9]. In Section 6, we illustrate the generic nature of the proposal and show how to implement the proposed strategy in a linear instrumental variable analysis. We end with a discussion in Section 7. Finally, these methods are implemented in an R-function, given in Appendix B.

1.1 Some relevant literature overview

The discussions on the Kang and Schafer article [5, 11–15] revealed that many different doubly robust estimators may be obtained for a given target parameter by varying the choice of nuisance working model estimators. Different choices may lead to doubly robust estimators which all have potentially very different behavior and properties under misspecification of at least one working model but are asymptotically equivalent under correct specification of both working models. As a result, many alternatives to maximum likelihood estimation (MLE) have now been proposed in the literature for the estimation of the nuisance parameters/nuisance working models that index certain doubly robust estimators. Rubin and van der Laan [16], Cao et al. [17] and Tsiatis et al. [18] developed doubly robust estimators in specific missing data models which target minimal asymptotic variance when the missingness model is correctly specified. The TMLE (targeted maximum likelihood estimation) procedure [9, 19] and the procedures of Tan [20] and Rotnitzky et al. [21] guarantee doubly robust estimators that fall within the parameter range, with the latter procedures also having desirable efficiency properties. With the exception of TMLE, all these proposals focus on improving the efficiency of doubly robust estimators under misspecification of the working model for the dependence of the observed outcome on covariates/confounders in missing data/causal inference models. The collaborative TMLE (C-TMLE) procedure of van der Laan and Gruber [22] additionally focuses on the estimation of the missingness/exposure probabilities in a way that aims to attain an optimal bias-variance trade-off. In contrast, Vermeulen and Vansteelandt [8] focus on bias reduction under the constraints of parametric working models. For a review of many of the aforementioned procedures, we refer to Rotnitzky and Vansteelandt [23]; for a comparison of the performance of many of these alternatives, we refer to Tan [20], Porter et al. [24] and Vermeulen and Vansteelandt [8].

2 Doubly robust estimation of a population mean with incomplete data

Consider the i.i.d. sample o=(O1,…,On) of size n, where the observed data vector O=(RY,R,X) is distributed according to a true underlying but unknown probability distribution P0. We let Y denote the outcome of interest which is susceptible to missingness, formalized using the missingness indicator R where R=1 if Y is observed and R=0 if Y is missing. We assume that missingness is explainable by X, a collection of auxiliary covariates, so that the missing at random (MAR, Rubin [25]) assumption holds: Y⊥⊥R|X, where we use ⊥⊥ to denote statistical independence. Throughout, for some parameter λ, we use λ0 to denote its (unknown) population value; in particular, E(Y)=ψ0. Furthermore, for a function f of the observed data and a probability distribution P, we let Pf denote the integral ∫fdP.

Doubly robust estimation of ψ0 requires specification of two nuisance working models [26]. First, we need a working model q={Qˉ(X)|Qˉ in some class of functions} for the true conditional mean outcome Qˉ0(X)=E0(Y|X) (which equals E0(Y|R=1,X) because of MAR). Let m(q) denote the statistical model for the joint distribution of O implied by the restrictions imposed by q. Let Q¯^n(X) denote an estimator of Qˉ0(X) with probability limit Qˉ∗(X); that is, Q¯^n(X)→pQ¯*(X). Under m(q), Qˉ∗(X)=Qˉ0(X), but not otherwise. Second, we need a working model g={g(X)|ginsomeclassoffunctions} for the true missingness mechanism g0(X)=P0(R=1|X), for which we assume positivity: g0(X)≥δ>0 with probability one (van der Laan and Rose [19, chap. 10]). Let m(g) denote the statistical model for the joint distribution of O implied by the restrictions imposed by the working model g. Let gˆn(X) denote an estimator of g0(X) with probability limit g∗(X); that is, gˆn(X)→pg∗(X). Under m(g), g∗(X)=g0(X). A doubly robust estimator of ψ0 is then obtained via

(1)ψ^n(Q¯^n,g^n)=1n∑i=1n[Rig^n(Xi){Yi−Q¯^n(Xi)}+Q¯^n(Xi)].

It can easily be verified from the above expression that this estimator is consistent for ψ0 under the union model m(q)∪m(g): as soon as one but not necessarily both working models are correctly specified. Interestingly (provided sufficient regularity for the working models), it is also locally efficient [27] at the intersection model m(q)∩m(g) in the following sense: it has smallest asymptotic variance within the class of all estimators that are consistent and asymptotically normal under m(g), provided that also m(q) is correctly specified. At the intersection model, it has the following simple expansion

(2)ψ^n(Q¯^n,g^n)−ψ0=(P0−Pn){D*(Q¯0,g0;ψ0)}+Rn,

with (under sufficient regularity on the working models q and g, see for instance Appendix A.1.1 of van der Laan and Rose [19]) the remainder Rn=op(1/n), Pn the empirical distribution which puts mass 1/n on each observation Oi, i=1,…,n and D∗(Qˉ0,g0; ψ0) the efficient influence function, given by

(3)D∗(Qˉ0,g0;ψ0)(O)=Rg0(X){Y−Qˉ0(X)}+Qˉ0(X)−ψ0.

The expansion (2) is attractive because it (and thus the asymptotic distribution of the doubly robust estimator) is the same when both nuisance working models are correctly specified, no matter how the nuisance parameters are estimated and no matter whether they are estimated or known. Finally, note that by construction

(4)Pn[D*{Q¯^n,g^n;ψ^n(Q¯^n,g^n)}]=0.

3 Bias-reduced doubly robust estimation

Bias-reduced doubly robust estimation is a generic estimation strategy for the nuisance parameters indexing parametric nuisance working models g and q, aimed at bias reduction under misspecification of both these models. In Section 3.1, we will introduce such parametric models in the context of the missing data example. In Section 3.2, we will then review the bias-reduced estimation principle.

3.1 Parametric nuisance working models and MLE

Suppose we parameterize the working model for the conditional mean outcome by a q-dimensional parameter γ: qγ={Qˉ(γ)(X)|γ∈Rq} with Qˉ(γ)(X)=Q{γTk(X)}, Q an appropriate inverse link function and k=(1,k1,…,kq−1); e.g., a linear regression model Qˉ(γ)(X)=γ1+γ2TX for a continuous outcome Y. If the model m(qγ) is correctly specified and thus includes P0, we let γ0 be such that Qˉ(γ0)=Qˉ0. Further, let the p-dimensional parameter α parameterize the working model for the missingness mechanism: gα={g(α)(X)|α∈Rp} where g(α)(X)=G{αTl(X)}, G is an appropriate inverse link function and l=(1,l1,…,lp−1); e.g., a logistic regression model g(α)(X)=expit(α1+α2TX), expit(x)=1/(1+e−x). If the model m(gα) is correctly specified and thus includes P0, we let α0 be such that g(α0)=g0.

Root-n consistent and asymptotically normal estimators ˆγn and αˆn for the nuisance parameters γ and α can be obtained as solutions to estimating equations PnDQˉ(ˆγn)=0 and PnDg(αˆn)=0 with the estimating functions DQˉ and Dg such that P0{DQˉ(γ0)}=0 if P0∈m(qγ) and P0{Dg(α0)}=0 if P0∈m(gα).

Throughout, we will assume that ˆγn→pγ∗ for some γ∗ (with γ∗=γ0 under model m(qγ)) and αˆn→pα∗ for some α∗ (with α∗=α0 under model m(gα)) and, moreover, that these estimators are asymptotically linear with influence function −P0{Dγ,Qˉ(γ∗)}−1DQˉ(γ∗) and −P0Dα,g(α∗)−1Dg(α∗), where Dγ,Qˉ(γ∗)=∂DQˉ(γ)/∂γ|γ=γ∗ and Dα,g(α∗)=∂Dg(α)/∂α|α=α∗.

In practice, maximum likelihood estimation (MLE) (or least squares) is routinely employed to estimate the nuisance parameters. The MLEs ˆγnMLE and αˆnMLE solve the estimating equations PnDQˉMLE(ˆγnMLE)=0 and PnDgMLE(αˆnMLE)=0 with

(5)DQ¯MLE(γ)(O)=R{Y−Q¯(γ)(X)}Q'{γTk(X)}k(X),

(6)DgMLE(α)(O)=R−g(α)(X)g(α)(X){1−g(α)(X)}G'{αTl(X)}l(X),

Q'(x)=∂Q(x)/∂x and G'(x)=∂G(x)/∂x. A doubly robust estimator is then given by ψ^nDR,MLE≡ψ^n(Q¯^nMLE,g^nMLE), with Q¯^nMLE≡Q¯(γ^nMLE) and gˆnMLE≡g(αˆnMLE). Although MLE is asymptotically efficient and optimal for these nuisance parameters with respect to the corresponding working model, it need not be optimal with respect to the target parameter ψ0 under misspecification of one of these models. Under such misspecification, the influence function of the doubly robust estimator (and thus its asymptotic distribution) becomes indeed dependent on the choice of root-n consistent estimators of the nuisance parameters (see e.g., Vermeulen and Vansteelandt [8], Proposition 1). This raises the question how to best fit the nuisance working models.

3.2 Bias-reduced doubly robust estimation

3.2.1 Estimation principle

Vermeulen and Vansteelandt [8] aim to reduce the squared asymptotic bias of the doubly robust estimator under misspecification of both working models by making use of particular nuisance parameter estimators. Their starting point is that under possible misspecification of qγ and gα, for arbitrary but fixed nuisance parameter values γ and α, the asymptotic bias of the doubly robust estimator, as a function of P0 and these nuisance parameters γ and α, is given by bias(γ,α;ψ0)=P0D∗{Qˉ(γ),g(α);ψ0}. This bias is zero under the union model m(qγ)∪m(gα), so when we either take γ=γ0 or α=α0. Away from this model, locally minimizing bias2(γ,α; ψ0) – if a local minimum exists – requires finding the values (γBR∗,T,αBR∗,T)T that solve the equations P0Dγ∗{Qˉ(γBR∗),g(αBR∗)}=0 and P0Dα∗{Qˉ(γBR∗),g(αBR∗)}=0 (Vermeulen and Vansteelandt [8], Theorem 1); here, Dγ∗{Qˉ(γ),g(α)} denotes the gradient ∂D∗{Qˉ(γ),g(α);ψ0}/∂γ and Dα∗{Qˉ(γ),g(α)} denotes the gradient ∂D∗{Qˉ(γ),g(α);ψ0}/∂α. This is so because these values γBR∗ and αBR∗ ensure that ∂bias2(γ,αBR∗;ψ0)/∂γ|γ=γBR∗=0 and ∂bias2(γBR∗,α;ψ0)/∂γ|α=αBR∗=0.

The unknown values γBR∗ and αBR∗ can be estimated as the solutions ˆγnBR and αˆnBR to the estimating equations

(7)PnDα∗{Qˉ(ˆγnBR),g(αˆnBR)}=0,

(8)PnDγ∗{Qˉ(ˆγnBR),g(αˆnBR)}=0.

Here, the function Dα∗{Qˉ(γ),g(α)} is used as an estimating function for γ and the function Dγ∗{Qˉ(γ),g(α)} as an estimating function for α. Vermeulen and Vansteelandt [8] (Theorem 2) show that ˆγnBR→pγBR∗ and αˆnBR→pαBR∗ with γBR∗=γ0 under m(qγ) and αBR∗=α0 under m(gα). This can be understood upon noting that the double robustness implies that P0Dγ∗{Qˉ(γ),g(α0)}=0 for all γ under m(gα) and P0Dα∗{Qˉ(γ0),g(α)}=0 for all α under m(qγ). Clearly, computation of these estimators ˆγnBR and αˆnBR of γBR∗ and αBR∗ does not demand knowledge of the full data law. The bias-reduced doubly robust estimator is now given by ψ^nBR≡ψ^n(Q¯^nBR,g^nBR), with Q¯^nBR=Q¯^(γ^nBR) and gˆnBR=g(αˆnBR).

3.2.2 Illustration: estimation of a population mean with incomplete data

We illustrate the bias-reduced doubly robust estimation principle for the missing data problem discussed in Section 2. Consider the working models qγ and gα from Section 3.1 but with k and l of the same dimension (i.e., such that q=p). Estimators for the nuisance parameters (ˆγnBR,T,αˆnBR,T)T are then obtained by solving (7) and (8):

(9)n−1∑i=1nRi{Yi−Q¯(γ^nBR)(Xi)}G'{α^nMLE,Tl(Xi)}G2{αnMLE,Tl(Xi)}l(Xi)=0,

(10)n−1∑i=1n{1−Rig(α^nBR)(Xi)}Q'{γ^nBR,Tk(Xi)}k(Xi)=0.

Vermeulen and Vansteelandt [8] show how (10) can be solved by optimizing an integrated estimating equation; (9) can be solved via weighted least squares based on the complete cases. They moreover discuss various desirable properties of the doubly robust estimator based on these nuisance parameter estimators. Here, we develop visual insight using two simple examples.

Example 1

Consider a sample o of size n=105, which we take to be large so that we can ignore sampling variability. In this first example, for each individual i, we let Xi=dN(0,1), Ri|Xi=dBer{π0(Xi)} with π0(Xi)=expit(−1+Xi3) and Yi|Xi=dN(Xi2,1). The true mean outcome equals ψ0=1. One-dimensional working models of the form g(α)(X)=expit(αX) and Qˉ(γ)(X)=γX are misspecified. We obtain αˆnMLE=1.115 and γˆnMLE=1.606, leading to ψˆnDR,MLE=0.237 and bias(αˆnMLE,γˆnMLE;1)=−0.763. The bias-reduced doubly robust estimation strategy yields αˆnBR=2.254 and γˆnBR=−0.959, leading to ψˆnBR=0.663 and a smaller bias of bias(αˆnBR,γˆnBR;1)=−0.337, as theoretically expected.

Figure 1 shows a contour plot of the log of the squared asymptotic bias as a function of the nuisance parameters α and γ. It thus indicates for each combination of fixed nuisance parameter values γ and α the magnitude of the asymptotic bias of the doubly robust estimator for these choices of γ and α. Darker values indicate bias closer to zero. The cross “×” indicates the point (αˆnMLE,γˆnMLE)=(1.115,1.606), which approximates (αMLE∗,γMLE∗) and the bullet “∙” indicates the point (αˆnBR,γˆnBR)=(2.254,−0.959), which approximates (αBR∗,γBR∗). Figure 1 illustrates the defining property of the bias-reduced estimation principle: as one varies the nuisance parameter values allowed by the chosen working models, the squared asymptotic bias of the doubly robust estimator is locally minimized at (2.254,−0.959)≈(αBR∗,γBR∗) in certain directions of γ and α. Bias-reduction is indeed local, as we observe that there are two other regions where even smaller bias near zero is attained. The location of these regions depends strongly on the underlying data generating mechanism P0, so that they cannot be obtained by solving an estimating equation without further knowledge of the true data-generating mechanism.□

$Figure 1: Contour plot of the log of the squared asymptotic bias log{bias2(α,γ;1)}$$\log\{{\rm{bia}}{{\rm{s}}^2}(\alpha, \gamma; 1)\} $$ as a function of the nuisance parameters α$$\alpha $$ and γ$$\gamma $$ for example 1 with ×=(1.115,1.606)≈(αMLE∗,γMLE∗)$$ \times = (1.115, 1.606) \approx (\alpha _{{\rm{MLE}}}^*, \gamma _{{\rm{MLE}}}^*)$$, ∙=(2.254,−0.959)≈(αBR∗,γBR∗)$$ \bullet = (2.254, - 0.959) \approx (\alpha _{{\rm{BR}}}^*, \gamma _{{\rm{BR}}}^*)$$ and ▲▲=(1.115,0.169)≈(αMLE∗,γMLE−BR∗)$$ \blacktriangle = (1.115, 0.169) \approx (\alpha _{{\rm{MLE}}}^*, \gamma _{{\rm{MLE}} - {\rm{BR}}}^*)$$.$

Figure 1:

Contour plot of the log of the squared asymptotic bias log{bias2(α,γ;1)} as a function of the nuisance parameters α and γ for example 1 with ×=(1.115,1.606)≈(αMLE∗,γMLE∗), ∙=(2.254,−0.959)≈(αBR∗,γBR∗) and ▲▲=(1.115,0.169)≈(αMLE∗,γMLE−BR∗).

Example 2

To illustrate that the location of the aforementioned regions depends on the underlying data generating mechanism, we show a similar plot for a second example. With n=105, for each individual i, we let Xi=dN(1,1), Ri|Xi=dBer{π0(Xi)} with π0(Xi)=expit(−1+Xi2) and Yi|Xi=dN(Xi2,1). Under this data generating mechanism, the true mean outcome equals ψ0=2.

In Figure 2, the cross “×” now indicates the point (αˆnMLE,γˆnMLE)=(0.737,2.157), which approximates (αMLE∗,γMLE∗), leading to ψˆnDR,MLE=2.393 and bias(αˆnMLE,γˆnMLE; 2)=0.393. The bullet “∙” indicates the point (αˆnBR,γˆnBR)=(0.609,1.220), which approximates the probability limits (αBR∗,γBR∗), leading to ψˆnBR=2.316 and a smaller bias given by bias(αˆnBR,γˆnBR;2)=0.316. The behavior in this second example (see Figure 2) is similar to the behavior in the first example (see Figure 1). However, do note that the regions where smaller bias near zero is attained are located in a different position, not trackable by solving an estimating equation without further knowledge of the true data-generating mechanism.□

$Figure 2: Contour plot of the log of the squared asymptotic bias log{bias2(α,γ;2)}$$\log\{{\rm{bia}}{{\rm{s}}^2}(\alpha, \gamma; 2)\} $$ as a function of the nuisance parameters α$$\alpha $$ and γ$$\gamma $$ for example 2 with ×=(0.737,2.157)≈(αMLE∗,γMLE∗)$$ \times = (0.737, 2.157) \approx (\alpha _{{\rm{MLE}}}^*, \gamma _{{\rm{MLE}}}^*)$$, ∙=(0.609,1.220)≈(αBR∗,γBR∗)$$ \bullet = (0.609, 1.220) \approx (\alpha _{{\rm{BR}}}^*, \gamma _{{\rm{BR}}}^*)$$ and ▲▲=(0.737,0.859)≈(αMLE∗,γMLE−BR∗)$$\blacktriangle = (0.737, 0.859) \approx (\alpha _{{\rm{MLE}}}^*, \gamma _{{\rm{MLE}} - {\rm{BR}}}^*)$$.$

Figure 2:

Contour plot of the log of the squared asymptotic bias log{bias2(α,γ;2)} as a function of the nuisance parameters α and γ for example 2 with ×=(0.737,2.157)≈(αMLE∗,γMLE∗), ∙=(0.609,1.220)≈(αBR∗,γBR∗) and ▲▲=(0.737,0.859)≈(αMLE∗,γMLE−BR∗).

As previously noted, it follows from eq. (2) that small perturbations (of the order one over root-n) in the nuisance parameters γ and α do not affect the first-order asymptotic behavior of the doubly robust estimator when the intersection model m(qγ)∩m(gα) holds. Besides targeting bias reduction, the bias-reduced estimation procedure extends this property to working model misspecification. This is visually suggested by both Figures 1 and 2 which show that the estimators αˆnBR and γˆnBR are situated in a region where small perturbations of the nuisance parameters do not heavily affect the doubly robust estimator (in contrast to the regions that deliver near-zero bias, which are very unstable). This is unlike the MLE of the nuisance parameters. More formally, the doubly robust estimator ψˆnBR is first-order ancillary [28] under misspecification of the working models [8], Corollary 1). This means that even under possible model misspecification, ψˆnBR still admits the expansion ψˆnBR−ψ0=(P0−Pn)[D∗{Qˉ(γBR∗),g(αBR∗); ψ0}]+op(1/n) and thus ψˆnBR and ψˆn{Qˉ(γBR∗),g(αBR∗)} are asymptotically equivalent. This moreover implies that the asymptotic variance of the doubly robust estimator ψˆnBR can be straightforwardly estimated as one over n times the sample variance of the values D∗{Qˉ(ˆγnBR),g(αˆnBR);ψˆnBR}, without having to adjust for the estimation of the nuisance parameters.

A limitation of the bias-reduced estimation procedure is that it is confined to parametric working models of the same dimension. Equal dimensions are indeed required because the gradient of D∗ with respect to α is used as an estimating function for γ and vice versa. This can be overcome by targeting asymptotic bias reduction in the direction of a single nuisance parameter, for instance in the direction of α. The effect of this is best understood from Figures 1 or 2. For a given value α˜ of α, e.g., the MLE αˆnMLE, and for each value of γ, we evaluate how the bias of the doubly robust estimator changes as we move away from the chosen value α˜. We then choose the outcome regression parameter γ at which no change in the squared asymptotic bias is seen. For example 1, with MLE αˆnMLE=1.115, this leads to γˆnMLE−BR=0.169, ψˆnMLE−BR=0.518 and bias(αˆnMLE,γˆnMLE−BR;1)=−0.482. On Figure 1, this point (αˆnMLE,γˆnMLE−BR)=(1.115,0.169) is marked by means of the triangle “▲”. This is the point where no change of bias is seen in the direction of α. Note that the bias lies between the bias of the MLE and the bias-reduced doubly robust estimator. A qualitatively different result is seen for example 2. With MLE αˆnMLE=0.737, we find γˆnMLE−BR=0.859, ψˆnMLE−BR=2.302 and bias(αˆnMLE,γˆnMLE−BR; 2)=0.302, which is slightly smaller than the bias of the bias-reduced doubly robust estimator. This is an artefact of the true underlying data-generating mechanism P0; we are not aware of theoretical reasons for this bias to be smaller (see example 1 where it is higher). In the next section, we will explain how we can use this idea of targeting asymptotic bias reduction in one direction to be able to make use of data-adaptive learning algorithms.

4 Extending bias-reduced doubly robust estimation beyond parametric working models

In this section, we will relax the restriction to parametric working models by making use of data-adaptive learning algorithms to estimate the conditional mean outcome, as in TMLE. We will continue to work with a parametric working model for the missingness mechanism because of the possible concern that a too flexible, data-adaptive model specification may result in near-positivity violations and thereby distort the performance of the doubly robust estimator; as noted by a referee, this could alternatively be prevented via cross-validation [29]. With concern for bias under such parametric working model, we will apply the bias-reduction principle of Section 3.2 in the direction of the nuisance parameters indexing that working model. A positive side effect of our proposed procedure will be that it no longer constrains the dimensions of both working models to be the same.

4.1 Main idea

Our proposal is to start with a parametric model for the missingness mechanism g0(X), indexed by α, and an estimator αˆn of α, e.g., the MLE. Next, an estimator for the conditional mean outcome Qˉ0(X) is obtained, either by maximum likelihood estimation under a parametric model or by using data-adaptive learning algorithms such as super-learning [10]. With concern for misspecification of the missingness model, we next fluctuate this initial estimator of Qˉ0(X) through a parametric fluctuation model, indexed by a finite-dimensional parameter ε of at least the dimension of α. This model includes covariates that are chosen in such a way that the score of the fluctuation parameter ε equals the gradient Dα∗{Qˉ(γ),g(α)}. By doing so, we aim to reduce the asymptotic bias of the doubly robust estimator in the direction of α, as explained in Section 3.2. In Section 4.2, we will detail how this can be done. Note that we can thus allow for a missingness model of arbitrary dimension by exploiting the bias-reduced estimation principle in only a single dimension.

The idea of extending an initial fit of the conditional mean outcome is not new: this idea was already proposed in Bang and Robins [6] and it also underlies the TMLE procedure of van der Laan and Rubin [9], as well as other improved doubly robust estimation procedures [20, 21]. Our proposal is nonetheless different in that it explicitly targets bias reduction (in the direction of α). In contrast, the TMLE procedure aims at obtaining a doubly robust substitution estimator and the procedures by Tan [20] and Rotnitzky et al. [21] target a bounded doubly robust estimator with desirable efficiency properties.

4.2 Practical implementation of the procedure

The extension of the bias-reduced doubly robust estimation procedure can be implemented using the following four steps.

Step 1: Estimatorg^nMLEfor the missingness mechanismg0(X).

Postulate a parametric working model for g0(X): gα={g(α)(X)|α∈Rp} where g(α)(X)=G{αTl(X)} and G is an appropriate inverse link function and l=(1,l1,…,lp−1), e.g., the logistic regression model G(⋅)=expit(⋅). Obtain the MLE αˆnMLE solving

PnDgMLE(αˆnMLE)=0,

with DgMLE(α) given by eq. (6). Let gˆnMLE=g(αˆnMLE).

Step 2: Initial estimatorQ¯^n0for the conditional mean outcomeQ¯0.

The second step of the procedure is to obtain an initial estimator for Qˉ0. We describe two possibilities: (1) a parametric model and (2) a super-learner.

Parametric Working Model. The first option is to postulate a parametric working model for Qˉ0(X): qγ={Qˉ(γ)(X)|γ∈Rq} with Qˉ(γ)(X)=Q{γTk(X)}, Q an appropriate inverse link function and k=(1,k1,…,kq−1); e.g., a linear regression model with Q the identity link. Obtain the MLE ˆγnMLE solving

PnDQˉMLE(ˆγnMLE)=0,

with DQˉMLE(γ) given by eq. (5). Let Q¯^nMLE=Q¯(γ^nMLE).

Super-Learner. Another option is to obtain an initial estimator based on data-adaptive learning algorithms, such as super-learning [10]. The super-learner is a machine learning algorithm which starts from a library of estimators {Q¯^j|j=1…,J}, which may consist of both parametric and nonparametric estimators. It then considers the family of all weighted averages of these estimators: Q¯^ω=∑j=1JωjQ¯^j, ω=(ω1,…,ωJ)T, ∑j=1Jωj=1 and ωj≥0 for j=1,…,J. Next, the optimal weight vector ωˆn is defined to be the choice of ω that minimizes the cross-validated risk with respect to some loss function lSL:(O,Qˉ)→lSL(Qˉ)(O)∈R which satisfies Qˉ0=argminQˉE{lSL(Qˉ)}, e.g., the squared error loss function l2(Qˉ)(O)=R{Y−Qˉ(X)}2. The super-learner of Qˉ0 is defined as the estimator Q¯^nSL=Q¯^ω^n.
For further reference, we let the initial estimator Q¯^n0 denote either Q¯^nMLE or Q¯^nSL.

Step 3: FluctuationQ¯^n(c)of the initial estimatorQ¯^n0.

To construct an appropriate fluctuation model, we need to choose an appropriate loss-function. In this article, we consider the quasi-log-likelihood loss function with corresponding logistic fluctuation model. This choice is favored over the squared error loss function with linear fluctuation model [30] because it ensures predictions within the admissible range for the outcome. In particular, suppose Y is known to fall in the interval [a,b] with a<b. To be able to use the quasi-log-likelihood loss functions and the logistic fluctuation model, we rescale the outcome between zero and one as follows:

(11)Y˜=Y−ab−a.

The procedure we describe below is based on the transformed outcome Y˜ and results in an estimator ψ˜n of ψ˜n=E(Y˜). Because E(Y)=(b−a)E(Y˜)+a, the final estimator is given by ψˆn=(b−a)ψ˜n+a. For notational convenience, without loss of generality, we will assume that a=0 and b=1 so that Y=Y˜∈[0,1] and we can drop the ∼-notation.

Having thus obtained an estimator for the missingness mechanism, such as gˆnMLE, and an initial estimator Q¯^n0 for the conditional mean outcome taking values in the interval [0,1], we can now fluctuate Q¯^n0 to target bias reduction in the direction of the finite-dimensional parameter α. Below, we consider three fluctuation models that will accomplish this goal with the second also guaranteeing the final estimator to be a substitution estimator, like the TMLE procedure.

Fluctuation model 1. Consider the fluctuation model {Q¯^n0(ε(1)):ε(1)∈ℝp} through the initial estimator (Q¯^n0(0)=Q¯^n0):
(12)logit Q¯^n0(ε(1))(X)=logit Q¯^n0(X)+ε(1),TH(1)(g^nMLE)(X),
with H(1)(g^nMLE)(X)=[G'{α^nMLE,Tl(X)}/G2{α^nMLE,Tl(X)}]l(X) and with logit(x)=log{x/(1−x)}. Then define εˆn(1) as the MLE under this parametric submodel, i.e., such that
ε^n(1)=arg minε(1)Pn[ℒ(1){Q¯^n0(ε(1))}]
where l(1) is the quasi-log-likelihood loss function;
(13)l(1)(Qˉ)(O)=−R[Ylog{Qˉ(X)}+(1−Y)log{1−Qˉ(X)}].
It is easily verified that εˆn(1) solves the estimating equation
(14)∑i=1nRi{Yi−Q¯^n0(ε^n(1))(Xi)}H(1)(g^nMLE)(Xi)=0,
which can be solved via standard logistic regression of the observed outcomes on the covariates H(1) using offset logit Q¯^n0. Because the score eq. (14) is like the estimation eq. (9), it targets bias-reduction (in the direction of α) in the sense that was previously described. Define the updated estimator Q¯^n(1)=Q¯^n0(ε^n(1)). Because the quasi-log-likelihood loss function is a valid loss-function for the conditional mean outcome Qˉ0 in the sense that Qˉ0=argminQˉE{l(1)(Qˉ)} (see Gruber and van der Laan [30], Lemma 1), Q¯^n(1) is an improved fit of Qˉ0 as compared to Q¯^n0 with respect to the loss function (13).
Fluctuation model 2. We extend eq. (12) so that the final doubly robust estimator of the target parameter will also be a substitution estimator (also known as a regression doubly robust estimator [12]). For this purpose, define the fluctuation model {Q¯^n0(ε(2)):ε(2)∈ℝp+1} through the initial estimator:
logit Q¯^n0(ε(2))(X)
(15)=logit Q¯^n0(X)+ε(2,1),TH(1)(g^nMLE)(X)+ε(2,2)H(2)(g^nMLE)(X),
ε(2)=(ε(2,1),T,ε(2,2))T and H(2)(gˆnMLE)=1/gˆnMLE. Then define εˆn(2) as the MLE under this parametric submodel, i.e., such that
ε^n(2)=arg minε(2)Pn[ℒ(1){Q¯^n0(ε(2))}]
where l(1) is the quasi-log-likelihood loss function (13). It follows that εˆn(2) solves the estimating equation
(16)∑i=1nRi{Yi−Q¯^n0(ε^n(2))(Xi)}×(H(1),T(g^nMLE)(Xi),H(2)(g^nMLE)(Xi))T=0,
which can be easily solved via standard logistic regression of the observed outcomes on the covariates H(1) and H(2) using offset logit Q¯^n0. By adding H(2) to the fluctuation model, it follows from the second component of eq. (16) that the resulting doubly robust estimator is a substitution estimator (as in the TMLE procedure) because eq. (16) implies that ∑i=1nRi/g^nMLE{Yi−Q¯^n0(ε^n(2))(Xi)}=0 so that ψ^n{Q¯^n0(ε^n(2)),g^nMLE}=Pn{Q¯^n0(ε^n(2))}. Define the updated estimator Q¯^n(2)=Q¯^n0(ε^n(2)).
Fluctuation model 3. There is evidence in the literature that the use of a weighted loss function in the construction of a fluctuation model can improve the stability of the doubly robust estimator of interest; see for example Robins et al. [12], Díaz and Rosenblum [31] for a comparison of an unweighted versus a weighted loss function in the context of TMLE. We therefore propose the following alternative fluctuation model to eq. (12): {Q¯^n0(ε(3)):ε(3)∈ℝp} through the initial estimator:
(17)logit Q¯^n0(ε(3))(X)=logit Q¯^n0(X)+ε(3),Tl(X).
Then define εˆn(3) as the weighted MLE under this parametric submodel
ε^n(3)=arg minε(3)Pn[G′{α^nMLE,Tl(X)}G2{α^nMLE,Tl(X)}ℒ(1){Q¯^n0(ε(3))}],
with l(1) the quasi-log-likelihood loss function (13); thus εˆn(3) solves the estimating equations
(18)0=∑i=1nRi{Yi−Q¯^n0(ε^n(2))(Xi)}G′{α^nMLE,Tl(Xi)}G2{α^nMLE,Tl(Xi)}l(Xi),
which can be easily solved via standard weighted logistic regression of the observed outcomes on the covariates l using offset logit Q¯^n0. As before this targets bias reduction because eq. (18) is like the estimating eq. (9).Define the updated estimator Q¯^n(3)=Q¯^n0(ε^n(3)).

We thus obtained three different updated estimators Q¯^n(c), c=1,2,3. These all aim to reduce bias in the direction of α. As noted, this is because each of the scores for the fluctuation parameters ε(c) (c=1,2,3) ensure that

Pn{Dα*(Q¯^n(c),g^nMLE)}=0.

Step 4: Estimating the Target Parameterψˆn(c).

Given the estimators gˆnMLE and Q¯^n(c), we obtain the doubly robust estimator ψ^n(c)≡ψ^n(Q¯^n(c),g^nMLE). This is a one-step procedure and does not require iterations. Finally note that by construction (implied by the estimating eq. (16) as a consequence of adding H(2) to the fluctuation model), ψ^n(2)=ψ^n(Q¯^n(2),g^nMLE)=Pn(Q¯^n(2)).

Remark

If in the construction of the missingness model, we take G(x)=expit(x), then G'(x)=G(x){1−G(x)}. For every c=1,2,3, it then follows from the estimating equations for the fluctuation parameter ε(c) that

(19)∑i=1nRig^nMLE(Xi){Yi−Q¯^n(c)(Xi)}=∑i=1nRi{Yi−Q¯^n(c)(Xi)}.

This implies that the doubly robust estimator can be written as

(20)ψ^n(c)=n−1∑i=1n{RiYi+(1−Ri)Q¯^n(c)(Xi)},

which averages the observed outcome for the responders and a predicted outcome for the non-responders, making the doubly robust estimator equal a special type of imputation estimator. This is desirable as this ensures that the doubly robust estimator is sample bounded in the sense that ψˆn(c) lies in the observed data range [12, 20].

4.3 Inference

In Appendix A, we present an asymptotic linearity theorem with corresponding influence function for the doubly robust estimators ψˆn(c) (c=1,2,3) under the assumption of a correctly specified working model for the missingness mechanism but a potentially misspecified working model for the conditional mean outcome, where the latter is estimated via either a parametric model or a data-adaptive learning algorithm. This asymptotic linearity of ψˆn(c) and the corresponding influence function

D∗(Qˉ(c)∗,g0;ψ0),

with Qˉ(c)∗ the probability limit of Q¯^n(c), provides us with a strategy for inference about the unknown ψ0 under a correctly specified missingness model. When the outcome model is correctly specified so that Qˉ(c)∗=Qˉ0, the influence function equals the efficient influence function D∗(Qˉ0,g0; ψ0). When the outcome model is misspecified so that Qˉ(c)∗≠Qˉ0, the influence function of the doubly robust estimator equals D∗(Qˉ(c)∗,g0;ψ0), which may differ from the efficient influence function (indeed, efficiency of ψˆn(c) is only attained locally at Qˉ(c)∗=Qˉ0). However, inferences based on this influence function are valid, even though they ignore the uncertainty in the working model for the missingness mechanism. This is because bias-reduced estimation is used in the direction of α implying that

P0Dα∗(Qˉ(c)∗,g0)=0.

This first-order ancillarity property with respect to α ensures that the resulting doubly robust estimator is insensitive to local changes (of the order one over root-n) in the nuisance parameter α, even under misspecification of the outcome model. Consequently, no correction for the estimation of the missingness mechanism is needed under inconsistency of Q¯^n(c) [8].

Remark

Remember that, when one of the nuisance parameter estimators is not consistent, then the asymptotic distribution of the double-robust estimator changes because of the addition of first-order terms to its expansion. These first-order terms reflect the extent to which the double-robust estimator is affected by the uncertainty in the nuisance parameter estimator(s). A recent proposal by van der Laan [32] targets the estimation of the nuisance parameters so as to make these additional first-order terms asymptotically linear, so that asymptotic 95% confidence intervals can be constructed; in doing so, he assumes that the missingness model is correctly specified (to the extent that it delivers a consistent double-robust estimator of the target parameter in collaboration with the possibly inconsistent estimator of the outcome mean). Our proposal is different in that it makes these additional first-order terms zero, does not assume correct specification of the missingness model regarding the bias reduction property, but also does not allow data-adaptive estimation of the missingness model.

4.3.1 Standard errors

When the missingness model is correctly specified, a standard error for ψˆn(c) can be easily calculated as the square root of the sample variance of the estimated influence function divided by n:

(21)SE^(ψ^n(c))=var^{D*(Q¯^n(c),g^nMLE;ψ^n(c))}n

with

var^{D*(Q¯^n(c),g^nMLE;ψ^n(c))}=1n∑i=1n{D*(Q¯^n(c),g^nMLE;ψ^n(c))(Oi)}2=1n∑i=1n[Rig^n(Xi){Yi−Q¯^n(c)(Xi)}+Q¯^n(c)(Xi)−ψ^n(c)]2.

An alternative for standard error calculation, which does not demand the assumption that the missingness model is correctly specified, is to use the non-parametric bootstrap. However, there is no theory supporting that the non-parametric bootstrap would produce valid results when the estimators rely on data-adaptive estimation such as super-learning [32].

4.3.2 Confidence intervals and p-values

Given the estimator SEˆ(ψˆn(c)) for the standard error of ψˆn(c), a confidence interval and p-value can be calculated based on the asymptotic normality of the estimator. A (1−α)100% CI is given by

ψˆn(c)±zα/2SEˆ(ψˆn(c))

where zα/2 is such that Φ(zα/2)=1−α/2 and Φ(⋅) the cumulative distribution function of a standard normal random variable. A p-value for the hypothesis test H0: ψ0=ψ˜ versus H1: ψ0≠ψ˜ can be calculated as

p=21−Φψˆn(c)−ψ˜SEˆ(ψˆn(c)).

4.3.3 Inference on the original scale

With Y˜=(Y−a)/(b−a) the transformed data which falls within the interval [0,1], previously, we let ψ˜n(c) denote the final estimator for ψ˜0=E(Y˜). Because E(Y)=(b−a)E(Y˜)+a, the final estimator for ψ0=E(Y) can be obtained as ψ^n(c)=(b−a)ψ˜n(c)+a. When n(ψ˜n(c)−ψ˜0) converges in distribution to N(0,σ˜2), then n(ψˆn(c)−ψ0) will converge in distribution to N(0,σ2) with σ2=(b−a)2σ˜2. Consequently, a standard error for ψˆn(c) on the original scale of the outcome, can be obtained as SE^(ψ^n(c))=(b−a)SE^(ψ˜n(c)).

In Appendix B, we provide an R-function that implements these novel methods.

5 Simulation studies

We carried out different simulation studies to compare the performance of the new bias-reduced doubly robust estimators with several alternatives for the estimation of a mean outcome in the presence of incomplete data.

5.1 Estimators

All estimators are based on a parametric working model gα for the missingness mechanism g0(X). Let αˆnMLE denote the MLE and let gˆnMLE=g(αˆnMLE). For the conditional mean outcome Qˉ0(X), we consider both estimators based on a parametric working model and based on data-adaptive learning algorithms. First, consider a parametric model qγ with ˆγnMLE the MLE and Q¯^nMLE=Q¯(γ^nMLE). Given the parametric working models gα and qγ, let αˆnBR and ˆγnBR denote the nuisance parameter estimators for α and γ obtained via the bias-reduced estimation principle with gˆnBR=g(αˆnBR) and Q¯^nBR=Q¯(γ^nBR). Second, we consider the super-learner Q¯^nSL based on a library consisting of generalized additive and linear models, random forests and adaptive polynomial splines. This can be fitted using the SuperLearner R package [33]. Let Q¯^nMLE,(c)=Q¯^nMLE(ε^n(c)) denote the updated estimators for the conditional mean outcome based on the initial estimator Q¯^nMLE, c=1,2,3 and let Q¯^nSL,(c)=Q¯^nSL(ε^n(c)) denote the updated estimators for the conditional mean outcome based on the initial estimator Q¯^nSL, c=1,2,3. The estimators under consideration are then the doubly robust estimator based on the MLE for the parametric working models ψ^nDR,MLE=ψ^n(Q¯^nMLE,g^nMLE), the bias-reduced doubly robust estimator ψ^nBR=ψ^n(Q¯^nBR,g^nBR), the substitution estimator ψ^nSL=Pn(Q¯^nSL) based on standardizing super-learning predictions (which is not doubly robust), the proposed estimators based on both a parametric working model for the missingness mechanism and a parametric conditional mean model for the outcome, ψ^nDR,MLE,(c)=ψ^n(Q¯^nMLE,(c),g^nMLE), and based on a parametric model for the missingness model but a super-learner for the conditional mean outcome, ψ^nSL,(c)=ψ^n(Q¯^nSL,(c),g^nMLE), for c=1,2,3. Finally, we consider two versions of TMLE, described in van der Laan and Rubin [9] and Gruber and van der Laan [30]. Both are based on the parametric missingness model gα fitted via MLE but the first is based on Q¯^nMLE and the second is based on Q¯^nSL, leading to the TMLEs ψˆnTMLE and ψˆnSL,TMLE, respectively.

Many alternative estimation strategies for these nuisance working models have been considered in the literature (see Section 1.1 for an overview). For a simulation-based comparison of many of these alternatives, we refer to Tan [20], Porter et al. [24] and Vermeulen and Vansteelandt [8].

Remark

In the two simulation settings of the subsequent sections, the outcome Y is not a priori bounded. For those estimators involving a logistic fluctuation model for the initial estimator of the conditional mean outcome, we follow the default setting in the tmle R package [34] by taking the bounds to be the observed range of Y but widened by 10% of both the minimum and maximum value. This differs slightly from the approach in Gruber and van der Laan [30]. Specifically, we let a=mini=1nYi−0.1|mini=1nYi| and b=maxi=1nYi+0.1|maxi=1nYi| and Y˜=(Y−a)/(b−a). Next, since the initial estimator needs to be represented as a logistic function of its logit transformation, the initial estimator of the conditional mean outcome Qˉ0(X) needs to be bounded away from 0 and 1 because logit(x) is not defined for x=0 or 1. Therefore, this initial estimator is truncated at (ζ,1−ζ) for some small ζ>0. In the simulations, we take ζ=0.005 as in Gruber and van der Laan [30].

For each of the scenarios considered below, we perform 1,000 Monte Carlo runs at sample sizes of n=200 and 1,000. For each estimator, we calculated the Monte Carlo bias (BIAS), the root mean square error (RMSE) and the Monte Carlo standard deviation (MCSD). For the doubly robust estimators (all but ψˆnSL), we also show the average sandwich standard error (ASSE) calculated via eq. (21), and the Monte Carlo coverage of the corresponding 95% Wald confidence intervals (COV) where standard errors are calculated via formula (21).

5.2 Scenario 1: single-covariate setting

5.2.1 Data-generating mechanism

This simulation scenario considers a simple data-generating mechanism where for each i (i=1,…,n), Xi=dN(0,1), Ri|Xi=dBer{π0(Xi)} and Yi|Xi=dN{m0(Xi),1}. For each setting, the following parametric working models are used: π(X; α)=expit(α1+α2X) and m(X;γ)=γ1+γ2X. Simulation experiments with correctly specified parametric working models used m0(X)=1+X and π0(X)=expit(ξX) for ξ=1,2. To allow for misspecification in the outcome model, we additionally generated data using m0(X)=X2 and π0(X)=expit(ξX) for ξ=1,2. To allow for gross misspecification of the missingness mechanism model, we generated data using m0(X)=1+X and π0(X)=expit(−4+1.5|X|0.5+0.75X+0.5|X|1.5), as in Vansteelandt et al. [7]. Finally, we also generated data with m0(X)=X2 and π0(X)=expit(−4+1.5|X|0.5+0.75X+0.5|X|1.5) to allow for misspecification of both models. In each of the settings, the target parameter E(Y)=ψ0 equals one. Results for the Scenario 1 are given in Table 1 (n=200) and Table 2 (n=1000).

Table 1:

Simulation results based on 1,000 Monte Carlo replications for Scenario 1, n=200.

EST	BIAS	RMSE	MCSD	ASSE	COV	BIAS	RMSE	MCSD	ASSE	COV
	n=200
	OR correct, MM correct (ξ=1)					OR incorrect, MM correct (ξ=1)
ψˆnDR,MLE	–0.001	0.13	0.13	0.13	0.94	–0.018	0.33	0.33	0.29	0.89
ψˆnBR	–0.001	0.13	0.13	0.13	0.94	–0.030	0.21	0.20	0.17	0.88
ψˆnSL	0.007	0.13	0.13	–	–	–0.085	0.19	0.17	–	–
ψˆnTMLE	–0.000	0.13	0.13	0.13	0.94	–0.036	0.27	0.26	0.24	0.89
ψˆnSL,TMLE	0.001	0.13	0.13	0.13	0.94	–0.029	0.17	0.17	0.14	0.89
ψˆnDR,MLE,(1)	0.005	0.14	0.14	0.13	0.93	0.028	0.19	0.19	0.16	0.92
ψˆnDR,MLE,(2)	0.010	0.14	0.14	0.13	0.93	0.037	0.21	0.21	0.18	0.91
ψˆnDR,MLE,(3)	–0.000	0.13	0.13	0.13	0.94	–0.055	0.22	0.22	0.18	0.85
ψˆnSL,(1)	0.004	0.14	0.14	0.13	0.93	–0.000	0.16	0.16	0.14	0.92
ψˆnSL,(2)	0.010	0.14	0.14	0.13	0.92	0.011	0.17	0.17	0.14	0.92
ψˆnSL,(3)	0.001	0.13	0.13	0.13	0.94	–0.028	0.17	0.17	0.14	0.89
	OR correct, MM correct (ξ=2)					OR incorrect, MM correct (ξ=2)
ψˆnDR,MLE	–0.002	0.20	0.20	0.17	0.94	–0.060	1.19	1.19	0.58	0.68
ψˆnBR	–0.001	0.19	0.19	0.15	0.88	–0.110	0.26	0.24	0.17	0.77
ψˆnSL	0.018	0.15	0.15	–	–	–0.249	0.33	0.22	–	–
ψˆnTMLE	0.017	0.19	0.19	0.15	0.86	0.013	0.33	0.33	0.24	0.82
ψˆnSL,TMLE	0.018	0.19	0.19	0.14	0.85	–0.020	0.23	0.23	0.16	0.82
ψˆnDR,MLE,(1)	0.097	0.30	0.28	0.14	0.70	0.304	0.53	0.43	0.22	0.65
ψˆnDR,MLE,(2)	0.110	0.31	0.29	0.14	0.67	0.282	0.51	0.43	0.21	0.64
ψˆnDR,MLE,(3)	0.003	0.17	0.17	0.14	0.88	–0.187	0.38	0.33	0.19	0.61
ψˆnSL,(1)	0.090	0.28	0.27	0.14	0.70	0.128	0.34	0.32	0.17	0.72
ψˆnSL,(2)	0.111	0.31	0.29	0.14	0.66	0.100	0.35	0.34	0.17	0.69
ψˆnSL,(3)	0.010	0.18	0.18	0.14	0.86	–0.093	0.25	0.23	0.15	0.75
	OR correct, MM incorrect					OR incorrect, MM incorrect
ψˆnDR,MLE	–0.008	1.04	1.04	0.76	0.96	5.496	11.46	10.06	4.78	0.90
ψˆnBR	–0.003	0.29	0.29	0.22	0.85	1.030	1.23	0.68	0.43	0.34
ψˆnSL	0.030	0.29	0.29	–	–	0.155	0.42	0.39	–	–
ψˆnTMLE	0.010	0.30	0.30	0.49	0.94	1.199	1.34	0.60	0.94	0.79
ψˆnSL,TMLE	–0.002	0.29	0.30	0.46	0.94	0.312	0.50	0.39	0.55	0.91
ψˆnDR,MLE,(1)	0.029	0.31	0.31	0.40	0.91	0.411	0.65	0.50	0.70	0.85
ψˆnDR,MLE,(2)	0.035	0.33	0.33	0.38	0.89	0.277	0.58	0.51	0.77	0.93
ψˆnDR,MLE,(3)	0.014	0.30	0.30	0.49	0.95	0.595	0.77	0.49	0.86	0.87
ψˆnSL,(1)	0.032	0.32	0.32	0.38	0.90	0.131	0.41	0.39	0.40	0.87
ψˆnSL,(2)	0.037	0.33	0.33	0.37	0.88	0.102	0.41	0.40	0.40	0.87
ψˆnSL,(3)	0.001	0.30	0.30	0.46	0.94	0.155	0.39	0.36	0.53	0.92

^[1]

Table 2:

Simulation results based on 1,000 Monte Carlo replications for Scenario 1, n=1000.

EST	BIAS	RMSE	MCSD	ASSE	COV	BIAS	RMSE	MCSD	ASSE	COV
	n=1000
	OR correct, MM correct (ξ=1)					OR incorrect, MM correct (ξ=1)
ψˆnDR,MLE	0.003	0.06	0.06	0.06	0.96	0.003	0.15	0.15	0.16	0.95
ψˆnBR	0.003	0.06	0.06	0.06	0.96	–0.003	0.09	0.09	0.09	0.94
ψˆnSL	0.005	0.06	0.06	–	–	–0.024	0.07	0.07	–	–
ψˆnTMLE	0.003	0.06	0.06	0.06	0.96	–0.004	0.12	0.12	0.13	0.95
ψˆnSL,TMLE	0.003	0.06	0.06	0.06	0.96	–0.003	0.07	0.07	0.07	0.93
ψˆnDR,MLE,(1)	0.004	0.06	0.06	0.06	0.95	0.013	0.09	0.09	0.08	0.93
ψˆnDR,MLE,(2)	0.005	0.06	0.06	0.06	0.95	0.014	0.10	0.10	0.09	0.93
ψˆnDR,MLE,(3)	0.003	0.06	0.06	0.06	0.96	–0.011	0.10	0.10	0.09	0.92
ψˆnSL,(1)	0.004	0.06	0.06	0.06	0.95	0.000	0.07	0.07	0.07	0.94
ψˆnSL,(2)	0.005	0.06	0.06	0.06	0.94	0.001	0.07	0.07	0.07	0.94
ψˆnSL,(3)	0.003	0.06	0.06	0.06	0.95	–0.003	0.07	0.07	0.07	0.93
	OR correct, MM correct (ξ=2)					OR incorrect, MM correct (ξ=2)
ψˆnDR,MLE	–0.000	0.09	0.09	0.08	0.94	–0.041	0.50	0.49	0.35	0.78
ψˆnBR	0.000	0.08	0.08	0.08	0.91	–0.052	0.14	0.13	0.10	0.80
ψˆnSL	0.006	0.07	0.07	–	–	–0.113	0.15	0.10	–	–
ψˆnTMLE	0.004	0.09	0.09	0.08	0.90	0.044	0.19	0.19	0.13	0.85
ψˆnSL,TMLE	0.004	0.09	0.09	0.07	0.90	–0.012	0.10	0.10	0.08	0.87
ψˆnDR,MLE,(1)	0.027	0.12	0.12	0.07	0.78	0.168	0.30	0.24	0.13	0.66
ψˆnDR,MLE,(2)	0.030	0.13	0.12	0.07	0.78	0.152	0.27	0.23	0.11	0.64
ψˆnDR,MLE,(3)	0.001	0.08	0.08	0.08	0.91	–0.107	0.22	0.19	0.13	0.64
ψˆnSL,(1)	0.025	0.12	0.12	0.07	0.78	0.038	0.13	0.12	0.08	0.79
ψˆnSL,(2)	0.029	0.13	0.12	0.07	0.77	0.027	0.14	0.13	0.08	0.76
ψˆnSL,(3)	0.001	0.08	0.08	0.07	0.91	–0.038	0.11	0.10	0.08	0.82
	OR correct, MM incorrect					OR incorrect, MM incorrect
ψˆnDR,MLE	–0.026	0.52	0.52	0.49	0.99	7.239	9.64	6.37	4.10	0.39
ψˆnBR	–0.006	0.11	0.11	0.11	0.94	1.237	1.28	0.33	0.28	0.01
ψˆnSL	–0.000	0.11	0.12	–	–	0.059	0.16	0.15	–	–
ψˆnTMLE	–0.004	0.12	0.12	0.41	1.00	1.222	1.26	0.29	0.66	0.37
ψˆnSL,TMLE	–0.005	0.12	0.12	0.39	1.00	0.100	0.18	0.15	0.42	1.00
ψˆnDR,MLE,(1)	–0.001	0.12	0.12	0.37	1.00	0.601	0.66	0.26	0.83	0.97
ψˆnDR,MLE,(2)	–0.001	0.12	0.12	0.36	1.00	0.573	0.63	0.25	0.80	1.00
ψˆnDR,MLE,(3)	0.003	0.16	0.16	0.42	1.00	0.289	0.46	0.36	0.85	0.97
ψˆnSL,(1)	0.000	0.13	0.13	0.36	1.00	0.033	0.14	0.14	0.36	1.00
ψˆnSL,(2)	0.001	0.12	0.12	0.36	1.00	0.035	0.14	0.14	0.35	1.00
ψˆnSL,(3)	0.003	0.15	0.15	0.40	1.00	0.003	0.16	0.16	0.44	1.00

^[2]

5.2.2 Results

When both working models are correctly specified and weights are not highly variable (ξ=1), all estimators tend to perform similarly in terms of bias and precision. However, when weights become highly variable (ξ=2), estimators ψˆnDR,MLE,(1), ψˆnDR,MLE,(2), ψˆnSL,(1) and ψˆnSL,(2) tend to show some finite-sample bias (n=200), which is resolved when the sample size is increased (n=1000). With highly variable weights, these estimators are also relatively less efficient. When the outcome model is misspecified but the working model for the missingness mechanism is correct, we observe adequate performance for all estimators, but smaller bias and larger precision for the estimators that are based on the super-learner. Furthermore, it appears desirable to use a weighted loss function (as ψˆnDR,MLE,(3) and ψˆnSL,(3) do) as compared to fluctuation models using covariates with the estimated missingness probabilities in the denominator (as for ψˆnDR,MLE,(1), ψˆnDR,MLE,(2), ψˆnSL,(1) and ψˆnSL,(2)). In the case where the missingness model is wrong but the outcome model is correct, all estimators show very similar behavior, except for the standard doubly robust estimator ψˆnDR,MLE which performs poorly. Finally, when both working models are misspecified, then ψˆnBR drastically outperforms ψˆnDR,MLE in terms of bias and precision. The proposed estimator ψˆnSL,(3) performs best.

In conclusion, ψˆnSL,(3) performs best in Scenario 1, both in terms of bias and precision, especially when the inverse probability weights become highly variable. Adding the additional covariate H(2) to the fluctuation model for ψˆnDR,MLE,(2) and ψˆnSL,(2) guarantees a substitution estimator, but does not lead to enhanced performance as compared to ψˆnDR,MLE,(1) and ψˆnSL,(1). This can partly be understood upon noting that ψˆnDR,MLE,(1) and ψˆnSL,(1) already have the form of an imputation estimator (see eq. (20)).

Tables 1 and 2 also show results on the performance of the sandwich standard error calculated using formula (21). The proposed estimator of the sandwich standard error performs well under correct specification of both working models, especially for ψˆnSL,(3). As predicted by the theory, this is also the case when the outcome model is misspecified but the missingness mechanism is correct, but the weights are not highly variable (ξ=1). When the weights become highly variable (ξ=2), the performance is worse. This is not a surprise because convergence to the normal limit distribution then happens more slowly. When the missingness model is misspecified (for both a correctly specified and misspecified outcome model), then the sandwich standard errors overestimate the finite-sample variability of the estimator for all different estimators (except for ψˆnBR).

5.3 Scenario 2: Kang and Schafer setting

5.3.1 Data-generating mechanism

This simulation study is taken from Kang and Schafer [5] and often used as a benchmark to evaluate doubly robust estimators for the population mean outcome explainable by measured auxiliary covariates. For each individual i=1,…,n, Zi=(Zi1,Zi2,Zi3,Zi4)T=dN(0,I), I is the 4×4 identity matrix, Ri|Zi=dBer{π0(Zi)} with π0(Z)=expit(−Z1+0.5Z2−0.25Z3−0.1Z4) and Yi|Zi=dN{m0(Zi),1} with m0(Z)=210+27.4Z1+13.7Z2+13.7Z3+13.7Z4. Misspecified working models are linear for the outcome model and logistic for the missingness mechanism model, with covariates Xi=(Xi1,Xi2,Xi3,Xi4)T with X1=exp(Z1/2), X2=Z2/{1+exp(Z1)}+10, X3=(Z1Z3/25+0.6)3 and X4=(Z2+Z4+20)2. The target parameter E(Y)=ψ0 equals 210. We limit ourselves to the realistic settings where the working models both use either the covariates Zk or the covariates Xk (k=1,…,4) and thus both working models are correctly specified or both working models are incorrectly specified. We will show simulation results for two scenarios where either R=1 or R=0 denotes the data that are observed and results are shown in Table 3 (n=200) and Table 4 (n=1000).

Table 3:

Simulation results based on 1,000 Monte Carlo replications for Scenario 2, n=200.

EST	BIAS	RMSE	MCSD	ASSE	COV	BIAS	RMSE	MCSD	ASSE	COV
	observed outcome RY					observed outcome (1−R)Y
	n=200
	OR correct, MM correct					OR correct, MM correct
ψˆnDR,MLE	0.099	2.53	2.53	2.57	0.96	0.085	2.53	2.53	2.57	0.96
ψˆnBR	0.090	2.54	2.54	2.57	0.96	0.095	2.54	2.55	2.57	0.95
ψˆnSL	0.013	2.52	2.52	–	–	0.254	2.55	2.54	–	–
ψˆnTMLE	0.028	2.53	2.53	2.56	0.96	0.243	2.55	2.54	2.54	0.95
ψˆnSL,TMLE	0.028	2.53	2.53	2.56	0.96	0.243	2.55	2.54	2.54	0.95
ψˆnDR,MLE,(1)	–0.112	2.73	2.73	2.56	0.94	0.326	2.72	2.70	2.55	0.94
ψˆnDR,MLE,(2)	–0.153	2.80	2.80	2.57	0.94	0.348	2.77	2.75	2.56	0.94
ψˆnDR,MLE,(3)	0.029	2.53	2.53	2.56	0.96	0.241	2.55	2.54	2.54	0.95
ψˆnSL,(1)	–0.114	2.73	2.72	2.56	0.94	0.324	2.71	2.69	2.55	0.94
ψˆnSL,(2)	–0.186	2.80	2.80	2.57	0.94	0.365	2.77	2.75	2.56	0.94
ψˆnSL,(3)	0.025	2.53	2.53	2.56	0.96	0.244	2.55	2.54	2.54	0.95
	OR incorrect, MM incorrect					OR incorrect, MM incorrect
ψˆnDR,MLE	–15.150	88.60	87.34	15.92	0.93	4.759	6.05	3.74	3.31	0.60
ψˆnBR	–2.239	4.45	3.85	2.95	0.82	3.440	4.63	3.10	2.73	0.73
ψˆnSL	–1.912	3.68	3.15	–	–	4.742	5.83	3.40	–	–
ψˆnTMLE	–6.840	8.44	4.94	4.01	0.57	4.195	5.40	3.40	2.95	0.65
ψˆnSL,TMLE	–3.221	4.81	3.57	2.89	0.75	3.706	5.05	3.43	2.68	0.66
ψˆnDR,MLE,(1)	–5.962	7.83	5.09	3.05	0.50	1.447	4.02	3.75	2.90	0.84
ψˆnDR,MLE,(2)	–6.478	8.41	5.37	3.07	0.46	1.614	4.23	3.91	2.91	0.83
ψˆnDR,MLE,(3)	–3.263	5.23	4.09	3.28	0.80	3.656	4.77	3.07	2.66	0.70
ψˆnSL,(1)	–3.722	6.08	4.81	2.88	0.65	2.157	4.16	3.56	2.76	0.80
ψˆnSL,(2)	–3.944	6.41	5.05	2.89	0.64	2.293	4.38	3.73	2.77	0.79
ψˆnSL,(3)	–1.970	3.61	3.02	2.67	0.84	3.350	4.61	3.17	2.57	0.71

^[3]

Table 4:

Simulation results based on 1,000 Monte Carlo replications for Scenario 2, n=1000.

EST	BIAS	RMSE	MCSD	ASSE	COV	BIAS	RMSE	MCSD	ASSE	COV
	observed outcome RY					observed outcome (1−R)Y
	n=1000
	OR correct, MM correct					OR correct, MM correct
ψˆnDR,MLE	0.023	1.12	1.12	1.15	0.96	0.022	1.13	1.13	1.15	0.96
ψˆnBR	0.022	1.12	1.12	1.15	0.96	0.024	1.13	1.13	1.15	0.96
ψˆnSL	–0.006	1.13	1.13	–	–	0.070	1.13	1.12	–	–
ψˆnTMLE	0.013	1.12	1.12	1.15	0.96	0.062	1.13	1.13	1.14	0.96
ψˆnSL,TMLE	0.011	1.12	1.12	1.15	0.96	0.063	1.13	1.13	1.14	0.96
ψˆnDR,MLE,(1)	0.008	1.12	1.12	1.15	0.96	0.070	1.14	1.13	1.14	0.95
ψˆnDR,MLE,(2)	0.005	1.12	1.12	1.15	0.96	0.072	1.14	1.14	1.14	0.95
ψˆnDR,MLE,(3)	0.013	1.12	1.12	1.15	0.96	0.063	1.13	1.13	1.14	0.95
ψˆnSL,(1)	0.001	1.12	1.12	1.15	0.96	0.072	1.13	1.14	1.14	0.95
ψˆnSL,(2)	–0.001	1.12	1.12	1.15	0.96	0.078	1.14	1.14	1.14	0.95
ψˆnSL,(3)	0.011	1.12	1.12	1.15	0.96	0.064	1.13	1.13	1.14	0.95
	OR incorrect, MM incorrect					OR incorrect, MM incorrect
ψˆnDR,MLE	–53.715	469.04	466.19	49.69	0.78	4.509	4.83	1.73	1.60	0.19
ψˆnBR	–3.208	3.63	1.69	1.34	0.38	2.970	3.25	1.34	1.25	0.35
ψˆnSL	–2.879	3.18	1.52	–	–	3.356	3.74	1.65	–	–
ψˆnTMLE	–6.242	6.81	2.73	3.72	0.39	4.075	4.35	1.52	1.40	0.19
ψˆnSL,TMLE	–2.281	2.75	1.53	1.53	0.62	2.833	3.20	1.48	1.15	0.35
ψˆnDR,MLE,(1)	–3.878	4.55	2.39	1.98	0.51	1.442	2.35	1.86	1.35	0.73
ψˆnDR,MLE,(2)	–3.738	4.47	2.45	1.92	0.52	1.570	2.46	1.90	1.34	0.70
ψˆnDR,MLE,(3)	–3.351	4.86	1.92	2.82	0.74	3.110	3.39	1.34	1.23	0.30
ψˆnSL,(1)	–2.771	3.38	1.94	1.28	0.46	1.919	2.38	1.42	1.17	0.60
ψˆnSL,(2)	–3.001	3.60	1.99	1.28	0.41	1.981	2.46	1.46	1.17	0.58
ψˆnSL,(3)	–2.188	2.52	1.24	1.21	0.55	2.571	2.91	1.37	1.13	0.41

^[4]

5.3.2 Results

When both working models are correctly specified, all estimators have comparable performance, especially for n=1000. However, ψˆnDR,MLE,(1), ψˆnDR,MLE,(2), ψˆnSL,(1) and ψˆnSL,(2) tend to be slightly more variable than ψˆnDR,MLE,(3) and ψˆnSL,(3) at n=200, again illustrating the enhanced performance of using a weighted loss function for estimators ψˆnDR,MLE,(3) and ψˆnSL,(3). When both working models are misspecified, as in Kang and Schafer [5], ψˆnDR,MLE shows severe erratic behavior when the observed outcome RY is used; this behavior is partially eliminated when (1−R)Y is used as the observed outcome [12]. The bias-reduced doubly robust estimator from Section 3, ψˆnBR, does not show this severe erratic behavior (for both observed outcomes RY and (1−R)Y). Interestingly, the new estimator ψˆnSL,(3) outperforms all existing estimators ψˆnDR,MLE, ψˆnBR, ψˆnTMLE and ψˆnSL,TMLE for both observed outcomes RY and (1−R)Y. When (1−R)Y is used as the observed outcome, ψˆnDR,MLE,(1) and ψˆnSL,(1) show the best performance under double model misspecification.

Tables 3 and 4 also show the performance of the sandwich standard errors (21). Excellent finite-sample behavior is seen when both working models are correctly specified. When both working models are misspecified, the approximation of the sandwich standard errors is better for n=1000 than for n=200, but not sufficient, as is also the case for TMLE. Unlike in Scenario 1, we do observe that they underestimate the true finite-sample variability of the estimators. Overall, the best performance is seen for ψˆnSL,(3).

6 Linear instrumental variable analyses

In this section, we show that the proposed procedure is not restricted to the doubly robust estimator from Section 2 but can be easily extended to other doubly robust estimators. We will do this by illustrating how the principle can be implemented for linear instrumental variable analyses.

6.1 Doubly robust estimation of linear instrumental variable models

Suppose that we are interested in the causal effect of an exposure A on an outcome Y in the presence of unmeasured confounding U and suppose in particular that for measured covariates X, the vector (XT,UT)T would be sufficient to control for confounding of the effect of A on Y. Suppose that data is available on an instrumental variable (IV) Z [35–37], which is such that (a) Z is associated with A conditional on X, (b) Z⊥⊥Y|A,X,U and (c) Z⊥⊥U|X. The observed data is given by the i.i.d. sample o=(O1,…,On) of size n with P0 the unknown underlying data-generating mechanism of the observables O=(Y,A,Z,X) where we consider both A and Z to be dichotomous. We consider inference for the causal effect ψ0 indexing the linear instrumental variable model

(22)E(Y|A,Z,X,U)=E(Y|A=0,Z,X,U)+ψ0A.

Let m denote the statistical model for P0 implied by assumptions (a)-(b)-(c) and (22). Okui et al. [38] show that a doubly robust estimator of ψ0 can be obtained by solving

(23)0=∑i=1n{Zi−g^n(Xi)}{Yi−ψAi−Q¯^n(Xi)}.

Here, gˆn(X) is an estimator of the conditional mean of the instrument given the measured confounders X, g0(X)=E(Z|X), based on a nuisance working model g={g(X)|g in some class of functions}. Let g∗(X) denote the probability limit (gˆn(X)→pg∗(X)) such that g∗(X)=g0(X) when m(g) holds, where m(g) represents the statistical model for the joint distribution of O implied by the working model g. Next, Q¯^n(X) is an estimator of the conditional mean of the outcome given the measured confounders X among the non-exposed, Qˉ0(X)=E(Y|A=0,X), based on a nuisance working model q={Qˉ(X)|Qˉ in some class of functions}. Let Qˉ∗(X) denote the probability limit (Q¯^n(X)→pQ¯*(X)) such that Qˉ∗(X)=Qˉ0(X) when m(q) holds, where m(q) represents the statistical model for the joint distribution of O implied by the working model o. Consistency of the doubly robust estimator ψ^n(Q¯^n,g^n) is then attained under the model m∩{m(q)∪m(g)}. At the intersection model m∩m(q)∩m(g), it has the following expansion

(24)ψ^n(Q¯^n,g^n)−ψ0=(Pn−P0){DIV*(Q¯0,g0;ψ0)}+op(1/n),

with influence function DIV∗(Qˉ0,g0;ψ0) given by

(25)DIV∗(Qˉ0,g0;ψ0)(O)=Z−g0(X)Y−Qˉ0(X)EAZ−g0(X)−ψ0.

Note that by construction,

Pn[DIV*{Q¯^n,g^n;ψ^n(Q¯^n,g^n)}]=0.

6.2 Practical implementation of the proposed procedure

The proposed data-adaptive bias-reduced doubly robust procedure works as follows:

Step 1: EstimatorgˆnMLEfor the conditional meang0(X)of the instrument.

Postulate a parametric working model for g0(X): gα={g(α)(X)|α∈Rp} where g(α)(X)=G{αTl(X)} and G an appropriate inverse link function, e.g., G(⋅)=expit(⋅), and l=(1,l1,…,lp−1). Obtain the MLE αˆnMLE using the estimating function DgMLE(α)(O) given in eq. (6) but with R replaced by Z. Let gˆnMLE=g(αˆnMLE).

Step 2: Initial estimatorQ¯^n0for the conditional mean outcome among the non-exposedQˉ0.

We again consider two options: (1) Postulate a parametric working model for Qˉ0(X): qγ={Qˉ(γ)(X)|γ∈Rq} where Qˉ(γ)(X)=Q{γTk(X)} and Q an appropriate inverse link function and k=(1,k1,…,kq−1). Obtain the MLE ˆγnMLE using the estimating function DQMLE(γ)(O) given in eq. (5) but with R replaced by 1−A. Let Q¯^nMLE=Q¯(γ^nMLE); (2) Use a super-learner based on a library {Q¯^j|j=1,…,J} to obtain a fit Q¯^nSL among those with A=0. Let the initial estimator Q¯^n0 denote either Q¯^nMLE or Q¯^nSL.

Step 3: FluctuationQ¯^n1of the initial estimatorQ¯^n0.

Following the same argument as in Step 3 of Section 4.2, we assume that Y∈[0,1]. In this step, we fluctuate the initial estimator Q¯^n0 by means of a parametric fluctuation model Q¯^n0(ε) such that the score with respect to ε targets bias reduction in the direction α. For this purpose, define the fluctuation model {Q¯^n0(ε):ε∈ℝp} through the initial estimator (Q¯^n0(0)=Q¯^n0):

(26)logit Q¯^n0(ε)(X)=logit Q¯^n0(X)+εTH^(gnMLE)(Z,X),

with

H^(g^nMLE)(Z,X)=G'{α^nMLE,Tl(X)}l(X)+[Z−G{α^nMLE,Tl(X)}]W^(g^nMLE),

W^(g^nMLE)=n−1∑j=1nAjG'{α^nMLE,Tl(Xj)}l(Xj)n−1∑j=1nAj[Zj−G{α^nMLE,Tl(Xj)}]],

where Hˆ(gˆnMLE)(Z,X) is chosen in a way so that the score for the fluctuation parameter ε under this fluctuation model is given by the derivative of the influence function DIV*{Q¯^n0(ε),g(α)} to the parameter α and thus targets bias reduction in the direction of α. Define ε^n=arg minεPn[ℒ(1){Q¯^n0(ε)}], the MLE under this parametric submodel with l(1) the quasi-log-likelihood function (13). This can be obtained via standard logistic regression of the outcome on the covariates H^ using offset logit Q¯^n0 and then defining the updated estimates as Q¯^n1=Q¯^n0(ε^n). As mentioned before, by construction of the fluctuation model, it follows that εˆn solves

Pn{Dα,IV*(Q¯^n1,g^nMLE)}=0,

with Dα,IV∗Qˉ,g(α)=∂DIV∗Qˉ,g(α)/∂α, thereby targeting bias reduction in the direction of α.

Step 4: Estimating the target parameterψˆn.

Given the estimators gˆnMLE and Qˆn1, we obtain the doubly robust estimator ψˆn≡ψˆn(Qˆn1,gˆnMLE).

7 Discussion

In this article, we have proposed an extension to the bias-reduced doubly robust estimation principle of Vermeulen and Vansteelandt [8]. In particular, we relaxed the restriction to parametric nuisance working models by making use of data-adaptive learning algorithms to estimate the conditional mean outcome. We carry on to work with a parametric working model for the missingness mechanism for the inferential problem of Section 2, partly in order to prevent potential positivity violations. However, because the concern of bias is greater for such parametric models, we applied the bias-reduction principle of Section 3.2 in the direction of the nuisance parameters indexing these parametric working models. We furthermore illustrated that the proposed extension is not restricted to doubly robust estimation of the mean outcome susceptible to missingness explainable by measured covariates but also extends to other doubly robust procedures as illustrated in Section 6.

This new procedure follows the spirit of the TMLE procedure of van der Laan and Rubin [9] in the sense it also extends an initial data-adaptive estimator of the relevant part Qˉ0(X) of the conditional outcome distribution, enhancing the performance of the estimator of the target parameter. Fluctuation of initial parametric estimators is also seen in other contexts, e.g., Bang and Robins [6], Tan [20], Rotnitzky et al. [21].

In the implementation of Section 4.2, we could additionally have considered a fluctuation gˆnMLE(εg) of the initial estimator gˆnMLE for the missingness mechanism so as to target additional bias reduction in the direction of the fluctuation parameters ε used in the construction of the fluctuation model Q¯^n0(ε). In preliminary simulation studies, we observed this to be unstable and sometimes non-convergent.

A limitation of our proposal is that the current asymptotic linearity theorem, presented in Appendix A, only guarantees valid inference under a correct working model for the missingness mechanism (but a possibly misspecified finite- or infinite-dimensional outcome working model). This is because misspecification of the working model for the missingness mechanism (but correct specification of the outcome model) would demand acknowledging the uncertainty of the estimator for the outcome working model so as to make the remainder term Rn in the expansion (2) of second-order to obtain valid inference. It is not clear how to accomplish this when using data-adaptive learning algorithms. In Appendix A, we show how this can be done when a parametric working model for the conditional mean outcome is used. An alternative for standard error calculation, which does not demand the assumption of a correctly specified missingness model, is the non-parametric bootstrap, which however lacks supporting theory when the estimators rely on data-adaptive estimation [32] and thus will only provide valid results if one can claim asymptotic linearity of the data-adaptive estimator for the conditional mean outcome model. van der Laan [32] proposes a strategy to obtain valid inference under misspecification of one of both working models in the context of the TMLE procedure. This is accomplished by additionally fluctuating initial estimators of the working models such that the scores with respect to the corresponding fluctuation parameters guarantee the remainder term in the expansion (2) to be asymptotically linear. In the context of the current paper, this would demand adding an additional covariate to the fluctuation models considered in Section 4.2. Further research is needed to see how this can be integrated within the proposed approach.

Article note

Karel Vermeulen is a Research Fellow of the Fund for Scientific Research (FWO).

Appendix

A A Asymptotic Linearity Theorem

We present an asymptotic linearity theorem with corresponding influence function for the doubly robust estimators ψˆn(c) (c=1,2,3) of Section 4 under the assumption of a correctly specified working model for the missingness mechanism but a potentially misspecified working model for the conditional mean outcome, which can be either finite- or infinite-dimensional.

Theorem 1 (Asymptotic linearity of the proposed doubly robust estimator)

With the notation given in the previous section, letQˉ(c)∗denote the probability limit ofQ¯^n(c)and letgMLE∗denote the probability limit ofgˆnMLE; thusQ¯^n(c)→pQ¯(c)*andgˆnMLE→pgMLE∗. Because we assume a correctly specified model for the missingness mechanism, gMLE∗=g0. This implies thatP0{D∗(Qˉ(c)∗,gMLE∗;ψ0)}=0. By definition of the proposed estimation strategy, we have that

Pn{D*(Q¯^n(c),g^nMLE;ψ^n(c))}=0,

PnDgMLE(αˆnMLE)=0,

Pn{Dα*(Q¯^n(c),g^nMLE)}=0.

We make the following regularity conditions;

Donsker class condition

Suppose that the set D1=D∗(Qˉ,g; ψ0): Qˉ,g is a P0-Donsker class where (Qˉ,g) varies over a set containing the sequences (Q¯^n(c),g^nMLE) and (Qˉ(c)∗,gˆnMLE) with probability tending to one.

Consistency condition of D∗

Assume that

P0[{D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,gMLE*;ψ0)}2]→0,

P0D∗(Qˉ(c)∗,gˆnMLE; ψ0)−D∗(Qˉ(c)∗,gMLE∗; ψ0)2→0,

in probability as n→∞.

Glivenko-Cantelli class condition

Suppose that D2=Dα∗(Qˉ,g): Qˉ,g is a P0-Glivenko-Cantelli class where (Qˉ,g) varies over a set containing the sequence (Q¯^n(c),g^nMLE) with probability tending to one.

Consistency condition of Dα∗

With Dα∗=(Dα,1∗,…,Dα,p∗)T, assume that

P0[{Dα,i*(Q¯^n(c),g^nMLE)−Dα,i(Q¯(c)*,gMLE*)}2]→0

in probability as n→∞ for i=1,…,p.

Second-order term condition

Define the second order term

Rn*=P0[R{Q¯^n(c)(X)−Q¯(c)*(X)}{g^nMLE(X)−gMLE*(X)g^nMLE(X)gMLE*(X)}]

and assume Rn∗=op(1/n). Note that this second-order term involves the product of the differences Q¯^n(c)(X)−Q¯(c)*(X) and gˆnMLE(X)−gMLE∗(X).

Then, ψˆn is an asymptotically linear estimator of ψ0 at P0 with influence function D∗(Qˉ(c)∗,g0; ψ0). That is,

ψˆn−ψ0=(Pn−P0)D∗(Qˉ(c)∗,g0; ψ0)+op(1/n).

In particular, n(ψˆn−ψ0) converges in distribution to a normal distribution with mean zero and variance σ02=P0[{D∗(Qˉ(c)∗,g0; ψ0)}2].

Proof. By definition of the proposed estimator, we know that

Pn{D*(Q¯^n(c),g^nMLE; ψn(c))}=0.

Furthermore, because we assume the model for the missingness mechanism to be correctly specified, we have that P0{D∗(Qˉ(c)∗,gMLE∗; ψ0)}=0. From this, it follows that

ψ^n−ψ0=1n∑i=1n[Rig^nMLE(Xi){Yi−Q¯^n(c)(Xi)}+Q¯^n(c)(Xi)−ψ0]

=Pn{D*(Q¯^n(c),g^nMLE; ψ0)}

=Pn{D*(Q¯^n(c),g^nMLE; ψ0)}−P0{D*(Q¯(c)*,gMLE*; ψ0)}

=(Pn−P0)D∗(Qˉ(c)∗,gMLE∗;ψ0)

+Pn{D*(Q¯^n(c),g^nMLE; ψ0)−D*(Q¯(c)*,gMLE*; ψ0)}

=(Pn−P0)D∗(Qˉ(c)∗,gMLE∗;ψ0)

(27)+Pn{D*(Q¯^n(c),g^nMLE; ψ0)−D*(Q¯(c)*,g^nMLE; ψ0)}

(28)+PnD∗(Qˉ(c)∗,gˆnMLE; ψ0)−D∗(Qˉ(c)∗,gMLE∗;ψ0).

Let us first consider the term (27). We have

Pn{D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)}=(Pn−P0){D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)}+P0{D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)}

and from the Donsker class condition and the consistency condition of D∗, it follows that

(Pn−P0){D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)}=op(1/n).

Next consider the term P0{D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)}. We have that

P0{D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)

=P0{D*(Q¯^n(c),gMLE*;ψ0)−D*(Q¯(c)*,gMLE*;ψ0)}

+P0{D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)

−D*(Q¯^n(c),gMLE*;ψ0)+D*(Q¯(c)*,gMLE*;ψ0)}

We have that

P0{D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)−D*(Q¯^n(c),gMLE*;ψ0)+D*(Q¯(c)*,gMLE*;ψ0)}=P0(RgMLE*(X)[{Y−Q¯(c)*(X)}−{Y−Q¯^n(c)(X)}]−Rg^nMLE(X)[{Y−Q¯(c)*(X)}−{Y−Q¯^n(c)(X)}])=P0[R{1gMLE*(X)−1g^nMLE(X)}{Q¯^n(c)(X)−Q¯(c)*(X)}]=Rn*,

so that

P0{D*(Q¯^n(c),g^nMLE;ψ0)−D*(Q¯(c)*,g^nMLE;ψ0)}

=P0{D*(Q¯^n(c),gMLE*;ψ0)−D*(Q¯(c)*,gMLE*;ψ0)}+Rn*

and by assumption, the second-order term Rn∗=op(1/n). Finally, note that

P0{D*(Q¯^n(c),gMLE*;ψ0)−D*(Q¯(c)*,gMLE*;ψ0)}

=−P0[{Q¯^n(c)(X)−Q¯(c)*(X)}{R−gMLE*(X)gMLE*(X)}],

which is op(1/n) under the assumption that gMLE∗(X) equals the true missingness mechanism g0(X). We conclude that eq. (27) is op(1/n).

We next consider the second term (28), which is the contribution of the estimation of the missingness mechanism. From the way we estimated the fluctuation parameter ε, it will follow we can ignore this term; that is, (28) is op(1/n), even under misspecification of the estimator of Qˉ0. Define the function Υ(c)(α)=D∗{Qˉ(c)∗,g(α);ψ0}. It follows that (28) equals Pn{Υ(c)(αˆnMLE)−Υ(c)(α0)}. We know that αˆnMLE is an asymptotically linear estimator for α0 with influence function −P0Dα,gMLE(α0)DgMLE(α0), that is,

αˆnMLE−α0=(Pn−P0)−P0Dα,gMLE(α0)−1DgMLE(α0)+op(1/n)

and where Dα,gMLE(α0)=∂DgMLE(α)/∂α|α=α0. Consider the Taylor expansion

Υ(c)(αˆnMLE)−Υ(c)(α0)=Υ(c),αT(α0)(αˆnMLE−α0)+op(|αˆnMLE−α0|)

with Υ(c),α(α0)=∂Υ(c)(α)/∂α|α=α0. Note that by the asymptotic linearity of αˆnMLE, op(|αˆnMLE−α0|)=op(1/n). Consequently,

PnΥ(c)(αˆnMLE)−Υ(c)(α0)=PnΥ(c),αT(α0)(αˆnMLE−α0)+op(1/n)=(Pn−P0)−P0Υ(c),αT(α0)P0Dα,gMLE(α0)−1DgMLE(α0)+op(1/n)

where the last equality follows from the asymptotic linearity of αˆnMLE and the fact that PnΥ(c),αT(α0)=P0Υ(c),αT(α0)+op(1/n). In principle, when the probability limit Qˉ(c)∗ is different from the truth Qˉ0, this term would contribute to the influence function of ψˆn(c). However, by definition of εˆn(c) and since that Q¯^n(c)=Q¯^n0(ε^n(c)), we have that Pn{Dα(Q¯^n(c),g^nMLE)}=0 for every n. From the Glivenko-Cantelli class condition and the consistency condition of Dα∗, it then follows that Pn{Dα(Q¯^n(c),g^nMLE)}→pP0{Dα(Q¯(c)*,gMLE*)}=P0{ϒ(c),α(α0)}. Hence, it follows that P0{Υ(c),α(α0)}=0. We conclude that eq. (28) is op(1/n).

Putting everything together, we find that

ψˆn(c)−ψ0=(Pn−P0)D∗(Qˉ(c)∗,g0; ψ0)+op(1/n),

showing asymptotic linearity of ψˆn(c).□

Remark

When the updated estimator Q¯^n(c) (c=1,2,3) is based on a parametric model for the initial estimator, the Donsker class condition will be satisfied. Furthermore, when the updated estimator Q¯^n(c) (c=1,2,3) is based on the super-learner Q¯^nSL as an initial estimator and each of the estimators in the library falls in a Donsker class, the Donsker class condition will also be satisfied for Q¯^nSL because the convex combination of such estimators also falls in that class [39]. In short, the Donsker class condition is satisfied if it holds for each of the estimators in the library of the super-learner. Examples for such estimators are for instance given in van der Laan [32].

Theorem 1 establishes asymptotic linearity of the doubly robust estimators ψˆn(c) (c=1,2,3) under model m(g). To obtain asymptotic linearity under model m(q) (so that Qˉ(c)∗=Qˉ0), one should assume asymptotic linearity of Q¯^n(c) in the sense that P0{D*(Q¯^n(c),gMLE*;ψ0)−D*(Q¯(c)*,gMLE*;ψ0)}=(Pn−P0)DQ¯(c)**+op(1/n) (see van der Laan and Rose [19], p. 572 for a discussion concerning this assumption). In this case, the influence function of ψˆn(c) under model m(q) becomes D∗(Qˉ(c)∗,gMLE∗;ψ0)+DQˉ(c)∗∗. When the initial estimator Q¯^n0 for the conditional mean outcome Qˉ0 is parametric such as Qˉ(γ)(X)=Q{γTl(X)}, and γ is estimated via a root-n consistent estimator such as the MLE, asymptotic linearity of ψˆn(c) (c=1,2,3) under model m(q) can be shown in a straightforward manner, as we argue next. In this case, the updated estimator Q¯^n(c) is based on a parametric model Qˉ(θ(c)) for the conditional mean outcome with dimension equal to the dimension of θ(c)=(γT,ε(c),T)T. Let θˆn(c) denote the corresponding estimator of θ(c) with probability limit θ(c)∗. When the parametric model for the conditional mean outcome is correctly specified, θ(c)∗=(γ0T,0T)T. Under standard regularity conditions, θˆn(c) is asymptotically linear with influence function DQˉ∗(θ(c)); that is, (θˆn(c)−θ(c)∗)=(Pn−P0)DQˉ∗(θ(c)∗)+op(1/n). We now show that under m(q), equation (27) is not op(1/n) because then gMLE∗≠g0. For this purpose, define the function ξ(c)(θ(c),α)=D∗{Qˉ(θ(c)),g(α);ψ0}. It follows from a Taylor expansion and the asymptotic linearity of θˆn(c), that eq. (27) equals

Pnξ(c)(θˆn(c),αˆnMLE)−ξ(c)(θ(c)∗,αˆnMLE)

=Pnξ(c),θ(c)T(θ(c)∗,αˆnMLE)(θˆn(c)−θ(c)∗)+op(|θˆn(c)−θ(c)∗|)

=Pnξ(c),θ(c)T(θ(c)∗,αˆnMLE)(Pn−P0)DQˉ∗(θ(c)∗)+op(1/n),

with ξ(c),θ(c)T(θ(c)∗,αˆnMLE)=∂ξ(c)(θ(c),αˆnMLE)/∂θ(c)|θ(c)=θ(c)∗. Under suitable regularity conditions (Robins et al. [40], app. B), it follows from the uniform weak law of large numbers that Pnξ(c),θ(c)T(θ(c)∗,αˆnMLE) converges in probability to P0ξ(c),θ(c)T(θ(c)∗,αMLE∗). This shows that eq. (27) can be written as

(Pn−P0)P0ξ(c),θ(c)T(θ(c)∗,αMLE∗)DQˉ∗(θ(c)∗)+op(1/n).

Upon replacing α0 by αMLE∗ in the remainder of the proof of Theorem 1, it follows that the influence function of ψˆn(c) under model m(q) equals D∗(Qˉ(c)∗,gMLE∗; ψ0)+P0ξ(c),θ(c)T(θ(c)∗,αMLE∗)DQˉ∗(θ(c)∗). For this specific setting we thus have that DQˉ(c)∗∗=P0ξ(c),θ(c)T(θ(c)∗,αMLE∗)DQˉ∗(θ(c)∗). Local-ancillarity with respect to γ can however still be obtained by implementing bias-reduction in the direction of γ as well, enforcing this term to be zero (see Section 3).

Remark

Under model m(g), αMLE∗=α0 and hence P0ξ(c),θ(c)(θ(c)∗,α0)=0. We may hence conclude from the derivation above (but now with the probability limit θ(c)∗≠(γ0T,0T)T) that eq. (27) is op(1/n) for a parametric initial estimator for Qˉ0 which is potentially misspecified, given the parametric model for the missingness mechanism is correctly specified. This argument is more insightful than the Donsker class condition and the second-order term condition.

B R-function

Below, we provide an R-function to obtain the data-adaptive bias-reduced doubly robust estimators ψˆn(c) to estimate the mean outcome E(Y)=ψ0 in the presence of incomplete data for both a linear regression working model and a super-learner working model for the initial estimator of the conditional mean outcome and with a logistic regression working model for the propensity score.

As input, the function uses the missingness indicator R, the outcome Y, the auxiliary covariates cov, the estimation method for the initial estimator of the conditional mean outcome type.initQ=c(“par”, “SL”) (either “par” for a parametric linear regression model or “SL” for a super-learner), the level zeta at which the initial estimator must be truncated, the type of loss-function for the logistic fluctuation model fluc=c(“unweighted”, “weighted”), and the level of significance alpha of a hypothesis test of the mean equal to psi.tilde. As output, the function delivers the estimate est, which equals ψˆn(c), the standard error se, which equals SEˆ(ψˆn(c)) based on eq. (21), a (1−α)100% Wald confidence interval ci, the Wald statistic Wald.statistic and the corresponding p-value p.value for a test of the null hypothesis H0:ψ0=ψ˜. We also provide an example where the procedure is applied to a random dataset, obtained via the Kang and Schafer data-generating mechanism.

data.adaptive.biasreduced.DR <-

function(R,Y,cov,type.initQ=c("par","SL"),

zeta=0.005,fluc=c("unweighted","weighted"),

alpha=0.05,psi.tilde=0){

expit <- function(x){1/(1+exp(-x))}

logit <- function(x){log(x/(1-x))}

n <- length(R)

dat.cov <- data.frame(cov)

int.cov <- cbind(rep(1,n),cov)

colnames(dat.cov)<-

paste("cov.",1:dim(cov) [2],sep="")

# propensity score

mler <- glm(R~cov,family="binomial")

ps.par <- predict(mler,type="response")

# initial conditional mean outcome

a <- min(Y [R==1]) – 0.1*abs(min(Y [R==1]))

b <- max(Y [R==1]) + 0.1*abs(max(Y [R==1]))

Y.star <- (Y-a)/(b-a) {if(type.initQ=="par") {

mley <- lm(Y.star~cov,subset=(R==1))

initQ <- predict(mley,newdata=dat.cov)

}

else if(type.initQ=="SL"){

Ym.star <- Y.star [R==1]

dat.cov.m <- dat.cov [R==1,,drop=FALSE]

SL.library <- c("SL.glm", "SL.randomForest",

"SL.gam","SL.polymars", "SL.mean")

initQ <- SuperLearner(Y=Ym.star,X=dat.cov.m,

newX=dat.cov,verbose = FALSE,

SL.library=SL.library,

method="method.NNLS")$SL.predict

}

initQ.trunc <- ifelse(initQ < zeta,zeta,

ifelse(initQ > 1-zeta,1-zeta,initQ))

# fluctuation

{

if(fluc=="unweighted"){

w.cov <- (1-ps.par)/ps.par*int.cov

fluctuationQ <-glm(Y.star^~–1+w.cov,

family=binomial,offset=logit(initQ.trunc),

subset=(R==1))

flucQ <- expit(logit(initQ.trunc)+

as.vector(coef(fluctuationQ)%*%t(w.cov)))

}

else if(fluc=="weighted"){

fluctuationQ <-glm(Y.star~cov,

family=binomial,offset=logit(initQ.trunc),

subset=(R==1)) weights=(1-ps.par)/ps.par)

flucQ <- expit(logit(initQ.trunc)+

as.vector(coef(fluctuationQ)%*%t(int.cov))

}

# doubly robust estimator

U <- function(R,Y,outcome,ps){

outcome+R/ps*(Y-outcome)

}

est.trunc <- mean(U(R=R,Y=Y.star,

outcome=flucQ,ps=ps.par))

psi <- (b-a)*est.trunc+a

# standard error

se.psi <- (b-a)*sd(U(R=R,Y=Y.star,

outcome=flucQ,ps=ps.par))/sqrt(n)

# 95% confidence interval

ci.psi <- psi+c(–1,1)*qnorm(1-alpha/2)*se.psi

# Wald test statistic

W <- (psi-psi.tilde)/se.psi

# p-value Wald test

p.value <- 2*pnorm(abs(W),lower.tail=FALSE)

return(list(est=psi,se=se.psi,ci=ci.psi,

Wald.statistic=W,p.value=p.value))

}

Example

library(SuperLearner)

gen.dataKS <- function(k,n,mech= c("normal","reverse"),spec=c("C","I")){

set.seed(k)

z1 <- rnorm(n); z2 - rnorm(n)

z3 <- rnorm(n); z4<- rnorm(n)

z <- cbind(z1,z2,z3,z4)

colnames(z)<- paste("Z.",1:4,sep="")

x1 <- exp(0.5*z1)

x2 <- z2/(1+exp(z1))+10

x3 <- (0.04*z1*z3+0.6)^3

x4 <- (z2+z4+20)^2

y <- rnorm(n,210+27.4*z1+13.7*z2+13.7*z3+13.7*z4,sd=1)

r <- rbinom(n,1,expit(–1.5*z1+0.75*z2-0.375*z3–0.15*z4))

if(spec=="C"){

x1 <- z1;x2 <- z2

x3 <- z3;x4 <- z4;

x <- z

}

if(mech=="reverse"){

r <- 1-r

}

data <- data.frame(r,y,x1,x2,x3,x4)

names(data)<- c("r","y","x.1","x.2","x.3","x.4")

data

}

data <- gen.dataKS(1,n=1000,mech="reverse",spec="I")

R <- data$r

Y <- data$y

Cov <- cbind(data$x.1,data$x.2,data$x.3,data$x.4)

dataadaptive.biasreduced.DR(R=R,Y=Y,cov=cov,

type.initQ="SL",zeta=0.005,

fluc="weighted",alpha=0.05,psi.tilde=210)

References

1. Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for Semiparametric Models: Some Questions and an Answer”. Stat Sinica 2001;11:920–36.Search in Google Scholar

2. Tsiatis AA, Davidian M, Zhang M, Lu X. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Stat Med 2008;27:4658–77.10.1002/sim.3113Search in Google Scholar PubMed PubMed Central

3. Moore KL, van der Laan MJ. Covariate adjustment in randomized trials with binary outcomes: targeted maximum likelihood estimation. Stat Med 2009;28:39–64.10.1002/sim.3445Search in Google Scholar PubMed PubMed Central

4. Vermeulen K, Thas O, Vansteelandt S. Increasing the power of the Mann-Whitney test in randomized experiments through flexible covariate adjustment. Stat Med 2015;35:1012–30.10.1002/sim.6386Search in Google Scholar PubMed

5. Kang JD, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 2007a;22:523–39.10.1214/07-STS227Search in Google Scholar PubMed PubMed Central

6. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005;61:962–72.10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed

7. Vansteelandt S, Bekaert M, Claeskens G. On model selection and model misspecification in causal inference. Stat Methods Med Res 2012;21:7–30.10.1177/0962280210387717Search in Google Scholar PubMed

8. Vermeulen K, Vansteelandt S. Bias-reduced doubly robust estimation. J Am Stat Assoc 2015;110:1024–3610.1080/01621459.2014.958155Search in Google Scholar

9. van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat 2006;2(1), Article 11.10.2202/1557-4679.1043Search in Google Scholar

10. van der Laan MJ, Polley EC, Hubbard AE. “Super Learner,” Statistical Applications in Genetics and Molecular Biology 2007:6.10.2202/1544-6115.1309Search in Google Scholar PubMed

11. Ridgeway G, McCaffrey D. Comment: demystifying double robustness: a comparison of alternative strategies for estimating a population mean for incomplete data. Stat Sci 2007;22:540–3.10.1214/07-STS227CSearch in Google Scholar

12. Robins JM, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: performance of double-robust estimators when “Inverse Probability” weights are highly variable. Stat Sci 2007;22:544–59.10.1214/07-STS227DSearch in Google Scholar

13. Tan Z. Comment: Understanding OR, PS and DR. Stat Sci 2007;22:560–8.10.1214/07-STS227ASearch in Google Scholar

14. Tsiatis AA, Davidian M. Comment: demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 2007;22:569–73.10.1214/07-STS227BSearch in Google Scholar

15. Kang JD, Schafer JL. Rejoinder: demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 2007b;22:574–80.10.1214/07-STS227REJSearch in Google Scholar

16. Rubin DB, van der Laan MJ. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. Int J Biostat 2008;4:1–40.10.2202/1557-4679.1084Search in Google Scholar

17. Cao WH, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 2009;96:723–34.10.1093/biomet/asp033Search in Google Scholar PubMed PubMed Central

18. Tsiatis AA, Davidian M, Cao W. Improved doubly robust estimation when data are monotonely coarsened, with application to longitudinal studies with dropout. Biometrics 2011;67:536–45.10.1111/j.1541-0420.2010.01476.xSearch in Google Scholar PubMed PubMed Central

19. van der Laan MJ, Rose S. Targeted learning, causal inference for observational and experimental data. Springer series in Statistics, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

20. Tan Z. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika 2010;97(3):661–82.10.1093/biomet/asq035Search in Google Scholar

21. Rotnitzky A, Lei QH, Sued M, Robins JM. Improved double-robust estimation in missing data and causal inference models. Biometrika 2012;99:439–56.10.1093/biomet/ass013Search in Google Scholar PubMed PubMed Central

22. van der Laan MJ, Gruber S. Collaborative double robust targeted maximum likelihood estimation. Int J Biostat 2010:6.10.2202/1557-4679.1181Search in Google Scholar PubMed PubMed Central

23. Rotnitzky A, Vansteelandt S. Double-robust methods. In: Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis AA, Verbeke G, editors. Handbook of missing data methodology. Boca Raton: CRC Press, 2015.Search in Google Scholar

24. Porter KE, Gruber S, van der Laan MJ, Sekhon JS. The relative performance of targeted maximum likelihood estimators. Int J Biostat 2011:7.10.2202/1557-4679.1308Search in Google Scholar PubMed PubMed Central

25. Rubin DB. Inference and missing data. Biometrika 1976;63:581–90.10.1093/biomet/63.3.581Search in Google Scholar

26. Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc 1999a;94:1096–120.10.1080/01621459.1999.10473862Search in Google Scholar

27. Bickel PJ, Klaassen CA, Ritov Y, Wellner JA. Efficient and adaptive estimation for semiparametric models. Baltimore: The Johns Hopkins University Press, 1993.Search in Google Scholar

28. Cox DR. Local ancillarity. Biometrika 1980;67:279–86.10.1093/biomet/67.2.279Search in Google Scholar

29. Petersen ML, Porter KE, Gruber S, Wang Y, van der Laan MJ. Diagnosing and responding to violations in the positivity assumption. Stat Methods Med Res 2010;21:31–54.10.1177/0962280210386207Search in Google Scholar PubMed PubMed Central

30. Gruber S, van der Laan M. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat 2010;6(1), Article 26.10.2202/1557-4679.1260Search in Google Scholar PubMed PubMed Central

31. Díaz I, Rosenblum M. Targeted maximum likelihood estimation using exponential families. ArXiv e-prints. Int J Biostat 2014;11(2):233–51.10.1515/ijb-2014-0039Search in Google Scholar PubMed

32. van der Laan MJ. Targeted estimation of nuisance parameters to obtain valid statistical inference. Int J Biostat 2014;10:29–57.10.1515/ijb-2012-0038Search in Google Scholar PubMed

33. Polley E, van der Laan M. Super Learner Prediction: the SuperLearner package, 2014.Search in Google Scholar

34. Gruber S, van der Laan M. Targeted maximum likelihood estimation: the TMLE package, 2014.Search in Google Scholar

35. Robins J. Correcting for noncompliance in randomized trials using structural nested mean models. Commun Stat Theor Methods 1994;23:2379–2412.10.1080/03610929408831393Search in Google Scholar

36. Wooldridge J. Econometric analysis of cross section and panel data. Cambridge, MA: The MIT Press, 2002.Search in Google Scholar

37. Hernán MA, Robins J. Instruments for causal inference. an epidemiologist’s dream? Epidemiology 2006;17:360–72.10.1097/01.ede.0000222409.00878.37Search in Google Scholar PubMed

38. Okui R, Small DS, Tan Z, Robins JM. Doubly robust intrumental variable regression. Stat Sinica 2012;22:173–205.10.5705/ss.2009.265Search in Google Scholar

39. van der Vaart AW, Wellner JA. Weak convergence and empirical processes. New-York: Springer-Verlag, 1996.10.1007/978-1-4757-2545-2Search in Google Scholar

40. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression-coefficients when some regressors are not always observed. J Am Stat Assoc 1994;89:846–66.10.1080/01621459.1994.10476818Search in Google Scholar

Published Online: 2016-5-26

Published in Print: 2016-5-1

Data-Adaptive Bias-Reduced Doubly Robust Estimation

Abstract

1 Introduction

1.1 Some relevant literature overview

2 Doubly robust estimation of a population mean with incomplete data

3 Bias-reduced doubly robust estimation

3.1 Parametric nuisance working models and MLE

3.2 Bias-reduced doubly robust estimation

3.2.1 Estimation principle

3.2.2 Illustration: estimation of a population mean with incomplete data

4 Extending bias-reduced doubly robust estimation beyond parametric working models

4.1 Main idea

4.2 Practical implementation of the procedure

4.3 Inference

4.3.1 Standard errors

4.3.2 Confidence intervals and p-values

4.3.3 Inference on the original scale

5 Simulation studies

5.1 Estimators

5.2 Scenario 1: single-covariate setting

5.2.1 Data-generating mechanism

5.2.2 Results

5.3 Scenario 2: Kang and Schafer setting

5.3.1 Data-generating mechanism

5.3.2 Results

6 Linear instrumental variable analyses

6.1 Doubly robust estimation of linear instrumental variable models

6.2 Practical implementation of the proposed procedure

7 Discussion

Article note

Appendix

A A Asymptotic Linearity Theorem

Donsker class condition

Consistency condition of D∗

Glivenko-Cantelli class condition

Consistency condition of Dα∗

Second-order term condition

B R-function

Example

References

Journal and Issue

Articles in the same Issue