Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Third-variable effect analysis with multilevel additive models

  • Qingzhao Yu ,

    Contributed equally to this work with: Qingzhao Yu, Bin Li

    Roles Formal analysis, Funding acquisition, Methodology, Software, Writing – original draft, Writing – review & editing

    qyu@lsuhsc.edu

    Affiliation Biostatistics Program, School of Public Health, Louisiana State University Health Science Center, New Orleans, LA, United States of America

  • Bin Li

    Contributed equally to this work with: Qingzhao Yu, Bin Li

    Roles Formal analysis, Methodology, Software, Writing – review & editing

    Affiliation Department of Experimental Statistics, Louisiana State University, Baton Rouge, LA, United States of America

Abstract

Third-variable effect refers to the effect transmitted by third-variables that intervene in the relationship between an exposure and a response variable. Third-variable effect analysis has been broadly studied in many fields. However, it remains a challenge for researchers to differentiate indirect effect of individual factor from multiple third-variables, especially when the involving variables are of hierarchical structure. Yu et al. (2014) defined third-variable effects that were consistent for all different types of response (categorical or continuous), exposure, or third-variables. With these definitions, multiple third-variables can be considered simultaneously, and the indirect effects carried by individual third-variables can be separated from the total effect. In this paper, we extend the definitions of third-variable effects to multilevel data structures, where multilevel additive models are adapted to model the variable relationships. And then third-variable effects can be estimated at different levels. Moreover, transformations on variables are allowed to present nonlinear relationships among variables. We compile an R package mlma, to carry out the proposed multilevel third-variable analysis. Simulations show that the proposed method can effectively differentiate and estimate third-variable effects from different levels. Further, we implement the method to explore the racial disparity in body mass index accounting for both environmental and individual level risk factors.

Introduction

A third-variable effect (TVE) refers to the effect conveyed by a third-variable to an observed relationship between an exposure and a response variable of interest. Depend on whether there is a causal relationship from the exposure variable to the third-third variable and then to the response, the third-variable (denoted as M) is often called a mediator (when there are causal relationships) or a confounder (no causal relationship is involved). Research purposes for the third-variable analysis are generally, 1) identify significant third-variables that can partially or completely explain the relationship between the exposure variable (X) and the outcome (Y); and 2) differentiate the TVE from different paths that connect between X and Y. Except for the difference in explaining the effects from a mediator and from a confounder, the TVE analysis method can be used to make inferences on both mediation and confounding effects [1]. Third-variable analysis has been widely used in psychology, social sciences, behavioral research, health prevention, epidemiological studies, and genetic analysis. Within the context of generalized linear models, general predictive models, and counterfactual framework, a number of methods have been proposed for inferences on TVEs, to name a few, [27].

In research practices, the experiment data are often collected hierarchically. For example, to identify variables that are related with childhood obesity, we consider both environmental (e.g. walkability of the neighborhood) and individual factors (e.g. snacking habit). When hierarchical databases are considered, third-variable analysis method based on generalized linear models are usually not readily adaptable since the independence assumption among observations is violated. In such cases, hierarchical models, also known as multilevel or mixed-effect models, are more appropriate to fit relationships among variables since these models can catch dependencies among observations and allow for predictors from different levels of the data [8]. In this paper, we propose a third-variable analysis method based on multilevel models. For the hierarchical model, we assume there are two levels of data and refer the individual level as level 1, and the group level as level 2. Although more than two levels of hierarchy is possible, this paper focuses on two-level databases only.

In third-variable analysis, besides the pathway that directly connect the exposure variable with the outcome, we explore the exposurethird-variableresponse or XMY pathways. When doing third-variable analysis with multilevel models, the level of the variables at the left of the arrow should be higher than or equal to that of the right, since a group level variable may affect an individual level variable but not vice versa [9]. In this setting, only the 2 → 2 → 2, 2 → 2 → 1, 2 → 1 → 1, and 1 → 1 → 1 relationships are legible. Multilevel models are necessary to deal with hierarchical data base even for the 1 → 1 → 1 relationship. [10] studied the bias brought by using single level models when data are hierarchical. [9, 1113] proposed third-variable analysis methods for all three types of multilevel models. Moreover, [1416] proposed alternative methods to test the indirect effects in 2 → 2 → 1, 2 → 1 → 1 models. In addition, [17] proposed to use Bayesian third-variable analysis to deal with hierarchical databases.

In this paper, we propose to use generalized additive multilevel models for third-variable analysis with hierarchical databases. The major contributions of this proposed method are that: 1) we extend the general definitions of TVEs proposed by [7] to multilevel data structure; 2) the method allows any types (categorical or continuous) of the exposure(s), third-variables and response variable(s) in exploring TVEs; 3) TVEs from individual or group of third-variables and from different levels are differentiable; 4) nonlinear associations among variables are allowed in calculating the TVEs; 5) multiple exposure(s) from different levels and multivariate outcomes are allowed in the third-variable analysis, and 6) an R package mlma was developed for the method proposed in this paper. Practitioners can easily implement the method in real data analysis.

The rest of the paper is organized as follows. In Section 2, we review the general definitions of TVEs and then extend these concepts to the multilevel model situation. We also review the generalized additive multilevel models (GAMM) that are used to model relationships among variables. Based on that, we propose the multilevel third-variable analysis with GAMM. In Section 3, we illustrate the use of the proposed method in different multilevel data structures and the usage of the mlma R package. We then use the method in a real data example in Section 4: to explore the ethnic disparity in obesity considering both individual and environmental risk factors. Section 5 gives a summary of the proposed method and points out future research directions.

2 Third-variable analysis with multilevel additive models

[7] proposed general definitions for TVEs. First we recapitulate the definitions and then propose the multilevel additive models in third-variable analysis with hierarchical databases.

2.1 Definitions of third-variable effects with single-level data

The conceptual model for TVE of one level is shown in Fig 1. In the Figure, X is the exposure variable, Y is the outcome, M = {M1, …, Mp} is the vector of p third-variables, and Z is the vector of other variables which relate with Y but do not intermediate the XY relationship. As usual, the TVE includes total effect—the overall effect from the exposure variable X to the outcome Y, direct effect—the remaining effect from X to Y after accounting for effects form third-variables, and indirect effect from Mi—the effect from the path XMiY.

thumbnail
Fig 1. Conceptual model for one-level third-variable effects.

https://doi.org/10.1371/journal.pone.0241072.g001

Basically, [7] defines the total effect as the changing rate in the outcome when the exposure variable changes. Compared with the traditional definition of TVEs, which are defined as the amount of change of the outcome when the exposure variable changes from one status to another, the Yu et al.’s definition has the following benefits:

  1. The total effect is scale invariant to the unit of the exposure variable. This is because the effect is defined as the changing rate but not as the changing amount of Y with X.
  2. There is no need to define the two status of the exposure variable. Therefore the total effects can be calculated not only for binary but also for continuous exposures.
  3. The total effect can be calculated with nonlinear models when the relationship among variables are changing at different values of the exposure variable. The total effect can be expressed as a function of the exposure variable by the Yu definition [7].

The direct effect not from Mi is similarly defined as the total effect except that the relationship between the exposure variable and Mi is manipulatively broken. In such case, Mi are controlled not to change with X: Mi follows its marginal distribution from the observations. The indirect effect from Mi is then defined as the total effect minus the direct effect not from Mi. All the benefits of the definition of total effect are also fit for the direct and indirect effects. In additon, multiple third-variables can be considered simultaneously, and the indirect effects carried by individual third-variables can be separated from the total effect. For the formal definitions of TVEs, readers are referred to [7] and [18]. [7] showed that the proposed definitions of TVEs are equivalent to the conventional TVEs under certain situations (e.g., linear regression models with continuous third-variables and outcome). They also established the relationship between the proposed definitions of direct or indirect effect and the natural direct or indirect effect for binary exposure variables. Later, the definitions have been implemented to explore racial and ethnic health disparities by [19, 20] and extended to deal with time-to-event outcomes ([21]) and for multiple exposures and multivariate outcomes ([22]). In this paper, we extend the method to the context of multilevel models.

2.2 Definitions of third-variable effects with data of two levels

The unique data structure in multilevel models raises the potential problem of confounding TVEs from different levels. As pointed out in [13], the relationship between two level-one variables can be decomposed into between-group and within-group components. In particular, the aggregated variables at the second level can be highly related while the relationship may be very weak or even at an opposite direction when considered at the individual level [23]. For example, [24] points out that the proportion of black residents may be an important variable for the census tract, while it is different from the meaning of ethnicity as an individual-level variable. [25] discussed the difference of the two components extensively. It is important to differentiate the between-group and within-group components in third-variable analysis, where the TVEs can be decomposed to level 1 and level 2 effects. To identify the level 1 and level 2 TVEs separately, [13] proposed the group-mean centering method (CWC), where they subtracted the group means from individual level variables and added group means as level 2 covariates. In their paper, [13] showed that the CWC method efficiently separated level 1 and level 2 TVEs and resulted in less bias and more power compared with non-centering methods. In this paper, we use a different way to estimate the level 1 and level 2 TVEs: extend the definitions of TVEs with single level models by [7] to multilevel models.

With the generalized definition of TVEs, [22] has shown that a third-varaible analysis can involve multiple exposure variables and multivariate outcomes. The purpose of third-variable analysis is to differentiate the direct effect and indirect effect from each third-variable for each pair of the exposure-outcome relationship. If the outcome is at level 2, all exposure and mediators have to be level 2 as discussed in Section 1. Therefore, a single-level third-variable analysis works. If the outcome is at level 1, the exposure variable can be a level 1 or level 2 variable. The third-variables can be level 1 or 2 for a level 2 exposure, but have to be level 1 for a level 1 exposure variable. In this paper, we focus on level 1 outcomes.

Denote Mij = (Mij1, …, MijK) as the vector of K potential level 1 third-variables for the ith object at the jth group. Mij,−k is the vector Mij excluding the kth element. Denote M.j = (M.j1, …, M.jL) as the vector of the L potential level 2 third-variables or level 1 third-variables aggregated at level 2 within group j. Let Mijk(xij) be a random variable that has a conditional distribution given Xij = xij. For an exposure variable X at any level, let u* be the minimum unit of X, such that if xdomain(X), then x + u* ∈ domain(X). For now, we ignore other covariates Z. Assume effects of exposures and third-variables on the outcome are additive, we have the general definitions of TVEs, following [7], for level 1 (Definition 1) and level 2 exposure variables (Definition 2). Note that a level 1 exposure can have only level 1 mediators while a level 2 exposure can have both level 1 and level 2 mediators.

Definition 1. For a level 1 exposure variable X, the level 1 total effect (TE1) of X on Y, the level 1 direct effect () of X on Y not from level 1 third-variable Mk and the level 1 indirect effect of X on Y through Mk at X = xij () are defined as: (1) (2) (3)

The average level one TVEs are the mean value of the TVEs defined by Definition 1: ATE1 = Eij[TE1(xij)], ADE1,\k = Eij[DE1,\k(xij)] and AIE1,k = ATE1ADE1,\k.

Definition 2. For a level 2 exposure variable X, the level 2 total effect (TE2) of X on Y, the level 2 direct effect () of X on Y not from the level 1 third variable Mk and level 2 third variable Ml, and the level 2 indirect effect of X on Y through Mk and Ml at X = x.j () are defined as: (4) (5) (6) (7) (8)

The average level 2 TVEs are the TVEs defined by Definition 2 averaged at the group level: ATE2 = Ej[TE2,j(x.j)], AIE21,k = Ej[IE21,jk(x.j)] and AIE22,l = Ej[IE22,jl(x.j)].

2.3 Multilevel additive models

We use multilevel additive models to build relationships among variables of hierarchy. The additive model is a nonlinear regression method that was first proposed by [26]. A multilevel additive model can deal with both nonlinear covariate effects and cluster-specific heterogeneity [27]. It is now gaining rapid popularity in psychological and social research [28]. Using the notations in Section 2.2, assume that we have L level 2 and K level 1 third-variables. In addition, assume that there are E1 level 1 and E2 level 2 exposure variables, denoted as and respectively. We propose the following linear additive multilevel models for multilevel third-variable analysis. The boldfaced letter indicates a potential vector of functions or numbers.

For level 2 third-variables, M.jl, l = 1, …, L:

For level 1 third-variables, Mijk, k = 1, …, K:

The full model:

In the models, r0jk and r0j are second-level random errors with mean zero and finite variances. f(⋅) is a function/transformation vector of ⋅. The transformation enables modeling nonlinear relationships among variables. We assume that all the transformation functions are first-order differentiable. α and β are coefficient vectors for transformed variables in predicting the response variable on the left side of each equation. In addition, g(⋅)s are the link functions that link the expected response variable with the right-hand-side of each equation, the systematic component of a generalized linear model. For example, a binary Mijk may have a link function g1k = logit(P(Mijk = 1)). Similarly, a link function can be used on the outcome. With the link function, we can deal with different types of third-variables and outcomes.

2.4 Third-variable effects with multilevel additive model

Based on the definitions of TVEs, we derive the TVEs in Theorems 1 and 2. In the theorems, f′(x) denotes the first derivative of function f on the random variable X and realized at X = x. In addition, g−1 denotes the inverse function of g. We further denote μijk = E(Mijk) and μ.jk = E(M.jk). The proofs of theorems are included in the S1 File.

Theorem 1 With the relationships among variables built by Section 2.3, the TVEs for level 1 exposure variable Xije, e = 1, …, E1 on the outcome variable Y are:

Theorem 2. With the relationships among variables built by Section 2.3, the TVEs for level 2 exposure variable X.je, e = 1, …, E2 on the outcome variable Y are:

2.5 Bootstrap method for third-variable effect inferences

Finally, we use the bootstrap method to calculate the variances of the TVEs. In particular, at the group level, a bootstrap sample of the same size for each group is drawn with replacement from the original data set. Then multilevel additive models are fitted based on the bootstrap sample to get the estimates of βs, based on which the TVEs can be calculated by Theorems 1 and 2. The above process repeats many times and inferences can be made based on the repeated estimates. To draw the bootstrap sample, we keep all groups at the second level and then at the individual level, we draw observations with replacement of size nj from the jth group, where nj is the number of observations in the jth group. This bootstrap method is adopted in the R package mlma to estimate the variances of estimates and to build up confidence intervals. The mlma package is available from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/mlma/index.html and illustrated by the simulations in Section 3.

2.6 The mlma R package

The authors developed a R package, mlma, for multilevel third-variable analysis. The analysis is based on multilevel additive models where nonlinear transformations of variables are allowed. The package mlma contains three groups of functions: function data.org is used to prepare data sets for analysis—transforming variables, dichotomizing categorical third-variables, and getting the derivatives of the transformation functions. The functions mlma and boot.mlma are used for statistical inferences on the TVEs. The former estimates the TVEs and the latter generates bootstrap samples from the original data sets and does the third-variable analysis based on the bootstrap samples. The estimates of TVEs from the bootstrap samples are then used for statistical inferences. The third group of functions are generic functions—print, summary and plot results from the mlma and boot.mlma functions. The details of how to use the functions are fully documented within the package and the mlma vignette. The analysis in Section 3 are all performed using the package.

3 Simulations

In this section, we show the performance of the proposed method and illustrate the use of the R package mlma. The first simulation is based on a 2 → 1 → 1 true model adapted from [13] but it is extended to include exposure and third-variables at both levels. There is one exposure variable at each level and one third-variable at each level. The level 1 exposure and level 2 third-variable are binary. No transformation is performed on the original variables. The first simulation is to exam how the sensitivity and specificity of identifying important third-variables are influenced by different sets of parameters.

The second simulation is based on nonlinear multilevel models with both level exposures and one level 1 continuous third-variable. The second simulation is to demonstrate how to include transformation terms of variables in the multilevel mediation analysis and how to use the graph tools in the R package mlma to illustrate relationships among variables.

The simulations are performed using the R package mlma. The R codes are included in the S1 File.

3.1 Simulation 1

The first simulation includes one level 1 binary exposure, Xij, and one level 2 continuous exposure, X.j, where j denotes the jth group and i denotes the ith observation in the jth group. There are 600 observations in total and n observations in each group. Therefore, there are 600/n groups. To check how group sizes and number of groups can influence the sensitivity of identifying important TVEs, we set n at 5 and 20 respectively. Xij are generated through binomial distribution with the probability of success 0.5. X.j are generated through a standard normal distribution. There are a level 2 binary third variable, M.j, and one level 1 continuous third-variable, Mij. In detail, the simulation data are generated from the following models for i = 1, …, n and j = 1, …, 600/n: where u0jN(0, 0.5), ϵ0ijN(0, 1), u1jN(0, v2) and ϵ1ijN(0, v1) are independently generated random errors at both levels. The level two random error for the response variable is set as one-fifth of the level one variance—v2 = v1/5, which makes the intraclass correlation (ICC) to be.17. As pointed out by [29], this medium ICC value facilitates model convergence. v2 is chosen to be 5 or 1, β1 and β2 ranges within the set {−.59, −.14, 0, 0.14, 0.39} and β3 and β4 from {0, 0.14, 0.59}. We then check the influence of these parameters on the sensitivity and specificity of identifying important TVEs. The data with each combination of parameters are generated 20 times. The sensitivity of a TVE is the proportion of times that the estimated confidence interval of the effect does not include 0 when the actual effect is not 0. Relatively, the specificity is defined as the proportion of times that the estimated confidence interval of the effect includes 0 when the true effect is 0.

For the level 2 exposure, the average direct effect for X.jYij (denoted as de2), the indirect effect from M.j for X.jM.jYij (denoted as ie2.2), the indirect effect from Mij for X.jMijYij (denoted as ie2.1), and the total effect (te2) are estimated.

For the level 1 exposure, the average direct effect for XijYij (denoted as de1), the indirect effect from Mij for XijMijYij (denoted as ie2.1), and the total effect (te1) are estimated.

Fig 2 shows the sensitivity (or 1-specificity when β2 = 0) of identifying significant level 2 direct effect, when βs change but n is fixed at 20 and v1 at 1. The x-axis is β2, the true level 2 direct effect. Different color and line type in each plot represents different values of β1. Each row of the 3 × 3 panel in Fig 2 is a different value for β4, each column is for a different value of β3. We found that the sensitivity of correctly identifying important level 2 direct effect depends on the value of other TVEs. This is mainly due to the increased correlations among variables that results in raised variances in the estimation.

thumbnail
Fig 2. Estimates of level 2 direct effect from simulation 1.

The x-axis is the true direct effect, and y-axis is the sensitivity (or 1-specificity when true de is 0). Different color and line type represents different value for β1. Rows are for β4 = 0, 0.14 and 0.59, and columns are for β3 = 0, 0.14 and 0.59 respectively.

https://doi.org/10.1371/journal.pone.0241072.g002

Fig 3 shows the sensitivity or 1-specificity (when β1 = 0) of identifying important level 1 direct effect. The sensitivity is averaged over β3 and β4 for different values of β1, v1 and n. We see that the sensitivities reduce a lot when the variances are bigger. However, values of n have minimum influence on the sensitivity.

thumbnail
Fig 3. Estimates of level 1 direct effect from Simulation 1.

The x-axis is the true direct effect, and y-axis is the sensitivity (or 1-specificity when the true de is 0). Different color and line type represents different value for β2. Each plot is for a different combination of n and v1.

https://doi.org/10.1371/journal.pone.0241072.g003

Finally, Figs 4 and 5 show the sensitivity and specificity of identifying indirect effects. For both Figures, we fix v1 at 1. Fig 4 shows how the sensitivity of level 2 indirect effects of Mij and M.j changes with β3 and β4. The left panel is the sensitivity and 1-specificity (when β3 = 0) of identifying Mij. The level 2 indirect effect of Mij is proportional to β3, so the sensitivity increases with β3. The sensitivity is minimally influenced by the value of β4. The right panel is the sensitivity and 1-specificity (when β4 = 0) of identifying M.j. Again, the sensitivity of identifying M.j is proportional to β4, but is not systematically influenced by the value of β3.

thumbnail
Fig 4. Estimates of level 2 indirect effect from simulation 1.

The left panel is the sensitivity and 1-specificity of identifying Mij. The right panel is the sensitivity and 1-specificity of identifying M.j.

https://doi.org/10.1371/journal.pone.0241072.g004

thumbnail
Fig 5. Estimates of indirect effects of Mij from simulation 1.

The left panel is for the level 1 indirect effect and the right panel is for the level 2 indirect effect of Mij.

https://doi.org/10.1371/journal.pone.0241072.g005

Fig 5 shows how the actual direct effect influences the estimation of the indirect effect of Mij. The left panel is for the level 1 indirect effect and the right panel is for the level 2 indirect effect. For both panels, the black solid line is the 1-specificity and the other two lines are the sensitivity. The sensitivity for level 1 indirect effect increases with β3 and that for level 2 indirect effect increases with β4. The x-axis is the true level 1 or 2 direct effect for the left and right panel respectively. We see very small influence of the true direct effect on the estimation of the indirect effect.

3.2 Simulation 2

The second simulation is to illustrate the use of the mlma package with transformations in the variables. In the simulation, there is only one level 2 third-variable, Mij and the variable is causally proceeded by two exposure variables, XijN(0, 0.5) at the first level and Xjχ2(df = 2) at the second level. In the simulation, there are 30 groups, and 20 observations generated for each group. That is i = 1, …, 20 and j = 1, …, 30. The data are generated from the following model: where ϵ1ijN(0, 1), ϵijN(0, 2), u0jN(0, 0.4) and uoj1N(0, 0.5) are independently generated as the level 1 and level 2 random errors respectively. For the data generation mechanism, the true level 1 direct effect is 0.98 and the true level 2 direct effect is de2(xj) = 0.98/xj. In addition, the true level 1 indirect effect for M at xij is 1.2 × 0.98 × xij and the true level 2 indirect effect is 1.2 × 0.98 = 1.176, where the former depends on the value of xij representing a nonlinear relationship between M and Xij and the latter represents a strong (1.176) constant indirect effect.

Fig 6 shows the estimated direct effects at both levels from the simulated data. We see that the estimated direct effect correctly delineates the nonlinear relationship among variables.

thumbnail
Fig 6. Estimates of direct effects of at both levels for simulation 2.

https://doi.org/10.1371/journal.pone.0241072.g006

The mlma package provides a plot function to show relationships among variables. Fig 7 shows the results of the plot function without specifying a specific third-variable. The Figure shows the relative effect (defined as the TVE divided by the total effect) at each exposure variable.

thumbnail
Fig 7. Relative effect at different exposure variables for simulation 2.

https://doi.org/10.1371/journal.pone.0241072.g007

Fig 8 shows results of the plot function with the third-variable M. The left panel is the effect of M with the level 1 exposure denoted as x1 and the right panel is for the level 2 exposure which is denoted as x2. The first row of Fig 8 gives the estimated indirect effect with confidence intervals. The second row shows the changing rate of M with the exposure variables and the third row gives the changing rate of the outcome with the third-variable M. Again, the figures correctly describe relationships among variables.

thumbnail
Fig 8. Indirect effect of M at different exposure variables for simulation 2.

https://doi.org/10.1371/journal.pone.0241072.g008

The codes for simulating data set and transforming variables in the third-variable analysis are available in the S1 File.

4 Explore the racial disparity in obesity

Finally, we implement the method in a real data example: to explore the racial disparity in body mass index (BMI). In a previous research, we found that on average, non-Hispanic blacks have a higher BMI and higher rate of obesity compared with non-Hispanic whites [19, 22] using the 2003-2006 National Health and Nutrition Examination Survey (NHANES). We are interested to see how the disparities can be explained by both individual and environmental factors. Environmental risk factors are generated at the census tract level, which include both food environments and physical activity environments. Individual level variables include age, gender, smoking status, etc. Readers are referred to [19, 22] for more details about the variables.

In this demonstration, we try to use risk factors to explain the racial disparity in BMI. We first use multiple additive regression trees to describe the relationships between BMI and all risk factors, and then we performed a data transformation on the risk factors so that the transformed variables have a roughly linear relationship with BMI. For the individual level risk factors, we did the following transformations:

  1. The natural cubic spline bases for age with degrees of freedom of 4.
  2. Truncate the physical activity measurement to two parts: smaller than 2.1 and larger than 2.1 since we see a change point at 2.1 when depicting the relationship between physical activity and BMI.

We also use the natural cubic spline basis with different degrees of freedom on some of the environmental factors. Please refer to the S1 File to see how we make the transformations. The individual level exposure variable is the race (black (0) and white (1)). The census tract level exposure is the proportion of whites in the census tract. Table 1 shows the estimated third variable effects at both the individual and census tract levels. Note that since high level variable (e.g. census tract level) can influence the lower level (individual level) variable but not the reverse, all level 1 third-variables have both individual and census tract level effects, while all level 2 third-variables have only census tract level effects.

thumbnail
Table 1. Inferences on third-variable effects of risk factors in explaining racial disparities in BMI.

https://doi.org/10.1371/journal.pone.0241072.t001

We found that at the individual level, on average, the BMI among Whites is 1.94 (TE) lower than that for Blacks. Individual level factors can partially explain the racial difference. For example, age explains about 15%(−0.30 dividedby −1.94) of the difference. Fig 9 shows that a larger proportion of Blacks (x = 0) are at the middle age (late 30s to early 60s), compared with Whits. In addition, middle age is related with the highest BMI. The right panel of Fig 9 shows the relationship between age and BMI, which is not linear. BMI increases with age, peaked at the middle age and then declines. By the transformation in third-variable analysis, we can catch the nonlinear relationship between age and BMI.

On the other hand, the racial difference in BMI at the census tract level is not significant. The total effect is −0.9 with a 95% confidence interval that includes 0. In the study, no environmental risk factors that significantly influences the racial disparity in BMI is found. However, in the analysis, multilevel model is required to control the dependencies of subjects who come from the same residential environments.

5 Discussion and future work

In this paper, we extend the definitions of TVEs proposed by [7] to multilevel data settings, based on which the direct and indirect effects can be differentiated from different levels. In addition, indirect effects from different third-variables can be separated. Multilevel additive models are adapted to model the relationships among variables to account for the hierarchical data structure. Data transformation is allowed to represent potential non-linear relationship. We also create an R package, mlma, that implements the proposed method for multilevel third-variable analysis. Using the package, TVEs on average as well as at different values of the exposure(s) can be estimated. Simulations showed that the proposed method and package can accurately estimate TVEs at different levels. We implement the method to explore the racial disparity in BMI accounting for both environmental and individual-level risk factors.

There are some limitations in the current method. One is that we have to know how to transform the variables so that the multilevel additive linear models can accurately catch the relationships among variables. Also, the transformation functions have to be differentiable to calculate the indirect effects. Alternative, we can use the multilevel smoothing splines [30], to analyze the multilevel relationship. In such case, there is no need to know the transformation forms beforehand and the bases for smoothing splines are differentiable. Another limitation of the package of the current version is that the research is confined to deal with data of at most two levels. Extension to more levels is more complicated due to the conditions of TVEs such that the predictor should have equal or higher level than the third-variables, which in turn should have higher or equal levels than the outcome. Furthermore, when the dataset has more levels, TVEs from different levels should be estimated separately. A future work is to develop a more flexible algorithm for general models of potentially more levels.

Supporting information

References

  1. 1. MacKinnon D. P., Krull J., and Lockwood C. Equivalence of the mediation, confounding and suppression effect. Prevention Science, 2000, 1(4):173–181.
  2. 2. Robins J. M. and Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 1992, 3:143–155.
  3. 3. MacKinnon D. P., Warsi G., and Dwyer J. A simulation study of mediated effect measures. Multivariate Behavioral Research, 1995, 30:41–62. pmid:20157641
  4. 4. Albert J. M. Mediation analysis via potential outcomes models. Statistics in Medicine. 2007; 63:926–934.
  5. 5. VanderWeele T. J. Marginal structural models for the estimation of direct and indirect effects. Epidemiology, 2009, 20(1):18–26. pmid:19234398
  6. 6. Imai K., Keele L., and Yamamoto T. Identification, inference, and sensitivity analysis for causal mediation effects. Statistical Science, 2010, 25(1):51–71.
  7. 7. Yu Q., Fan Y., and Wu X. General multiple mediation analysis with an application to explore racial disparity in breast cancer survival. Journal of Biometrics Biostatistics, 5(2):189, 2014.
  8. 8. Gelman A. Multilevel (hierarchical) modeling: What it can and cannot do. Technometrics, 2006, 48(3):432–435.
  9. 9. Krull J. L. and MacKinnon D. P. Multilevel modeling of individual and group level mediated effects. Multivariate Behavioral Research, 2001, 36(2):249–277.
  10. 10. Tofighi D., West S. G., and MacKinnon D. P. Multilevel mediation analysis: the effects of omitted variables in the 1-1-1 model. British Journal of Mathematical and Statistical Psychology, 2012, 66(2):290–307.
  11. 11. Bauer D. J. Estimating multilevel linear models as structureal equatin models. Journal of Educational and Behavioral Statistics. 2003; 28(2):135–167.
  12. 12. Bauer D. J., Preacher K. J., and Gil K. M. Conceptualizing and testing random indirect effects and moderated mediation in multilevel models: New procedures and recommendations. Psychological Methods. 2006; 11(2):142–163.
  13. 13. Zhang Z., Zyphur M. J., and Preacher K. J. Testing multilevel mediation using hierarchical linear modesl, problems and solutions. Organizational Research Methods, 12(4):695–719, 2009.
  14. 14. Kenny D. A., Bolger N., and Korchmaros J. D. Lower level mediation in multilevel models. Psychological Methods, 2003, 8(2):115–128.
  15. 15. Gitelman A. I. Estimating causal effects from multilevel group-allocation data. Journal of Educational and Behavioral Statistics, 2005, 30(4):397–412.
  16. 16. Pituch K. A., Whittaker T. A., and Stapleton L. M. A comparison of methods to test for mediation in multisite experiments. Multivariate Behavioral Research, 2005, 40(1):1–23.
  17. 17. Yuan Y. and YMacKinnon D. P. Bayesian mediation analysis. Psychological Methods, 14(4):301–322, 2009. pmid:19968395
  18. 18. Yu Q. and Li B. mma: An r package for mediation analysis with multiple mediators. Journal of Open Research Software, 5, apr 2017.
  19. 19. Yu Q., Scribner R., Leonardi C., Zhang L., Park C., Chen L., et al. Exploring racial disparity in obesity: a mediation analysis considering geo-coded environmental factors. Spatial and Spatio-temporal Epidemiology, 21:13–23, 2017. pmid:28552184
  20. 20. Yu Q., Medeiros K., Wu X., and Jensen R. Nonlinear predictive models for multiple mediation analysis: With an application to explore ethnic disparities in anxiety and depression among cancer survivors. Psychometrika, Apr 2018, 83(4):991–1006. pmid:29611093
  21. 21. Yu Q., Wu X., Li B., and Scribner R. Multiple mediation analysis with survival outcomes: With an application to explore racial disparity in breast cancer survival. Statistics in Medicine, 38(3):398–412, sep 2018. pmid:30255567
  22. 22. Yu Q. and Li B. A multivariate multiple third-variable effect analysis with an application to explore racial and ethnic disparities in obesity. Journal of Applied Statistics, 2020.
  23. 23. Raudenbush S. W. and Bryk A. S. Hierarchical Linear Models: Applications and Data Analysis Methods (2nd edition). Sage, Thousand Oaks, CA, 2002.
  24. 24. Snijders TAB and Bosker R. J. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. Sage, Thousand Oaks, CA, 1999.
  25. 25. Alker H. R. A Typology of Ecological Fallacies, In Quantitative Ecological Analysis in the Social Sciences. The MIT Press, Cambridge, MA, 1969.
  26. 26. Friedman J. H. and Stuetzle W. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical, 1981, 76:817–823.
  27. 27. Lang S., Umlauf N., Wechselberger P., Harttgen K., and Kneib T. Multilevel structured additive regression. Statistics and Computing, 2012, 24(2):223–238.
  28. 28. Brunauer W., Lang S., and Umlauf N. Modelling house prices using multilevel structured additive regression. Statistical Modelling, 2013, 13(2):95–123.
  29. 29. Busing F. Distribution characteristics of variance estimates in two-level models. Unpublish manuscript, 1993.
  30. 30. Gu C. and Ma P. Optimal smoothing in nonparametric mixed-effect models. The Annals of Statistics, 2005, 33(3):1357–1379.