Third-variable effect analysis with multilevel additive models

Qingzhao Yu; Bin Li

doi:10.1371/journal.pone.0241072

Abstract

Third-variable effect refers to the effect transmitted by third-variables that intervene in the relationship between an exposure and a response variable. Third-variable effect analysis has been broadly studied in many fields. However, it remains a challenge for researchers to differentiate indirect effect of individual factor from multiple third-variables, especially when the involving variables are of hierarchical structure. Yu et al. (2014) defined third-variable effects that were consistent for all different types of response (categorical or continuous), exposure, or third-variables. With these definitions, multiple third-variables can be considered simultaneously, and the indirect effects carried by individual third-variables can be separated from the total effect. In this paper, we extend the definitions of third-variable effects to multilevel data structures, where multilevel additive models are adapted to model the variable relationships. And then third-variable effects can be estimated at different levels. Moreover, transformations on variables are allowed to present nonlinear relationships among variables. We compile an R package mlma, to carry out the proposed multilevel third-variable analysis. Simulations show that the proposed method can effectively differentiate and estimate third-variable effects from different levels. Further, we implement the method to explore the racial disparity in body mass index accounting for both environmental and individual level risk factors.

Citation: Yu Q, Li B (2020) Third-variable effect analysis with multilevel additive models. PLoS ONE 15(10): e0241072. https://doi.org/10.1371/journal.pone.0241072

Editor: Gang Han, Texas A&M University, UNITED STATES

Received: August 29, 2020; Accepted: October 8, 2020; Published: October 23, 2020

Copyright: © 2020 Yu, Li. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: This study is funded by the National Institute on Minority Health and Health Disparities (https://www.nimhd.nih.gov/). The award number is #1R15MD012387. The PI is Dr. Qingzhao Yu. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

A third-variable effect (TVE) refers to the effect conveyed by a third-variable to an observed relationship between an exposure and a response variable of interest. Depend on whether there is a causal relationship from the exposure variable to the third-third variable and then to the response, the third-variable (denoted as M) is often called a mediator (when there are causal relationships) or a confounder (no causal relationship is involved). Research purposes for the third-variable analysis are generally, 1) identify significant third-variables that can partially or completely explain the relationship between the exposure variable (X) and the outcome (Y); and 2) differentiate the TVE from different paths that connect between X and Y. Except for the difference in explaining the effects from a mediator and from a confounder, the TVE analysis method can be used to make inferences on both mediation and confounding effects [1]. Third-variable analysis has been widely used in psychology, social sciences, behavioral research, health prevention, epidemiological studies, and genetic analysis. Within the context of generalized linear models, general predictive models, and counterfactual framework, a number of methods have been proposed for inferences on TVEs, to name a few, [2–7].

In research practices, the experiment data are often collected hierarchically. For example, to identify variables that are related with childhood obesity, we consider both environmental (e.g. walkability of the neighborhood) and individual factors (e.g. snacking habit). When hierarchical databases are considered, third-variable analysis method based on generalized linear models are usually not readily adaptable since the independence assumption among observations is violated. In such cases, hierarchical models, also known as multilevel or mixed-effect models, are more appropriate to fit relationships among variables since these models can catch dependencies among observations and allow for predictors from different levels of the data [8]. In this paper, we propose a third-variable analysis method based on multilevel models. For the hierarchical model, we assume there are two levels of data and refer the individual level as level 1, and the group level as level 2. Although more than two levels of hierarchy is possible, this paper focuses on two-level databases only.

In third-variable analysis, besides the pathway that directly connect the exposure variable with the outcome, we explore the exposure → third-variable → response or X → M → Y pathways. When doing third-variable analysis with multilevel models, the level of the variables at the left of the arrow should be higher than or equal to that of the right, since a group level variable may affect an individual level variable but not vice versa [9]. In this setting, only the 2 → 2 → 2, 2 → 2 → 1, 2 → 1 → 1, and 1 → 1 → 1 relationships are legible. Multilevel models are necessary to deal with hierarchical data base even for the 1 → 1 → 1 relationship. [10] studied the bias brought by using single level models when data are hierarchical. [9, 11–13] proposed third-variable analysis methods for all three types of multilevel models. Moreover, [14–16] proposed alternative methods to test the indirect effects in 2 → 2 → 1, 2 → 1 → 1 models. In addition, [17] proposed to use Bayesian third-variable analysis to deal with hierarchical databases.

In this paper, we propose to use generalized additive multilevel models for third-variable analysis with hierarchical databases. The major contributions of this proposed method are that: 1) we extend the general definitions of TVEs proposed by [7] to multilevel data structure; 2) the method allows any types (categorical or continuous) of the exposure(s), third-variables and response variable(s) in exploring TVEs; 3) TVEs from individual or group of third-variables and from different levels are differentiable; 4) nonlinear associations among variables are allowed in calculating the TVEs; 5) multiple exposure(s) from different levels and multivariate outcomes are allowed in the third-variable analysis, and 6) an R package mlma was developed for the method proposed in this paper. Practitioners can easily implement the method in real data analysis.

The rest of the paper is organized as follows. In Section 2, we review the general definitions of TVEs and then extend these concepts to the multilevel model situation. We also review the generalized additive multilevel models (GAMM) that are used to model relationships among variables. Based on that, we propose the multilevel third-variable analysis with GAMM. In Section 3, we illustrate the use of the proposed method in different multilevel data structures and the usage of the mlma R package. We then use the method in a real data example in Section 4: to explore the ethnic disparity in obesity considering both individual and environmental risk factors. Section 5 gives a summary of the proposed method and points out future research directions.

2 Third-variable analysis with multilevel additive models

[7] proposed general definitions for TVEs. First we recapitulate the definitions and then propose the multilevel additive models in third-variable analysis with hierarchical databases.

2.1 Definitions of third-variable effects with single-level data

The conceptual model for TVE of one level is shown in Fig 1. In the Figure, X is the exposure variable, Y is the outcome, M = {M₁, …, M_p} is the vector of p third-variables, and Z is the vector of other variables which relate with Y but do not intermediate the X−Y relationship. As usual, the TVE includes total effect—the overall effect from the exposure variable X to the outcome Y, direct effect—the remaining effect from X to Y after accounting for effects form third-variables, and indirect effect from M_i—the effect from the path X−M_i−Y.

Download:

Fig 1. Conceptual model for one-level third-variable effects.

https://doi.org/10.1371/journal.pone.0241072.g001

Basically, [7] defines the total effect as the changing rate in the outcome when the exposure variable changes. Compared with the traditional definition of TVEs, which are defined as the amount of change of the outcome when the exposure variable changes from one status to another, the Yu et al.’s definition has the following benefits:

The total effect is scale invariant to the unit of the exposure variable. This is because the effect is defined as the changing rate but not as the changing amount of Y with X.
There is no need to define the two status of the exposure variable. Therefore the total effects can be calculated not only for binary but also for continuous exposures.
The total effect can be calculated with nonlinear models when the relationship among variables are changing at different values of the exposure variable. The total effect can be expressed as a function of the exposure variable by the Yu definition [7].

The direct effect not from M_i is similarly defined as the total effect except that the relationship between the exposure variable and M_i is manipulatively broken. In such case, M_i are controlled not to change with X: M_i follows its marginal distribution from the observations. The indirect effect from M_i is then defined as the total effect minus the direct effect not from M_i. All the benefits of the definition of total effect are also fit for the direct and indirect effects. In additon, multiple third-variables can be considered simultaneously, and the indirect effects carried by individual third-variables can be separated from the total effect. For the formal definitions of TVEs, readers are referred to [7] and [18]. [7] showed that the proposed definitions of TVEs are equivalent to the conventional TVEs under certain situations (e.g., linear regression models with continuous third-variables and outcome). They also established the relationship between the proposed definitions of direct or indirect effect and the natural direct or indirect effect for binary exposure variables. Later, the definitions have been implemented to explore racial and ethnic health disparities by [19, 20] and extended to deal with time-to-event outcomes ([21]) and for multiple exposures and multivariate outcomes ([22]). In this paper, we extend the method to the context of multilevel models.

2.2 Definitions of third-variable effects with data of two levels

The unique data structure in multilevel models raises the potential problem of confounding TVEs from different levels. As pointed out in [13], the relationship between two level-one variables can be decomposed into between-group and within-group components. In particular, the aggregated variables at the second level can be highly related while the relationship may be very weak or even at an opposite direction when considered at the individual level [23]. For example, [24] points out that the proportion of black residents may be an important variable for the census tract, while it is different from the meaning of ethnicity as an individual-level variable. [25] discussed the difference of the two components extensively. It is important to differentiate the between-group and within-group components in third-variable analysis, where the TVEs can be decomposed to level 1 and level 2 effects. To identify the level 1 and level 2 TVEs separately, [13] proposed the group-mean centering method (CWC), where they subtracted the group means from individual level variables and added group means as level 2 covariates. In their paper, [13] showed that the CWC method efficiently separated level 1 and level 2 TVEs and resulted in less bias and more power compared with non-centering methods. In this paper, we use a different way to estimate the level 1 and level 2 TVEs: extend the definitions of TVEs with single level models by [7] to multilevel models.

With the generalized definition of TVEs, [22] has shown that a third-varaible analysis can involve multiple exposure variables and multivariate outcomes. The purpose of third-variable analysis is to differentiate the direct effect and indirect effect from each third-variable for each pair of the exposure-outcome relationship. If the outcome is at level 2, all exposure and mediators have to be level 2 as discussed in Section 1. Therefore, a single-level third-variable analysis works. If the outcome is at level 1, the exposure variable can be a level 1 or level 2 variable. The third-variables can be level 1 or 2 for a level 2 exposure, but have to be level 1 for a level 1 exposure variable. In this paper, we focus on level 1 outcomes.

Denote M_ij = (M_ij1, …, M_ijK) as the vector of K potential level 1 third-variables for the ith object at the jth group. M_ij,−k is the vector M_ij excluding the kth element. Denote M_.j = (M_.j1, …, M_.jL) as the vector of the L potential level 2 third-variables or level 1 third-variables aggregated at level 2 within group j. Let M_ijk(x_ij) be a random variable that has a conditional distribution given X_ij = x_ij. For an exposure variable X at any level, let u* be the minimum unit of X, such that if x ∈ domain(X), then x + u* ∈ domain(X). For now, we ignore other covariates Z. Assume effects of exposures and third-variables on the outcome are additive, we have the general definitions of TVEs, following [7], for level 1 (Definition 1) and level 2 exposure variables (Definition 2). Note that a level 1 exposure can have only level 1 mediators while a level 2 exposure can have both level 1 and level 2 mediators.

Definition 1. For a level 1 exposure variable X, the level 1 total effect (TE₁) of X on Y, the level 1 direct effect () of X on Y not from level 1 third-variable M_k and the level 1 indirect effect of X on Y through M_k at X = x_ij () are defined as: (1) (2) (3)

The average level one TVEs are the mean value of the TVEs defined by Definition 1: ATE₁ = E_ij[TE₁(x_ij)], ADE_1,\k = E_ij[DE_1,\k(x_ij)] and AIE_1,k = ATE₁ − ADE_1,\k.

Definition 2. For a level 2 exposure variable X, the level 2 total effect (TE₂) of X on Y, the level 2 direct effect () of X on Y not from the level 1 third variable M_k and level 2 third variable M_l, and the level 2 indirect effect of X on Y through M_k and M_l at X = x_.j () are defined as: (4) (5) (6) (7) (8)

The average level 2 TVEs are the TVEs defined by Definition 2 averaged at the group level: ATE₂ = E_j[TE_2,j(x_.j)], AIE_21,k = E_j[IE_21,jk(x_.j)] and AIE_22,l = E_j[IE_22,jl(x_.j)].

2.3 Multilevel additive models

We use multilevel additive models to build relationships among variables of hierarchy. The additive model is a nonlinear regression method that was first proposed by [26]. A multilevel additive model can deal with both nonlinear covariate effects and cluster-specific heterogeneity [27]. It is now gaining rapid popularity in psychological and social research [28]. Using the notations in Section 2.2, assume that we have L level 2 and K level 1 third-variables. In addition, assume that there are E₁ level 1 and E₂ level 2 exposure variables, denoted as and respectively. We propose the following linear additive multilevel models for multilevel third-variable analysis. The boldfaced letter indicates a potential vector of functions or numbers.

For level 2 third-variables, M_.jl, l = 1, …, L:

For level 1 third-variables, M_ijk, k = 1, …, K:

The full model:

In the models, r_0jk and r_0j are second-level random errors with mean zero and finite variances. f(⋅) is a function/transformation vector of ⋅. The transformation enables modeling nonlinear relationships among variables. We assume that all the transformation functions are first-order differentiable. α and β are coefficient vectors for transformed variables in predicting the response variable on the left side of each equation. In addition, g(⋅)s are the link functions that link the expected response variable with the right-hand-side of each equation, the systematic component of a generalized linear model. For example, a binary M_ijk may have a link function g_1k = logit(P(M_ijk = 1)). Similarly, a link function can be used on the outcome. With the link function, we can deal with different types of third-variables and outcomes.

2.4 Third-variable effects with multilevel additive model

Based on the definitions of TVEs, we derive the TVEs in Theorems 1 and 2. In the theorems, f′(x) denotes the first derivative of function f on the random variable X and realized at X = x. In addition, g⁻¹ denotes the inverse function of g. We further denote μ_ijk = E(M_ijk) and μ_.jk = E(M_.jk). The proofs of theorems are included in the S1 File.

Theorem 1 With the relationships among variables built by Section 2.3, the TVEs for level 1 exposure variable X_ije, e = 1, …, E1 on the outcome variable Y are:

Theorem 2. With the relationships among variables built by Section 2.3, the TVEs for level 2 exposure variable X_.je, e = 1, …, E2 on the outcome variable Y are:

2.5 Bootstrap method for third-variable effect inferences

Finally, we use the bootstrap method to calculate the variances of the TVEs. In particular, at the group level, a bootstrap sample of the same size for each group is drawn with replacement from the original data set. Then multilevel additive models are fitted based on the bootstrap sample to get the estimates of βs, based on which the TVEs can be calculated by Theorems 1 and 2. The above process repeats many times and inferences can be made based on the repeated estimates. To draw the bootstrap sample, we keep all groups at the second level and then at the individual level, we draw observations with replacement of size n_j from the jth group, where n_j is the number of observations in the jth group. This bootstrap method is adopted in the R package mlma to estimate the variances of estimates and to build up confidence intervals. The mlma package is available from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/mlma/index.html and illustrated by the simulations in Section 3.

2.6 The mlma R package

The authors developed a R package, mlma, for multilevel third-variable analysis. The analysis is based on multilevel additive models where nonlinear transformations of variables are allowed. The package mlma contains three groups of functions: function data.org is used to prepare data sets for analysis—transforming variables, dichotomizing categorical third-variables, and getting the derivatives of the transformation functions. The functions mlma and boot.mlma are used for statistical inferences on the TVEs. The former estimates the TVEs and the latter generates bootstrap samples from the original data sets and does the third-variable analysis based on the bootstrap samples. The estimates of TVEs from the bootstrap samples are then used for statistical inferences. The third group of functions are generic functions—print, summary and plot results from the mlma and boot.mlma functions. The details of how to use the functions are fully documented within the package and the mlma vignette. The analysis in Section 3 are all performed using the package.

3 Simulations

In this section, we show the performance of the proposed method and illustrate the use of the R package mlma. The first simulation is based on a 2 → 1 → 1 true model adapted from [13] but it is extended to include exposure and third-variables at both levels. There is one exposure variable at each level and one third-variable at each level. The level 1 exposure and level 2 third-variable are binary. No transformation is performed on the original variables. The first simulation is to exam how the sensitivity and specificity of identifying important third-variables are influenced by different sets of parameters.

The second simulation is based on nonlinear multilevel models with both level exposures and one level 1 continuous third-variable. The second simulation is to demonstrate how to include transformation terms of variables in the multilevel mediation analysis and how to use the graph tools in the R package mlma to illustrate relationships among variables.

The simulations are performed using the R package mlma. The R codes are included in the S1 File.

3.1 Simulation 1

The first simulation includes one level 1 binary exposure, X_ij, and one level 2 continuous exposure, X_.j, where j denotes the jth group and i denotes the ith observation in the jth group. There are 600 observations in total and n observations in each group. Therefore, there are 600/n groups. To check how group sizes and number of groups can influence the sensitivity of identifying important TVEs, we set n at 5 and 20 respectively. X_ij are generated through binomial distribution with the probability of success 0.5. X_.j are generated through a standard normal distribution. There are a level 2 binary third variable, M_.j, and one level 1 continuous third-variable, M_ij. In detail, the simulation data are generated from the following models for i = 1, …, n and j = 1, …, 600/n: where u_0j ∼ N(0, 0.5), ϵ_0ij ∼ N(0, 1), u_1j ∼ N(0, v₂) and ϵ_1ij ∼ N(0, v₁) are independently generated random errors at both levels. The level two random error for the response variable is set as one-fifth of the level one variance—v₂ = v₁/5, which makes the intraclass correlation (ICC) to be.17. As pointed out by [29], this medium ICC value facilitates model convergence. v₂ is chosen to be 5 or 1, β₁ and β₂ ranges within the set {−.59, −.14, 0, 0.14, 0.39} and β₃ and β₄ from {0, 0.14, 0.59}. We then check the influence of these parameters on the sensitivity and specificity of identifying important TVEs. The data with each combination of parameters are generated 20 times. The sensitivity of a TVE is the proportion of times that the estimated confidence interval of the effect does not include 0 when the actual effect is not 0. Relatively, the specificity is defined as the proportion of times that the estimated confidence interval of the effect includes 0 when the true effect is 0.

For the level 2 exposure, the average direct effect for X_.j → Y_ij (denoted as de2), the indirect effect from M_.j for X_.j → M_.j → Y_ij (denoted as ie2.2), the indirect effect from M_ij for X_.j → M_ij → Y_ij (denoted as ie2.1), and the total effect (te2) are estimated.

For the level 1 exposure, the average direct effect for X_ij → Y_ij (denoted as de1), the indirect effect from M_ij for X_ij → M_ij → Y_ij (denoted as ie2.1), and the total effect (te1) are estimated.

Fig 2 shows the sensitivity (or 1-specificity when β₂ = 0) of identifying significant level 2 direct effect, when βs change but n is fixed at 20 and v₁ at 1. The x-axis is β₂, the true level 2 direct effect. Different color and line type in each plot represents different values of β₁. Each row of the 3 × 3 panel in Fig 2 is a different value for β₄, each column is for a different value of β₃. We found that the sensitivity of correctly identifying important level 2 direct effect depends on the value of other TVEs. This is mainly due to the increased correlations among variables that results in raised variances in the estimation.

Download:

Fig 2. Estimates of level 2 direct effect from simulation 1.

The x-axis is the true direct effect, and y-axis is the sensitivity (or 1-specificity when true de is 0). Different color and line type represents different value for β₁. Rows are for β₄ = 0, 0.14 and 0.59, and columns are for β₃ = 0, 0.14 and 0.59 respectively.

https://doi.org/10.1371/journal.pone.0241072.g002

Fig 3 shows the sensitivity or 1-specificity (when β₁ = 0) of identifying important level 1 direct effect. The sensitivity is averaged over β₃ and β₄ for different values of β₁, v₁ and n. We see that the sensitivities reduce a lot when the variances are bigger. However, values of n have minimum influence on the sensitivity.

Download:

Fig 3. Estimates of level 1 direct effect from Simulation 1.

The x-axis is the true direct effect, and y-axis is the sensitivity (or 1-specificity when the true de is 0). Different color and line type represents different value for β₂. Each plot is for a different combination of n and v₁.

https://doi.org/10.1371/journal.pone.0241072.g003

Finally, Figs 4 and 5 show the sensitivity and specificity of identifying indirect effects. For both Figures, we fix v₁ at 1. Fig 4 shows how the sensitivity of level 2 indirect effects of M_ij and M_.j changes with β₃ and β₄. The left panel is the sensitivity and 1-specificity (when β₃ = 0) of identifying M_ij. The level 2 indirect effect of M_ij is proportional to β₃, so the sensitivity increases with β₃. The sensitivity is minimally influenced by the value of β₄. The right panel is the sensitivity and 1-specificity (when β₄ = 0) of identifying M_.j. Again, the sensitivity of identifying M_.j is proportional to β₄, but is not systematically influenced by the value of β₃.

Download:

Fig 4. Estimates of level 2 indirect effect from simulation 1.

The left panel is the sensitivity and 1-specificity of identifying M_ij. The right panel is the sensitivity and 1-specificity of identifying M_.j.

https://doi.org/10.1371/journal.pone.0241072.g004

Download:

Fig 5. Estimates of indirect effects of M_ij from simulation 1.

The left panel is for the level 1 indirect effect and the right panel is for the level 2 indirect effect of M_ij.

https://doi.org/10.1371/journal.pone.0241072.g005

Fig 5 shows how the actual direct effect influences the estimation of the indirect effect of M_ij. The left panel is for the level 1 indirect effect and the right panel is for the level 2 indirect effect. For both panels, the black solid line is the 1-specificity and the other two lines are the sensitivity. The sensitivity for level 1 indirect effect increases with β₃ and that for level 2 indirect effect increases with β₄. The x-axis is the true level 1 or 2 direct effect for the left and right panel respectively. We see very small influence of the true direct effect on the estimation of the indirect effect.

3.2 Simulation 2

The second simulation is to illustrate the use of the mlma package with transformations in the variables. In the simulation, there is only one level 2 third-variable, M_ij and the variable is causally proceeded by two exposure variables, X_ij ∼ N(0, 0.5) at the first level and X_j ∼ χ²(df = 2) at the second level. In the simulation, there are 30 groups, and 20 observations generated for each group. That is i = 1, …, 20 and j = 1, …, 30. The data are generated from the following model: where ϵ_1ij ∼ N(0, 1), ϵ_ij ∼ N(0, 2), u_0j ∼ N(0, 0.4) and u_oj1 ∼ N(0, 0.5) are independently generated as the level 1 and level 2 random errors respectively. For the data generation mechanism, the true level 1 direct effect is 0.98 and the true level 2 direct effect is de2(x_j) = 0.98/x_j. In addition, the true level 1 indirect effect for M at x_ij is 1.2 × 0.98 × x_ij and the true level 2 indirect effect is 1.2 × 0.98 = 1.176, where the former depends on the value of x_ij representing a nonlinear relationship between M and X_ij and the latter represents a strong (1.176) constant indirect effect.

Fig 6 shows the estimated direct effects at both levels from the simulated data. We see that the estimated direct effect correctly delineates the nonlinear relationship among variables.

Download:

Fig 6. Estimates of direct effects of at both levels for simulation 2.

https://doi.org/10.1371/journal.pone.0241072.g006

The mlma package provides a plot function to show relationships among variables. Fig 7 shows the results of the plot function without specifying a specific third-variable. The Figure shows the relative effect (defined as the TVE divided by the total effect) at each exposure variable.

Download:

Fig 7. Relative effect at different exposure variables for simulation 2.

https://doi.org/10.1371/journal.pone.0241072.g007

Fig 8 shows results of the plot function with the third-variable M. The left panel is the effect of M with the level 1 exposure denoted as x1 and the right panel is for the level 2 exposure which is denoted as x2. The first row of Fig 8 gives the estimated indirect effect with confidence intervals. The second row shows the changing rate of M with the exposure variables and the third row gives the changing rate of the outcome with the third-variable M. Again, the figures correctly describe relationships among variables.

Download:

Fig 8. Indirect effect of M at different exposure variables for simulation 2.

https://doi.org/10.1371/journal.pone.0241072.g008

The codes for simulating data set and transforming variables in the third-variable analysis are available in the S1 File.

4 Explore the racial disparity in obesity

Finally, we implement the method in a real data example: to explore the racial disparity in body mass index (BMI). In a previous research, we found that on average, non-Hispanic blacks have a higher BMI and higher rate of obesity compared with non-Hispanic whites [19, 22] using the 2003-2006 National Health and Nutrition Examination Survey (NHANES). We are interested to see how the disparities can be explained by both individual and environmental factors. Environmental risk factors are generated at the census tract level, which include both food environments and physical activity environments. Individual level variables include age, gender, smoking status, etc. Readers are referred to [19, 22] for more details about the variables.

In this demonstration, we try to use risk factors to explain the racial disparity in BMI. We first use multiple additive regression trees to describe the relationships between BMI and all risk factors, and then we performed a data transformation on the risk factors so that the transformed variables have a roughly linear relationship with BMI. For the individual level risk factors, we did the following transformations:

The natural cubic spline bases for age with degrees of freedom of 4.
Truncate the physical activity measurement to two parts: smaller than 2.1 and larger than 2.1 since we see a change point at 2.1 when depicting the relationship between physical activity and BMI.

We also use the natural cubic spline basis with different degrees of freedom on some of the environmental factors. Please refer to the S1 File to see how we make the transformations. The individual level exposure variable is the race (black (0) and white (1)). The census tract level exposure is the proportion of whites in the census tract. Table 1 shows the estimated third variable effects at both the individual and census tract levels. Note that since high level variable (e.g. census tract level) can influence the lower level (individual level) variable but not the reverse, all level 1 third-variables have both individual and census tract level effects, while all level 2 third-variables have only census tract level effects.

Download:

Table 1. Inferences on third-variable effects of risk factors in explaining racial disparities in BMI.

https://doi.org/10.1371/journal.pone.0241072.t001

We found that at the individual level, on average, the BMI among Whites is 1.94 (TE) lower than that for Blacks. Individual level factors can partially explain the racial difference. For example, age explains about 15%(−0.30 dividedby −1.94) of the difference. Fig 9 shows that a larger proportion of Blacks (x = 0) are at the middle age (late 30s to early 60s), compared with Whits. In addition, middle age is related with the highest BMI. The right panel of Fig 9 shows the relationship between age and BMI, which is not linear. BMI increases with age, peaked at the middle age and then declines. By the transformation in third-variable analysis, we can catch the nonlinear relationship between age and BMI.

Download:

Fig 9. Indirect effect of age at the individual level.

https://doi.org/10.1371/journal.pone.0241072.g009

On the other hand, the racial difference in BMI at the census tract level is not significant. The total effect is −0.9 with a 95% confidence interval that includes 0. In the study, no environmental risk factors that significantly influences the racial disparity in BMI is found. However, in the analysis, multilevel model is required to control the dependencies of subjects who come from the same residential environments.

5 Discussion and future work

In this paper, we extend the definitions of TVEs proposed by [7] to multilevel data settings, based on which the direct and indirect effects can be differentiated from different levels. In addition, indirect effects from different third-variables can be separated. Multilevel additive models are adapted to model the relationships among variables to account for the hierarchical data structure. Data transformation is allowed to represent potential non-linear relationship. We also create an R package, mlma, that implements the proposed method for multilevel third-variable analysis. Using the package, TVEs on average as well as at different values of the exposure(s) can be estimated. Simulations showed that the proposed method and package can accurately estimate TVEs at different levels. We implement the method to explore the racial disparity in BMI accounting for both environmental and individual-level risk factors.

There are some limitations in the current method. One is that we have to know how to transform the variables so that the multilevel additive linear models can accurately catch the relationships among variables. Also, the transformation functions have to be differentiable to calculate the indirect effects. Alternative, we can use the multilevel smoothing splines [30], to analyze the multilevel relationship. In such case, there is no need to know the transformation forms beforehand and the bases for smoothing splines are differentiable. Another limitation of the package of the current version is that the research is confined to deal with data of at most two levels. Extension to more levels is more complicated due to the conditions of TVEs such that the predictor should have equal or higher level than the third-variables, which in turn should have higher or equal levels than the outcome. Furthermore, when the dataset has more levels, TVEs from different levels should be estimated separately. A future work is to develop a more flexible algorithm for general models of potentially more levels.

Supporting information

S1 File. Proofs and R codes.

https://doi.org/10.1371/journal.pone.0241072.s001

(PDF)

References

1. MacKinnon D. P., Krull J., and Lockwood C. Equivalence of the mediation, confounding and suppression effect. Prevention Science, 2000, 1(4):173–181.
- View Article
- Google Scholar
2. Robins J. M. and Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 1992, 3:143–155.
- View Article
- Google Scholar
3. MacKinnon D. P., Warsi G., and Dwyer J. A simulation study of mediated effect measures. Multivariate Behavioral Research, 1995, 30:41–62. pmid:20157641
- View Article
- PubMed/NCBI
- Google Scholar
4. Albert J. M. Mediation analysis via potential outcomes models. Statistics in Medicine. 2007; 63:926–934.
- View Article
- Google Scholar
5. VanderWeele T. J. Marginal structural models for the estimation of direct and indirect effects. Epidemiology, 2009, 20(1):18–26. pmid:19234398
- View Article
- PubMed/NCBI
- Google Scholar
6. Imai K., Keele L., and Yamamoto T. Identification, inference, and sensitivity analysis for causal mediation effects. Statistical Science, 2010, 25(1):51–71.
- View Article
- Google Scholar
7. Yu Q., Fan Y., and Wu X. General multiple mediation analysis with an application to explore racial disparity in breast cancer survival. Journal of Biometrics Biostatistics, 5(2):189, 2014.
- View Article
- Google Scholar
8. Gelman A. Multilevel (hierarchical) modeling: What it can and cannot do. Technometrics, 2006, 48(3):432–435.
- View Article
- Google Scholar
9. Krull J. L. and MacKinnon D. P. Multilevel modeling of individual and group level mediated effects. Multivariate Behavioral Research, 2001, 36(2):249–277.
- View Article
- Google Scholar
10. Tofighi D., West S. G., and MacKinnon D. P. Multilevel mediation analysis: the effects of omitted variables in the 1-1-1 model. British Journal of Mathematical and Statistical Psychology, 2012, 66(2):290–307.
- View Article
- Google Scholar
11. Bauer D. J. Estimating multilevel linear models as structureal equatin models. Journal of Educational and Behavioral Statistics. 2003; 28(2):135–167.
- View Article
- Google Scholar
12. Bauer D. J., Preacher K. J., and Gil K. M. Conceptualizing and testing random indirect effects and moderated mediation in multilevel models: New procedures and recommendations. Psychological Methods. 2006; 11(2):142–163.
- View Article
- Google Scholar
13. Zhang Z., Zyphur M. J., and Preacher K. J. Testing multilevel mediation using hierarchical linear modesl, problems and solutions. Organizational Research Methods, 12(4):695–719, 2009.
- View Article
- Google Scholar
14. Kenny D. A., Bolger N., and Korchmaros J. D. Lower level mediation in multilevel models. Psychological Methods, 2003, 8(2):115–128.
- View Article
- Google Scholar
15. Gitelman A. I. Estimating causal effects from multilevel group-allocation data. Journal of Educational and Behavioral Statistics, 2005, 30(4):397–412.
- View Article
- Google Scholar
16. Pituch K. A., Whittaker T. A., and Stapleton L. M. A comparison of methods to test for mediation in multisite experiments. Multivariate Behavioral Research, 2005, 40(1):1–23.
- View Article
- Google Scholar
17. Yuan Y. and YMacKinnon D. P. Bayesian mediation analysis. Psychological Methods, 14(4):301–322, 2009. pmid:19968395
- View Article
- PubMed/NCBI
- Google Scholar
18. Yu Q. and Li B. mma: An r package for mediation analysis with multiple mediators. Journal of Open Research Software, 5, apr 2017.
- View Article
- Google Scholar
19. Yu Q., Scribner R., Leonardi C., Zhang L., Park C., Chen L., et al. Exploring racial disparity in obesity: a mediation analysis considering geo-coded environmental factors. Spatial and Spatio-temporal Epidemiology, 21:13–23, 2017. pmid:28552184
- View Article
- PubMed/NCBI
- Google Scholar
20. Yu Q., Medeiros K., Wu X., and Jensen R. Nonlinear predictive models for multiple mediation analysis: With an application to explore ethnic disparities in anxiety and depression among cancer survivors. Psychometrika, Apr 2018, 83(4):991–1006. pmid:29611093
- View Article
- PubMed/NCBI
- Google Scholar
21. Yu Q., Wu X., Li B., and Scribner R. Multiple mediation analysis with survival outcomes: With an application to explore racial disparity in breast cancer survival. Statistics in Medicine, 38(3):398–412, sep 2018. pmid:30255567
- View Article
- PubMed/NCBI
- Google Scholar
22. Yu Q. and Li B. A multivariate multiple third-variable effect analysis with an application to explore racial and ethnic disparities in obesity. Journal of Applied Statistics, 2020.
- View Article
- Google Scholar
23. Raudenbush S. W. and Bryk A. S. Hierarchical Linear Models: Applications and Data Analysis Methods (2nd edition). Sage, Thousand Oaks, CA, 2002.
24. Snijders TAB and Bosker R. J. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. Sage, Thousand Oaks, CA, 1999.
25. Alker H. R. A Typology of Ecological Fallacies, In Quantitative Ecological Analysis in the Social Sciences. The MIT Press, Cambridge, MA, 1969.
26. Friedman J. H. and Stuetzle W. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical, 1981, 76:817–823.
- View Article
- Google Scholar
27. Lang S., Umlauf N., Wechselberger P., Harttgen K., and Kneib T. Multilevel structured additive regression. Statistics and Computing, 2012, 24(2):223–238.
- View Article
- Google Scholar
28. Brunauer W., Lang S., and Umlauf N. Modelling house prices using multilevel structured additive regression. Statistical Modelling, 2013, 13(2):95–123.
- View Article
- Google Scholar
29. Busing F. Distribution characteristics of variance estimates in two-level models. Unpublish manuscript, 1993.
30. Gu C. and Ma P. Optimal smoothing in nonparametric mixed-effect models. The Annals of Statistics, 2005, 33(3):1357–1379.
- View Article
- Google Scholar

[ref1] 1. MacKinnon D. P., Krull J., and Lockwood C. Equivalence of the mediation, confounding and suppression effect. Prevention Science, 2000, 1(4):173–181.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Robins J. M. and Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 1992, 3:143–155.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. MacKinnon D. P., Warsi G., and Dwyer J. A simulation study of mediated effect measures. Multivariate Behavioral Research, 1995, 30:41–62. pmid:20157641
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Albert J. M. Mediation analysis via potential outcomes models. Statistics in Medicine. 2007; 63:926–934.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. VanderWeele T. J. Marginal structural models for the estimation of direct and indirect effects. Epidemiology, 2009, 20(1):18–26. pmid:19234398
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Imai K., Keele L., and Yamamoto T. Identification, inference, and sensitivity analysis for causal mediation effects. Statistical Science, 2010, 25(1):51–71.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref7] 7. Yu Q., Fan Y., and Wu X. General multiple mediation analysis with an application to explore racial disparity in breast cancer survival. Journal of Biometrics Biostatistics, 5(2):189, 2014.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref8] 8. Gelman A. Multilevel (hierarchical) modeling: What it can and cannot do. Technometrics, 2006, 48(3):432–435.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref9] 9. Krull J. L. and MacKinnon D. P. Multilevel modeling of individual and group level mediated effects. Multivariate Behavioral Research, 2001, 36(2):249–277.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref10] 10. Tofighi D., West S. G., and MacKinnon D. P. Multilevel mediation analysis: the effects of omitted variables in the 1-1-1 model. British Journal of Mathematical and Statistical Psychology, 2012, 66(2):290–307.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref11] 11. Bauer D. J. Estimating multilevel linear models as structureal equatin models. Journal of Educational and Behavioral Statistics. 2003; 28(2):135–167.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref12] 12. Bauer D. J., Preacher K. J., and Gil K. M. Conceptualizing and testing random indirect effects and moderated mediation in multilevel models: New procedures and recommendations. Psychological Methods. 2006; 11(2):142–163.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref13] 13. Zhang Z., Zyphur M. J., and Preacher K. J. Testing multilevel mediation using hierarchical linear modesl, problems and solutions. Organizational Research Methods, 12(4):695–719, 2009.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref14] 14. Kenny D. A., Bolger N., and Korchmaros J. D. Lower level mediation in multilevel models. Psychological Methods, 2003, 8(2):115–128.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref15] 15. Gitelman A. I. Estimating causal effects from multilevel group-allocation data. Journal of Educational and Behavioral Statistics, 2005, 30(4):397–412.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref16] 16. Pituch K. A., Whittaker T. A., and Stapleton L. M. A comparison of methods to test for mediation in multisite experiments. Multivariate Behavioral Research, 2005, 40(1):1–23.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref17] 17. Yuan Y. and YMacKinnon D. P. Bayesian mediation analysis. Psychological Methods, 14(4):301–322, 2009. pmid:19968395
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref18] 18. Yu Q. and Li B. mma: An r package for mediation analysis with multiple mediators. Journal of Open Research Software, 5, apr 2017.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref19] 19. Yu Q., Scribner R., Leonardi C., Zhang L., Park C., Chen L., et al. Exploring racial disparity in obesity: a mediation analysis considering geo-coded environmental factors. Spatial and Spatio-temporal Epidemiology, 21:13–23, 2017. pmid:28552184
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref20] 20. Yu Q., Medeiros K., Wu X., and Jensen R. Nonlinear predictive models for multiple mediation analysis: With an application to explore ethnic disparities in anxiety and depression among cancer survivors. Psychometrika, Apr 2018, 83(4):991–1006. pmid:29611093
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref21] 21. Yu Q., Wu X., Li B., and Scribner R. Multiple mediation analysis with survival outcomes: With an application to explore racial disparity in breast cancer survival. Statistics in Medicine, 38(3):398–412, sep 2018. pmid:30255567
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref22] 22. Yu Q. and Li B. A multivariate multiple third-variable effect analysis with an application to explore racial and ethnic disparities in obesity. Journal of Applied Statistics, 2020.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref23] 23. Raudenbush S. W. and Bryk A. S. Hierarchical Linear Models: Applications and Data Analysis Methods (2nd edition). Sage, Thousand Oaks, CA, 2002.

[ref24] 24. Snijders TAB and Bosker R. J. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. Sage, Thousand Oaks, CA, 1999.

[ref25] 25. Alker H. R. A Typology of Ecological Fallacies, In Quantitative Ecological Analysis in the Social Sciences. The MIT Press, Cambridge, MA, 1969.

[ref26] 26. Friedman J. H. and Stuetzle W. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical, 1981, 76:817–823.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref27] 27. Lang S., Umlauf N., Wechselberger P., Harttgen K., and Kneib T. Multilevel structured additive regression. Statistics and Computing, 2012, 24(2):223–238.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref28] 28. Brunauer W., Lang S., and Umlauf N. Modelling house prices using multilevel structured additive regression. Statistical Modelling, 2013, 13(2):95–123.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref29] 29. Busing F. Distribution characteristics of variance estimates in two-level models. Unpublish manuscript, 1993.

[ref30] 30. Gu C. and Ma P. Optimal smoothing in nonparametric mixed-effect models. The Annals of Statistics, 2005, 33(3):1357–1379.
View Article
Google Scholar

[87] View Article

[88] Google Scholar