Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Random intercept and linear mixed models including heteroscedasticity in a logarithmic scale: Correction terms and prediction in the original scale

  • Ricardo Ramírez-Aldana ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing

    ricardoramirezaldana@gmail.com

    Affiliation Instituto Nacional de Geriatría, Ciudad de México, Mexico

  • Lizbeth Naranjo

    Roles Data curation, Formal analysis, Investigation, Software, Visualization, Writing – review & editing

    Affiliation Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México, Ciudad de México, Mexico

Abstract

Random intercept models are linear mixed models (LMM) including error and intercept random effects. Sometimes heteroscedasticity is included and the response variable is transformed into a logarithmic scale, while inference is required in the original scale; thus, the response variable has a log-normal distribution. Hence, correction terms should be included to predict the response in the original scale. These terms multiply the exponentiated predicted response variable, which subestimates the real values. We derive the correction terms, simulations and real data about the income of elderly are presented to show the importance of using them to obtain more accurate predictions. Generalizations for any LMM are also presented.

Introduction

In economics and other scientific areas such as medicine, geology, and genetics; it is common to study linear models with a dependent variable defined in a logarithmic scale; for instance in studies related to income [1], health insurance [2], medical expenditures [3], health care utilization and earnings [4], or sediment discharge [5]. In the logarithmic scale, the variable can have an associated normal distribution, whereas in the original scale this is not true. In other words, dependent variables correspond to random variables with a log-normal distribution, a skewed distribution associated with variables taking only positive values, which has been extensively used in analyses for real data corresponding to stock prices, income (without higher-income individuals), time from infection to first symptoms, distribution of particles, number of words per sentence, age of marriage, size of living tissue, etc. There are instances in which presence of heteroscedasticidity can be solved considering such logarithmic scale; however, sometimes this issue is not solved even after the transformation, for instance when the variability is not proportional to the squared conditional mean response given values of the explanatory variables. Additionally, there are data in which nesting between observations is present, for instance, when observations belong to the same spatial cluster. In this case, independence between observations is not satisfied, since the values in a same cluster are correlated, and a random intercept model is preferred.

A random intercept model (RIM) in a logarithmic scale is a special type of linear mixed model (LMM) [6, 7], in which: (1) where i = 1, …, m, j = 1, …, ni, m is the number of clusters, ni is the number of observations in the ith cluster, log(Yij) is the response associated with the jth observation in the ith cluster, xij = (xij1, …, xijp)′ is a vector of dimension p associated with the jth observation in the ith cluster corresponding to the p fixed effects given in the regression parameters vector β = (β1, …, βp)′. Variable γi represents an intercept random effect associated with cluster i, which allows to model the relationship among observations for each cluster, it has a normal distribution additionally, γi, for i = 1, …, m, are independent and identically distributed (i.i.d.). The random error is ϵij, and since heteroscedasticidity is assumed in (1), i.i.d., where is a known number that allows different variability between observations and clusters. The terms are assumed as known; and for instance, they could be obtained using the unit size or the ELL method [8, 9]. Under this method, two linear models are fitted, a first model (beta) is the corresponding marginal model and a second one (alpha) is a model associated with transformed residuals obtained from the beta model (residuals obtained after deleting the effects associated with the random effects); then, an approximation of the terms can be used in the random intercept model. Finally, the random effects γi and the errors ϵij are independent.

In many cases, it is necessary to return to the original scale of Yij. Traditionally, this is done by simply applying an exponential function to the predicted values obtained from the model. However, this approach does not consider that the random terms involved in the model are transformed as well, and predictions are subestimated. In some cases, a generalized linear mixed model (GLMM) [10] with an associated distribution according to the data type could be used (e.g., gamma, Poisson, etc.). However, sometimes it is preferred to use the normal distribution in the logarithmic scale, when we know the dependent variable has a log-normal distribution (as far as we know, it is not one distribution included in programs that fit GLMM; and, transforming the dependent variable in a normal GLMM would be similar as what we are doing), or when other processes depend on such normal RIM. For instance, in small area estimation there are methods based on the RIM, e.g. the empirical best predictor (EBP) method [11], in which parameters are estimated from a RIM using the sample information; after that, the conditional distribution of the out-of-sample data given the sample data can be derived from the normal distribution assumption; the predicted values, simulations, and Monte Carlo approximations are used to estimate poverty measures at a small area level (for elements in or outside of the sample), and finally, a parametric bootstrap mean squared error (MSE) estimator is obtained based on the same RIM. Additionally, not all possible distributions are implemented in GLMM and a logarithmic transformation must be used in LMM. In this sense; recently, [12] proposed a model for data showing skewness at the log scale based on an extension of a distribution called generalized beta of the second kind, which can be seen as a random effects model designed for skewed response variables extending the usual log-normal-nested error model, they also found empirical best predictors for poverty measures in small areas.

Statistical models to correct the logarithmic transformation in linear regression models have been proposed by different authors, e.g. [5, 13, 14]; other authors have used Bayesian methods to deal with it, e.g. [15]. Some extensions to address heteroscedasticity in linear regression models with a logarithmic scale have also been proposed, e.g. [2, 16]. Other authors have compared the logarithmic transformation in linear models with other type of models in different applications, e.g. [4, 1721]. Moreover, others have studied the Box-Cox transformation in linear models, e.g. [3, 2224], being the logarithmic transformation a particular case. Finally, a Box-Cox transformation in LMM has been studied by [23].

In this paper are derived the correction terms that should be used to obtain more accurate predictions in a RIM with heteroscedasticity, when predictions are desired in the original scale. In economics, for instance, this allows a more accurate prediction of income or to improve predictions of measures depending on it, for instance poverty measures. These correction terms multiply the exponentiated predicted values obtained from the RIM, calculating the latter values (without correction) being the usual procedure. These terms are important since they allow to obtain more precise predictions, with a smaller MSE. Since the RIM contains two random terms, the random effect and the error term, two correction terms are obtained. When these correction terms are not included, subestimation occurs. Similar terms have been obtained for linear regression models, but, as far as the authors know, they have not been derived for RIM. These results are relevant, not only when a RIM is fitted in some data, but also when methodology is based on such models, for instance in small area estimation.

The motivation example considers a sample of aging individuals over 60 years old in Mexico, in which some household and socio-demographic measures and income are known. The information is available by state i = 1, …, m; for m = 32, the number of individuals by state is ni, Yij is the income by individual j = 1, …, ni and state i, and there are a total of observations. Assume we want to estimate the expected income for these individuals, E[Yij], according to the available explanatory variables. This process could be useful to estimate income for another set of similar individuals, in which income is not available but the other variables are, for imputation, or in a simulation process, as in small-area estimation which depends on simulating income in out-of-sample individuals. In the framework of linear models, there are several options for this estimation; in some of them, the distributional assumptions are better satisfied in a logarithmic scale, using log(Yij) as response, but the estimations are required on the original scale. A first option is to estimate the expected income, without a common random effect for state, simply using a linear regression in a logarithmic scale and using a correction term to estimate in the original scale. A second option is fitting a linear model on the log-transformed scale with random effects for state obtaining the exponentiated predicted values to estimate the income. A third option is to associate a gamma distribution to the income, a commonly used distribution for positive skewed data such as the income or costs [25], and fit a generalized linear mixed model including random effects for state. The fourth option we propose is to apply correction terms on the second option. We show here that this improves the precision of the estimated values. We apply the corrections terms to both the real data set and in simulated data to evaluate which of the different described options including random effects has a better performance, for the simulations we variate the number of clusters, observations by cluster, parameters associated with the variance of the random effects and error terms, and consider models under different distributions.

This paper is organized as follows. In the second section, we briefly present the correction term used in a linear regression model in a logarithmic scale, including the so-called smearing estimate. In the third section, we derive correction terms for a RIM with heteroscedasticity, and the corresponding correction terms for a RIM with homoscedasticity are obtained as a particular case. In the fourth section, we obtain simulations using a log-normal distribution to show that the MSE is minimized when the correction terms are used and in certain cases when gamma distribution simulations are used, comparing the estimations using our method with those derived from a generalized linear mixed model. Additionally, the real data corresponding to income in elderly people is analyzed to show the use of the correction terms and to compare predictions using different options to calculate them. In a fifth section, we propose a generalization for LMM and for transformations different from the logarithm. Finally, the conclusion is presented in the last section, and some of the linear algebra used for the calculations is presented as Supplementary Material.

Correction term associated with a linear regression model in a logarithmic scale

In our motivation example, assume that we estimate income in a logarithmic scale without considering a random effect for state. In this case, we are in the framework of a linear regression and the second index j is unnecessary, and thus, for simplicity we eliminate it in this section.

A linear regression model in a logarithmic scale, also called log-normal linear model, is defined as: (2) where Yi is the response variable for the ith observation, xi = (xi1, …, xip)′ is a vector of the p explanatory variables for the ith observation, β = (β1, …, βp)′ is a vector of dimension p of regression parameters, and ui is an error term, where uiN(0, σ2) i.i.d., for i = 1, …, n. In matrix notation, log(Y) = Xβ + u, where Y = (Y1, …, Yn)′, X = (x1, …, xn)′, and u = (u1, …, un)′.

From (2), the expected value of the response is However, since in the original scale, we have that the expected value is Omitting subindex i, we have that: Noticing that the integral in the last equality is equal to one, hence, (3) and so Therefore, the estimator of the predicted response is given by

Using the same reasoning and considering heteroscedasticity in (2), allowing to have different variability among subjects, i.e. , .

The last term can be estimated by replacing β with the least squares estimator , and σ2 with the biased (maximum likelihood, ML) or unbiased estimator, or respectively, where is the residual sum of squares. Observe that is the estimator of the predicted response associated with a log-normal distribution.

A modified non-parametric estimator can also be associated with model (2) by using the smearing estimate [3]. We assume uiF i.i.d., i = 1, …, n, where E[ui] = 0 and Var(ui) = σ2. Since F is not completely known, the empirical distribution, is used, where from (2) is the estimated value of ui, and the indicator function is equal to 1 if and 0 otherwise.

Assuming Y0 corresponds to an observed response with associated explanatory variables values x0, the predicted response is: Furthermore, substituting the regression parameter β by the estimates , the estimated predicted response is given by:

Correction terms in a RIM with heteroscedasticity and a logarithmic scale

In this section we derive the correction terms for a RIM with heteroscedasticity in a logarithmic scale. In terms of our motivation example, the process corresponds to estimate the expected income by fitting a model in a logarithmic scale adding a common random effect for state and using correction terms that allow more precise estimations in the original scale. First, a preliminar estimator is introduced. Second, an estimator based on the random effect best linear predictor is presented. Third, an estimator based on a conditional expectation is proposed. Finally, a correction term based on the smearing estimate is given.

A preliminar estimator

From the RIM model given in (1), equivalent to then, by using independence between γi and ϵij, the expectation of the response in the original scale is: (4) As in (3), the expectations of the exponentials of γi and ϵij are and , and, as a consequence, Therefore, using the corresponding estimators, (5) where and are variance estimators corresponding to the error and random effects terms, respectively, and is the fixed effects estimator, estimated by using ML or restricted ML estimator (REML) methods.

An estimator based on the random effect best linear predictor

The predicted values in (5) do not consider that the random effect γi can be estimated through the best linear predictor, (6) thus having a predictor for each ith observation, for i = 1, …, m. The vector of estimated random effects corresponds to .

Similarly, to obtain a better predictor associated with Yij, it is more adequate to use E[exp(γi)|log(Y)], instead of E[exp(γi)], in (4). Hence, the predictor is: (7)

A first approach to estimate E[exp(γi)|log(Y)] could be simply by using so the estimator (7) would be Note that, (8) is the predicted value corresponding to log(Yij) exponentiated to return to the original scale (naive estimator). This term is multiplied by a term associated with the error. Assuming heteroscedasticity, and the estimator (7) would be: (9)

Note that, according to the Jensen inequality, thus, subestimates E[exp(γi)|log(Y)]. Hence, a better prediction can be derived by directly obtaining E[exp(γi)|log(Y)].

An estimator based on E[exp(γi)|log(Y)]

In this subsection, we obtain a better predictor by computing directly the conditional expectation E[exp(γi)|log(Y)]. For this purpose, first we derived the conditional distribution of the random effect γi conditional to the transformed response for the sample, log(Y) = (log(Y1), …, log(Ym))′, which is a vector of dimension n, where is the sample size, with for i = 1, …, m. The random effect has an univariate distribution γiN(0, σ2), whereas log(Y) has a multivariate distribution log(Y) ∼ Nn(Xβ, V), where X is the design matrix of dimension n × p of fixed effects associated with the response, β is a vector of dimension p of regression parameters, and V is the variance and covariance matrix Var[log(Y)] with dimension n × n. The expected value of this conditional distribution corresponds to the predictor given in (6), whereas using properties concerning the distribution of conditioned multivariate normal random variables, it can be shown (see Proposition 1 in Supplementary Material) that the variance associated with the conditional distribution corresponds to (10) Thus, Using the result given in (3), corresponding to the expected value associated with a lognormal random variable, As a consequence, the predictor of the response in the original scale, is estimated considering heteroscedasticity and a predictor E[exp(γi)|log(Y)], for each i = 1, …, m, and corresponds to (11) From (11) and assuming (or wij = 1), which is a model with homoscedasticity in the error term, and the predictor corresponds to (12)

Observe how the predicted values given in (11) or (12) include the term , which is the naive estimator associated with Yij. This value is corrected according to two factors, one corresponding to the error and another to the random effect. In contrast, the predictor in (9) only considered the term associated with the error term.

Under heteroscedasticity, the predictor given in (9) subestimates the real value since the term E[exp(γi)|log(Y)], i = 1, …, m, is not used. However, it can be easier to calculate since the sum is not included. Once, E[exp(γi)|log(Y)] is calculated, the predictor is given in (11). As far as we know, this expected value had not been obtained before.

Observe that all estimators given in (9), (11), and (12) consider that a normal distribution is associated with the transformed data.

A correction term based on the smearing estimate

We saw in the second section that a smearing estimator [3] is a nonparametric statistic used to estimate the expected response on the untransformed scale after fitting a linear model on the transformed scale, thus being useful when the normality assumption is not satisfied. In this subsection, we used this type of estimator to obtain correction terms for the RIM. One variant of the estimators in a model considering homoscedasticity, Eq (12), is obtained by using a smearing estimate for the error term: (13) One variant, considering different variance in each ith cluster, and the corresponding smearing estimate, is:

Experimental results

In this section, the proposed correction terms for RIM with heteroscedasticity in a logarithmic scale are applied to simulation-based scenarios and to an income for elderly people real dataset.

Simulation-based experiment

A simulation-based experiment is conducted to analyse the correction terms proposed in this paper.

The goal of this simulation experiment is to demonstrate that the proposed approach implementation properly works, and, therefore, the real values are adequately recovered by the estimated ones. We generated one hundred datasets for different scenarios, the generated covariates and general structure are as follows.

A set of m clusters having ni observations each one, for i = 1, …, m, are simulated. For balanced designs m = {50, 100} and ni = {10, 20} ∀i. For unbalanced designs there are two scenarios, one with ni = {11, 12, …, 50} and m = 40, and another with ni = {11, 12, …, 90} and m = 80. The variables xil are randomly generated from a uniform distribution U(0, 1), for l = 1, …, p and i = 1, …, n, where p = 3 and . In order to include an intercept term, xi1 = 1. These values are the entries in the design matrix X of dimension n × p. The regression parameters vector is β = (0.8, 1.3, −0.7)′.

The intercept random effects γi, for i = 1, …, m, are generated from a normal distribution with mean 0 and variance , .

In order to include heteroscedasticity, fixed values were proposed for the weights wij. They have been deterministically assigned as wij = (i + 1)/10 + j/1000, for i = 1, …, m and j = 1, …, ni. The error terms vector ϵ is generated from a multivariate normal distribution Nn(0, R), where R = diag(Σ1, …, Σm), , and σ2 = {0.2, 0.4}. Note that, by using properties of the multivariate normal distribution, it is also possible to generate ϵ by the following way: first simulate ϵ* from a multivariate standard normal distribution Nn(0, In×n), or equivalently generate from a univariate standard normal distribution N(0, 1), then do ϵ = R1/2 ϵ* where R1/2 is such that R = R1/2 R1/2, or equivalently .

Finally, the response variable in the logarithmic scale is obtained from (1), this is, by substituting the simulated values in , thus, the response variable Yij is obtained as Yij = exp(log(Yij)).

Fig 1 shows one simulated dataset. These graphics show that the response variable log(Y) has a linear relation with Xβ (graphic in the left). In contrast, as sometimes occurs in practice, a logarithm transformation is needed on Y to get a linear relationship with the explanatory variables (graphic in the right).

Fig 2 shows the simulated response variable log(Y) and Y, compared with their estimated responses. The squared red dots represent the naive estimates without correction terms in (8), and the blue triangles represent the estimated values obtained by using the correction terms in (11). Note that in general the estimates by using the naive estimator are lower than the estimates obtained by using the proposed correction terms in (11), showing that the naive estimator subestimates the real values.

Multiple data sets were generated according to the specifications provided in the above paragraphs, and the model’s performance was analyzed by using the mean squared error (MSE), given by

Table 1 shows the means and standard deviations (sd) associated with the MSE for the one hundred datasets simulated for each scenario defined according to different values of m, ni, σ2, and . The MSE are computed by using different estimates in specific, first by using the naive estimator of (8) (column MSEnaive), and then by using the correction terms of (5), (9), (11), and (13) (columns MSE(5), MSE(9), MSE(11), and MSE(13) respectively). Finally, a GLMM with a gamma distribution and a logarithmic link is fitted (column MSEGamma), being this an alternative to model positive skewed variables avoiding fitting a transformed response in a LMM.

thumbnail
Table 1. Summary of the MSE for different values of m, ni, σ2, and for data simulated from a RIM in a logarithmic scale.

https://doi.org/10.1371/journal.pone.0249910.t001

From these simulation scenarios it is shown that, assuming heteroscedasticity, the best estimations, with the lowest MSE mean and sd, are in general those obtained by using the correction terms given in (11). Standard deviations are always larger in column MSE(9). Moreover, just in one case the means and sd’s in column MSEGamma are lower than others.

Another type of datasets were generated from a GLMM with a gamma distribution associated with the response Yij and logarithmic link. The values of the parameters are similar to the ones used in the previous simulation experiment concerning the RIM in a logarithmic scale, having analogous balanced and unbalanced designs with the same values of m, ni, xil, p, β, and ; and including the heteroscedasticity terms wij. The response variable of the GLMM with gamma distribution and logarithmic link is thus generated from where the probability density function of YGamma(shape = a, scale = s) is given by , a > 0, s > 0, and where . The shape parameter a depends on α, which was chosen as α = {1, 1.5, 5}. The purpose of simulating data based on a GLMM with gamma distribution and logarithmic link was to see how our approach worked even when the true distribution associated with the data was not Gaussian. However, our simulations are based on a model extensively used in positive skewed distributions, being this model an alternative to fitting a LMM on the transformed response. In fact, for some particular values assigned to the shape and scale parameters, the distribution associated with the data was similar as that observed for the LMM in the logarithmic scale.

Table 2 shows the means and standard deviations (sd) associated with the MSE for the one hundred datasets simulated for each scenario, each one defined according to different values of m, ni, α, and ; and assuming heteroscedasticity. From these scenarios, it is shown that when the parameter associated with shape α is much bigger than 1, the best estimations, those having the lowest MSE mean and sd, are in general those obtained by using the correction terms given in (11). Hence, in this case, the estimations by using the RIM in a logarithmic scale and the corrections terms are good, even better than those obtained using a GLMM with a gamma distribution and logarithmic link. However, when α is close to 1 the estimations obtained by using the RIM in a logarithmic scale are worst, which makes sense, since a gamma distribution with parameter α = 1 is an exponential distribution, which completely differs from a log-normal distribution.

thumbnail
Table 2. Summary of the MSE for different values of m, ni, α, and for data simulated from a GLMM with gamma distribution and logarithmic link.

https://doi.org/10.1371/journal.pone.0249910.t002

Income for elderly people data application

Returning to our motivation example, we performed analyses based on the National Household Income and Expenditure Survey (Encuesta Nacional de Ingresos y Gastos de los Hogares, ENIGH) 2016 [26], a biennial study to examine income and its distribution in Mexico. Elderly people were considered (60 or more years old). Quaterly total income, that is the income considering all possible sources of income, was obtained for each person as a response variable. Household and sociodemographic information was considered as well. To avoid presence of outliers, only people with an income between 2,000 and 40,000 Mexican pesos were considered. Hence, a total of n = 18, 512 participants were included in the analyses.

As already mentioned in the Introduction section, a logarithmic scale was used for the response variable. To help deciding which variables to use as explanatory, we first fitted linear regression models. According to the obtained results, some variables were modified (categories collapsed) or generated using information from other questions. The final linear model in which we are based upon has a coefficient of determination of 0.35. The sociodemographic explanatory variables included in the RIM are: sex, indigeneous (1 = Yes, 2 = No), knowing how to read and write a note (1 = Yes, 2 = No), level of education (0 = None to 9 = Ph.D.), marital status (0 = Without a partner, 1 = With a partner), having a health service provider (1 = Yes, 2 = No), work (1 = Looking for a job, 2 = Retired, 3 = Domestic chores, 4 = Other situation, 5 = Can not work, 6 = Working), disability (0 = Without, 1 = With), and contribution to social security in all their lives (1 = Yes, 2 = No). At a household level, explanatory variables are: number of rooms, presence of wc (1 = Yes, 2 = No), number of light bulbs, household ownership (1 = Rented, 2 = Borrowed, 3 = Owner but paying it, 4 = Owner, 5 = Intestated, 6 = Another situation), number of residents, type of the location where the household is in (0 = Rural, 1 = Urban, a location is considered as urban when its size is of 2,500 or more residents), socioeconomic stratum (1 = Low, 2 = Low medium, 3 = High medium, 4 = High), and flooring material (1 = Ground, 2 = Cement, 3 = Wood, mosaic, or another floor recovering).

Since individuals are nested in each of the 32 states, an intercept random effect for state was included, each state having between 400 and 1000 observations. The parameter (fixed effects) estimations associated with the RIM model with homoscedasticity in the error term are shown in Table 3. The estimated standard deviation associated with the random effect, is approximately 0.08, and the corresponding value associated with the error term, is approximately 0.6. A likelihood ratio test comparing the RIM model with a model without the random effect, i.e. was obtained, with an associated p-value of less than 0.05 (this number when divided by two is even smaller, a calculation that must be made since the hypothesis involves a value in the frontier of the parametral space). Hence, a random effect is necessary and a linear regression model (without random effects) should not be fitted, which we defined as a first option to possibly use for this data in the Introduction section.

thumbnail
Table 3. Parameter estimations for the RIM associated with income in a logarithmic scale for elderly people data in 2016.

https://doi.org/10.1371/journal.pone.0249910.t003

Fig 3 shows the histogram and qq-plot associated with the residuals. They are indicative that the normality assumption is satisfied, the same being true when the random effects qq-plot is examined.

thumbnail
Fig 3. Residuals.

Left: Histogram of the residuals. Middle: qq-plot of the residuals. Right: qq-plot of the residuals of the random effects.

https://doi.org/10.1371/journal.pone.0249910.g003

Fig 4 shows the fitted values for the RIM associated with income in a logarithmic scale for the elderly people data in 2016. The squared red dots represent the naive estimates without correction terms, and blue triangles represent the estimated values by using the correction terms in (12), a particular case of (11), and which, according to the simulation results, are the best estimations (with lowest MSE). Note that the estimates derived through the naive estimator are in general lower than those derived through the proposed correction terms in (12), showing that the naive estimator subestimates the data. In terms of the options discussed in the Introduction section, the naive estimates and those including the correction terms corresponded to the second and fourth, respectively. When the naive estimator is obtained and compared with the true values, the squared root of the mean squared error is 7027.784, whereas using the correction factor given in (12), the squared root of the mean squared error is 6829.003, which is an improvement. In terms of the third option discussed in the Introduction, we fitted a GLMM using a gamma distribution for the response variable, a logarithmic link function, and both a penalised quasi-likelihood (PQL) and Laplace approximation methods, we checked that the normality assumption in the estimated random effects is satisfied. We obtained values for the squared root of the mean squared error of 6973.41 and 6979.769 under the PQL and Laplace methods, respectively. Hence, in this example the estimates under the correction term are even more precise that those obtained using a GLMM. We fitted models considering some heteroscedasticity schemes, for instance using the ELL method, the cluster size, or the squared residuals, but only with the former method we obtained an inferior mean squared error than under the homoscedasticity scheme; however, the normality assumption was not satisfied as well.

thumbnail
Fig 4. Fitted values for the RIM associated with income in a logarithmic scale for elderly people data in 2016.

Squared red: naive estimates. Blue triangles: estimates by using the correction terms in (12).

https://doi.org/10.1371/journal.pone.0249910.g004

Mimic simulation example

Validating our proposed correction terms for RIM including heteroscedasticity in a logarithm scale, we did a simulation experiment based on 100 data sets of size 2000. The simulated data approximately mimic the motivating data of the income for elderly people, assuming two types of weights associated with heteroscedasticity: the cluster size and one of the explanatory variables, as sometimes is found in real data. Details are given in the S2 Text.

Our simulation strategy generated the means and standard deviations of the MSE for each one of the corrections terms considering the two types of weights and varying values associated with the variance of the random effects and error terms, see Table 4. The estimations with the lowest MSE corresponded to those obtained using the correction terms associated with Eqs (9) and (11). See details and a table including more values in the Supplementary Material.

thumbnail
Table 4. Summary of the MSE for the mimic simulation example.

https://doi.org/10.1371/journal.pone.0249910.t004

Generalization to linear mixed models and with functions different from the logarithm

In this section, we generalize the correction terms for any LMM and for transformations different from the logarithm. We have seen that the estimators based on the conditional expectancy associated with the random effects given the transformed response have a better performance; thus, we present only this type of estimator for LMM, obtaining a closed formula. A LMM includes q random effects; for instance, we can have random effects associated with some or all the fixed effects. In its matrix form, a LMM corresponds to where X is the design matrix associated with the fixed effects of dimension n × p and β is the corresponding vector of parameters of dimension p. On the other hand, γ = (γ1, …, γm) is a vector of dimension mq of random effects, where γi a vector of dimension q corresponding to all random effects associated with a cluster i, with distribution γNmq(0, G), with G a diagonal matrix of dimension mq × mq, G = diag(D, D, …, D), where D is the variance and covariance matrix of dimension q × q associated with the random effects, which is assumed to be the same for all clusters. This term is multiplied by the matrix U, a block diagonal matrix of dimension n × mq given by U = diag(U1, U2, …, Um), with Ui of dimension ni × q. The vector of errors has distribution ϵNn(0, R), where R is a block diagonal matrix of dimension n × n given by R = diag(Σ1, Σ2, …, Σm), with Σi a diagonal matrix of dimension ni × ni given by . The error terms and random effects are assumed independent. Considering an individual j in a cluster i; i = 1, …, m and j = 1, …, ni, the expression analogous to (1) associated with a LMM is: (14) where uij is the jth row corresponding to matrix Ui.

From the joint distribution of the random effects γ and transformed response log(Y), we obtain (see Proposition 2 in S1 File) that the variance and covariance matrix associated with cluster i, Var(γi|log(Y)), for i = 1, …, m, is (15) and where is the best linear predictor of γi, . Consequently, and using the expected value corresponding to a log-normal distribution: (16)

Thus, to estimate E[Yij] in a cluster i; i = 1, …, m, for an individual j; j = 1, …, ni, where Yij is modeled as in (14), we use the estimator for the random error ϵij, multiplied by the expected value associated with the random effects conditional to the response calculated in (16), and the constant part . The estimator corresponds to: (17)

In (17), all terms are known once substituting the estimated variance and covariance terms for the random effects in D and in Σi, both D and Σi part of Var(γi|log(Y)). These terms and obtained after fitting the model.

For instance, consider a model including random effects associated with the intercept and a variable u. For each cluster i = 1, …, m, γi = (γi1, γi2)′, with γi1 and γi2 scalars corresponding to the random effects for the intercept and variable u, respectively. The values associated with variable u in cluster i can be accommodated in a vectorial form as , thus Ui is a matrix of dimension ni × 2 such that , where corresponds to the intercept. Finally, (18) where and correspond to the variances associated with the random effects for the intercept and variable u, respectively, and is the corresponding covariance. It is easy to derive that in this case (15) corresponds to with = and D given in (18). This equation can be substituted in expression (17) using estimations of , , and , values obtained after fitting the LMM in any statistical software.

We could consider a transformation more general than a logarithm, for instance a Box-Cox transformation g, whose inverse follows a power-normal distribution. Each observation Yij; for j = 1, …, ni and i = 1, …, m, associated with a MLM under a Box-Cox transformation with parameter λ, g(Yij), satisfies that with and . The expected value E[X] of a power-normal distribution, in this case , is calculated in [27] (Lemma 1). After considering the estimated parameters, this expression corresponds to one class of corrected predictions in the original scale, that without conditioning the random effects to the sample. For instance, for λ = 0, the expected value given in [27] is , corresponding to Eq (5) when only one random effect is used.

For an invertible function g(⋅), and considering estimators based on conditioning on the sample, as in (11) for a RIM or (17) for any LMM, a simulation can be used. Assuming that in the transformed scale all normality assumptions are satisfied, we can apply similar results as when a MLM and logarithm transformation were considered, and (19) where Var[γi|g(Y)] corresponds to (15). The expected value of the response in the original scale in a cluster i for an individual j can be approximated with simulations by generating a set of random numbers zl, for l = 1, …, L, according to the distribution given in (19), and obtaining: using or E[ϵij] instead of ϵij, the expected value E[ϵij] could be obtained by simulating the distribution of ϵij.

Conclusion

The correction terms we proposed for a RIM with or without heteroscedasticity with response in a logarithmic scale enable more precise predictions. This is useful since responses in a logarithmic scale are commonly used, specially in financial and poverty analyses, and with our procedure, we can obtain more precise predictions of an economic measure in a population or better simulations of the distribution of the response, or an associated measure, for a new population (by simulating the error term and random effects and using the values of the explanatory variables). As the simulations assuming log-normal distributions and real data show, the best predictions, with lowest MSE, correspond to those including two correction terms, one for the errors and another for the random effects. These correction terms are easy to calculate and implement without the need of special software.

Even though in a GLMM, a distribution different from the normal can be used, it is sometimes desired simply to work in a logarithmic scale when the normal behaviour under this transformation is properly satisfied; or in other words, when a lognormal distribution adequately fits some data. Besides, through simulations with gamma distributions, a commonly used distribution used to model income or similar variables, we showed that the predictions using the two correction terms are more precise than those obtained through a GLMM with a gamma distribution, as long as the parameter α associated with shape, in the gamma distribution is not close to one. And, even when the parameter is one, corresponding to an exponential distribution, as the number of clusters and observations in each cluster increase, the estimations obtained using the correction terms are close to those obtained with the GLMM and a gamma distribution (being in general better the ones using the smearing estimate, specially for lower values associated with the variance of the random effect, and viceversa), and better that those obtained through another correction method or without correction terms. On the other hand, in other type of analyses, as in some small area estimation techniques, it is desirable to preserve a normal distribution since the fit of a RIM is just one first step in a set of processes, all assuming normality; hence, assuming another distribution would change the complete technique; and, without the correction, the estimated poverty measures or any measure associated with a small area might be incorrect. The weights we considered for heteroscedasticity were of the form ; however, a more general form can be used by substituting for in all formulas. If the variance structure is estimated using a function, for instance an exponential variance structure, we estimate the LMM including this structure. Thus, the parameters for the structure are estimated with the fixed and random effects parameters. Any inference should be performed being careful that the degrees of freedom are corrected or appropriate corrections applied, particularly for small sample sizes [28] and non-linear covariance structures [29]. For the predictions in the original scale, the terms can be calculated using the estimated parameters corresponding to the variance structure and then using our formulas. Any further inference should be taken with care considering the variance structure was estimated. In fact, assuming any correlation structure associated with the error for each cluster, i.e. assuming that the matrix Σi is not necessarily diagonal (however, the correlation structure between clusters is still assumed diagonal), for instance when time is involved, Eqs (15) and (16) still hold true, and formula (17) might be used modifying the third term accordingly, though care should be taken if any inference is required.

We also generalized the procedure considering any LMM, being RIM a particular case, and outlined the process that could be followed when a function different from the logarithm is used, though it seems that approximations should be used in this general case. Future work could be to continue working with transformations different from the logarithm to see if better predictions with closed formulas can be obtained. An exact variance estimator of the predicted values is also something desirable, though it seems, from some preliminary calculations, that a closed formula cannot be obtained; however, a better approximation than one using only simulations might be possible. We are working in the implementation of the correction terms in two-part models and their variants, for instance for health expenditure data in which there is concentration in the zero value since some people do not spend money, to see whether our correction terms allow to obtain better predictions as some preliminary analyses have shown.

Supporting information

S1 File. Proposition 1 and Proposition 2 concerning the calculations to obtain Var(γi|log(Y)) and Var(γi|log(Y)) for a RIM and LMM, respectively, and details about the mimic simulation example.

https://doi.org/10.1371/journal.pone.0249910.s001

(PDF)

S2 File. R code associated with the non-simulated data analyzed in the manuscript.

https://doi.org/10.1371/journal.pone.0249910.s002

(R)

S3 File. Source R code that allow to replicate the analysis for the mimic simulation example.

https://doi.org/10.1371/journal.pone.0249910.s003

(R)

S1 Data. Data that allow to replicate the analysis for the mimic simulation example.

https://doi.org/10.1371/journal.pone.0249910.s004

(CSV)

S1 Text. Instructions that allow to replicate the analysis for the mimic simulation example.

https://doi.org/10.1371/journal.pone.0249910.s005

(TXT)

Acknowledgments

We thank to Dra. Mónica Tinajero Bravo from Consejo Nacional de Evaluación de la Política de Desarrollo Social (CONEVAL), Mexico, for her critical remarks and interesting discussion on the proposal of this paper.

References

  1. 1. Ermini L, Hendry DF. Log Income vs. Linear Income: An Application of the Encompassing Principle. Oxford Bulletin of Economics and Statistics. 2008;70(s1):807–827.
  2. 2. Manning WG. The logged dependent variable, heteroscedasticity, and the retransformation problem. Journal of Health Economics. 1998;17(3):283–295. pmid:10180919
  3. 3. Duan N. Smearing Estimate: A Nonparametric Retransformation Method. Journal of the American Statistical Association. 1983;78(383):605–610.
  4. 4. Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20(4):461–494. pmid:11469231
  5. 5. Shen H, Zhu Z. Efficient mean estimation in log-normal linear models. Journal of Statistical Planning and Inference. 2008;138(3):552–567.
  6. 6. West BT, Welch K, Galecki AT. Linear mixed models: a practical guide using statistical software. 2nd ed. Boca Raton, Florida: Chapman and Hall CRC; 2015.
  7. 7. Demidenko E. Mixed models: Theory and Applications. Hoboken, New Jersey: John Wiley and Sons; 2004.
  8. 8. Elbers C, Lanjouw JO, Lanjouw P. Micro-level estimation of Welfare. Washington, DC: Poverty Group, Development Research Group, The World Bank; 2002. 2911.
  9. 9. Elbers C, Lanjouw JO, Lanjouw P. Micro-Level Estimation of Poverty and Inequality. Econometrica. 2003;71(1):355–364.
  10. 10. McCulloch CE, Searle SR. Generalized linear and mixed models. New York: Wiley-Interscience; 2001.
  11. 11. Molina I, Rao JNK. Small area estimation of poverty indicators. The Canadian Journal of Statistics / La Revue Canadienne de Statistique. 2010;38(3):369–385.
  12. 12. Graf M, Marín JM, Molina I. A generalized mixed model for skewed distributions applied to small area estimation. TEST. 2019;28(2):565–597.
  13. 13. Beauchamp JJ, Olson JS. Corrections for bias in regression estimates after logarithmic transformation. Ecology. 1973;54(6):1403–1407.
  14. 14. Newman MC. Regression analysis of log-transformed data: statistical bias and its correction. Environmental Toxicology and Chemistry. 1993;12:1129–1133.
  15. 15. Stow CA, Reckhow KH, Qian SS. A Bayesian Approach to Retransformation Bias in Transformed Regression. Ecology. 2006;87(6):1472–1477. pmid:16869423
  16. 16. Gustavsson SM, Johannesson S, Sallsten G, Andersson EM. Linear maximum likelihood regression analysis for untransformed log-normally distributed data. Open Journal of Statistics. 2012;2:389–400.
  17. 17. Lee CY. Comparison of two correction methods for the bias due to the logarithmic transformation in the estimation of biomass. Canada Journal of Forest Research. 1982;12(2):326–331.
  18. 18. Duan N, Manning WG, Morris CN, Newhouse JP. A Comparison of Alternative Models for the Demand for Medical Care. Journal of Business and Economic Statistics. 1983;1(2):115–126.
  19. 19. Jansson M. A comparison of detransformed logarithmic regressions and power function regressions. Geografiska Annaler Series A, Physical Geography. 1985;67(1/2):61–70.
  20. 20. Griswold M, Parmigiani G, Potosky A, Lipscomb J. Analyzing health care costs: a comparison of statistical methods motivated by Medicare colorectal cancer charges. Biostatistics. 2004;1(1):1–23.
  21. 21. Gregori D, Petrinco M, Bo S, Deideri A, Merletti F, Pagano E. Regression models for analyzing costs and their determinants in health care: an introductory review. International Journal for Quality in Health Care. 2011;23(3):331–341. pmid:21504959
  22. 22. Taylor JMG. The retransformed mean after a fitted power transformation. Journal of the American Statistical Association. 1986;81(393):114–118.
  23. 23. Gurka MJ, Edwards LJ, Muller KE, Kupper LL. Extending the Box-Cox Transformation to the Linear Mixed Model. Journal of the Royal Statistical Society Series A (Statistics in Society). 2006;169(2):273–288.
  24. 24. Fitzmaurice GM, Lipsitz SR, Parzen M. Approximate Median Regression via the Box-Cox Transformation. The American Statistician. 2007;61(3):233–238.
  25. 25. Malehi AS, Pourmotahari F, Angali KA. Statistical models for the analysis of skewed healthcare cost data: a simulation study. Health Economics Review. 2015;5(1):11. pmid:26029491
  26. 26. Instituto Nacional de Estadística y Geografía (INEGI). Encuesta Nacional de Ingresos y Gastos de los Hogares 2016, ENIGH. Mexico: Instituto Nacional de Estadística y Geografía; 2016. Available from: https://www.inegi.org.mx/programas/enigh/nc/2016/.
  27. 27. Freeman J, Modarres R. Inverse Box-Cox: The power-normal distribution. Statistics & Probability Letters. 2006;76:764–772.
  28. 28. Kenward MG. Small sample inference for fixed effects from restricted maximum likelihood. Biometrics. 1997;53:983–997. pmid:9333350
  29. 29. Kenward MG, Roger JH. An improved approximation to the precision of fixed effects from restricted maximum likelihood. Computational Statistics and Data Analysis. 2009;53:2583–2595.