This Month
Published: 01 December 2015

Points of Significance

Multiple linear regression

Martin Krzywinski² &
Naomi Altman¹

Nature Methods volume 12, pages 1103–1104 (2015)Cite this article

44k Accesses
78 Citations
43 Altmetric
Metrics details

Subjects

Statistics

When multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple.

You have full access to this article via your institution.

Download PDF

Last month we explored how to model a simple relationship between two variables, such as the dependence of weight on height¹. In the more realistic scenario of dependence on several variables, we can use multiple linear regression (MLR). Although MLR is similar to linear regression, the interpretation of MLR correlation coefficients is confounded by the way in which the predictor variables relate to one another.

In simple linear regression¹, we model how the mean of variable Y depends linearly on the value of a predictor variable X; this relationship is expressed as the conditional expectation E(Y|X) = β₀ + β₁X. For more than one predictor variable X₁, . . ., X_p, this becomes β₀ + Σβ_jX_j. As for simple linear regression, one can use the least-squares estimator (LSE) to determine estimates b_j of the β_j regression parameters by minimizing the residual sum of squares, SSE = Σ(y_i − ŷ_i)², where ŷ_i = b₀ + Σ_jb_jxij. When we use the regression sum of squares, SSR = Σ(ŷ_i − Y⁻)², the ratio R² = SSR/(SSR + SSE) is the amount of variation explained by the regression model and in multiple regression is called the coefficient of determination.

The slope β_j is the change in Y if predictor j is changed by one unit and others are held constant. When normality and independence assumptions are fulfilled, we can test whether any (or all) of the slopes are zero using a t-test (or regression F-test). Although the interpretation of β_j seems to be identical to its interpretation in the simple linear regression model, the innocuous phrase “and others are held constant” turns out to have profound implications.

To illustrate MLR—and some of its perils—here we simulate predicting the weight (W, in kilograms) of adult males from their height (H, in centimeters) and their maximum jump height (J, in centimeters). We use a model similar to that presented in our previous column¹, but we now include the effect of J as E(W|H,J) = β_HH + β_JJ + β₀ + ε, with β_H = 0.7, β_J = −0.08, β₀ = −46.5 and normally distributed noise ε with zero mean and σ = 1 (Table 1). We set β_J negative because we expect a negative correlation between W and J when height is held constant (i.e., among men of the same height, lighter men will tend to jump higher). For this example we simulated a sample of size n = 40 with H and J normally distributed with means of 165 cm (σ = 3) and 50 cm (σ = 12.5), respectively.

Table 1 Regression coefficients and R² for different predictors and predictor correlations

Full size table

Although the statistical theory for MLR seems similar to that for simple linear regression, the interpretation of the results is much more complex. Problems in interpretation arise entirely as a result of the sample correlation² among the predictors. We do, in fact, expect a positive correlation between H and J—tall men will tend to jump higher than short ones. To illustrate how this correlation can affect the results, we generated values using the model for weight with samples of J and H with different amounts of correlation.

Let's look first at the regression coefficients estimated when the predictors are uncorrelated, r(H,J) = 0, as evidenced by the zero slope in association between H and J (Fig. 1a). Here r is the Pearson correlation coefficient². If we ignore the effect of J and regress W on H, we find Ŵ = 0.71H − 51.7 (R² = 0.66) (Table 1 and Fig. 1b). Ignoring H, we find Ŵ = −0.088J + 69.3 (R² = 0.19). If both predictors are fitted in the regression, we obtain Ŵ = 0.71H − 0.088J − 47.3 (R² = 0.85). This regression fit is a plane in three dimensions (H, J, W) and is not shown in Figure 1. In all three cases, the results of the F-test for zero slopes show high significance (P ≤ 0.005).

**Figure 1: The results of multiple linear regression depend on the correlation of the predictors, as measured here by the Pearson correlation coefficient r (ref. 2).**

When the sample correlations of the predictors are exactly zero, the regression slopes (b_H and b_J) for the “one predictor at a time” regressions and the multiple regression are identical, and the simple regression R² sums to multiple regression R² (0.66 + 0.19 = 0.85; Fig. 2). The intercept changes when we add a predictor with a nonzero mean to satisfy the constraint that the least-squares regression line goes through the sample means, which is always true when the regression model includes an intercept.

**Figure 2: Results and interpretation of multiple regression changes with the sample correlation of the predictors.**

Balanced factorial experiments show a sample correlation of zero among the predictors when their levels have been fixed. For example, we might fix three heights and three jump heights and select two men representative of each combination, for a total of 18 subjects to be weighed. But if we select the samples and then measure the predictors and response, the predictors are unlikely to have zero correlation.

When we simulate highly correlated predictors r(H,J) = 0.9 (Fig. 1c), we find that the regression parameters change depending on whether we use one or both predictors (Table 1 and Fig. 1d). If we consider only the effect of H, the coefficient β_H = 0.7 is inaccurately estimated as b_H = 0.44. If we include only J, we estimate β_J = −0.08 inaccurately, and even with the wrong sign (b_J = 0.097). When we use both predictors, the estimates are quite close to the actual coefficients (b_H = 0.63, b_J = −0.056).

In fact, as the correlation between predictors r(H,J) changes, the estimates of the slopes (b_H, b_J) and intercept (b₀) vary greatly when only one predictor is fitted. We show the effects of this variation for all values of predictor correlation (both positive and negative) across 250 trials at each value (Fig. 2). We include negative correlation because although J and H are likely to be positively correlated, other scenarios might use negatively correlated predictors (e.g., lung capacity and smoking habits). For example, if we include only H in the regression and ignore the effect of J, b_H steadily decreases from about 1 to 0.35 as r(H,J) increases. Why is this? For a given height, larger values of J (an indicator of fitness) are associated with lower weight. If J and H are negatively correlated, as J increases, H decreases, and both changes result in a lower value of W. Conversely, as J decreases, H increases, and thus W increases. If we use only H as a predictor, J is lurking in the background, depressing W at low values of H and enhancing W at high levels of H, so that the effect of H is overestimated (b_H increases). The opposite effect occurs when J and H are positively correlated. A similar effect occurs for b_J, which increases in magnitude (becomes more negative) when J and H are negatively correlated. Supplementary Figure 1 shows the effect of correlation when both regression coefficients are positive.

When both predictors are fitted (Fig. 2), the regression coefficient estimates (b_H, b_J, b₀) are centered at the actual coefficients (β_H, β_J, β₀) with the correct sign and magnitude regardless of the correlation of the predictors. However, the standard error in the estimates steadily increases as the absolute value of the predictor correlation increases.

Neglecting important predictors has implications not only for R², which is a measure of the predictive power of the regression, but also for interpretation of the regression coefficients. Unconsidered variables that may have a strong effect on the estimated regression coefficients are sometimes called 'lurking variables'. For example, muscle mass might be a lurking variable with a causal effect on both body weight and jump height. The results and interpretation of the regression will also change if other predictors are added.

Given that missing predictors can affect the regression, should we try to include as many predictors as possible? No, for three reasons. First, any correlation among predictors will increase the standard error of the estimated regression coefficients. Second, having more slope parameters in our model will reduce interpretability and cause problems with multiple testing. Third, the model may suffer from overfitting. As the number of predictors approaches the sample size, we begin fitting the model to the noise. As a result, we may seem to have a very good fit to the data but still make poor predictions.

MLR is powerful for incorporating many predictors and for estimating the effects of a predictor on the response in the presence of other covariates. However, the estimated regression coefficients depend on the predictors in the model, and they can be quite variable when the predictors are correlated. Accurate prediction of the response is not an indication that regression slopes reflect the true relationship between the predictors and the response.

References

Altman, N. & Krzywinski, M. Nat. Methods 12, 999–1000 (2015).
Article CAS Google Scholar
Altman, N. & Krzywinski, M. Nat. Methods 12, 899–900 (2015).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,
Naomi Altman
Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.,
Martin Krzywinski

Authors

Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Regression coefficients and R²

The significance and value of regression coefficients and R² for a model with both regression coefficients positive, E(W|H,J) = 0.7H + 0.08J - 46.5 + ε. The format of the figure is the same as that of Figure 2.

Supplementary information

Supplementary Figure 1

Regression coefficients and R² (PDF 299 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Krzywinski, M., Altman, N. Multiple linear regression. Nat Methods 12, 1103–1104 (2015). https://doi.org/10.1038/nmeth.3665

Download citation

Published: 01 December 2015
Issue Date: December 2015
DOI: https://doi.org/10.1038/nmeth.3665

This article is cited by

Income and Oral and General Health-Related Quality of Life: The Modifying Effect of Sense of Coherence, Findings of a Cross-Sectional Study
- Mehrsa Zakershahrak
- Sergio Chrisopoulos
- David Brennan
Applied Research in Quality of Life (2023)
Outcomes of a novel all-inside arthroscopic anterior talofibular ligament repair for chronic ankle instability
- Ziyi Chen
- Xiao’ao Xue
- Yinghui Hua
International Orthopaedics (2023)
Predicting financial losses due to apartment construction accidents utilizing deep learning techniques
- Ji-Myong Kim
- Junseo Bae
- Sang-Guk Yum
Scientific Reports (2022)
Regression modeling of time-to-event data with censoring
- Tanujit Dey
- Stuart R. Lipsitz
- Naomi Altman
Nature Methods (2022)
A Systematic Analysis for Energy Performance Predictions in Residential Buildings Using Ensemble Learning
- Monika Goyal
- Mrinal Pandey
Arabian Journal for Science and Engineering (2021)

Multiple linear regression

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary Figure 1 Regression coefficients and R²

Supplementary information

Supplementary Figure 1

Rights and permissions

About this article

Cite this article

This article is cited by

Income and Oral and General Health-Related Quality of Life: The Modifying Effect of Sense of Coherence, Findings of a Cross-Sectional Study

Outcomes of a novel all-inside arthroscopic anterior talofibular ligament repair for chronic ankle instability

Predicting financial losses due to apartment construction accidents utilizing deep learning techniques

Regression modeling of time-to-event data with censoring

A Systematic Analysis for Energy Performance Predictions in Residential Buildings Using Ensemble Learning

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links