Full Length Article
Diagnosing harmful collinearity in moderated regressions: A roadmap

https://doi.org/10.1016/j.ijresmar.2015.08.004Get rights and content

Highlights

  • Condition numbers do not accurately identify when collinearity is actually harmful to statistical inferences.

  • Variance inflation factor (VIF), constructed from squared correlations, may misdiagnose the extent of collinearity problems.

  • VIF is a confounded measure as it conflates lack of magnitude and/or variability of regressors with collinearity

  • C2 can indicate the adverse effects of collinearity in terms of distorting statistical inferences.

  • C2 can identify how much collinearity needs to disappear to generate significant results.

Abstract

Collinearity is inevitable in moderated regression models. Marketing scholars use a variety of collinearity diagnostics including variance inflation factors (VIFs) and condition indices in order to diagnose the extent of collinearity in moderated models. In this paper, we show that VIFs are likely to misdiagnose the extent of collinearity problems by conflating lack of variability (small variance) or lack of magnitude (small mean) in data for collinearity. Condition indices accurately diagnose collinearity, however, they fail to identify when collinearity is actually harmful to statistical inferences. We propose a new measure, C2, based on raw data, which diagnoses the extent of collinearity in moderated regression models. More importantly, this C2 measure, in conjunction with the t-statistic of the non-significant coefficient, can indicate adverse effects of collinearity in terms of distorting statistical inferences and how much collinearity would have to disappear to generate significant results. The efficacy of C2 over VIFs and condition indices is demonstrated using simulated data and its usefulness in moderated regressions is illustrated in an empirical study of brand extensions.

Introduction

Moderated regression models are ideal for testing a contingency hypothesis that the effect of one independent variable, say U, is moderated by a second independent variable, say V, by adding a cross product term, U × V, as an additional explanatory variable (Irwin and McClelland 2001). Moderated regressions, which have become a method of choice for marketing scholars to test multiplicative interactions, have been used in a variety of empirical settings including business-to-business (Fang, Palmatier, Scheer, & Li, 2008), consumer behavior (van Doorn & Verhoef, 2011), international marketing (Leenders & Eliashberg, 2011), product development (Cui & O'Connot, 2012), and retailing (Srinivasan & Moorman, 2005). Owing to the potential for strong linear dependencies among the regressors, U and V, and the interaction term, U × V, researchers fear the existence of high levels of collinearity that may lead to flawed statistical inferences. Indeed, a review of influential marketing journals including Journal of Marketing, Journal of Marketing Research, Marketing Science, and International Journal of Research in Marketing, for the period, 2005–2015, shows that 83 papers that used moderated regressions expressed concerns about collinearity.

The challenges faced by marketing researchers when confronted with collinearity issues can be illustrated using an example of brand extensions. Literature suggests that consumer attitudes toward the newer brand extensions (Attitude) are determined by the quality perceptions of the parent brand (Quality), perceptions of how easily capabilities can be transferred between the parent and extension product class, (Transfer), and the interaction between Quality and Transfer (c.f. Aaker & Keller, 1990). In order to test whether these relationships are significantly different from zero, the researcher fits the following moderated regression model4:Attitude=δ0+δ1Quality+δ2Transfer+δ3Quality×Transfer+ε.

Given that the multiplicative interaction term, Quality × Transfer, is constructed from the regressors, Quality and Transfer, the first question that confronts the researcher is whether the data are indeed plagued by collinearity and if so, the nature and severity of the collinearity. The researcher surveys the extant marketing literature and finds that the two rules of thumb – values of variance inflation factors (VIFs), which are based upon correlations between the independent variables, in excess of 10, and values of condition indices in excess of 30 – are predominantly used to judge the existence and strength of collinearity.

Low correlations or low values of VIFs (less than 10) are considered to be indicative that collinearity problems are negligible or non-existent (c.f. Marquardt, 1970). However, VIF is constructed from squared correlations, VIF  1/(1  R2), and because correlations are not faithful indicators of collinearity, using traditional rules of thumb for VIF values may lead to misdiagnosis of collinearity problems.5 Unlike VIFs, a high condition index (> 30) does indicate the presence of collinearity. However, the condition index by itself does not shed light on the root causes, i.e. the offending variables, of the underlying linear dependencies. In such cases wherein collinearity is diagnosed as high by the condition index, it is always good practice to examine the variance decomposition proportions (values greater than 0.50 in any row indicates linear dependencies) to identify the specific variables that contributed to the collinearity present in the data (Belsley, Kuh, & Welsch, 1980).

Reverting back to the brand extension example, let us suppose that the t-statistic for the estimate of the interaction coefficient δ3 is found to be 1.60, considerably below the critical value 1.96. In such a situation, the researcher faces the second critical question: did collinearity adversely affect the interaction coefficient in terms of statistical significance? While variance decomposition metrics help identify the specific variables underlying the potential near-linear dependencies, they do not inform whether collinearity adversely affected the significances of the variables. In other words, none of the current collinearity metrics including VIFs, condition indices, and variance decomposition proportions shed any light into assessing whether collinearity is the culprit that caused the non-significance of the interaction coefficient.

If there was a way to confirm collinearity as the culprit behind the interaction variable's non-significance, the researcher is confronted with a third question: if collinearity in the data could be reduced in some meaningful way through collection of new or additional data, would the measured t-statistic increase enough to be statistically significant? Alternatively, suppose that the data on Quality and Transfer were constructed from a well-balanced experimental design, rather than from a survey, would this experiment lead to a reduction of collinearity sufficient enough to move the interaction effect to statistical significance? An answer to this question would enable the researcher to truly ascertain whether new data collection is needed or whether the researcher needs to focus her efforts elsewhere to identify the reasons for insignificant results. Unfortunately, existing metrics for collinearity diagnosis including correlations, VIF, condition index, or variance decomposition proportions do not provide any insight into this issue.

In this paper, we propose a new measure of collinearity, C2 that reflects the quality of data to remedy the above mentioned problems. C2 not only helps diagnose the existence of collinearity but also indicates whether collinearity was the reason behind non-significant effects. More importantly, C2 can also indicate whether a non-significant effect would have become significant if the collinearity in the data could be reduced, and if so, how much collinearity must be reduced to achieve this significant result.

Section snippets

Collinearity in moderated regression

Consider a moderated variable regression.Y=α01+α1U+α2V+α3U×V+ν,

where U and V are ratio scaled explanatory variables in N-dimensional data vectors.6 Correlations refer to linear co-variability of two variables around their means.

Developing a new collinearity metric: C2

We propose a new collinearity metric for moderated regression models that is derived from the non-centered coefficient of determination and satisfies five major criteria: a) the measure is based upon raw data rather than mean-centered data to avoid the problems that affect correlation and VIF, b) it distinguishes collinearity from other sources of data weaknesses such as lack of variability of the exogenous variables and lack of magnitude, c) it is easily computed, d) it is easily interpreted,

When is collinearity harmful?

The statistical significance of the estimator of coefficient α3 of the variable U × V in the moderated regression (Eq. (2)) is typically evaluated by its t-statistic, which can be expressed in terms of the proposed collinearity score C2:t3=a32s1C2varU×V+NN1U×V̅2N1.

(See Theil, 1971, p. 166, for a comparable derivation). Although the t-statistic is determined by five other factors, numerosity (N), effect sizes (a3), standard error of the residuals of the main regression (s), data magnitude U×V̅,

Empirical application

Brand extensions, wherein firms use established brand names to enter into completely different product classes (for example, Honda submersible pumps), is a popular strategy to mitigate the risks of introducing completely new products (Aaker & Keller, 1990). We use the brand extension domain as the empirical context and use Aaker and Keller's data to illustrate the impact of collinearity diagnostics.10

Conclusions

Grewal et al. (2004) suggest that the literature on collinearity can be organized in three major areas: (i) conditions under which collinearity will occur, (ii) how collinearity can be diagnosed, and (iii) how collinearity should be managed so as to avoid its deleterious effects. To the first area, we note that the occurrence of collinearity is inevitable in moderated regression models because the interaction term naturally contains information similar to the linear terms.

Most marketing

References (22)

  • R.J. Freund et al.

    SAS system for regression

    (1991)
  • Cited by (31)

    • Understanding the factors associated with one-way and round-trip carsharing usage based on a hybrid operation carsharing system: A case study in Beijing

      2023, Travel Behaviour and Society
      Citation Excerpt :

      When the VIF value of an independent variable is no less than 10, indicating that the variable has high multicollinearity with other variables. This method has been adopted in several studies (Chennamaneni et al., 2016; Singh and Kumar, 2021) and works well in solving multicollinearity problems. The implementation steps and process of selecting independent variables using VIF are as follows.

    View all citing articles on Scopus

    The names of the authors are listed alphabetically. This is a fully collaborative work.

    1

    Tel.: + 1 262 472 5473.

    2

    Tel.: + 1 713 743 4175.

    3

    Tel.: + 1 573 882 9727.

    View full text