Measuring the effects of confounders in medical supervised classification problems: the Confounding Index (CI)

https://doi.org/10.1016/j.artmed.2020.101804Get rights and content

Highlights

  • A novel index is introduced to measure the confounding effect of a categorical variable in classification studies.

  • The index is also used for continuously distributed variables, by binning their values.

  • The index can rank the effect of various variables, allowing to determine affordable matching criteria.

  • The index can assess the effectiveness of a normalization procedure and the robustness of a learning model against confounding effects.

  • The validity and usefulness of the index is proved both on simulated and real-world neuroimaging data.

Abstract

Over the years, there has been growing interest in using machine learning techniques for biomedical data processing. When tackling these tasks, one needs to bear in mind that biomedical data depends on a variety of characteristics, such as demographic aspects (age, gender, etc.) or the acquisition technology, which might be unrelated with the target of the analysis. In supervised tasks, failing to match the ground truth targets with respect to such characteristics, called confounders, may lead to very misleading estimates of the predictive performance. Many strategies have been proposed to handle confounders, ranging from data selection, to normalization techniques, up to the use of training algorithm for learning with imbalanced data. However, all these solutions require the confounders to be known a priori. To this aim, we introduce a novel index that is able to measure the confounding effect of a data attribute in a bias-agnostic way. This index can be used to quantitatively compare the confounding effects of different variables and to inform correction methods such as normalization procedures or ad-hoc-prepared learning algorithms. The effectiveness of this index is validated on both simulated data and real-world neuroimaging data.

Introduction

In the last years, there has been a growing interest in the use of supervised learning in biomedical contexts. However, such biomedical applications are often subject to the detrimental effects of so-called confounders, that are characteristics of the data generation process that do not represent clinically relevant aspects, but might nevertheless bias the training process of the predictor [1], [2], [3]. In neuroimaging studies, for instance, the confounding effect of demographic characteristics such as gender and age is amply discussed [4], [5]. Studies on biometric sensor data, instead, have shown that the relationship between features and disease class label learned by the classifier is confounded by the identity of the subjects, because the easier task of subject identification replaces the harder task of disease recognition [6], [2]. Finally, learning algorithms trained with a collection of different databases, a common practice in biomedical applications, suffer from high generalization errors caused by the confounding effects of the different acquisition modalities or recruitment criteria [7]. This phenomenon is often referred to as ‘batch effect’ in gene-expression studies [8] and it is proved that it may lead to spurious findings and hide real patterns [8], [9], [10], [11], [12], [13]. The acknowledgement of these problems brought us to a precise definition of a confounder as a variable that affects the features under examinations and has an association with the target variable in the training sample that differs from that in the population of interest [4]. In other words, the training set contains a bias with respect to such confounding variable. The approaches developed to deal with confounders can be summarized in three broad classes. The first and most intuitive one matches training data with respect to the confounder, thus eliminating the bias, at the cost of discarding subjects and impoverishing the dataset [14], [1]. A second approach corrects data with a normalization procedure, regressing out the contribution of the confounder before estimating the predictive model [14], [15], [16]. However, the dependency of the data from the confounders may not be trivial to capture in a normalization function and this problem is exacerbated when different confounders are considered together. For example, batch effects cannot be easily eliminated by the most common between-sample normalization methods [10], [17]. Alternatively, confounders have been included as predictors along with the original input features during predictive modeling [18], [14]. However, it has been noted that the inclusion in the input data of a confounder that is highly associated with the correct response may actually increase its effect, since in this case the confounder alone can be used to predict the response. Recently, a third kind of more articulated approaches has been developed, operating on the learning model rather than on the data itself, for instance resorting to Domain Adaptation techniques [7]. Similarly, some attempts have been made using approaches designed to enforce fairness requirements in learning algorithms, so that sensitive information such as ethnicity does not influence the outcome of a predictor [19], [20], [21], [22]. However, also in these models, it is very difficult to correct for multiple confounders, as it would be necessary in biomedical studies.

An effective solution to the confounders problem thus requires combining the three techniques described above: normalizing for the confounders that have a known effect on the data, matching the instances if this does not excessively reduce sample size, and adopting a learning algorithm able to manage the biases that have not been eliminated earlier. When planning such an articulated approach, it is useful to have an instrument that can quantify the effect of a confounding variable and assess the effectiveness of the possible countermeasures. To this aim, we present in this paper a novel figure of merit, called ‘Confounding Index’ (CI) that measures the confounding effect of a variable in a binary classification task tackled through Machine Learning (ML) models.

Previous renowned works on this subject are the ‘Back-door’ and ‘Front-door’ criteria, developed in the causal inference framework described in Judea Pearl's work [23], [24], commonly cited as a way to determine which variables play as counfounders. However, both these criteria are not specifically developed for ML analysis and are based on conditional probabilities, thus, they provide a measure of the confounding effect that mainly depends on the specific composition of the dataset under examination. On the contrary, our CI is designed for ML problems and aims at quantifying how easily the way a confounder affects the data is learnable by the chosen algorithm with respect to the desired classification task, independently from the confounder distribution in the dataset. Furthermore, given that the mentioned criteria do not take into account the algorithm used for the statistical analysis, they cannot be used to evaluate the effectiveness of an algorithm that, for example, has been specifically designed to avoid learning from biases. To our knowledge, there is a single and recent study [1] that, similarly to our purposes, presents a method of quantifying confounders effects in ML studies. However, this measure (thoroughly investigated in Section 3) is again strictly related to the specific biases present in the training set.

The proposed CI founds on measuring the variation of the Area Under the receiver operating characteristic Curves (AUCs) obtained using different, engineered biases during training, and thus depends on how the confounder and the class labels affect the input features. The CI ranges from 0 to 1 and allows:

  • to test the effect of a confounding variable on a specific binary classifier;

  • to rank variables with respect to their confounding effect;

  • to anticipate the effectiveness of a normalization procedure and assess the robustness of a training algorithm against confounding effects.

While the proposed approach is described for binary confounding variables, it can be applied for discrete ones computing the CI metric for every pair of values and can be straightforwardly adopted also to assess the confounding effect of continuous variables by discretizing their values (an example of this is shown in the empirical assessment on our index). In such a scenario, CI allows to identify the widest range of values for which the effect of such variables can be ignored.

The biomedical sector is the one that we believe to be more suitable for the application of our CI since biomedical data, far more than other data types, depend in complex ways on many known and hidden factors of the data generation process. However, the proposed CI is general enough to be applied in any supervised classification setup. The remainder of the paper is organized as follows: Section 2 introduces the formalization of the problem and the notation used throughout the paper, Section 3 discusses in detail the only other related work on this topic in literature. Sections 4 and 5 describe the CI and its implementation. Sections 6 and 7 report the experimental setup and the results of the analysis performed on both simulated and real-world neuroimaging data, while Section 8 concludes the paper. A summary of the symbols used to describe the CI and the formulation of the confounding problem is reported in Table 1.

Section snippets

Notation

In this section we introduce the notation used in this paper to describe a binary classification framework and the problem of confounders we want to address.

Related works

Previous literature on the effect of confounders in ML analysis is divided mainly on statements of the problem and solution proposals (i.e., normalization procedures, corrected loss-functions, etc.). However, every study estimates the confounding effects on its results differently, often taking arbitrary decisions about which possible confounders to consider and how.

Our objective is thus to propose our CI as a standardized tool to quantify the effect of a variable that may play as confounder in

Definition of Confounding Index (CI)

In this section, we present our definition of Confounding Index (CI). This index makes it possible to compare the confounding effects of categorical variables with respect to a defined two-class classification task, with a measure that does not depend on the particular bias present in the training dataset. Basically, it shows how easily the differences due to a possible confounder can be detected by a ML algorithm with respect to the differences due to the classes we want to study. The

Monotonicity evaluation

As already explained in Section 4.2, our CI can be calculated only under the monotonicity conditions of Eq. (9) which should be verified. This can be done with just a visual inspection of the data, or using various trend analysis methods already described in literature. In this section we will briefly illustrate the method presented in [28] that we have used for all the analysis described in this paper. We chose this method because it allows to evaluate the monotonicity conditions even when the

Materials and methods

In this section we will first show the effectiveness of our CI on simulated data and describe a possible application on real world data.Artificial data in fact allow to analyze how CI varies with respect to the differences introduced in the input data due to c and y, while real world data can give a practical idea of the usefulness of the CI.

The real data used in this study are neuroimaging data [29], [30] which, as all the biomedical data, depend on several variables that can have a

CI evaluation when the variables affect different features

The results of the analysis described in Section 6.1.1 are reported in Fig. 5a, in which the CI values are plotted as a function of kc and ky. As the plot shows, our CI depends both on kc and ky. Furthermore, as we would expect, the confounding effect of c is weaker for easier tasks (i.e., the ones with higher ky) and stronger for harder tasks.

Fig. 5c and d show how every point in Fig. 5a is calculated; in fact CI is the maximum value between Φ and Φ*, the two quantities shown in the plots.

Conclusions

In this paper we have presented an index for assessing the confounding effect of a categorical variable in a binary classification study.

The study made on simulated data shows the goodness and sensitivity of our CI, the value of which depends on the intensity with which the confounder and the label influence the features under exam. Furthermore, it has been found that Φ and Φ* differ only when c and y influence the same features. This phenomenon could give precious insights on how the

References (42)

  • S. Greenland et al.

    Confounding in health research

    Annu Rev Public Health

    (2001)
  • S. Saeb et al.

    The need to approximate the use-case in clinical machine learning

    Gigascience

    (2017)
  • H. Zhao et al.

    Multiple Source Domain Adaptation with Adversarial Learning

    (2018)
  • C. Lazar et al.

    Batch effect removal methods for microarray gene expression data integration: a survey

    Brief Bioinformatics

    (2012)
  • A. Scherer

    Batch effects and noise in microarray experiments: sources and solutions, vol. 868

    (2009)
  • J.T. Leek et al.

    Tackling the widespread and critical impact of batch effects in high-throughput data

    Nat Rev Genet

    (2010)
  • J.M. Akey et al.

    On the design and analysis of gene expression studies in human populations

    Nat Genet

    (2007)
  • H.S. Parker et al.

    The practical effect of batch on genomic prediction

    Stat Appl Genet Mol Biol

    (2012)
  • C. Soneson et al.

    Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation

    PLOS ONE

    (2014)
  • A. Rao et al.

    Predictive modelling using neuroimaging data in the presence of confounds

    NeuroImage

    (2017)
  • J. Dukart et al.

    Age correction in dementia-matching to a healthy brain

    PLoS ONE

    (2011)
  • Cited by (12)

    View all citing articles on Scopus
    View full text