Measuring the effects of confounders in medical supervised classification problems: the Confounding Index (CI)

doi:10.1016/j.artmed.2020.101804

Artificial Intelligence in Medicine

Volume 103, March 2020, 101804

https://doi.org/10.1016/j.artmed.2020.101804 Get rights and content

Highlights

•
A novel index is introduced to measure the confounding effect of a categorical variable in classification studies.
•
The index is also used for continuously distributed variables, by binning their values.
•
The index can rank the effect of various variables, allowing to determine affordable matching criteria.
•
The index can assess the effectiveness of a normalization procedure and the robustness of a learning model against confounding effects.
•
The validity and usefulness of the index is proved both on simulated and real-world neuroimaging data.

Abstract

Over the years, there has been growing interest in using machine learning techniques for biomedical data processing. When tackling these tasks, one needs to bear in mind that biomedical data depends on a variety of characteristics, such as demographic aspects (age, gender, etc.) or the acquisition technology, which might be unrelated with the target of the analysis. In supervised tasks, failing to match the ground truth targets with respect to such characteristics, called confounders, may lead to very misleading estimates of the predictive performance. Many strategies have been proposed to handle confounders, ranging from data selection, to normalization techniques, up to the use of training algorithm for learning with imbalanced data. However, all these solutions require the confounders to be known a priori. To this aim, we introduce a novel index that is able to measure the confounding effect of a data attribute in a bias-agnostic way. This index can be used to quantitatively compare the confounding effects of different variables and to inform correction methods such as normalization procedures or ad-hoc-prepared learning algorithms. The effectiveness of this index is validated on both simulated data and real-world neuroimaging data.

Introduction

In the last years, there has been a growing interest in the use of supervised learning in biomedical contexts. However, such biomedical applications are often subject to the detrimental effects of so-called confounders, that are characteristics of the data generation process that do not represent clinically relevant aspects, but might nevertheless bias the training process of the predictor [1], [2], [3]. In neuroimaging studies, for instance, the confounding effect of demographic characteristics such as gender and age is amply discussed [4], [5]. Studies on biometric sensor data, instead, have shown that the relationship between features and disease class label learned by the classifier is confounded by the identity of the subjects, because the easier task of subject identification replaces the harder task of disease recognition [6], [2]. Finally, learning algorithms trained with a collection of different databases, a common practice in biomedical applications, suffer from high generalization errors caused by the confounding effects of the different acquisition modalities or recruitment criteria [7]. This phenomenon is often referred to as ‘batch effect’ in gene-expression studies [8] and it is proved that it may lead to spurious findings and hide real patterns [8], [9], [10], [11], [12], [13]. The acknowledgement of these problems brought us to a precise definition of a confounder as a variable that affects the features under examinations and has an association with the target variable in the training sample that differs from that in the population of interest [4]. In other words, the training set contains a bias with respect to such confounding variable. The approaches developed to deal with confounders can be summarized in three broad classes. The first and most intuitive one matches training data with respect to the confounder, thus eliminating the bias, at the cost of discarding subjects and impoverishing the dataset [14], [1]. A second approach corrects data with a normalization procedure, regressing out the contribution of the confounder before estimating the predictive model [14], [15], [16]. However, the dependency of the data from the confounders may not be trivial to capture in a normalization function and this problem is exacerbated when different confounders are considered together. For example, batch effects cannot be easily eliminated by the most common between-sample normalization methods [10], [17]. Alternatively, confounders have been included as predictors along with the original input features during predictive modeling [18], [14]. However, it has been noted that the inclusion in the input data of a confounder that is highly associated with the correct response may actually increase its effect, since in this case the confounder alone can be used to predict the response. Recently, a third kind of more articulated approaches has been developed, operating on the learning model rather than on the data itself, for instance resorting to Domain Adaptation techniques [7]. Similarly, some attempts have been made using approaches designed to enforce fairness requirements in learning algorithms, so that sensitive information such as ethnicity does not influence the outcome of a predictor [19], [20], [21], [22]. However, also in these models, it is very difficult to correct for multiple confounders, as it would be necessary in biomedical studies.

An effective solution to the confounders problem thus requires combining the three techniques described above: normalizing for the confounders that have a known effect on the data, matching the instances if this does not excessively reduce sample size, and adopting a learning algorithm able to manage the biases that have not been eliminated earlier. When planning such an articulated approach, it is useful to have an instrument that can quantify the effect of a confounding variable and assess the effectiveness of the possible countermeasures. To this aim, we present in this paper a novel figure of merit, called ‘Confounding Index’ (CI) that measures the confounding effect of a variable in a binary classification task tackled through Machine Learning (ML) models.

Previous renowned works on this subject are the ‘Back-door’ and ‘Front-door’ criteria, developed in the causal inference framework described in Judea Pearl's work [23], [24], commonly cited as a way to determine which variables play as counfounders. However, both these criteria are not specifically developed for ML analysis and are based on conditional probabilities, thus, they provide a measure of the confounding effect that mainly depends on the specific composition of the dataset under examination. On the contrary, our CI is designed for ML problems and aims at quantifying how easily the way a confounder affects the data is learnable by the chosen algorithm with respect to the desired classification task, independently from the confounder distribution in the dataset. Furthermore, given that the mentioned criteria do not take into account the algorithm used for the statistical analysis, they cannot be used to evaluate the effectiveness of an algorithm that, for example, has been specifically designed to avoid learning from biases. To our knowledge, there is a single and recent study [1] that, similarly to our purposes, presents a method of quantifying confounders effects in ML studies. However, this measure (thoroughly investigated in Section 3) is again strictly related to the specific biases present in the training set.

The proposed CI founds on measuring the variation of the Area Under the receiver operating characteristic Curves (AUCs) obtained using different, engineered biases during training, and thus depends on how the confounder and the class labels affect the input features. The CI ranges from 0 to 1 and allows:

•
to test the effect of a confounding variable on a specific binary classifier;
•
to rank variables with respect to their confounding effect;
•
to anticipate the effectiveness of a normalization procedure and assess the robustness of a training algorithm against confounding effects.

While the proposed approach is described for binary confounding variables, it can be applied for discrete ones computing the CI metric for every pair of values and can be straightforwardly adopted also to assess the confounding effect of continuous variables by discretizing their values (an example of this is shown in the empirical assessment on our index). In such a scenario, CI allows to identify the widest range of values for which the effect of such variables can be ignored.

The biomedical sector is the one that we believe to be more suitable for the application of our CI since biomedical data, far more than other data types, depend in complex ways on many known and hidden factors of the data generation process. However, the proposed CI is general enough to be applied in any supervised classification setup. The remainder of the paper is organized as follows: Section 2 introduces the formalization of the problem and the notation used throughout the paper, Section 3 discusses in detail the only other related work on this topic in literature. Sections 4 and 5 describe the CI and its implementation. Sections 6 and 7 report the experimental setup and the results of the analysis performed on both simulated and real-world neuroimaging data, while Section 8 concludes the paper. A summary of the symbols used to describe the CI and the formulation of the confounding problem is reported in Table 1.

Section snippets

Notation

In this section we introduce the notation used in this paper to describe a binary classification framework and the problem of confounders we want to address.

Related works

Previous literature on the effect of confounders in ML analysis is divided mainly on statements of the problem and solution proposals (i.e., normalization procedures, corrected loss-functions, etc.). However, every study estimates the confounding effects on its results differently, often taking arbitrary decisions about which possible confounders to consider and how.

Our objective is thus to propose our CI as a standardized tool to quantify the effect of a variable that may play as confounder in

Definition of Confounding Index (CI)

In this section, we present our definition of Confounding Index (CI). This index makes it possible to compare the confounding effects of categorical variables with respect to a defined two-class classification task, with a measure that does not depend on the particular bias present in the training dataset. Basically, it shows how easily the differences due to a possible confounder can be detected by a ML algorithm with respect to the differences due to the classes we want to study. The

Monotonicity evaluation

As already explained in Section 4.2, our CI can be calculated only under the monotonicity conditions of Eq. (9) which should be verified. This can be done with just a visual inspection of the data, or using various trend analysis methods already described in literature. In this section we will briefly illustrate the method presented in [28] that we have used for all the analysis described in this paper. We chose this method because it allows to evaluate the monotonicity conditions even when the

Materials and methods

In this section we will first show the effectiveness of our CI on simulated data and describe a possible application on real world data.Artificial data in fact allow to analyze how CI varies with respect to the differences introduced in the input data due to c and y, while real world data can give a practical idea of the usefulness of the CI.

The real data used in this study are neuroimaging data [29], [30] which, as all the biomedical data, depend on several variables that can have a

CI evaluation when the variables affect different features

The results of the analysis described in Section 6.1.1 are reported in Fig. 5a, in which the CI values are plotted as a function of k_c and k_y. As the plot shows, our CI depends both on k_c and k_y. Furthermore, as we would expect, the confounding effect of c is weaker for easier tasks (i.e., the ones with higher k_y) and stronger for harder tasks.

Fig. 5c and d show how every point in Fig. 5a is calculated; in fact CI is the maximum value between Φ and Φ*, the two quantities shown in the plots.

Conclusions

In this paper we have presented an index for assessing the confounding effect of a categorical variable in a binary classification study.

The study made on simulated data shows the goodness and sensitivity of our CI, the value of which depends on the intensity with which the confounder and the label influence the features under exam. Furthermore, it has been found that Φ and Φ* differ only when c and y influence the same features. This phenomenon could give precious insights on how the

References (42)

A. Rao et al.
Predictive modelling using neuroimaging data in the presence of confounds
NeuroImage
(2017)
M.R. Brown et al.
ADHD-200 global competition: diagnosing ADHD using personal characteristic data can outperform resting state FMRI measurements
Front Syst Neurosci
(2012)
A.P. Bradley
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recogn
(1997)
D. Bamber
The area above the ordinal dominance graph and the area below the receiver operating characteristic graph
J Math Psychol
(1975)
B. Fischl
Freesurfer
NeuroImage
(2012)
B. Fischl et al.
Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain
Neuron
(2002)
B. Fischl et al.
Sequence-independent segmentation of magnetic resonance images
Neuroimage
(2004)
A. Yendiki et al.
Spurious group differences due to head motion in a diffusion MRI study
NeuroImage
(2014)
E.C. Neto et al.
Using permutations to assess confounding in machine learning applications for digital health
(2018)
E.C. Neto et al.
Learning disease vs participant signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications
(2017)

S. Greenland et al.

Confounding in health research

Annu Rev Public Health

(2001)

S. Saeb et al.

The need to approximate the use-case in clinical machine learning

Gigascience

(2017)

H. Zhao et al.

Multiple Source Domain Adaptation with Adversarial Learning

(2018)

C. Lazar et al.

Batch effect removal methods for microarray gene expression data integration: a survey

Brief Bioinformatics

(2012)

A. Scherer

Batch effects and noise in microarray experiments: sources and solutions, vol. 868

(2009)

J.T. Leek et al.

Tackling the widespread and critical impact of batch effects in high-throughput data

Nat Rev Genet

(2010)

J.M. Akey et al.

On the design and analysis of gene expression studies in human populations

Nat Genet

(2007)

H.S. Parker et al.

The practical effect of batch on genomic prediction

Stat Appl Genet Mol Biol

(2012)

C. Soneson et al.

Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation

PLOS ONE

(2014)

A. Rao et al.

Predictive modelling using neuroimaging data in the presence of confounds

NeuroImage

(2017)

J. Dukart et al.

Age correction in dementia-matching to a healthy brain

PLoS ONE

(2011)

Cited by (12)

Multi-site harmonization of MRI data uncovers machine-learning discrimination capability in barely separable populations: An example from the ABIDE dataset
2022, NeuroImage: Clinical
Citation Excerpt :
In the work by Haar et al. (Haar et al., 2016) the modest accuracy in the case-control dis- crimination (<60%) suggested that anatomical measures are of limited diagnostic utility for ASD. It was highlighted afterwards that multi-center MRI data collections suffer from the so-called batch effect (Ferrari, Bosco et al., 2020; Ferrari, Retico et al. 2020; Lombardi et al., 2020). In brief, MRI data acquisitions made with different scanners and/or with different acquisition protocols encode confounding information in data which, if not accounted for, may completely mask case-control differences.
Machine Learning (ML) techniques have been widely used in Neuroimaging studies of Autism Spectrum Disorders (ASD) both to identify possible brain alterations related to this condition and to evaluate the predictive power of brain imaging modalities. The collection and public sharing of large imaging samples has favored an even greater diffusion of the use of ML-based analyses. However, multi-center data collections may suffer the batch effect, which, especially in case of Magnetic Resonance Imaging (MRI) studies, should be curated to avoid confounding effects for ML classifiers and masking biases. This is particularly important in the study of barely separable populations according to MRI data, such as subjects with ASD compared to controls with typical development (TD). Here, we show how the implementation of a harmo- nization protocol on brain structural features unlocks the case-control ML separation capability in the analysis of a multi-center MRI dataset. This effect is demonstrated on the ABIDE data collection, involving subjects encompassing a wide age range. After data harmonization, the overall ASD vs. TD discrimination capability by a Random Forest (RF) classifier improves from a very low performance (AUC = 0.58 ± 0.04) to a still low, but reasonably significant AUC = 0.67 ± 0.03. The performances of the RF classifier have been evaluated also in the age-specific subgroups of children, adolescents and adults, obtaining AUC = 0.62 ± 0.02, AUC = 0.65 ± 0.03 and AUC = 0.69 ± 0.06, respectively. Specific and consistent patterns of anatomical differences related to the ASD condition have been identified for the three different age subgroups.
Deep learning based joint fusion approach to exploit anatomical and functional brain information in autism spectrum disorders
2024, Brain Informatics
Deep Learning based Joint Fusion approach to exploit anatomical and functional brain information in Autism Spectrum Disorders
2023, Research Square
Exploring the State of Machine Learning and Deep Learning in Medicine: A Survey of the Italian Research Community
2023, Information (Switzerland)
Improving the reading skills of struggling secondary students in a real-world setting: issues of implementation and sustainability
2023, Australian Journal of Learning Difficulties
Introduction
2022, Deep Learning In Biology and Medicine

View all citing articles on Scopus

View full text

Measuring the effects of confounders in medical supervised classification problems: the Confounding Index (CI)

Highlights

Abstract

Introduction

Section snippets

Notation

Related works

Definition of Confounding Index (CI)

Monotonicity evaluation

Materials and methods

CI evaluation when the variables affect different features

Conclusions

NeuroImage

Front Syst Neurosci

Pattern Recogn

J Math Psychol

NeuroImage

Neuron

Neuroimage

NeuroImage

Using permutations to assess confounding in machine learning applications for digital health

Learning disease vs participant signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Confounding in health research

Annu Rev Public Health

The need to approximate the use-case in clinical machine learning

Gigascience

Multiple Source Domain Adaptation with Adversarial Learning

Batch effect removal methods for microarray gene expression data integration: a survey

Brief Bioinformatics

Batch effects and noise in microarray experiments: sources and solutions, vol. 868

Tackling the widespread and critical impact of batch effects in high-throughput data

Nat Rev Genet

On the design and analysis of gene expression studies in human populations

Nat Genet

The practical effect of batch on genomic prediction

Stat Appl Genet Mol Biol

Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation

PLOS ONE

Predictive modelling using neuroimaging data in the presence of confounds

NeuroImage

Age correction in dementia-matching to a healthy brain

PLoS ONE