Measuring the effects of confounders in medical supervised classification problems: the Confounding Index (CI)
Introduction
In the last years, there has been a growing interest in the use of supervised learning in biomedical contexts. However, such biomedical applications are often subject to the detrimental effects of so-called confounders, that are characteristics of the data generation process that do not represent clinically relevant aspects, but might nevertheless bias the training process of the predictor [1], [2], [3]. In neuroimaging studies, for instance, the confounding effect of demographic characteristics such as gender and age is amply discussed [4], [5]. Studies on biometric sensor data, instead, have shown that the relationship between features and disease class label learned by the classifier is confounded by the identity of the subjects, because the easier task of subject identification replaces the harder task of disease recognition [6], [2]. Finally, learning algorithms trained with a collection of different databases, a common practice in biomedical applications, suffer from high generalization errors caused by the confounding effects of the different acquisition modalities or recruitment criteria [7]. This phenomenon is often referred to as ‘batch effect’ in gene-expression studies [8] and it is proved that it may lead to spurious findings and hide real patterns [8], [9], [10], [11], [12], [13]. The acknowledgement of these problems brought us to a precise definition of a confounder as a variable that affects the features under examinations and has an association with the target variable in the training sample that differs from that in the population of interest [4]. In other words, the training set contains a bias with respect to such confounding variable. The approaches developed to deal with confounders can be summarized in three broad classes. The first and most intuitive one matches training data with respect to the confounder, thus eliminating the bias, at the cost of discarding subjects and impoverishing the dataset [14], [1]. A second approach corrects data with a normalization procedure, regressing out the contribution of the confounder before estimating the predictive model [14], [15], [16]. However, the dependency of the data from the confounders may not be trivial to capture in a normalization function and this problem is exacerbated when different confounders are considered together. For example, batch effects cannot be easily eliminated by the most common between-sample normalization methods [10], [17]. Alternatively, confounders have been included as predictors along with the original input features during predictive modeling [18], [14]. However, it has been noted that the inclusion in the input data of a confounder that is highly associated with the correct response may actually increase its effect, since in this case the confounder alone can be used to predict the response. Recently, a third kind of more articulated approaches has been developed, operating on the learning model rather than on the data itself, for instance resorting to Domain Adaptation techniques [7]. Similarly, some attempts have been made using approaches designed to enforce fairness requirements in learning algorithms, so that sensitive information such as ethnicity does not influence the outcome of a predictor [19], [20], [21], [22]. However, also in these models, it is very difficult to correct for multiple confounders, as it would be necessary in biomedical studies.
An effective solution to the confounders problem thus requires combining the three techniques described above: normalizing for the confounders that have a known effect on the data, matching the instances if this does not excessively reduce sample size, and adopting a learning algorithm able to manage the biases that have not been eliminated earlier. When planning such an articulated approach, it is useful to have an instrument that can quantify the effect of a confounding variable and assess the effectiveness of the possible countermeasures. To this aim, we present in this paper a novel figure of merit, called ‘Confounding Index’ (CI) that measures the confounding effect of a variable in a binary classification task tackled through Machine Learning (ML) models.
Previous renowned works on this subject are the ‘Back-door’ and ‘Front-door’ criteria, developed in the causal inference framework described in Judea Pearl's work [23], [24], commonly cited as a way to determine which variables play as counfounders. However, both these criteria are not specifically developed for ML analysis and are based on conditional probabilities, thus, they provide a measure of the confounding effect that mainly depends on the specific composition of the dataset under examination. On the contrary, our CI is designed for ML problems and aims at quantifying how easily the way a confounder affects the data is learnable by the chosen algorithm with respect to the desired classification task, independently from the confounder distribution in the dataset. Furthermore, given that the mentioned criteria do not take into account the algorithm used for the statistical analysis, they cannot be used to evaluate the effectiveness of an algorithm that, for example, has been specifically designed to avoid learning from biases. To our knowledge, there is a single and recent study [1] that, similarly to our purposes, presents a method of quantifying confounders effects in ML studies. However, this measure (thoroughly investigated in Section 3) is again strictly related to the specific biases present in the training set.
The proposed CI founds on measuring the variation of the Area Under the receiver operating characteristic Curves (AUCs) obtained using different, engineered biases during training, and thus depends on how the confounder and the class labels affect the input features. The CI ranges from 0 to 1 and allows:
- •
to test the effect of a confounding variable on a specific binary classifier;
- •
to rank variables with respect to their confounding effect;
- •
to anticipate the effectiveness of a normalization procedure and assess the robustness of a training algorithm against confounding effects.
The biomedical sector is the one that we believe to be more suitable for the application of our CI since biomedical data, far more than other data types, depend in complex ways on many known and hidden factors of the data generation process. However, the proposed CI is general enough to be applied in any supervised classification setup. The remainder of the paper is organized as follows: Section 2 introduces the formalization of the problem and the notation used throughout the paper, Section 3 discusses in detail the only other related work on this topic in literature. Sections 4 and 5 describe the CI and its implementation. Sections 6 and 7 report the experimental setup and the results of the analysis performed on both simulated and real-world neuroimaging data, while Section 8 concludes the paper. A summary of the symbols used to describe the CI and the formulation of the confounding problem is reported in Table 1.
Section snippets
Notation
In this section we introduce the notation used in this paper to describe a binary classification framework and the problem of confounders we want to address.
Related works
Previous literature on the effect of confounders in ML analysis is divided mainly on statements of the problem and solution proposals (i.e., normalization procedures, corrected loss-functions, etc.). However, every study estimates the confounding effects on its results differently, often taking arbitrary decisions about which possible confounders to consider and how.
Our objective is thus to propose our CI as a standardized tool to quantify the effect of a variable that may play as confounder in
Definition of Confounding Index (CI)
In this section, we present our definition of Confounding Index (CI). This index makes it possible to compare the confounding effects of categorical variables with respect to a defined two-class classification task, with a measure that does not depend on the particular bias present in the training dataset. Basically, it shows how easily the differences due to a possible confounder can be detected by a ML algorithm with respect to the differences due to the classes we want to study. The
Monotonicity evaluation
As already explained in Section 4.2, our CI can be calculated only under the monotonicity conditions of Eq. (9) which should be verified. This can be done with just a visual inspection of the data, or using various trend analysis methods already described in literature. In this section we will briefly illustrate the method presented in [28] that we have used for all the analysis described in this paper. We chose this method because it allows to evaluate the monotonicity conditions even when the
Materials and methods
In this section we will first show the effectiveness of our CI on simulated data and describe a possible application on real world data.Artificial data in fact allow to analyze how CI varies with respect to the differences introduced in the input data due to c and y, while real world data can give a practical idea of the usefulness of the CI.
The real data used in this study are neuroimaging data [29], [30] which, as all the biomedical data, depend on several variables that can have a
CI evaluation when the variables affect different features
The results of the analysis described in Section 6.1.1 are reported in Fig. 5a, in which the CI values are plotted as a function of kc and ky. As the plot shows, our CI depends both on kc and ky. Furthermore, as we would expect, the confounding effect of c is weaker for easier tasks (i.e., the ones with higher ky) and stronger for harder tasks.
Fig. 5c and d show how every point in Fig. 5a is calculated; in fact CI is the maximum value between Φ and Φ*, the two quantities shown in the plots.
Conclusions
In this paper we have presented an index for assessing the confounding effect of a categorical variable in a binary classification study.
The study made on simulated data shows the goodness and sensitivity of our CI, the value of which depends on the intensity with which the confounder and the label influence the features under exam. Furthermore, it has been found that Φ and Φ* differ only when c and y influence the same features. This phenomenon could give precious insights on how the
References (42)
- et al.
Predictive modelling using neuroimaging data in the presence of confounds
NeuroImage
(2017) - et al.
ADHD-200 global competition: diagnosing ADHD using personal characteristic data can outperform resting state FMRI measurements
Front Syst Neurosci
(2012) The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recogn
(1997)The area above the ordinal dominance graph and the area below the receiver operating characteristic graph
J Math Psychol
(1975)Freesurfer
NeuroImage
(2012)- et al.
Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain
Neuron
(2002) - et al.
Sequence-independent segmentation of magnetic resonance images
Neuroimage
(2004) - et al.
Spurious group differences due to head motion in a diffusion MRI study
NeuroImage
(2014) - et al.
Using permutations to assess confounding in machine learning applications for digital health
(2018) - et al.
Learning disease vs participant signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications
(2017)
Confounding in health research
Annu Rev Public Health
The need to approximate the use-case in clinical machine learning
Gigascience
Multiple Source Domain Adaptation with Adversarial Learning
Batch effect removal methods for microarray gene expression data integration: a survey
Brief Bioinformatics
Batch effects and noise in microarray experiments: sources and solutions, vol. 868
Tackling the widespread and critical impact of batch effects in high-throughput data
Nat Rev Genet
On the design and analysis of gene expression studies in human populations
Nat Genet
The practical effect of batch on genomic prediction
Stat Appl Genet Mol Biol
Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation
PLOS ONE
Predictive modelling using neuroimaging data in the presence of confounds
NeuroImage
Age correction in dementia-matching to a healthy brain
PLoS ONE
Cited by (12)
Multi-site harmonization of MRI data uncovers machine-learning discrimination capability in barely separable populations: An example from the ABIDE dataset
2022, NeuroImage: ClinicalCitation Excerpt :In the work by Haar et al. (Haar et al., 2016) the modest accuracy in the case-control dis- crimination (<60%) suggested that anatomical measures are of limited diagnostic utility for ASD. It was highlighted afterwards that multi-center MRI data collections suffer from the so-called batch effect (Ferrari, Bosco et al., 2020; Ferrari, Retico et al. 2020; Lombardi et al., 2020). In brief, MRI data acquisitions made with different scanners and/or with different acquisition protocols encode confounding information in data which, if not accounted for, may completely mask case-control differences.
Exploring the State of Machine Learning and Deep Learning in Medicine: A Survey of the Italian Research Community
2023, Information (Switzerland)Improving the reading skills of struggling secondary students in a real-world setting: issues of implementation and sustainability
2023, Australian Journal of Learning DifficultiesIntroduction
2022, Deep Learning In Biology and Medicine