skip to main content
research-article
Open Access

Do We Really Need Imputation in AutoML Predictive Modeling?

Published:12 April 2024Publication History

Skip Abstract Section

Abstract

Numerous real-world data contain missing values, while in contrast, most Machine Learning (ML) algorithms assume complete datasets. For this reason, several imputation algorithms have been proposed to predict and fill in the missing values. Given the advances in predictive modeling algorithms tuned in an Automated Machine Learning context (AutoML) setting, a question that naturally arises is to what extent sophisticated imputation algorithms (e.g., Neural Network based) are really needed, or we can obtain a descent performance using simple methods like Mean/Mode (MM). In this article, we experimentally compare six state-of-the-art representatives of different imputation algorithmic families from an AutoML predictive modeling perspective, including a feature selection step and combined algorithm and hyper-parameter selection. We used a commercial AutoML tool for our experiments, in which we included the selected imputation methods. Experiments ran on 25 binary classification real-world incomplete datasets with missing values and 10 binary classification complete datasets in which synthetic missing values are introduced according to different missingness mechanisms, at varying missing frequencies. The main conclusion drawn from our experiments is that the best method on average is the Denoise AutoEncoder on real-world datasets and the MissForest in simulated datasets, followed closely by MM. In addition, binary indicator variables encoding missingness patterns actually improve predictive performance, on average. Last, although there are cases where Neural-Network-based imputation significantly improves predictive performance, this comes at a great computational cost and requires measuring all feature values to impute new samples.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Real-world data often contain missing values, stemming from faulty sensors, non-responders in questionnaires, incomplete data entry, or other reasons. For example, in the openml portal, as of March 2022, 364 of the 3,487 active datasets contain missing values. Unfortunately, most Machine Learning (ML) algorithms demand complete datasets on which to operate.1 To address this problem, a plethora of imputation algorithms, ranging from simple to very advanced, have been developed to predict the missing values and allow the remaining algorithms in the analysis pipeline to complete.

The problem of imputation has been under study for decades [28, 47, 62]. Initially, it was studied in the context of estimating the coefficients of linear models, call it estimation perspective. In contrast, we study imputation from a predictive modeling perspective where the goal is to create an accurate model to predict a specific outcome of interest (target variable) in new samples. There are important differences in approaching the subject, under these two perspectives. Under the estimation perspective, (a) some methods would impute the missing values in the training data but would not create an imputation model that is able to impute test data [15, 77]. Hence, these methods cannot be applied to predictive modeling. In addition, (b) standard guidelines [67] suggest using the outcome in imputing feature values, e.g., to differentiate imputation values in cases vs. controls. This technique is not applicable in predictive modeling where the outcome is unknown in test samples. Finally, (c) a useful metric of imputation efficacy under the estimation perspective is the imputation accuracy [29, 34], i.e., the accuracy of predicting the missing values. Imputation accuracy is important for estimation purposes but may not be indicative of the impact of imputation on predictive performance.

Under the predictive modeling perspective, several interesting questions arise as follows:

Are advanced predictive modeling algorithms in need of imputation beyond the simple Mean/Mode (MM) technique? A non-linear algorithm could potentially learn a rule of the sort “if a feature value equals its mean (i.e., it is missing), then do not use it but instead rely on other observed features values for prediction.” Hence, it is questionable whether imputation would provide an advantage to such an algorithm.

Is the need for sophisticated imputation further reduced in Automated Machine Learning context (AutoML) whereby the most appropriate combination of algorithm and hyper-parameter values (combined algorithm and hyper-parameter selection (CASH) optimization) [68] is taking place?

Do Binary Indicator (BI) variables (1 if the value of a feature is missing and 0 otherwise) encoding the missingness patterns provide additional information to a classifier to learn a predictive model?

How does the feature selection step interact with imputation? Feature selection aims to reduce the number of features that enter the model without sacrificing predictive performance and leads to more interpretable models by providing insights regarding the underlying data generation. It remains open how the benefits of feature selection are impacted when we impute the missing values.

What is the tradeoff between the computational overhead of imputation and the improvement in predictive performance? Imputation algorithms impute all the missing values, independently of whether they contribute to the predictions of the model. In other words, imputation is unsupervised and not guided by the outcome to predict. Hence, they potentially perform a significant amount of unnecessary computations.

If imputation algorithms indeed improve performance, then are there any characteristics of the datasets (called meta-features) that allow us to predict the value of imputation prior to their analysis and decide whether imputation is worth the computational overhead?

To the best of our knowledge, this is the first empirical study that answers all the above research questions via an experimental evaluation over 25 binary classification real-world datasets, as well as 10 complete datasets in which synthetic missing values are introduced according to different missingness mechanisms, at varying missing frequencies. The MM imputation is used as a baseline and is compared against state-of-the-art representatives of different imputation algorithmic families, namely Discriminative, such as Miss-Forest [66], and Generative, such as SoftImpute [44] and probabilistic principal component analysis (PPCA) [70] exploiting matrix-factorization, or Generative Adversarial Imputation Nets (GAIN) [83], and Denoise AutoEncoder (DAE) [21] based on Neural Networks. The imputation algorithms are integrated into the Just Add Data Bio (JADBio) AutoML platform [73], which performs CASH and it includes a feature selection step.

In summary, the results show that the single best-performing algorithm is DAE and MissForest for the real and the simulated datasets, respectively. For five of the six imputation algorithms studied, the inclusion of BI variables is beneficial, on average. MM, when BI variables are included and CASH is taking place, is a close competitor and places as the second-best algorithm. Advanced imputation methods do offer a significant advantage but only in a few datasets. In contrast, they require the measurements of all feature values to impute new samples, which in some way invalidates the feature selection step and leads to models of high dimensionality. In addition, they require orders of magnitude more computational time. Meta-level analysis has indicated that only one feature is correlated with the relative performance of the algorithms; unfortunately, the correlation is not statistically significant when corrected for multiple testing. More datasets and new meta-features are needed to extract patterns of when sophisticated imputation should be used over the simple MM.

Overall, in an AutoML setting where optimization is taking place and BI variables are included, MM is a reasonable option; other algorithms should be used only if feature selection is not required and computational time is of little importance relative to improving predictive performance.

The article is organized as follows. Section 2 introduces missing data mechanisms and a taxonomy of imputation families. In Section 3, we present the experimental environment, the selected datasets for evaluation, and the metrics and hyper-parameters tuned. Section 4 describes the missing data generation procedure. The experimental results for real-world data with missing data and simulated missing data are presented in Sections 5 and 6, respectively. In Section 7, we discuss the results of the meta-level analysis on real-world datasets. Related work is discussed in Section 8, followed by the contributions and lessons learned in Section 9. Finally, Section 10 presents the conclusions and limitations of the study. The detailed information about the datasets, missing value simulation setup, and experimental results are provided in Appendices A, B, and C, respectively.

Skip 2BACKGROUND AND CONTEXT Section

2 BACKGROUND AND CONTEXT

2.1 Missingness Mechanisms

The concept of a missing mechanism [62] formalizes the generation process of missing data. In this respect, the BI are modeled as random variables and assigned a distribution. There are three types of underlying mechanisms that generate missing data, namely, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For formal definitions of these mechanisms, readers are referred to Reference [40]. Intuitively, MCAR implies that the probability of a value missing is independent of the actual value, the other observed quantities, and any latent variables. MAR implies that the missingness only depends on the observed data (so it can be predicted). MNAR refers to the case that the missing values are related to both the observed and unobserved variables, including the missing value itself. When missingness is MNAR, it is in principle and in general not possible to impute the missing values in a way that follows the unknown underlying data distribution.2

An illustrative example is given in Figure 1, which is adapted from Reference [46]. The missingness mechanisms can be described using a causal graph. Let us assume A and B are observed random variables and O a latent variable. Each variable is depicted as a node of the graph. Assume that A and B have a direct connection to O, which is the variable (node) of interest. The \(R_o\) node is a mask variable that denotes the missingness inserted into O, which causes \(O^*\). \(O^*\) is a surrogate of O but with missing values inserted in the positions specified by \(R_o\). As seen in Figure 1(a), MCAR missing values do not depend on any of the variables A, B, 0. In contrast, missingness depends on B for the MAR mechanism, and in O itself for the MNAR data, as seen in Figure 1(b) and (c).

Fig. 1.

Fig. 1. Panel (a) illustrates the MCAR missingness mechanism. Panel (b) denotes the MAR missingness and panel (c) the MNAR missingness. A, B are random variables, while O is the observed variable of interest. \(R_o\) is variable that represents the missingness in O in the form of mask variable. \(O^*\) is the result of applying the \(R_o\) mask to the O variable.

2.2 An Imputation Family Taxonomy

There are numerous imputation algorithms and approaches in the literature, and we do not attempt a full review. Readers are encouraged to explore comprehensive surveys available in the field for a more in-depth understanding [2, 4, 12, 13, 24]. Imputation approaches can be partitioned into various distinct families/groups of methods. A taxonomy is attempted in Figure 2. First, imputed values can be decided based on only the feature with the missing value (Univariate imputation) or several features (multivariate imputation). The former methods include Mean/Median/Max imputation for continuous data and Mode imputation for categorical data. Multivariate methods can be partitioned into Iterative and Distance-Based, also known as Hot-Deck methods [6]. Distance-based methods employ a distance or a similarity metric for samples to find neighbors or cluster them. A commonly used algorithm in this category is the K-nearest neighbors imputation (KNNi) [71], which imputes values based on the neighbors of the sample with missing values. K-means-based methods cluster the samples before imputation [37].

Fig. 2.

Fig. 2. Taxonomy of imputation families: rectangular nodes represent families, while oval nodes represent algorithms of that family.

Iterative methods, start with a simple initial guess (e.g., using MM imputation) and, in each iteration, try to improve the imputed values. We further split iterative methods into Discriminative and Generative. Discriminative methods, build a predictive model per feature with missing values, given the other features in the dataset. This model is used to predict the missing values of the corresponding feature, in each iteration. The Discriminative family can either utilize a (generalized) linear model or a non-linear model. Linear discriminative methods include Multivariate Imputation by Chained Equations (MICE) [76]. Non-linear discriminative methods include the MissForest algorithm [66] employing Random Forests, and Datawig [9] that can impute continuous, categorical, and text data by employing different loss functions according to the missing features’ datatype.

Generative methods try to model the joint distribution of the data and use the generative model to impute values. They can be split into two categories, methods that employ matrix factorization and methods that use neural networks. The matrix-factorization family includes low-rank matrix decomposition methods: First, missing values are imputed with an initial guess, and the matrix is decomposed (factorized) and used to predict the missing values. Imputation is improved in each cycle via expectation-maximization steps. Examples of this family include the PPCA [70], SVDImpute [71], bPCA [10], and SoftImpute [44]. Such algorithms scale better w.r.t. to the number of features than MICE or MissForest that train a different model for each feature with missing values in each iteration. Recently, neural networks have also been tried as generative models. These algorithms are essentially non-linear alternatives to matrix factorization. These methods start with an initial guess and then train a neural network that learns the joint distribution. This family includes methods based on AutoEncoders (AE), such as DAE) [21, 35, 41] and Variational Autoencoder (VAE) [17, 45, 61]. Also, it includes generative adversarial networks (GAIN) [64, 83]). Finally, HoloClean, a data cleaning tool, implements an attention-based neural network for imputation, named Aimnet [82]. A detailed comparison of imputation methods is detailed in Section 8. In the next subsection, we will explain the rationale for our choice to include in our empirical study a subset of the aforementioned imputation methods.

2.3 Description of the Selected Imputation Methods

In this section, we present the main characteristics of the imputation methods given in Table 1 that we included in our testbed. In the analysis of their computational complexity, n denotes the number of samples, m the number of features, \(\#comp\) the number of principal components, \(\#sing\) the number of singular values, and \(\#trees\) for the number of trees.

Table 1.
AlgorithmModel FamilyBase ModelLearning ProcedureCategorical HandlingApprox. Complexity
MMUnivariateNo IterativeNativeO(\(n \cdot m\))
SOFTGenerativeSVDIterativeOne-hot-encodingO(\(k \cdot n \cdot m \cdot \#sing\))
PPCAGenerativePCAIterativeOne-hot-encodingO(\(k \cdot n \cdot m \cdot \#comp\))
MFDiscriminativeRFIterativeNativeO(\(k \cdot m^2 \cdot n \cdot \log (n) \cdot \#\text{trees}\))
GAINGenerativeGANIterativeOrdinal-Encoding
DAEGenerativeAEIterativeOne-hot-encoding
  • Abbreviations: n is the number of samples, m is the number of features, k is the number of iterations, \(\#trees\) stands for the number of trees in the forest(hp), \(\#comp\) stands for the number of principal components employed in the matrix factorization, and \(\#sing\) for the number of singular values of the SVD.

Table 1. Comparison of the Imputation Methods and Their Characteristics

  • Abbreviations: n is the number of samples, m is the number of features, k is the number of iterations, \(\#trees\) stands for the number of trees in the forest(hp), \(\#comp\) stands for the number of principal components employed in the matrix factorization, and \(\#sing\) for the number of singular values of the SVD.

2.3.1 Mean/Mode.

MM is the most common imputation method in AutoML tools and is included as the baseline methodology. It is an instance of the univariate imputation family. In the MM algorithm, missing values are imputed with the mean in the training data of the corresponding feature if it is continuous and the mode (most frequent value) if it is discrete. MM is the most computationally efficient method as it needs only \(O(n \cdot m)\) to impute the whole dataset. A variation of MM imputation is mentioned in medical literature [51] where missing values of a sample are imputed based on the mean/mode of the class to which it belongs. However, in the case of predictive modeling, this approach becomes problematic as the class of a sample is unknown during inference, as discussed in Section 1.

2.3.2 MissForest.

MissForest (MF) is a discriminative iterative method based on Random Forests [66]. First, the missing values are imputed by Mean/Mode. Subsequently, for each feature with missing values serving as the outcome, the algorithm trains a random forest on the rest of the features and uses it to predict the outcome’s missing values. After imputing all missing values, the algorithm uses the (now) complete dataset to warm-start the new iteration until a stopping criterion is met. MF is one of the slowest methods as it requires building a forest per feature for a number of iterations. The approximate worst case is \(O(k \cdot m^2 \cdot n \cdot \log (n) \cdot \#\text{trees})\). MF encounters scalability issues in datasets with more than 50 features. In addition, MF needs to store a forest for each feature, which creates model-storing issues. To avoid such issues, in our experiments we limit the maximum allowed depth of the Random Forests.

2.3.3 Probabilistic PCA.

PPCA is a statistical iterative method [39]. In each iteration, a principal component analysis (PCA) is performed, which is improved in the next step using maximum likelihood estimation [58] and assuming a multivariate Gaussian distribution of the data. To impute a new sample, the optimal set of principal components found from training is used to identify the missing values that maximize the joint probability of the sample. Categorical features are one-hot-encoded before applying PPCA and then are inverse transformed after PPCA returns the imputed data. PPCA is one of the fastest methods, scaling linearly to the number of samples, features, and the number of principal components computed. The approximate complexity for PPCA is \(O(k\cdot n\cdot m\cdot \#\text{comp})\).

2.3.4 SoftImpute.

SoftImpute (SOFT) is a statistical iterative method [44]. It starts with the initialization of missing values with the mean. Then, it iteratively solves the optimization problem on the complete matrix using a soft-thresholded SVD and proceeds iteratively until a stopping criterion is met. Categorical features are one-hot-encoded before applying SOFT and then are inverse transformed after SOFT returns the imputed data. SOFT like PPCA is very very fast, utilizing an EM approach. The approximate time complexity is \(O(k\cdot n \cdot m \cdot \#\text{sing})\).

2.3.5 Denoise Autoencoder.

DAE is a deep learning algorithm based on autoencoders [21]. The Denoise Autoencoder is based on an overcomplete implementation and a dropout layer. DAE projects the input data to a higher-dimensional subspace where the missing data are recovered by the decoder. The categorical data are one-hot-encoded before DAE is applied. Then the one-hot-encoded data are transformed back to the original representation. The complexity of the DAE is mostly measured by the number of epochs needed for the algorithm to impute the dataset accurately and the hidden layers’ size and depth. For further information see Section 3.8.

2.3.6 Generative Adversarial Imputation Nets.

GAIN is an adaptation of the GAN framework [83]. A generator is used to impute missing data based on the observed data. The discriminator tries to determine which data are observed and which are imputed. The goal of the generator is to provide an accurate imputation whereas the goal of the discriminator is to distinguish between the observed and missing data. The two neural networks are trained in an adversarial process. Categorical data are turned into ordinal features and normalized between 0 and 1. After applying the GANs we revert the categorical data to the original representation by doing the inverse procedure. The complexity of GAIN is mostly bottlenecked by the number of iterations needed to train the GANs. See Section 3.8 for more details.

2.3.7 Binary Indicators.

BI is not an imputation method but a feature construction method. Specifically, for each feature \(F_j\) with missing values, we construct a new feature, call it \(I_{jk}\), which indicates whether the value at the kth sample of feature j is missing or not. The idea is to encode in \(I_j\) the missingness pattern. BIs may help the classifier and allow it to learn whether to trust value \(F_{jk}\) for prediction. BI can complement any imputation method. BI’s complexity is \(O(n \cdot m)\); however, we should note that it increases the complexity of subsequent stages of the ML pipeline by increasing the dataset’s dimensionality by a maximum factor of 2. We note that imputation models extended with BI do not use BI to impute the missing data. BI are merged along with the imputed dataset. It is important to mention that BIs are not utilized during the imputation phase for BI-extended methods. Instead, they are added to the imputed dataset.

2.4 Rationale of the Selection of Algorithms

MM imputation is selected as a baseline and one of the most commonly used methods. MissForest is selected as a representative of a multivariate iterative imputation method over MICE, based on the results on imputation accuracy presented in Reference [79]. Encoding missingness information using BI is also experimentally evaluated, as it performed better than other methods in Reference [43]. Distance-based methods are excluded for various reasons. First, they need to memorize the full dataset to produce imputation as they do not learn a model. k-means and KNN imputation were not included in our testbed as according to other empirical studies are outperformed by MissForest [30, 42, 66].

As representatives of matrix completion-based methods, we chose PPCA that have shown the best performance according to previous empirical studies [25, 31, 58]. SoftImpute was also selected based on the experimental results presented in Reference [85]. GAIN and DAE were also included as representatives of neural network–based methods as they excel in several studies [11, 54]. The former is based on using a generative adversarial network to learn the probability distribution to impute and the latter on autoencoder. VAE and Aimnet were not included based on the inferior or comparable results to GAIN and MM, respectively [11, 38]. Finally, Datawig [9] is not included as, according to the results reported in Reference [9], it is outperformed by MissForest for both continuous and categorical data while it comes with a high cost to fit one neural network per feature with missing values.

2.5 How Is Imputation Treated in AutoML Platforms

AutoML platforms employ imputation methods, as well as modeling algorithms that directly treat missing values as a separate category. The current versions of JADBio (version v1.4) and AutoSklearn [16] employ MM by default, while DataRobot3 may also include BI variables. AutoSklearn allows the user to specify additional imputation methods to optimize over as part of the pipeline. TPOT employs median imputation for all missing features [36]. Auger.AI4 does iterative regression or mean imputation for numerical features depending on the dataset size and creates a new category for the categorical features. BigML5 by default does not impute the missing value; the missing values are handled internally by their predictive models, which are based on trees only. Autoprognosis optimizes the ML pipeline over a variety of missing data imputation algorithms. Specifically, it employs MICE, MissForest, Bootstrapped Expectation-Maximization imputation, Soft-Impute, and MM [3]. DriverlessAI by H2O creates a new value to express missingness when the XGBoost, LightGBM, and RuleFit algorithms are used. For generalized linear models, it performs MM imputation, while for tensorflow models missing values are treated as outliers [22]. GAMA [20] does not impute missing values by default. Autogluon [14] uses median imputation for continuous features and introduces a new “Unknown” category for categorical features.

Skip 3EXPERIMENTAL SETUP Section

3 EXPERIMENTAL SETUP

We now present the design choices for the experimental setup and the comparative evaluation.

3.1 Datasets

Incomplete Real Datasets. There are currently 364 datasets with missing values in the OpenML repository [78], we restricted our selection to binary classification datasets. We selected 25 binary classification datasets in an effort to cover a range of various dataset characteristics. The datasets contain both continuous and discrete features. The number of features ranges from 7 to 69, the sample size ranges from 155 to 31,406, the prevalence of the minority class ranges from 0.06 to 0.48, the number of features with at least 1% missing values ranges from 1 to 32, and, finally, the percentage of missing values ranges from 1.11% to 71.64%. Table 6 presents the characteristics of the datasets, along with their OpenML id.

Table 2.
AlgorithmHyper-parameterValue
Mean/Mode
MissForestn-trees250
maxDepth20, 30
maxLeafNodes30
SoftImputevariance-explained50%, 70%, 90%
PPCAvariance-explained50%, 70%, 90%
DAEdropout0.25, 0.4, 0.5
batch-size64
\(\theta\)5, 7, 10
epochs500
GAINalpha0.1, 1, 10
hint-rate0.5, 0.9
batch-size64
epochs10.000
  • For each algorithm, all combinations of values for its hyper-parameters shown were tried and combined with all other choices for feature selection and modeling by JADBio. There are 48 combinations of imputation algorithms and hyper-parameter values. The default hyper-parameter values are underlined.

Table 2. The Set of Values Tried for Each Hyper-Parameter Tuned

  • For each algorithm, all combinations of values for its hyper-parameters shown were tried and combined with all other choices for feature selection and modeling by JADBio. There are 48 combinations of imputation algorithms and hyper-parameter values. The default hyper-parameter values are underlined.

Table 3.
Base Imp.Methodp-valueq-value
MF0.0360.182
PPCA0.0640.182
GAIN0.0910.182
MM0.1480.222
SOFT0.310.375
DAE0.5520.552
  • Only MF has p-value < 0.05. GAIN and PPCA have p-value < 0.1. Setting the q-value threshold to 0.25 leads to accepting the hypothesis that BI’s are beneficial for four algorithms (MF, GAIN, PPCA, MM), expecting a 25% (one of four) of these discoveries to be false on average.

Table 3. p-values of the Matched t-test and the q-values after FDR Correction (Sorted)

  • Only MF has p-value < 0.05. GAIN and PPCA have p-value < 0.1. Setting the q-value threshold to 0.25 leads to accepting the hypothesis that BI’s are beneficial for four algorithms (MF, GAIN, PPCA, MM), expecting a 25% (one of four) of these discoveries to be false on average.

Table 4.
NameCategoryDescription
inst_to_attrGeneralSamples to features ratio
Minority Class %General% of minority class
nr_attrGeneralNumber of features
nr_instGeneralNumber of samples
n_numGeneralNumber of numerical features
n_catGeneralNumber of categorical features
% NAMissing% of missing values in data
% samples /w NAMissing% of samples with missing values
% features /w NAMissing% of features with missing values
% NA/Feat. /w (NA 1+%)MissingMean % of missing values per feature with more than 1% missing
# components 50%ClusteringNumber of components that explain 50% of data variance
# components 70%ClusteringNumber of components that explain 70% of data variance
# components 90%ClusteringNumber of components that explain 90% of data variance
Slh(k=2)ClusteringMean Silhouette Coefficient of all samples when using 2 clusters
Slh (k=3)ClusteringMean Silhouette Coefficient of all samples when using 3 clusters
Slh (k=4)ClusteringMean Silhouette Coefficient of all samples when using 4 clusters
  • The first column contains the name of the meta-feature, the second column denotes the category of the meta-feature, and the third column provides a brief explanation of the meta-feature.

Table 4. Meta-features Used in the Meta-level Analysis

  • The first column contains the name of the meta-feature, the second column denotes the category of the meta-feature, and the third column provides a brief explanation of the meta-feature.

Table 5.
Study#DatasetsMechanism% Missing values#Imp. methodsNNsBISystemFSTuning#ModelsEvalMetricMeta
[38]6BNat7–84%3N, 2C, 2M.NoNoAdhocNoImp+Pred7R-TT (70-30)ACC-F1No
[81]13BNat0.6–33.6%2N,1C,4MNoNoAdhocNoNo5TT (80-20)AUC-F1No
[57]2BMC,MR0–40% \(^{**}\)7CNoNoAdhocNoNo3TT(66.6-33.3)ACCNo
[8]5B,5RMC-MN10–50%8MNoNoAdhocNoImp4R-TT(50-50)ACC-R2No
[30]31B, 21R, 17MMC,MR,MN1–50%\(^{*}\)6MYesNoAdhocNoImp2CV(3-5)RMSE-F1No
[55]10B, 3RNat7MNoYesAdhocNoPred3NCVACCNo
[19]23B,MMC7%6MNoNoAutoMLYes10R-TT(75-25)ACCNo
[49]5BNat4N, 2C, 1MNoNoAutoMLImp+PredEnsembleCV(5)B-ACCNo
Ours35BN,MC,MR1–72%6MYesYesAutoMLYesImp+Pred4TT(50-50)ACC-F1-AUCYes
  • Most benchmarks either use datasets with simulated missing values or with native but not both. Abbreviations: # symbol means number, “—” denotes that the paper does not mention any details about the topic, on column data B, M, R denotes binary, multiclass, and regression datasets, respectively. On the column mechanism, values Nat, MC, MR, and MN denote Native, MCAR, MAR, and MNAR. On column #Imp. methods, N, C, and M denotes numerical, categorical, and mixed imputation method, respectively. The NNs column denotes that Neural Network imputation methods were included. BI means that methods were extended with BIs, FS means feature selection was included in the pipeline, and Tuning represents whether the study tuned the imputation methods (Imp), the predictive models (Pred), or both (Imp+Pred). Eval column presents the evaluation methodology, R denotes repeated, TT: train-test split and number in parenthesis the percentages of the train and test set, respectively, CV: Cross-validation and the number in parenthesis the number of folds, NCV denotes Nested Cross Validation. The metric column denotes the metric used for the evaluation methodology, ACC is classification accuracy, B-ACC is balanced accuracy, F1 is F1-score, RMSE is the root mean squared error, and AUC is the Area under the ROC curve. Meta presents whether a study has conducted a meta-level analysis. \(^{*}\)One of the features was made missing. \(^{*}\)missing values were only generated on the train data.

Table 5. An Overview of Related Work on Predictive Modeling

  • Most benchmarks either use datasets with simulated missing values or with native but not both. Abbreviations: # symbol means number, “—” denotes that the paper does not mention any details about the topic, on column data B, M, R denotes binary, multiclass, and regression datasets, respectively. On the column mechanism, values Nat, MC, MR, and MN denote Native, MCAR, MAR, and MNAR. On column #Imp. methods, N, C, and M denotes numerical, categorical, and mixed imputation method, respectively. The NNs column denotes that Neural Network imputation methods were included. BI means that methods were extended with BIs, FS means feature selection was included in the pipeline, and Tuning represents whether the study tuned the imputation methods (Imp), the predictive models (Pred), or both (Imp+Pred). Eval column presents the evaluation methodology, R denotes repeated, TT: train-test split and number in parenthesis the percentages of the train and test set, respectively, CV: Cross-validation and the number in parenthesis the number of folds, NCV denotes Nested Cross Validation. The metric column denotes the metric used for the evaluation methodology, ACC is classification accuracy, B-ACC is balanced accuracy, F1 is F1-score, RMSE is the root mean squared error, and AUC is the Area under the ROC curve. Meta presents whether a study has conducted a meta-level analysis. \(^{*}\)One of the features was made missing. \(^{*}\)missing values were only generated on the train data.

Table 6.
DatasetIDSamplesFeatures#Numerical#CategoricalMissing %Imbalance ratio#Feat with miss>1%%Missing/Feature
analcatdata_reviewer1,00837970751.560.43751.56
audiology999226690692.030.25623.23
anneal98989838162264.980.242985.15
autoHorse840205251781.110.4046.46
braziltourism9574128712.910.23210.68
bridges32810711476.030.4179.35
cjs1,0242,7963432271.640.242886.97
colic273682271523.800.371927.50
colleges_aaup8971,161151321.470.3063.68
cylinder-bands6,3325403924154.740.42237.93
dresses-sales23,3815001211113.920.42533.04
eucalyptus990736191453.210.2969.95
hepatitis55155196135.670.21119.56
hungarian2312941312120.460.36552.93
kdd_el_nino-small8397828807.450.35414.90
mushroom248,124220221.390.48130.53
pbcseq8021,945171343.430.5069.71
primary-tumor1,003339170173.900.25232.74
profb47067295419.840.33289.29
schizo4663401412217.520.481122.30
sick383,772297225.540.06722.96
soybean1,023683350359.780.133210.68
stress42,16719912848.290.20714.22
vote56435160165.630.39165.63
water-treatment940527363602.860.15224.53
  • The table below contains the dataset name, id, number of samples, number of features, number of categorical, and numeric features, Missingness percentage in the whole dataset, Minority Class with missing values over 1 finally the outcome type of each dataset.

Table 6. Binary Classification Real-World Datasets Used in the Comparative Evaluation

  • The table below contains the dataset name, id, number of samples, number of features, number of categorical, and numeric features, Missingness percentage in the whole dataset, Minority Class with missing values over 1 finally the outcome type of each dataset.

Complete Datasets: We selected 10 complete datasets from OpenML, where we introduce and simulate missingness. The number of features ranges from 9 to 135, the sample size ranges from 101 to 5,473, and the prevalence of the minority class ranges from 10% to 49%. Table 7 contains these values for each dataset, along with their OpenML id.

Table 7.
DatasetID#Samples#Features#Numerical#CategoricalMinority Class %
Australian40,98169014860.44
boston853506131210.41
churn40,7015,000201640.14
compas-two-years42,1935,27813760.47
image40,5922,00013513500.21
page-blocks1,0215,473101000.1
parkinsons1,488195222200.25
segment9582,310191900.14
stock8419509900.49
zoo965101161150.41
  • The table reports the dataset name, ID, the number of samples, number of features, the imbalance ratio, and the number of numerical and categorical variables.

Table 7. Binary Complete Datasets in Which We Inject Missing Values

  • The table reports the dataset name, ID, the number of samples, number of features, the imbalance ratio, and the number of numerical and categorical variables.

3.2 Evaluation Task and Metric

We note that the evaluation concerns only binary classification. The main metric of predictive performance is the Area Under the ROC curve (AUC). To save space and make interpretation easier, we report classification accuracy and F1-score results in the Appendices C. The datasets are split to 50% training and 50% hold-out test set used only for performance evaluation. Our experiments were conducted only once, due to the computational complexity of the experimental procedure (see Section 3.6). We applied statistical tests to compensate for the lack of repeated experiments. This allows reliable conclusions to be drawn from the experimental results.

3.3 AutoML Environment

To experiment with different imputation algorithms when CASH optimization is taking place, we employed the JADBio AutoML platform [73]. JADBio is a commercial product (a version of JADBio with basic functionality is freely available) but was offered to us for research purposes. JADBio includes feature selection as part of the ML pipeline and, thus, it can be used to study the effect of feature selection on imputation.

A quick description of JADBio’s architecture now follows. For each dataset to analyze, an internal knowledge base system, called Algorithm and Hyper-Parameter Space selection (AHPS) in Reference [73], selects the feature construction, preprocessing, feature selection, and modeling algorithms to try, along with a set of values for their hyper-parameters. The AHPS also selects the configuration evaluation protocol, e.g., 10-fold cross-validation, repeated cross-validation, or hold-out to estimate the performance of each configuration and select the winning one. The knowledge in AHPS is engineered by experienced analysts but also induced by meta-level learning algorithms.

The choices of the AHPS are based on the meta-features of the dataset (e.g., sample size, number of features), as well as the user preferences. For example, an algorithm that does not scale to the number of samples in the current dataset, will not be selected by AHPS. The choice of the evaluation protocol also depends on the meta-features: For a typical-sized dataset, JADBio may run a 10-fold cross-validation, for a large balanced dataset a hold-out, while for a small sample or an imbalanced dataset, it may run a repeated cross-validation protocol.

Subsequently, JADBio executes all configurations effectively performing a grid search for CASH optimization. However, JADBio includes pruning heuristics that may drop a configuration in the early folds of cross-validation if it is not deemed promising, departing from a pure grid search strategy [74]. Once configurations execute, the final model is built on all available data using the winning configuration.

The final performance of the model producing with the winning configuration is the cross-validated AUC adjusted for the bias incurred due to multiple tries (called “winner’s curse” in statistics). This adjustment is conceptually equivalent to adjusting p-values in multiple hypotheses testing. JADBio uses the BBC-CV algorithm for the performance estimate adjustment [74]. In Reference [73], experiments on 360 omics datasets of small sample size show that this estimation protocol returns slightly conservative out-of-sample AUC performances of the returned model. Nevertheless, for the purposes of this article, JADBio’s performance estimation was not used; instead, the performances on the 50% held-out set are reported.

Regarding the settings of JADBio employed in this set of experiments, we note the following. One of the user preferences indirectly controls the execution time and the number of configurations to try and has the settings Preliminary, Typical, Extensive, with Extensive trying more configurations and performing a more thorough optimization. All subsequent experiments were run using the Preliminary setting to make the computational requirements manageable. The number of configurations may vary between datasets depending on their meta-features, but in our experiments, it ranges from 900 to over 1,000. The training protocol of JADBio depends on the sample size, the class imbalance, and other factors. For typical-size datasets, JADBio uses a repeated 10-fold cross-validation with #repeats from 1 to 20. A heuristic procedure stops repetitions of cross-validation if no progress is detected. Overall, JADBio uses estimation protocols that execute each configuration between 10 to 200 times per dataset to choose the winning configuration and produce a model.

JADBio optimizes over the following set of algorithms. For feature selection, JADBio uses the Lasso [69] and a variant of the SES algorithm [75] with an upper bound on the number of conditional independence tests to perform. For classification, it optimizes over Ridge Logistic Regression, Decision Tree, Random Forests, and Support Vector Machines with polynomial, linear, and radial basis kernels.

To evaluate imputation algorithms, we embedded them into the JADBio configurations as the second step, after the standardization of continuous features and before feature selection, using the API provided. It is important to note that configurations are cross-validated as an atom, and hence, learning to impute is based only on the training data. This is necessary to avoid overestimating the performances of configurations and correspondingly, the imputation methods. Each imputation method returns an imputation model that is used to impute the test data before modeling is applied. It is worth noting that even if the feature selection step selects a small subset S of features when some values of S are missing in the test set, the imputation model may impute them based on other features. Hence, even if the predictive model requires just the features in S, the predictive pipeline may require more features. Specifically, all multivariate algorithms selected in the article require all features to impute. Hence, the predictive pipeline always requires all features when these algorithms are employed, even with feature selection.

3.4 Imputation Algorithms Implementations

We used the JadBio version 1.4.0 for our experiments. MM and BI methods were already implemented by the developing team of the tool used. For PPCA and SoftImpute, we relied on third-party implementations in R from the PCA methods 1.64.0 [65] and ‘softImpute’ package version 1.4.1 respectively. We implemented MissForest in python 3.8.4 using the iterativeImputer and RandomForest models from sci-kit learn 1.0.1 [53]. Pytorch 1.7.1 version [52] was utilized for the implementation of GAIN and DAE. We adapted the DAE implementation found at https://github.com/Harry24k/MIDA-pytorch to closely follow the description of DAE by the original authors in Reference [21]. We employed the GAIN from https://github.com/dhanajitb/GAIN-Pytorch.

3.5 Machine Specifications

The predictive performance experiments of the article were conducted on a fedora-powered VM using 8-core AMD Threadripper 3970x at 3.7 GHz with 12 GB RAM. The neural networks were trained using CPUs. The execution time results reported were measured on an eight-core AMD Ryzen-3600x at 4.6 GHz with 16 GB RAM and Windows 11 OS.

3.6 Computational Resources Employed

During the experiments, more than 41 days of CPU time have been spent training more than 80,000 configurations to conduct the experiments mentioned in the article.

3.7 Availability of Code

The code is available on the Github repository: https://github.com/mensxmachina/Imputation_in_AutoML. The code in the repository consists of scripts for the plots, the datasets, the meta-level analysis as well as the basic implementation of each imputation algorithm.

3.8 Exploring the Hyper-parameter Space of Imputation Algorithms

In the experiments, 24 hyper-parameter (hp) value sets were tried for the imputation algorithms: MM (1 hp set), MissForest (2 hp sets), SoftImpute (3 hp sets), PPCA (3 hp sets), DAE (9 hp sets), and GAIN (6 hp sets). The values tried for each hyper-parameter are shown in Table 2. These choices were based on the algorithm’s authors’ defaults and suggestions. These 24 hp sets were coupled with all other choices of JADBio multiplying by 24 the number of configurations normally tried. In subsequent experiments, each of these 24 hp sets is run on the original dataset, as well as the dataset with the inclusion of the BI features, leading to 48 different combinations. MM has no parameters and therefore does not need tuning. For MissForest, we train RF models with 250 trees, which offers higher imputation accuracy according to Reference [66]. However, we restrict the maximum depth of the tree and maximum leaf nodes, because the trained model had storing memory issues (see Section 2.3.2). SoftImpute and PPCA require selecting the number of principal components to use as a hyper-parameter. The majority of papers in the literature fails to report the tuning of the aforementioned methods that led us to develop the following heuristic: We select as many components required to explain \(x\%\) of the data variance. The values of x are shown in Table 2 as the values of “variance-explained.” The default hyper-parameters are used for DAE with the exception of the dropout layer and the hidden layers’ dimensions. The range of the dropout layer is based on Reference [63], while the theta value is tuned within a neighborhood of the author’s suggested default value. In the current implementation, we have three hidden layers for the encoder and the decoder. For each successive layer in the encoder, \(\theta\) hidden layer nodes are added and hyperbolic tangent is used as the activation function, as it produces better results for small and medium-sized datasets [21]. The model is trained using Stochastic Gradient Descent with an adaptive learning rate with a time decay factor of 0.99 and Nesterov’s accelerated gradient. GAIN architecture consists of three hidden layers for the discriminator and the generator while using Rectified Linear Unit as the activation function. For GAIN, we tune two hyper-parameters; alpha and hint rate. These hyper-parameters are considered the most important for GAIN. Alpha balances the loss between the discriminator and the generator, while the hint rate is responsible for the training of the discriminator. Both DAE and GAIN are trained at the specific epochs as suggested by authors and use the sigmoid activation function for the output layer.

Skip 4SIMULATING MISSING DATA Section

4 SIMULATING MISSING DATA

To experiment with a ranging percentage of missing values, as well as different missing mechanisms, we simulated the presence of missing values in the complete datasets presented in Appendix A.2.

4.1 Simulating Missing Completely at Random Data

Under MCAR, missing values are missing with a given probability (percentage) independently of any other factors such as the value itself or the values of other features. To simulate missing values at a realistic missingness percentage we sampled 64 real-world datasets from the OpenML repository with varying characteristics (see Section B.1). We then computed the 25%, 50%, and 75% quantiles of missingness percentages. Features with less than 1% of missing values were excluded from the calculation, as they probably point to features that missing values from typos or non-systematic sources. The quantile values turn out to be about 10%, 25%, and 50% of missingness. The quantile values are then used to vary the missingness percentages in both MCAR and MAR simulation experiments. We then introduced missing values with the given percentages at the 10 complete datasets described in Section 3.1. Even though it is trivial to introduce MCAR missing values, for consistency reasons, we employed the code available in Reference [48], which is also used for the MAR simulations below. To simulate the MCAR mechanism the software discards values uniformly at random from the dataset at the specified missingness percentage.

4.2 Simulating Missing at Random Data

Under MAR, missing values are missing with a probability (percentage) that depends (is conditional) on other observed features, i.e., \(P(I_j = 1|F_{k_1}, \ldots , F_{k_m})\). To realistically simulate data under MAR, one needs to decide (a) the number of features upon which the probability depends, (b) the functional form of the conditional probability function, and (c) the set \(\lbrace F_{k_1}, \ldots , F_{k_m}\rbrace\). To answer (a) we needed a realistic estimate of the number m of features in the conditional probability. To that end, in the corpus of the 10 real-world binary datasets of Table 7, we randomly selected one feature with missing values as the target feature and then performed predictive modeling using JADBio including feature selection.6 These experimental results suggest that, on average, a feature with missing values is dependent on 12 features, so we set \(m=12\). Subsequently, for each \(F_j\) we randomly selected a set of m other features with uniform probability. Finally, for the functional form of P, we used a logistic regression model: \(P(I_j = 1|F_{k_1}=f_1, \ldots , F_{k_m}=f_m) = \frac{1}{1+e^{\langle -w, f\rangle }}\), where w is a set of randomly chosen coefficients from a normal Gaussian distribution, f is the vector of values of the features \(F_{k_l}\), and \(\langle \cdot , \cdot \rangle\) denotes the inner product. For the simulation, the software [48] was also used. The software allows the simulation of MAR missing data, as described above, with prespecified missingess percentages. The same percentages as in the MCAR case were used.

Skip 5COMPARATIVE EVALUATION ON REAL-WORLD DATASETS WITH MISSING VALUES Section

5 COMPARATIVE EVALUATION ON REAL-WORLD DATASETS WITH MISSING VALUES

The 25 real-world datasets with missing values were analyzed with JADBio, optimizing over configurations that include the imputation algorithms selected and their hyper-parameter values.

5.1 Binary Indicators Improve the Predictive Performance

First, we partition results achieved when optimizing over any single imputation algorithm. Specifically, for each imputation algorithm, on a given dataset, the best AUC was selected over all configurations that include the specific algorithm. We will refer to this best AUC simply as the AUC of a given imputation algorithm, in all subsequently reported results. Figure 3(a) shows the difference in AUC performance when Binary Indicators are used versus when excluded. As we can see, MM and GAIN have the largest average increase by 0.0074 AUC and 0.0056 AUC, respectively. MF when extended with BI shows an average AUC increase of 0.0046, while PPCA shows an increase of 0.0038 AUC. The lowest average improvement is achieved by SOFT, which improves by 0.00022 AUC. Contrary to the above observations, DAE is the only method that does not benefit from the addition of BIs with a negligible decrease of 0.0004 AUC when BI’s are included. Figure 3(b) offers a complementary view. It illustrates the count of datasets per imputation method where the inclusion of BI is beneficial to the downstream performance. We observe that for every imputation method, including BI is beneficial in most instances. SOFT exhibits improvement across 19 of 25 datasets. GAIN, MM, and PPCA in 17 datasets. Finally, including BI, leads to enhancements in DAE and MF across 16 and 15 datasets, respectively.

Fig. 3.

Fig. 3. Panel (a) denotes the performance of each imputation method when including binary indicators minus the performance without binary indicators for each dataset. Panel (b) denotes the count of datasets per imputation method that indicators improve or deteriorate the performance.

To determine the statistical significance of the results, we performed a paired matched t-test for each algorithm with the null hypothesis H0 being that the BI+base has worse performance than the base method. The resulting p-values were converted to q-values with the Benjamini/Hochberg [7] method, to control for multiple testing. Table 3 shows the results. Using a q-value threshold of 0.25 there are four statistically significant results, resulting in accepting the alternative hypotheses that MF, GAIN, PPCA, and MM improve their performance when BIs are present. At the level of \(q=0.25=\frac{1}{4}\) this implies that, in the worse case, we expect one of these four discoveries to be false. While the inclusion of BIs may, in the worse case, double the dimensionality of the dataset, based on the above results, we would recommend their inclusion when the above imputation algorithms are employed.

5.2 BI+DAE Is the Best Imputation Method in Real-world Data

Figure 4 shows the average ranking achieved by each algorithm using the Autorank tool [26] (lower ranking is better). To avoid clutter, and based on the results of Section 5.1, we only show results when BIs are included. The horizontal black bars in the graph connect tools with non-statistically different ranks, according to a non-parametric Friedman test and post hoc Nemenyi test.

Fig. 4.

Fig. 4. The average rank of each imputation method when binary indicators are used. BI+DAE has the lowest average ranking. Rank differences are not statistically significant except for the average rank between BI+DAE and BI+PPCA.

Results show that BI+DAE is the highest ranking algorithm with an average rank of 2.84, followed by BI+MM with a 2.94 average ranking, BI+MF with 3.5, and BI+GAIN with 3.56, although their rank difference is not statistically significant at the 0.05 level. The two lowest-ranked methods are BI+SOFT and BI+PPCA, with 3.74 and 4.42 average rankings, respectively. BI+DAE’s rank is statistically significantly lower compared to BI+PPCA.

5.3 BI+MM Is the Best Method When Considering the Efficiency–Effectiveness Tradeoff

We now study the performance effectiveness vs. the computational efficiency tradeoff of the algorithms. In Figure 5(a) we use MM as the baseline. A point (execution run) corresponds to AutoML predictive modeling on a dataset with a given imputation algorithm. This results in \(5 \times 25 = 125\) points. The x-axis shows the effectiveness ratio defined as the ratio of the AUC corresponding to the point divided by the corresponding performance of BI+MM. Similarly, the y-axis shows the efficiency ratio defined as the training time of the point divided by the corresponding time of BI+MM. Hence, points in the first/fourth quadrant (top-left/bottom-right) correspond to runs where BI+MM dominates/is-dominated by other algorithms on the same datasets in both time and AUC. Notice that the scale of the y-axis is logarithmic. Larger points correspond to the mean value of an imputation method over all datasets.

Fig. 5.

Fig. 5. Panel (a) denotes the efficiency-effectiveness tradeoff by using BI+MM as reference algorithm. Panel (b) illustrates the efficiency-effectiveness tradeoff when BI+DAE is used as reference. Each point in panels (a) and (b) represents a dataset. The x-axis shows the effectiveness, defined as the ratio of the AUC achieved by the marked imputation method divided by the corresponding performance of the reference method, for a dataset. Similarly, the y-axis shows the efficiency that is defined as the training time of the marked imputation method divided by the time of the reference method for a dataset.

In total BI+MM is inferior in terms of predictive performance in 42 cases (16 of the 25 datasets) and, unsurprisingly, never gets dominated in terms of AUC performance and efficiency at the same time. The computational time of the other algorithms is orders of magnitude slower than BI+MM. However, in 83 of 125 points, BI+MM is both more efficient and effective than the compared method. Only BI+DAE scores on average higher than the AUC score. All the other imputation methods are on average slower to train and worse in terms of predictive performance.

Figure 5(b) shows the same exact results with BI+DAE as the baseline. In contrast to BI+MM above, BI+DAE dominates the other imputation methods on predictive performance and training time, in only 35 of 125 combinations. In 41 of 125 cases, it provides better predictive performance but at a higher computation cost. In 15 points, BI+DAE is faster but has lower predictive performance than the compared imputation method. Finally, 34 times it is dominated in both metrics. In conclusion, when if a single imputation algorithm is to be used, BI+MM arguably provides the best tradeoff between computational time and predictive performance.

5.4 Best Imputation Subset for Maximizing AUC Performance Is {BI+MM, BI+DAE}

In this section, we examine the results from a different perspective, trying to answer the question: What is the minimal-size subset of algorithms to try to achieve close-to-maximum AUC performance? To answer this question, we have implemented a simple greedy algorithm, where we assume the analyst starts with the subset \(\lbrace\)BI+MM\(\rbrace\) as an efficient baseline and adds algorithms to consider. In each iteration, the algorithm that leads to the largest AUC improvement of the subset when added is selected for inclusion. The maximum AUC performance is the sum of the maximum AUC for each dataset when including all imputation methods in the optimization pipeline, averaged across all datasets.

The results are shown in Figure 6 and quantitatively in Table 11. The x-axis shows the imputation algorithms in order of addition to the subset. For each algorithm, several hyper-parameter combinations are tried and combined with all other feature selection and modeling choices by AutoML. Hence the total number of configurations tried is multiplied by this factor. At each tick, the multiplication factor for the whole set is depicted in the parenthesis next to the name of the algorithm added to the set in that step. For example, BI+MM has no hyper-parameters (\(1\times\)), while BI+DAE has 9, so the multiplicative factor of the set \(\lbrace BI+MM, BI+DAE\rbrace\) is \(10\times\). The y-axis is the average (over all datasets) relative AUC achieved when performance is optimized over all algorithms and their hyper-parameters in the corresponding subset.

Table 8.
DatasetIDSamplesFeatures#Numerical#CategoricalMissing %Minority Class %#Feat miss>1%%Missing/FeatureType
adult17948,84214680.950.2434.41Binary
albert41,147425,2407878013.640.504324.73Binary
analcatdata_reviewer1,00837970751.560.43751.56Binary
anneal98989838162264.980.242985.15Binary
aps_failure41,13876,00017017008.350.021608.83Binary
ASP-POTASSCO-classification41,7051,29414213939.940.0213810.23MultiClass
ASP-POTASSCO-regression41,70414,23414213849.940.0013810.23Regression
audiology999226690692.030.25623.23Binary
autoHorse840205251781.110.4046.46Binary
braziltourism9574128712.910.23210.68Binary
bridges32810711476.030.4179.35Binary
Census-Income-KDD42,750199,5234113285.080.06729.72Binary
cjs1,0242,7963432271.640.242886.97Binary
Code_Smells_Data_Class43,07986,4676666049.990.006253.20Regression
colic273682271523.800.371927.50Binary
colleges42,7277,06347311631.420.003049.19Regression
colleges_aaup8971,161151321.470.3063.68Binary
colleges_usnews9301,3023332118.220.472523.96Binary
cylinder-bands6,3325403924154.740.42237.93Binary
Domainome41,5331,62398389838082.170.35968883.44Binary
dresses-sales23,3815001211113.920.42533.04Binary
echoMonths2221309728.290.00612.31Regression
eucalyptus990736191453.210.2969.95Binary
fishcatch2321587707.870.00155.06Regression
fps-in-video-games42,737425,8334433116.940.001225.44Regression
hepatitis55155196135.670.21119.56Binary
house_prices_nominal42,5631,4607936436.040.001629.74Regression
hungarian2312941312120.460.36552.93Binary
ipums_la_97-small9937,01960342611.420.041838.06MultiClass
ipums_la_98-small3817,48560342611.590.011740.91MultiClass
ipums_la_99-small3788,8446034269.710.021832.36MultiClass
jungle_chess_2pcs_endgame_rat_panther41,0025,8804618281.300.23610.00MultiClass
KDD9842,34382,31847735811911.300.128761.98Binary
KDDCup09-Upselling1,11250,0001500013,39116093.350.0760882.59Binary
KDDCup09_churn42,75950,0002301923869.780.0720578.28Binary
kdd_coil_156731611831.610.0034.85Regression
kdd_el_nino-small8397828807.450.35414.90Binary
kick41,16272,9833217156.390.12540.51Binary
lymphoma_2classes1,1014540264,02603.280.4921166.25Binary
meta566528211834.550.00331.82Regression
MiceProtein40,9661,080817741.600.10814.69MultiClass
Midwest_Survey_nominal42,5322,778271261.950.03510.51MultiClass
mlr_ranger_rng42,458278,86314863.560.00149.69Regression
mlr_svm_rng42,456540,57613769.380.00260.95Regression
Moneyball41,0211,2321411320.870.00473.05Regression
mushroom248,124220221.390.48130.53Binary
NewFuelCar41,50636,203171701.460.00124.78Regression
okcupid-stem42,73450,7891931615.970.101225.28MultiClass
pbc5244181817116.470.001224.66Regression
pbcseq8021,945171343.430.5069.71Binary
porto-seguro42,206595,2123725123.840.04528.21Binary
primary-tumor1,003339170173.900.25232.74Binary
profb47067295419.840.33289.29Binary
rl41,16031,4062222010.450.10828.71Binary
road-safety42,803363,243666159.100.054114.62MultiClass
SAT11-HAND-runtime-regression41,9804,44011611335.270.001061.15Regression
schizo4663401412217.520.481122.30Binary
sick383,772297225.540.06722.96Binary
soybean1,023683350359.780.133210.68Binary
speeddating40,5368,37812261612.870.001093.17Binary
stress42,16719912848.290.20714.22Binary
us_crime3151,994127126115.480.002481.91Regression
vote56435160165.630.39165.63Binary
water-treatment940527363602.860.15224.53Binary
  • The table contains the dataset name, id, number of samples, number of features, number of categorical and numeric features, Missingness percentage in the whole dataset, Minority Class %, the number of features with missing values over 1%, the missingness percentage over features with missing values, and, finally, the outcome type of each dataset.

Table 8. Datasets Used for Missing Value Simulation Experimental Setup

  • The table contains the dataset name, id, number of samples, number of features, number of categorical and numeric features, Missingness percentage in the whole dataset, Minority Class %, the number of features with missing values over 1%, the missingness percentage over features with missing values, and, finally, the outcome type of each dataset.

Table 9.
DatasetMissing Feature (target)#Selected Features
aps_failurecn_00625
colleges_aaupAverage_salary-full_professors5
colleges_usnewsOut-of-state_tuition18
dresses-salesV37
eucalyptusPMCno6
hepatitisALBUMIN6
hungarianthalach3
mushroomstalk-root16
pbcseqpresence_of_asictes7
speeddatingattractive24
  • On average, a missing feature depends on 12 other features.

Table 9. Summary of the Feature Selection Experiments for MAR Simulation

  • On average, a missing feature depends on 12 other features.

Table 10.

Table 10. Number of Wins, Average Ranking, Average Metric Score, Average Difference from Best per Dataset and Average Difference from a Baseline Method (MM)

Table 11.

Table 11. Table Denotes the Methods That Are Added to the Imputation Set, the % of the Maximum Score for the Specified Metric Reached by Each Set, and the Configuration Complexity Increase for Each Set

Fig. 6.

Fig. 6. The percentage of maximum AUC achieved by each subset of imputation methods. Each tick in the x-axis shows the algorithm to add to the previous subset. The multiplier next to the name of an algorithm shows the factor by which the total number of configurations tried is multiplied, due to the different combinations of hyper-parameter values of the imputation algorithms. BI+DAE and BI+MM allow us to recover 99.69% of the maximum AUC while increasing the configuration space by 10 times.

BI+MM, by itself, accounts for 98.69% of the maximum AUC. When BI+DAE is added to the mix, relative performance reaches 99.69%. The next best algorithm to add is BI+MF; 100% of AUC is reached when invoking all imputation algorithms. In summary, the addition of BI+MF, BI+PPCA, BI+GAIN, and BI+SOFT provide only marginal gains to the set \(\lbrace\)BI+MM, BI+DAE\(\rbrace\).

5.5 The Interplay between Feature Selection and Imputation

Feature selection algorithms try to reduce the number of features that enter the model without sacrificing predictive performance. Feature selection is often the primary task in analysis, while the predictive model may be just a side-benefit. For example, a medical doctor may be more interested in the quantities that determine the risk of disease and may reveal new medical knowledge, rather than the risk model itself. Feature selection leads to more interpretable models that provide intuition into the domain. In fact, the solution to the feature selection problem is directly linked to the causal model that underlies data generation [72]. In other circumstances, it is important to reduce the cost of measuring the features to provide predictions. The cost may be measured in monetary units, the computational cost to compute the features or risk to a patient from medical procedures that measure these features.

Figure 7 shows the impact of feature selection for each imputation algorithm on the real dataset. The drop in AUC performance when feature selection is enforced vs. not enforced (i.e., optimizing over all configurations) in the final configuration is shown. For each algorithm is about two to three AUC points. In other extensive experiments with hundreds of complete (no missing values), small-sample, high-dimensional omics datasets, JADBio has been shown to reduce the number of features by a factor of 4,000 without a noticeable drop in AUC performance [73]. The results provide evidence that feature selection may be more challenging in the presence of missing values.

Fig. 7.

Fig. 7. The loss in predictive performance when enforcing feature selection in the pipeline for the real-world data. For all imputation methods, enforcing feature selection leads to a drop in predictive performance (red lines) by less than three AUC points (on average). While feature selection can reduce the required features to measure for an acceptable loss of performance in some applications, it is invalidated by the imputation models that need to measure all features.

In any case, the problem of including both imputation and a feature selection step in the ML pipeline is that imputation invalidates feature selection, in some sense. Let us explain this statement with an example. Let us assume the pipeline that produces the final model consists of MF imputation, Lasso feature selection, and RF predictive modeling. Let us assume that Lasso selects the features \(\lbrace A, B, C\rbrace\). If any of these values (say the value of A) is missing on a new sample, then the MF imputation model will impute them using a Random Forest for A using some other subsets of features. If any of those are also missing, then MF will invoke its Random Forests for each value that is missing, and so on, recursively. Hence, if there are missing values on the test samples, one may need to measure an arbitrarily large feature subset, not just the selected features. The storage required to apply the ML pipeline includes both the RF as well as the MF model, which in turn includes a Random Forest for every feature that may need imputation.

Skip 6COMPARATIVE EVALUATION ON DATASETS WITH SIMULATED MISSING VALUES Section

6 COMPARATIVE EVALUATION ON DATASETS WITH SIMULATED MISSING VALUES

This section focuses on comparing imputation methods in datasets with generated missing values. To that end, we compare the predictive performance of each imputation method under various missingness mechanisms and percentages. Additionally, we study the effect of feature selection when the missingness increases. The figures in this section illustrate the more general MAR case. The results for MCAR results are included in Appendix C.2 and are qualitatively similar. Finally, results regarding imputation accuracy can be found in Appendix C.7.

6.1 BI+MF Is the Best Imputation Method in MCAR and MAR Simulated Missing Data

Figure 8 presents the AUC performance results (see Figure 16(b) for MCAR results). The AUC performance denoted is the absolute difference in performance at the specified missingness percentage minus the AUC performance of the complete dataset. First, we note that the figure illustrates that as the missingness percentage increases the average predictive performance for every imputation method used decreases, as expected. As we can see, increasing the missingness from 25% to 50% leads to a sizable performance drop for all imputation methods. Specifically, methods based on linear dimensionality reduction, namely BI+PPCA and BI+SOFT are the most affected by this increase in missing values.

Fig. 8.

Fig. 8. MAR data: AUC difference of each imputation method from the complete dataset. BI+MM is the best at 10% missingness. BI+MF and BI+MM are the top methods, tied at 25% missingness. BI+MF has the lowest avg. loss at 50%.

Fig. 9.

Fig. 9. The percentage of maximum AUC achieved by each subset of imputation methods. Each tick in the x-axis shows the algorithm to add to the previous subset. The multiplier next to the name of an algorithm shows the factor by which the total number of configurations tried is multiplied, due to the different combinations of hyper-parameter values of the imputation algorithms. The set containing BI+MF and BI+MM recovers 99.43% of the maximum AUC for MAR data. The complexity of the configuration space increases by 3 times when including both MM and MF.

Fig. 10.

Fig. 10. The relative efficiency for each imputation method against BI+MM versus the relative effectiveness in terms of AUC. Larger points indicate the mean values for a given algorithm. One hundred four of 147 pairs are won by BI+MM in both efficiency and effectiveness. Only MF dominates MM on average in terms of effectiveness. However, MF is 23.000 \(\times\) slower than MM.

Fig. 11.

Fig. 11. Panels (a) and (c) illustrate improvement of BI extended imputation methods over the base method. Panels (b) and (d) show the count of datasets where BI extended methods are scoring higher/lower than base methods.

Fig. 12.

Fig. 12. Panels (a) and (b) show the ranking of imputation methods for F1-score and accuracy score.

Fig. 13.

Fig. 13. Panels (a) and (b) illustrate the maximum performance achieved by each imputation subset. For both F1 and accuracy, BI+MM and BI+DAE set scores over 99% of the maximum performance.

Fig. 14.

Fig. 14. Panels (a) and (b) denote the tradeoff in terms of relative effectiveness and relative efficiency between BI+MM and the other imputation methods for F1 and accuracy metrics, respectively.

Fig. 15.

Fig. 15. Panels (a) and (b) denote the loss in predictive performance when enforcing feature selection in the pipeline for the real-world data. For all imputation methods, enforcing feature selection leads to a drop in predictive performance by less than 5% accuracy (on average).

Fig. 16.

Fig. 16. Panels (a), (c), and (e) denote the tradeoff in terms of effectiveness and efficiency between BI+MM and the other imputation methods. Panels (b), (d), and (f) show the difference from complete data for each imputation method at various missingness levels. BI+MF is the best method for MCAR data. However, BI+MM exhibits good performance at a fraction of the cost.

Figures 8 and 16(b) illustrate that in both MAR and MCAR data, respectively, MissForest combined with Binary Indicators is, on average, the best-performing method. Additionally, we note that PPCA and SOFT are the two worst imputation methods, especially as the missingness percentage increases. Table 12(a) in Appendix C.2 contains the quantitative results in detail and a detailed discussion on the ranking of the algorithms.

Table 12.

Table 12. The Average Loss from the Complete Datasets by Each Imputation Method when Missing Data Are MCAR (Left) and MAR (Right)

6.2 The Best Imputation Subset for Maximizing AUC Performance Is {BI+MM, BI+MF}

We now identify the minimal-size algorithm subset with close-to-optimal performance for simulated missing data. We use again the simple greedy algorithm introduced in Section 5.4 and apply it to the MCAR and MAR simulated data results. The results for MAR are in Figure 9, which is similar to Figure 6. The quantitative results are shown in Table 13(b). As shown in the figure, the \(\lbrace BI+MM, BI+MF\rbrace\) subset can score over 99% of the total max AUC for MAR data and would be the suggested set of algorithms to run in such problems. The results for MCAR are in Appendix C.2.6. They are qualitatively similar. The results for simulated missing values are somewhat different than the ones in the real datasets, namely BI+MF scores better than BI+DAE, which is now placed in third place. Possible reasons why are discussed in Section 9.

Table 13.

Table 13. The Methods That Are Added Sequentially to the Imputation Set, the % of Maximum Score for the Specified Metric Reached by Each Set, and the Configuration Complexity Increase for Each Set

6.3 BI+MM Provides the Best Tradeoff between Effectiveness and Efficiency

Figures 10 and 16(a), show the effectiveness vs. efficiency tradeoff of the algorithms. The aformentioned figures are similar to Figure 5(a) above for the real datasets. We repeat the explanation of the figure: The reference (baseline) algorithm is BI+MM. The x-axis shows the effectiveness ratio defined as the ratio of the AUC corresponding to the point divided by the corresponding performance of BI+MM. Similarly, the y-axis shows the efficiency ratio defined as the training time of the point divided by the corresponding time of BI+MM. Hence, points in the first/fourth quadrant (top-left/bottom-right) correspond to runs where BI+MM dominates/is-dominated-by other algorithms on the same datasets in both time and AUC. Notice that the scale of the y-axis is logarithmic. Larger points correspond to the mean value of an imputation method over all datasets. There are five imputation methods to compare against MM for 10 datasets over 3 percentages of missing values. This will naturally result in 150 points. However, MissForest did not run in three of the datasets (image dataset variations) due to its dimensionality; see Section 2.3.2 for details. The resulting plot will consist of 147 points.

For MAR data, BI+MM is never dominated in both metrics, as it is by far the most efficient method. In 104 of 147 cases, it dominates the opposing imputation methods in terms of both effectiveness and efficiency. However, 43 times it is dominated in efficiency. Only BI+MF has on average better predictive performance than BI+MM. However, it is 23,000 times slower to train on average. All the other imputation methods are worse than BI+MM on average while also taking more time to train. The results for MCAR data are qualitatively similar (see Figure 16(a) and discussion in Appendix C.2).

In total, BI+MM is again found to provide the best, arguably, tradeoff between efficiency and effectiveness. The results in the simulated data are further validating the results in the real-world data, verifying that BI+MM is indeed a decent imputation method all around. BI+MM, on average, is on par with more sophisticated methods such as BI+MF, BI+GAIN, and BI+DAE, while being thousands of times faster to train.

Skip 7META-LEVEL ANALYSIS OF REAL-WORLD RESULTS Section

7 META-LEVEL ANALYSIS OF REAL-WORLD RESULTS

In Machine Learning it is always invariably the case that there is no single better algorithm for all datasets, a one-size-fits-all type of algorithm. Hence, one needs to optimize over several choices for the dataset at hand. This school of thought is what gave rise to AutoML systems. The field of Meta-Level Learning [18] studies how to predict the most promising algorithm or algorithms to run on a given dataset based on its characteristics. These characteristics are called meta-level features or meta-features of the dataset and include the sample size, the number of features, the type of features, the percentage of missing values, and others [60].

In this section, we try to identify meta-features that correlate with the performance of the imputation algorithms. Such correlations could help predict which algorithms to run on a given dataset. They could also shed light on the dataset properties that enable an algorithm to perform better and lead to the design of better algorithms. Hence, we defined and computed the meta-features in Table 4. The selected meta-features can be split into three categories: (1) General meta-features, which report general characteristics of the dataset such as the number of samples or features. (2) Missing value-related meta-features, which provide insight into the dataset’s missing patterns, such as missing value percentage of features. (3) Cluster-based meta-features. One such type of metric is the silhouette coefficient, computed with the k-means algorithm with \(k=2, 3, 4\), as was proposed in Reference [1]. It shows the tendency of the data to cluster. Another type of such metric is the number of PCA components that explain \(\%x\) of the data. It shows whether the data are limited to a lower-dimensional subspace and the extent of cross-correlations between features. General meta-features were extracted using the pymfe package [5]. We implemented the missing and clustering-based meta-feature extraction using sklearn [53] and numpy [23]. To apply clustering or PCA the data are first imputed with MM.

We then correlated (Spearman correlation) these meta-features with the AUC performance of an algorithm relative to the performance of BI+MM as the baseline. A positive (negative) correlation indicates that when the meta-feature increases, the performance of the algorithm increases (decreases), relative to BI+MM. There are five algorithms (except BI+MM, which is used as a baseline) and 16 meta-features, leading to 80 correlations over datasets. Only one correlation was found to be significant at the level significance 0.1 (p-value = 0.059). Specifically, BI+PPCA relative AUC performance is positively correlated (correlation = 0.383) with the number of categorical variables in a dataset. This means that as the number of categorical variables increases we expect BI+PPCA to perform better relative to BI+MM. However, when correcting the p-values for multiple testing using the FDR control technique of Benjamini-Hochberg [7], we see that the q-value is 0.991, which means that detecting one such correlation is expected even if all meta-features are uncorrelated with the relative performance. BI+PPCA does not handle categorical features natively, which further makes us believe that the result is probably a false positive. Statistically significant correlations could not be found using meta-learning analysis. Further experiments containing more datasets and meta-features need to be conducted.

Skip 8RELATED WORK Section

8 RELATED WORK

In this section, we discuss related work on missing values imputation and position our contributions. We focus on empirical studies that compare different imputation methods based on the performance of the predictive models build on imputed datasets rather than the original values of complete datasets [13, 29, 56, 80].

Current literature can be split into two categories: AutoML and Adhoc ML modeling. The first category extends a specific AutoML tool by adding imputation methods, while the latter creates a predictive modeling pipeline that may contain a subset of a modern AutoML tool’s pipeline, such as hyper-parameter optimization, model selection, and pre-processing. AutoML in general is able to optimize the performance over various stages in a pipeline. As we optimize the whole pipeline, we expect the effect of each stage to become less significant, as other stages may compensate. AutoML tools allow us to get more insights on which features are more important for the task (feature selection), optimize the hyper-parameters for each stage of the pipeline (hyper-parameter tuning), and select the best predictive model for each imputation method (model selection). Consequently, imputation methods can be evaluated fairly under this optimization framework.

As shown in Table 5, the majority of related work either uses datasets with native or simulated missing values. The literature mainly focuses on the binary classification task (included in all previous works). Of the eight previous works, two papers include binary+regression [8, 55], and one work binary and multi-class data [19]. Reference [30] is the only study that includes all three types of outcomes. The most prominent missingness mechanism is MCAR found in all works that simulate missing values. Reference [30] is the only work that includes deep learning–based imputation methods. Binary Indicators are very prominent in AutoML tools; however, only Reference [55] has studied their effect when extending imputation methods. Finally, Reference [49] is the only work that included ensemble models for the prediction phase while Reference [19] is the only work that includes feature selection as part of the pipeline. As shown in the Table 5, none of the related work has included every step mentioned in the table’s columns.

Summarizing the related work section, the majority of the literature uses datasets with native missing values or generates them through a simulation based on various missingness mechanisms and missingness proportions. However, none of the mentioned studies benchmarks imputation methods on both native and generated missing value datasets. The studies on real-world datasets in general conclude that simple imputation methods such as MM are on par with other more complex methods. Research on datasets with simulated missing values concludes that more complex methods can indeed improve predictive performance on average. However, there is no universal best method proposed by any of the aforementioned benchmarks. Literature mainly focuses on the binary classification task. Accuracy and F1-score are the more prominent metrics in the literature. In the majority of the studies, the hold-out split is used for the evaluation. Some studies, use repeated splits or cross-validation to handle randomness. Specific predictive models could benefit from native handling of missing values compared to simple imputation, for instance, Gradient Boosted Trees. However, not all classifiers support missing value handling, making imputation still an essential part of the pre-processing step of ML pipelines. In general, hyper-parameter tuning, model selection, and feature selection are given less importance in previous literature. Most works skip one or more of the previous steps or fail to mention information about the specific stage. For example, only one predictive model is tuned or imputation methods are used with default parameters specified by the authors of the methods or the package implementations.

The research closer to ours is Reference [49]. In the aforementioned paper, Autosklearn was extended to include the data cleaning process, the emerging tool named AutoClean. Part of the extension was imputation. The study compares mean, median, mode, KNN [71], and Iterative imputation [77] for continuous features. For categorical features, constant, KNNi, and mode imputation were selected. The study used five binary classification datasets with 891 to 10,500 observations and 9 to 39 features that include missing values at low percentages. AutoClean optimized the pipeline by Bayesian hyper-parameter optimization. In autosklearn the predictive model is an ensemble of methods. They evaluated the performance by using a fivefold cross-validation and balanced accuracy metric. The study concluded that KNNi is a valuable addition to the simpler imputation methods. However, in most cases, simple imputation methods are selected more frequently than KNNi for both continuous and categorical data. Contrary to the aforementioned literature, we included feature selection in our experimental setup. Also, we conducted comparisons on both datasets with native and simulated missing values. In general, our evaluation was conducted on more datasets, with a higher range in terms of samples and features. Finally, we included neural network imputation methods and extended imputation with binary indicators.

In Reference [19], TPOT AutoML tool was extended with imputation methods, specifically mean, median, mode, max, MICE [77], and EM [27]. The median and mode were found to be the best imputation methods based on a restricted simulation study on 23 datasets at 7% MCAR missingness. The data were split multiple times (20) to account for randomness. At each split, 25% of the data were used as a hold-out set. Compared to the mentioned work, we simulated missing values with other mechanisms and missing proportions as well as used datasets with native missing values in our experiments. Also, we included recent state-of-the-art methods based on NNs such as DAE and GAIN. We also implemented and measured the effect of binary indicators when coupled with MM and complex methods.

Similarly, missing data imputation has been also researched as part of the data cleaning systems. Reference [38] compared deletion, mean, median, mode, new-category, and HoloClean [59] on six datasets with native missing values. For the predictive task 7 models were considered: Logistic Regression, KNN, Decision Tree, Random Forest, AdaBoost, Naive Bayes, and XGBoost. For the evaluation, the data were split 70% train–30% test set, repeated 20 times to account for randomness. They used accuracy and F1-score for evaluation according to the dataset’s imbalance. They concluded that simple imputation methods yield competitive performance to more complex methods such as Holoclean. Contrary to the aforementioned work, we included neural networks, the binary indicator method, and feature selection in the experimental setup. We conducted experiments on more datasets that, generated missing values but also had real-world missing values.

The benchmark study [81] was conducted on 13 real-world datasets from OpenML and concluded that mean/mode is comparable to more complex imputation methods such as random, SOFT [44], MF [66], KNNi [71], Hot-Deck [33], and MICE [77]. In the study, 20% of the data was kept as a hold-out and reported measures were AUC and F1-score. Specifically, while measuring the F1-score, MM had the highest average ranking. In contrast, for the AUC score, KNNi is found to be the best-performing method. However, both KNNi and MM are among the three best methods in both metrics. Hyper-parameter tuning was not considered in this article, the imputation and predictive methods used default parameters. In our work, we tune all the steps of the pipeline for a fair comparison. We also included deep learning methods that are the current state-of-the-art for imputation as well as binary indicators.

The largest benchmark study was conducted in Reference [30] on 69 real-world datasets with simulated missing values. The missing mechanisms were MCAR, MAR, and MNAR. The generation of missing values was set at 1%, 10%, 30%, and 50% missingness. They compared MM, KNNi [71], MF [66], custom DL-based imputation inspired by Reference [9], GAIN [83] and variational autoencoders [32] for the imputation problem. Cross-validation scores are reported, data split into 5 folds for all but deep learning methods. For deep learning methods, three folds were used for the split due to training costs. For regression data, RMSE is the reported metric. For classification data, the F1-score was reported. They concluded that MissForest is the best imputation method. However, they used a single classifier for the prediction phase. We argue that different imputation methods work better with different classifiers, which should be tuned as well. For example, on the cylinder-band dataset GAIN imputation method performs best with the Ridge Logistic Regression, whereas the DAE imputation method performs best with the RandomForest classifier on the same dataset. Additionally, missing values for the downstream task were generated for a randomly sampled feature in the dataset. We also uniformly simulated missing values, which is a harder problem to solve for the imputation methods as less observed data exist. In the mentioned work, GAIN had a convergence problem in 33% of the cases resulting in the worst ranking among the mentioned methods. In our work, GAIN does indeed converge due to different hyperparameter tuning. Finally, we include Binary Indicators as well as the DAE imputation method, which is the best method on average in real-world data with missing values. Simulated missingness results in our work are on par with the results of the aforementioned work as MissForest is the best method in both works. However, deep learning methods in our work are among the best methods and not the worst as in the literature mentioned.

Another study [57], compared the predictive performance on two datasets with imputed and incomplete data. Missing values were simulated on categorical features on the train data. They generated MCAR and MNAR missing values from 10% to 40% missingness in categorical features. One-third of the data were kept as hold-out test set. The accuracy score on the test set is reported. For the imputation of categorical features, they used six imputation models: mode, random, k-NN [71], iterative imputation based on logistic regression, random forest [66], and SVM. For the classification, they used three predictive models, ANNs, decision trees, and random forests. The authors optimized the hyper-parameters for the ANNs only. The imputation models and the other classifiers were not tuned. They did not conclude that an imputation method or a classifier is better than others and heavily depends on the nature and proportion of the missing data. However, results indicated that imputation is better than simply creating a new category in the data. In our work, for fair evaluation, we tune both imputation and predictive models. We include binary indicator methods and neural network imputation. Finally, we simulated missingness on both numeric and categorical features.

In Reference [55], the authors compared mean, median, KNNi [71], Iterative Imputer, Iterative Imputer /w Bagging (Multiple Imputation) [77], MIA (the native handling of missing values by Gradient Boosted Trees), and MIA /w bag. All of the previous methods, except MIA, were also extended with binary indicators. The study was conducted on 13 real-world datasets from four databases with native missing values. Nested cross-validation with five outer folds is used for estimating the accuracy score of the downstream task. The predictive models were set to Gradient-boosted trees and linear models. They concluded that MIA is a better alternative to imputation. Also, the indicator method helps improve the performance of the predictive task, which is on par with the results of our work. They conclude that simple imputation using mean or median is on par with KNNi and iterative imputation with linear models. In our work, we included deep learning imputation in our set of imputation methods. Also, we tuned the imputation methods to fairly evaluate the performance of each method, as tuning is important in the performance of some imputation methods.

Last, Reference [8] introduced a group of three methods named OptImpute, focusing on optimizing KNNi and iterative imputation based on SVMs and decision trees. They compared the group against five other imputation methods: mean/mode, K-nearest neighbors [71], iterative known [84], Bayesian PCA [50], and predictive-mean matching [77]. They compared the introduced method across 84 datasets with simulated missing values measuring imputation accuracy. They additionally measured the group’s performance on learning algorithm performance on 10 datasets. The missing values are generated by the MCAR mechanism with a range from 10% to 50%. The classifiers used for regression tasks are LASSO and SVR while for the classification tasks SVM and Optimal Trees. These datasets range in size, having 150 to 5,875 observations and 4 to 16 features. Data were split 50%–50% into train and test sets. The splits were repeated 100 times to account for randomness. Their group of methods improved the predictive performance of the models. Their method scored 86.1% average accuracy and average R-Squared (R2) of 0.339 compared to 84.4% and 0.315 R2 for the classification and regression data, respectively. However, no neural networks were used and the methods introduced have not been compared individually. Additionally, tuning was applied only to the group of proposed methods and not to the other imputation methods that were used and the predictive models. We tune all of the imputation methods and models. We also extend methods with binary indicators and include NNs imputation methods in our test bed, as well as MissForest. We also report multiple metrics (Accuracy, F1, and AUC) for the binary classification task. Finally, we have a wider range of datasets, both with native and simulated missing values.

8.1 Synopsis of Contributions Relative to the Related Work

Compared to the related work, we contribute in various ways. Our work can be directly compared to two other works that are conducted in an AutoML tool [19, 49]. Compared to the mentioned works, we include more datasets, more missingness mechanisms, neural network methods, and binary indicators in the experimental setup. For the first time, deep learning methods are compared to simple imputation methods in an AutoML predictive setting. One of the deep learning methods (BI+DAE) has the best average performance on real-world data with native missing values. Additionally, for the first time, the effect of the imputation methods on predictive performance is measured on datasets with generated missing values and native missing values. Until now, comparisons were conducted on only one of the two settings, specifically half of the papers use real-world datasets with missing values in them, while the other half use complete datasets with generated missing values. Contrary to the majority of the literature, we tune both imputation and predictive methods to fairly evaluate them. Only two of the eight related mention tuning both imputation and predictive modeling methods [38, 49]. We also conducted experiments on more datasets compared to the majority of the literature, while unlike [30], our simulation setting is applied to all features in the datasets and not only one. Also, only one of the eight aforementioned works includes feature selection as part of the ML pipeline. Finally, meta-learning, for the first time, is used to identify useful data characteristics that could give insights into the choice of a simple vs. a sophisticated imputation method. In general, as shown in Table 5, our testbed is the most complete overall in terms of dataset selection, missingness selection, imputation method selection, and pipeline steps. This allows us to fairly evaluate imputation methods in a state-of-the-art AutoML environment.

Skip 9LESSONS LEARNED AND CONTRIBUTIONS Section

9 LESSONS LEARNED AND CONTRIBUTIONS

The main insights that are drawn from our experimental results are the following:

Including BI in the dataset improves the predictive performance of the machine learning pipeline for most algorithms (see Section 5.1). The inclusion of BIs does increase the dimensionality—and difficulty—of the machine learning task. However, it does encode the information about which missing values are missing; this allows a classifier to learn which values to trust or not. Results indicate that encoding this information turns out to be more beneficial than harmful.

BI+DAE is found to be the single best imputation method in real-world data with native missing values followed by BI+MM, which is the standard in AutoML tools. As seen in Section 5.2, both methods have the same number of wins (when comparing only BI extended methods) across datasets with BI+DAE having higher mean AUC. The worst performance is exhibited by matrix-factorization (linear dimensionality reduction) methods such as PPCA. These methods do scale with the number of features and may be more suitable for high-dimensional, low-sample datasets.

BI+MM exhibits the best tradeoff between efficiency and effectiveness. As expected (see Sections 5.3 and 6.3 and Appendix C.2.4), BI+MM is the fastest method to train and also is more effective in the majority of the comparisons. MF, due to its iterative nature, is the slowest among all closely followed by GAIN. GAIN’s main bottleneck is the number of epochs required to train the network. The authors’ suggestion was 10,000 epochs, which is 20 times more than the 500 epochs suggested by the authors of the DAE method.

Based on the results of Section 5.4, we would suggest practitioners to optimize their models over the BI+MM and BI+DAE algorithms. BI+MM and BI+DAE score over 99% of the maximum AUC in real-world data as shown in Section 5.4. Specifically, BI+MM scores 98.68% of the maximum AUC. Adding BI+DAE to the pipeline leads to 99.69% of the maximum AUC. However, this comes at the cost of increasing the configuration space by 10\(\times\), as DAE has nine tuning configurations compared to one of BI+MM. Also, to reach 100% of the optimal performance, we have to train 24-times more configurations than by simply using BI+MM.

BI+MF is the best method in datasets with simulated missing values. As shown in Section 6.1, in both MCAR and MAR simulations, BI+MF is on average the best. In contrast, BI+MF is the third best with real-world data, falling behind BI+DAE. Despite our best efforts to realistically simulate missing values, there may still be differences between real-world missing-data generative mechanisms and our simulations. First, we simulated MCAR and MAR missing values. Real-world missing values may be NMAR. Second, the missingness probability for MAR data is determined by a generalized linear model (logistic regression model). Real-world missing values may follow non-linear models. The majority of the literature employs similar simulations for comparing imputation algorithms. However, as indicated by this study, results with simulated missingness may not generalize to real-world datasets. New simulation methodologies need to be proposed to this end.

Missingness increase leads to a deterioration in predictive performance. As shown in Section 6.1, increasing missingness causes a drop in the AutoML tool’s capability of predicting the outcome. Missingness at 10% leads to a 0.024 AUC drop compared to the complete dataset. Similarly, 25% missingness leads to 0.05 AUC drop, while at 50% we can inspect up to 0.144 drop average as seen in Tables 12(a) and (b).

The set containing BI+MM and BI+MF reaches 99% of maximum AUC for simulated data as shown in Section 6.2. BI+MM scores the 98.7% of the maximum AUC for MCAR data and 98.99% for MAR data. To surpass 99% of the maximum AUC, the addition of BI+MF is needed. This addition allows the tool to reach 99.62% and 99.43% on MCAR and MAR data, respectively. However, BI+MF has to be tuned, leading to a total 3\(\times\) increase in pipeline complexity.

A meta-learning methodology to correlate meta-features with performance is presented in Section 7. It could allow scientists to select the appropriate sophisticated methods based on meta-features, saving training time and improving overall performance. In addition, it could provide insight into the design choice of an algorithm that leads to better or worse performance on a given dataset. Unfortunately, no statistically significant results were found. This means that either there are no correlations present with the selected meta-features, or these correlations are not strong enough to be found significant with the given sample size of 25 datasets.

There are, of course, several limitations of the study that we would like to point to. The results and conclusions stem from computational experiments with binary classification tasks within a range of a number of features, sample size, imbalance of the classes, and missingness percentage. MNAR missingness pattern is not included in our experiments. Also, the mechanism for generating MAR data is based on a linear model. Results may differ for non-linear missingness generation and MNAR data. Despite the significant computational effort involved—optimizing over thousands of ML pipelines for each dataset—results stem from only 25 real-world datasets with native missing values and 60 complete datasets where missing values were introduced (10 original datasets times 2 missingness mechanisms (MCAR, MAR) times 3 missingness percentages). This fact limits the statistical power of our statistical tests. While JADBio is an effective AutoML tool, results should also be obtained from other AutoML tools to further generalize the conclusions. Another limitation of our work, concerns the comparison of methods on only binary classification data. Even though imputation algorithms are unsupervised learning methods and do not use information from the target variable (in our work), results may vary according to the supervised task. Finally, we selected models based on the AUC score in the training set. Optimizing for another metric, such as accuracy or F1-score, during training may yield different results.

Skip 10CONCLUSIONS Section

10 CONCLUSIONS

In this article, we conducted experiments on real-world datasets with native missing values and simulated missing values. We compared six imputation methods extended by binary indicators on a state-of-the-art AutoML tool. BI+DAE is the best method on real-world datasets with native missing values. However, BI+MM is comparable to, if not better than, the more sophisticated imputation methods in terms of predictive performance and efficiency on real-world data. Increasing missingness leads to predictive performance deterioration. Additionally, simulation data lead to contradicting results compared to real-world datasets. BI+DAE and BI+MM are the best methods on real-world data; however, when simulated data are considered BI+MF is the best method on average followed by BI+MM. Finally, meta-learning was employed but could not successfully find any patterns to predict whether a sophisticated imputation method can be used instead of the simple BI+MM to improve the downstream performance.

The results make us question whether advanced, multivariate imputation algorithms are really necessary for predictive modeling with AutoML. The simple BI+MM imputation is surprisingly effective and computationally efficient when the ML pipeline is properly tuned within an AutoML setting. BI features allow advanced classifiers to learn when to trust a value or not. Multivariate Imputation algorithms try to learn the full joint distribution of the dataset, a task that is quite challenging with low sample, imbalanced, or high-dimensional data and prone to error. It is also a very computationally demanding task. Imputing values for features that are redundant or irrelevant to the final model is a waste of computations. When imputing using multivariate imputation, one needs to store not only the final model (e.g., RF, SVM, or a NN) but also the imputation model to impute test samples. For some imputation models (Deep Neural Networks, or one RF for each feature as in MF) the additional storage may be non-negligible. In addition, the imputation model requires measuring all features and invalidates the efforts of feature selection. Arguably, the research effort that goes into novel and better-perfoming imputation methods would be more productive to be spent on novel and better-performing ways to natively handle missing values in our classification and feature selection algorithms.

APPENDICES

Skip ADATASETS APPENDIX Section

A DATASETS APPENDIX

A.1 Real-World Datasets with Native Missing Values

This section presents the 25 real-world binary classification datasets with native missing values. See Table 6 for more details.

A.2 Complete Datasets for Missing Data Simulation

This section presents the 10 complete binary classification datasets used for the simulated missing value experiments. Table 7 contains the dataset names and their characteristics.

Skip BMISSING VALUE SIMULATION SETUP APPENDIX Section

B MISSING VALUE SIMULATION SETUP APPENDIX

B.1 Datasets Selected to Determine the Percentage of Missing Values per Feature.

Realistic simulation of missing values requires selecting the missingness percentage for each feature; see Section 4.1. We sampled 64 real-world datasets with missing values from OpenML repository. Table 8 describes the datasets’ characteristics.

B.2 Determining the Average Number of Features on Wich a Missing Feature Depends.

In this section, we present the quantitative results for the experiments regarding the simulation of MAR mechanism presented in Section 4. Table 9 presents the dataset name, the randomly selected feature with missing values (target) and the result of the feature selection (# features selected). On average a missing feature is dependent on 12 features.

Skip CEXPERIMENTAL RESULTS APPENDIX Section

C EXPERIMENTAL RESULTS APPENDIX

C.1 Real-world Results

C.1.1 BI Improve Performance across All Metrics..

BI extended methods perform better than their base methods when the AUC score is measured (see Section 5.1). As seen in Figure 11(b) and (d), BI indeed improves the accuracy and F1-score of the downstream task, in the majority of the datasets. Figure 11(a) and (c) illustrate the gain or loss of including BI for each imputation method in the x-axis. BI improves the performance of the downstream task, on average.

C.1.2 BI+DAE Is the Best Method Followed by BI+MM.

Table 10(a), (b), and (c) show the quantitative results of the real-world experiments. Specifically, it depicts the number of wins (ties included) for each imputation method, as well as the average difference in AUC from the winning imputation method for each dataset. The table also reports the average AUC, average AUC difference from MM (set as baseline), and average ranking per method. For completeness, we include methods without BI as well. BI+DAE and BI+MM are the two best methods across all metrics, in terms of average AUC and average ranking. BI+MM gets the highest number of wins in all metrics. While BI+DAE closely follows with one win for each AUC and two wins for F1 and accuracy. However, BI+DAE is more consistent and has the highest average ranking and highest score for AUC and F1. Finally, BI+DAE exhibits the highest improvement over MM and the lowest difference from the best method in each dataset, on average.

C.1.3 BI+DAE Is the Best Method across All BI Extended Methods..

As seen in Section 5.2, BI+DAE is the highest-ranked method for the AUC metric. Figure 12(a) and (b) illustrate the average ranking of BI extended methods for F1 and accuracy metrics, respectively. BI+DAE is the highest-ranked method for accuracy metric but is the third best for F1. BI+MM is the second-best in accuracy and the best in F1. Overall, across different metrics, the relative order varies slightly. Statistically significant results remain the same for AUC and accuracy. No statistically significant results are found for the F1-score.

C.1.4 BI+MM and BI+DAE Score 99% of the Maximum across All Metrics..

As shown in Tables 11(a), (b), (c), and Figure 13, BI+MM scores over 98% of the maximum performance for each metric. Adding BI+DAE, which is the best next method, to the imputation set that already contains BI+MM allows the tool to score over 99.5% of the maximum performance. However, the complexity increases by 10\(\times\). Finally, to reach 100% of the maximum, all imputation methods need to be included in the imputation set. Including all methods in the pipeline of the tool, increases the original complexity by a factor of 24.

C.1.5 BI+MM Exhibits the Best Tradeoff between Effectiveness and Efficiency..

This section presents the tradeoff between the effectiveness and efficiency of imputation methods against a baseline (BI+MM). For a detailed explanation of the illustration see Section 5.3. Figure 14(a) shows that BI+MM dominates the other imputation methods in both effectiveness and efficiency in 84 of 125 pairs for F1-score. In 41 pairs, BI+MM is dominated in effectiveness. For the accuracy metric, BI+MM dominates the other methods in relative effectiveness and efficiency in 86 of 125 pairs. As seen in Figure 14(b), BI+MM is dominated in only 39 pairs.

C.1.6 Feature Selection–enforced Pipelines Degrade the Performance..

As seen in Figure 15(a) and (b), feature selection deteriorates the performance of the pipelines. The average absolute difference between feature selection–enforced pipelines and non-enforced is less than 5% for accuracy. On average, the F1-score is lower by 5 points when enforcing feature selection.

C.2 Simulation Results

C.2.1 A Decline in Predictive Performance Is Caused by Increasing Missingness..

In this section, we investigate the average performance drop in terms of multiple metrics compared to the complete dataset. Specifically, Table 12(a), (c), and (e) denote the results for the MCAR missingness and AUC, F1, and accuracy score, respectively. Table 12(b), (d), and (f) present the results for MAR missing data for AUC, F1, and accuracy score, respectively. Summarizing the results, across all metrics and both missingness mechanisms, as missingness increases, the performance of the tool deteriorates. For MCAR data, average AUC drops in absolute terms up to 0.023 at 10% missingness and 0.05 and 0.144 for 25% and 50%, respectively. While measuring F1, the loss is even bigger. At 10%, loss is up to 0.05, and at 25% loss can reach up to 0.078 average absolute difference, while average F1 loss at 50% missingness can be up to 0.158. The accuracy score deteriorates comparably to AUC. At 10% the loss can be up to 0.03, at 25% up to 0.05, and at 50% up to 0.118. For MAR data, results are similar to MCAR. It is noteworthy to mention that at 50% missingness, the performance does not deteriorate as much as for MCAR. This leads us to conclude that most methods in this comparative evaluation can recover information for high MAR missingness better than MCAR. This is theoretically sound, as every method except MM uses multiple variables for the imputation.

C.2.2 BI+MF Is the Best Method for MCAR Data.

Table 12(a), (c), and (e) and Figure 16(b), (d), and (f) present the results for MCAR missing data at various missingness percentages and multiple metrics (AUC, F1, and Accuracy). Summarizing the results, across all missingness percentages and measured metrics, BI+MF is the best method in terms of average absolute loss to the complete data. The second best method, at 10%, is BI+DAE, while at 25% BI+MM is the second best for F1 and accuracy metrics. The relative order of imputation methods across metrics remains stable until 50% missingness. The order at 50% missingness may vary according to the performance metric. The two worst methods are BI+PPCA and BI+SOFT, while the positions of second-, third-, and fourth-best methods are shared by BI+MM, BI+GAIN, and BI+DAE.

C.2.3 BI+MF Is the Best Method for MAR Data.

As seen in Section 6.1, BI+MF is the best method for MAR data. Table 12(b), (d), and (f) provide an overview of the results for AUC, F1, and accuracy score. The results are robust across all metrics. Figure 17(b) and (d) show that BI+MF has the lowest average loss for 25% and 50% missingness in both F1 and accuracy metric. At 10% missingness, BI+MF has comparable performance to BI+MM, which has the lowest loss at that missingness rate. A detailed review of the results for the AUC metric follows.

Fig. 17.

Fig. 17. Panels (a) and (c) denote the tradeoff in terms of effectiveness and efficiency between BI+MM and the other imputation methods. Panels (b) and (d) show the difference from complete data for each imputation method at various missingness levels. BI+MF is the best method for MAR data. However, BI+MM exhibits good performance at a fraction of the cost.

C.2.4 BI+MM Exhibits the Best Efficiency vs. Effectiveness Tradeoff for MCAR Missing data.

Regarding the MCAR missing data, it is worth noting that BI+MM dominates in over 90 of 147 pairs, for each performance metric. Specifically, for AUC metric BI+MM dominates in 98 pairs, for F1-score in 90 pairs, and for accuracy in 95 pairs. BI+MM is never dominated in efficiency, as seen in Figure 16(a), (c), and (e). This is not surprising, as BI+MM is significantly faster to train than any other imputation method. BI+MM is only dominated by BI+MF across all metrics in average effectiveness. BI+PPCA and BI+SOFT are on average less effective and less efficient than BI+MM.

C.2.5 BI+MM Exhibits the Best Efficiency vs. Effectiveness Tradeoff for MAR Missing Data.

In section 6.3, we presented the efficiency–effectiveness tradeoff for the MAR simulated data when AUC is reported. We extend and confirm our conclusion in this section by measuring and computing the tradeoff for F1-score and classification accuracy metrics. As seen in Figure 17(a) and (d), BI+MM is never dominated in terms of efficiency, as expected. BI+MM dominates the other methods in effectiveness and efficiency in 94 and 98 pairs for the F1-score and accuracy score, respectively. For F1, it is dominated in terms of relative effectiveness in 53 of 157 pairs, while it is dominated for accuracy in 49 pairs. BI+MF is on average more effective than BI+MM for MAR data. However, it is thousands of times less efficient (up to 90,000 for big datasets).

C.2.6 BI+MM and BI+MF Score over 99% of Maximum Performance for MCAR Data..

The minimal-size subset of algorithms with close-to-optimal performance for MCAR missing data is \({BI+MM, BI+MF}\). We used the simple greedy algorithm introduced in Section 5.4. As illustrated in Figure 18(a), this subset achieves over 99% of the maximum achievable AUC for MCAR data. The subset containing \({BI+MM, BI+MF}\), also scores over 99% of the maximum performance score for both F1 and accuracy, as denoted in Figure 18(b) and (c). Detailed quantitative results are presented in Table 13(a), (c), and (e) for the AUC, F1, and accuracy score, respectively.

Fig. 18.

Fig. 18. MCAR: The percentage of maximum metric (AUC, ACC, F1) achieved by each set of imputation methods versus the increased complexity of the pipelines. On the x-axis, we denote the added method as well as the additional complexity to the configuration space. The addition of BI+MF to the set increases the complexity by 3 \(\times\) compared to the original complexity. BI+MF and BI+MM allows us to recover over 99% of the maximum performance for MCAR data.

C.2.7 BI+MM and BI+MF Score over 99% of Maximum Performance for MAR Data..

The minimal-size subset of algorithms with close-to-optimal performance for MCAR missing data is \({BI+MM, BI+MF}\) as presented in Section 6.2. In this section, we present results for F1 and accuracy score. We used the simple greedy algorithm introduced in Section 5.4. The subset containing \({BI+MM, BI+MF}\), also scores over 99% of the maximum performance score for both F1 and accuracy, as denoted in Figure 19(a) and (b). Detailed quantitative results are presented in Table 13(b), (d), and (f) for the AUC, F1, and accuracy score, respectively.

Fig. 19.

Fig. 19. MAR: The percentage of maximum metric (ACC, F1) achieved by each set of imputation methods versus the increased complexity of the pipelines. On the x-axis, we denote the added method as well as the additional complexity to the configuration space. The addition of BI+MF to the set increases the complexity by 3 \(\times\) compared to the original complexity. BI+MF and BI+MM allow us to recover over 99% of the maximum metric for MAR data.

C.3 Evaluation of Imputation Accuracy

In this section, we present the results of imputation accuracy experiments. We measure the imputation accuracy for each imputation method, missing mechanism, and missingness percentage, by training the imputation methods on the default configurations (highlighted in Table 2). For each dataset, we measure the average R2-score between the imputed and the complete values (clamped between the 0–1 range) for the continuous features. For the categorical features, we measure the accuracy score between the imputed and the complete values. We measure the imputation scores in both train and test sets.

In summary, MF has the highest on-average imputation accuracy for categorical and continuous features. For both MCAR and MAR data, MF is the best method. Additionally, we observe that the differences between MF and the other methods are more prominent in MAR data. The result remains relatively the same across the train and test sets. MM is one of the worst performing in terms of imputation accuracy. However, MM is performing similarly to MF when measuring the downstream task performance, as seen in Sections 5 and 6. This observation further enhances our original hypothesis that imputation accuracy does not necessarily lead to better downstream task performance.

C.3.1 MF Has the Highest Imputation Accuracy for MAR Data..

For MAR data, MF has the highest average R2 and accuracy score in the train data, as seen in Figure 20(a) and (b). Figure 20(c) and (d) show that MF imputes values more accurately on test MAR data. In general, MF is the most accurate method for all missingness percentages. MM, as expected, has the lowest R2-score as it does not predict any of the variance of the continuous data. One interesting observation is that SOFT imputation fails to generalize on test data. As missingness increases, the imputation methods make worse predictions leading to lower scores.

Fig. 20.

Fig. 20. Panel (a) depicts the imputation R2-score for the continuous variables in MAR train data. Panel (b) illustrates the imputation accuracy in MAR train data, for categorical variables. Panel (c) shows the R2-score for continuous variables in MAR test data and panel (d) the accuracy score for categorical data in MAR test data.

C.3.2 MF Has the Highest Imputation Accuracy for MCAR Data..

For MCAR data, MF is the most accurate imputation method in the train data, as seen in Figure 21(a) and (b). Figure 21(c) and (d) show that MF has the highest average R2 and accuracy score on test MCAR data, across all missingness percentages. MM, as expected, has the lowest R2-score. SOFT fails to generalize on new unseen data. Finally, as missingness increases, the quality of imputed values deteriorates.

Fig. 21.

Fig. 21. Panel (a) depicts the imputation R2-score for the continuous variables in MCAR train data. Panel (b) illustrates the imputation accuracy in MCAR train data for categorical variables. Panel (c) shows the R2-score for continuous variables in MCAR test data and panel (d) shows the accuracy score for categorical data in MCAR test data.

C.4 Real World: Downstream Task Results

In this section, we include the quantitative results of the real-world experiments. Table 14 contains AUC results. Table 15 the results of F1-score, and Table 16 contains the results for the accuracy metric.

Table 14.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
analcatdata_reviewer-FS0.5850.5850.50.5970.5610.5950.6020.6020.5580.5580.5850.585
analcatdata_reviewer-NOFS0.6610.6680.5990.6460.6020.610.5970.6590.6060.6560.6610.668
analcatdata_reviewer-Overall0.6610.6680.6050.6430.6070.630.5970.6590.6060.6560.6610.668
anneal-FS0.8830.9880.7610.970.9160.9730.9870.9670.9950.9860.9750.968
anneal-NOFS0.8810.9960.9380.970.9310.9830.9720.9910.9960.9960.9820.982
anneal-Overall0.8830.9960.9430.9690.8960.9910.9720.9910.9960.9960.9750.982
audiology-FS0.9860.9860.980.980.980.9740.980.980.980.980.980.98
audiology-NOFS0.9980.9980.9920.9930.9940.9910.9920.9980.9950.9890.9930.995
audiology-Overall0.9980.9980.980.9810.9930.9920.9920.980.9950.9890.9930.98
autoHorse-FS0.9670.9670.9660.9660.9660.9660.9660.9660.9010.9960.9660.966
autoHorse-NOFS0.990.9890.9830.9880.9820.9880.9810.990.9890.9930.9760.976
autoHorse-Overall0.990.9890.9660.9660.9870.9880.9810.990.9890.9930.9660.966
braziltourism-FS0.6340.6340.640.640.6430.6340.640.640.7250.7250.6430.643
braziltourism-NOFS0.6160.7270.7160.7250.7210.7090.7090.7250.6690.6680.7310.715
braziltourism-Overall0.6160.7270.6430.640.6320.640.7090.640.6690.6680.7310.715
bridges-FS0.8440.8440.8530.8570.8910.8490.8910.8910.8920.8850.880.847
bridges-NOFS0.9090.9110.9020.9010.8820.9160.9020.8890.9010.9160.9150.912
bridges-Overall0.9090.9110.9020.9090.9050.9090.9020.8890.9010.9160.9150.912
cjs-FS1.01.01.01.01.01.01.01.00.9870.9971.01.0
cjs-NOFS1.01.00.9941.00.9850.9980.9960.9910.9870.990.9990.999
cjs-Overall1.01.01.01.01.01.01.01.00.9870.9971.01.0
colic-FS0.8290.8290.8370.8550.8450.8390.8360.8360.8390.8390.830.83
colic-NOFS0.8390.8380.8530.8630.8450.8720.8420.8650.8480.8580.8520.856
colic-Overall0.8290.8290.8490.8810.8460.8620.8360.8650.8390.8580.830.83
colleges_aaup-FS0.9990.9990.9990.9990.9990.9990.9990.9990.9960.9960.9990.999
colleges_aaup-NOFS0.9990.9990.9970.9990.9970.9970.9980.9970.9980.9980.9980.997
colleges_aaup-Overall0.9990.9990.9990.9990.9990.9990.9990.9990.9980.9980.9990.999
cylinder-bands-FS0.8080.7850.810.7880.8080.7970.8080.7980.7230.7230.8320.826
cylinder-bands-NOFS0.820.8190.8260.8320.8260.8280.8150.8210.8070.8240.8480.858
cylinder-bands-Overall0.820.8190.8320.8310.8130.8160.8150.8210.8070.8240.8480.858
dresses-sales-FS0.5620.5620.5640.5610.560.5520.5650.5650.50.5490.5620.562
dresses-sales-NOFS0.6190.6050.60.5970.5690.5830.5970.6010.5450.5390.6310.62
dresses-sales-Overall0.6190.5620.5640.5610.5670.5520.5650.5650.5450.5390.6310.562
eucalyptus-FS0.8330.8330.8210.820.8230.8350.8420.8170.750.8160.8070.834
eucalyptus-NOFS0.7770.7770.7780.7780.7780.7770.7790.8440.820.8240.8490.855
eucalyptus-Overall0.8330.8330.7780.7780.8320.8360.8420.8170.820.8240.8490.855
hepatitis-FS0.7480.7480.8340.830.8030.8070.6730.6730.7990.7990.8260.826
hepatitis-NOFS0.8260.8480.8660.8660.8440.850.8690.8640.8760.8660.8520.869
hepatitis-Overall0.8260.8480.8670.8580.8580.8660.8690.8640.7990.7990.8520.869
hungarian-FS0.8990.8990.8830.8840.8930.8670.8710.8710.8970.8970.8750.875
hungarian-NOFS0.9180.9150.9010.9010.9170.9150.8810.8950.8950.8980.9140.912
hungarian-Overall0.9180.9150.8870.8880.9130.9180.8710.8710.8950.8970.9140.912
kdd_el_nino-small-FS0.9830.9830.980.9810.9880.9870.9830.9810.950.9260.9830.986
kdd_el_nino-small-NOFS0.9870.9890.9840.9850.9880.9880.9850.9880.980.9850.9860.987
kdd_el_nino-small-Overall0.9870.9890.9820.9860.9880.9870.9850.9880.980.9850.9860.987
mushroom-FS1.01.01.01.01.01.01.01.01.01.01.01.0
mushroom-NOFS1.01.01.01.01.01.01.01.01.01.01.01.0
mushroom-Overall1.01.01.01.01.01.01.01.01.01.01.01.0
pbcseq-FS0.8490.8490.8510.8520.840.8490.8360.8310.8490.8490.8460.843
pbcseq-NOFS0.8490.850.8570.8440.8560.8480.8460.8450.8410.8380.8480.842
pbcseq-Overall0.8490.850.8510.850.8510.8490.8460.8450.8410.8380.8480.842
primary-tumor-FS0.8750.8870.8550.8750.8460.8740.8290.8820.8290.8640.7860.887
primary-tumor-NOFS0.880.8920.8750.8750.8710.880.8770.8890.8610.870.880.892
primary-tumor-Overall0.880.8920.8670.8920.8860.870.8770.8890.8610.870.880.892
profb-FS0.6420.6420.6420.6420.6420.6420.6420.6420.6310.6310.6420.642
profb-NOFS0.6950.6960.6910.6930.6950.6920.6950.6960.5790.580.6930.695
profb-Overall0.6950.6960.6920.6940.6920.6860.6950.6960.6310.6310.6930.695
schizo-FS0.5260.5260.5560.5570.5340.5150.5560.5560.5430.5430.6230.623
schizo-NOFS0.7190.6840.7640.7650.7170.740.760.760.560.5590.7940.805
schizo-Overall0.7190.6840.7810.7720.7360.7510.760.760.5430.5430.7940.805
sick-FS0.9840.9840.9920.9930.9930.9920.9770.9690.9910.9910.9920.992
sick-NOFS0.990.9890.9940.9940.9920.9950.9860.9880.9920.9890.9930.992
sick-Overall0.990.9890.9940.9940.9920.9930.9860.9880.9920.9890.9930.992
soybean-FS0.9810.9830.9830.9920.9910.9810.9850.9850.9730.9730.9910.991
soybean-NOFS0.9910.9940.9890.9860.9890.9920.9870.9910.9910.9850.9890.993
soybean-Overall0.9810.9940.9850.9910.990.9910.9870.9910.9910.9850.9890.993
stress-FS0.9160.9160.9320.9320.9320.9320.9320.9320.9330.9330.9320.932
stress-NOFS0.9020.9040.8990.9030.9060.9040.9020.9010.9480.9460.9090.909
stress-Overall0.9160.9160.9320.9320.9320.9320.9320.9320.9330.9330.9090.932
vote-FS0.9830.9850.9920.9950.9860.990.9910.9910.9890.9910.9780.986
vote-NOFS0.9920.9910.9940.9910.9940.990.9950.9910.9950.9920.9920.992
vote-Overall0.9920.9910.9920.9910.9950.9920.9910.9910.9890.9910.9920.992
water-treatment-FS0.9160.9880.9860.9870.9580.9880.9430.9430.50.50.9620.979
water-treatment-NOFS0.9880.9880.9880.9870.9880.9880.9540.9860.7880.7740.980.981
water-treatment-Overall0.9880.9880.9870.9870.9880.9880.9540.9860.7880.7740.980.981
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 14. Real-world Results for the AUC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 15.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
analcatdata_reviewer-FS0.6030.6030.6030.6030.6030.6030.6080.6080.6030.6030.6030.603
analcatdata_reviewer-NOFS0.6350.6560.6090.6530.6060.6360.6240.6530.6320.6450.6350.656
analcatdata_reviewer-Overall0.6350.6560.6140.6350.6120.6070.6240.6530.6320.6450.6350.656
anneal-FS0.9050.9880.880.990.9280.9910.9710.9870.9810.9760.9660.99
anneal-NOFS0.9030.990.930.9870.9420.9770.9610.990.9850.9930.9750.983
anneal-Overall0.9050.990.9330.9870.9240.990.9610.990.9850.9930.9660.983
audiology-FS0.9310.9310.9260.9260.9260.9090.9260.9260.9260.9260.9260.926
audiology-NOFS0.9660.9660.9310.9450.9310.9310.9330.9660.9490.9180.9310.966
audiology-Overall0.9660.9660.9260.8930.9260.9260.9330.9260.9490.9180.9310.926
autoHorse-FS0.9760.9760.9680.9680.9680.9680.9680.9680.8890.9840.9680.968
autoHorse-NOFS0.9840.9840.9760.9840.9760.9840.9760.9760.9760.9760.9760.976
autoHorse-Overall0.9840.9840.9680.9680.9760.9840.9760.9760.9760.9760.9680.968
braziltourism-FS0.8930.8930.8850.8850.8930.8910.8850.8850.8950.8950.8930.893
braziltourism-NOFS0.8720.8770.8830.8760.8820.8750.8790.8810.8810.880.8780.876
braziltourism-Overall0.8720.8770.8850.8850.8850.8850.8790.8850.8810.880.8780.876
bridges-FS0.7780.7780.80.80.8370.8090.8290.8290.8090.8160.8180.783
bridges-NOFS0.80.8240.8290.80.7920.830.830.8080.8160.8210.830.8
bridges-Overall0.80.8240.8260.8150.830.8080.830.8080.8160.8210.830.8
cjs-FS1.01.01.01.00.9990.9971.01.00.9310.9490.9881.0
cjs-NOFS1.01.00.9840.9810.9730.9740.9740.9560.9310.9430.9760.981
cjs-Overall1.01.01.01.00.9990.9991.01.00.9310.9490.9881.0
colic-FS0.9060.9060.8910.8950.8860.8990.8760.8760.8770.8770.8960.896
colic-NOFS0.8860.8920.8770.8970.9020.890.8930.9090.8780.8890.8810.873
colic-Overall0.9060.9060.8980.9020.9020.8980.8760.9090.8770.8890.8960.896
colleges_aaup-FS0.9940.9880.9930.9880.9880.9880.9930.9930.9850.9850.9940.993
colleges_aaup-NOFS0.9870.9870.9850.9870.9850.9830.9870.9850.9890.9890.9890.983
colleges_aaup-Overall0.9940.9880.9930.9880.9880.9880.9930.9930.9890.9890.9940.993
cylinder-bands-FS0.7960.810.80.810.7970.8130.7930.8190.810.810.8320.827
cylinder-bands-NOFS0.8160.8050.8270.8260.8140.8080.8060.8020.7990.8250.8440.852
cylinder-bands-Overall0.8160.8050.8290.8320.810.80.8060.8020.7990.8250.8440.852
dresses-sales-FS0.5920.5920.5920.5920.5950.5920.5920.5920.5920.5920.5920.592
dresses-sales-NOFS0.5970.6010.5920.5980.5940.5920.6050.6010.5930.5960.6020.602
dresses-sales-Overall0.5970.5920.5920.5920.5920.5920.5920.5920.5930.5960.6020.592
eucalyptus-FS0.6720.6720.6440.6440.6640.6770.680.6590.6030.6340.6490.654
eucalyptus-NOFS0.6380.6380.6380.6380.6380.6380.640.6780.650.6520.6970.691
eucalyptus-Overall0.6720.6720.6380.6380.6640.6760.680.6590.650.6520.6970.691
hepatitis-FS0.9120.9120.9020.9320.9320.9170.9120.9120.8960.8960.9320.932
hepatitis-NOFS0.910.910.9160.9280.910.9040.9190.9240.9320.9250.9120.913
hepatitis-Overall0.910.910.9310.9120.9130.9170.9190.9240.8960.8960.9120.913
hungarian-FS0.810.810.790.7890.8030.7520.780.780.7830.7830.770.77
hungarian-NOFS0.8260.8240.8040.8040.8140.8260.8170.8070.80.810.810.814
hungarian-Overall0.8260.8240.790.790.8060.8210.780.780.80.7830.810.814
kdd_el_nino-small-FS0.8970.8970.9080.910.9170.9130.9190.8970.830.7770.9180.932
kdd_el_nino-small-NOFS0.9250.9280.9190.9150.9240.9270.90.9230.8980.9160.9220.921
kdd_el_nino-small-Overall0.9250.9280.9110.9180.9190.9180.90.9230.8980.9160.9220.921
mushroom-FS1.01.00.9991.01.01.00.9990.9990.9960.9971.01.0
mushroom-NOFS1.01.01.01.01.01.01.01.00.9990.9981.01.0
mushroom-Overall1.01.00.9991.01.01.00.9990.9990.9960.9981.01.0
pbcseq-FS0.7760.7760.7880.7960.770.7790.7630.7720.7790.7790.7860.767
pbcseq-NOFS0.7750.7790.7870.7790.7850.7920.7790.780.7730.7740.7850.775
pbcseq-Overall0.7750.7790.7810.7830.7850.7770.7790.780.7730.7740.7850.775
primary-tumor-FS0.7530.7030.6880.6920.7050.7110.6520.7040.6520.6730.5850.703
primary-tumor-NOFS0.7350.7120.7170.7170.7440.7030.7360.7030.7360.7020.7350.712
primary-tumor-Overall0.7350.7120.7140.7380.7440.6990.7360.7030.7360.7020.7350.712
profb-FS0.5480.5480.5480.5480.5480.5480.5480.5480.5520.5520.5480.548
profb-NOFS0.5730.5730.5620.5760.5620.5710.5730.5730.5140.510.5720.573
profb-Overall0.5730.5730.5690.5760.5680.560.5730.5730.5520.5520.5720.573
schizo-FS0.6530.6530.6530.6530.6450.650.6480.6480.6580.6580.6480.648
schizo-NOFS0.6840.6450.7350.7290.6940.7010.7280.7220.6530.6530.750.763
schizo-Overall0.6840.6450.7190.7220.6960.710.7280.7220.6580.6580.750.763
sick-FS0.8370.8370.8380.8360.860.8510.8410.8230.8430.8430.8490.849
sick-NOFS0.8390.8390.8470.8490.8420.8510.8550.8510.8430.8390.8450.848
sick-Overall0.8390.8390.8370.8540.8420.8490.8550.8510.8430.8390.8450.848
soybean-FS0.8070.8390.8890.9010.9180.8330.8430.8430.8330.8330.8970.905
soybean-NOFS0.9130.8980.8840.8670.8940.8820.8750.8970.9110.8710.8820.899
soybean-Overall0.8070.8980.860.8890.8870.8840.8750.8970.9110.8710.8820.899
stress-FS0.7920.7920.8440.8440.8440.8440.8440.8440.850.850.8440.844
stress-NOFS0.750.750.7690.7560.7560.750.7560.750.8780.8370.750.75
stress-Overall0.7920.7920.8440.8440.8440.8440.8440.8440.850.850.750.844
vote-FS0.9480.9480.9480.9480.9480.9490.9480.9480.9490.9480.9290.948
vote-NOFS0.9430.9540.9480.9490.9490.9530.9590.9550.9590.9540.9480.959
vote-Overall0.9430.9540.9430.9550.9550.9650.9480.9480.9490.9480.9480.959
water-treatment-FS0.80.9870.9750.9750.8210.9870.7320.7320.2630.2630.9610.94
water-treatment-NOFS0.9870.9870.9870.9750.9870.9870.7850.950.5710.50.9510.918
water-treatment-Overall0.9870.9870.9750.9750.9870.9870.7850.950.5710.50.9510.918
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 15. Real-world Results for the F1 Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 16.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
analcatdata_reviewer-FS0.5890.5890.5680.6110.60.60.60.60.5680.5680.5890.589
analcatdata_reviewer-NOFS0.6320.6470.6210.6370.6260.60.60.6470.5950.6370.6320.647
analcatdata_reviewer-Overall0.6320.6470.6210.6420.60.6370.60.6470.5950.6370.6320.647
anneal-FS0.8490.9820.7950.9840.8840.9870.9550.980.9710.9640.9470.984
anneal-NOFS0.8440.9840.8950.980.9060.9640.940.9840.9780.9890.9620.973
anneal-Overall0.8490.9840.8950.980.880.9840.940.9840.9780.9890.9470.973
audiology-FS0.9650.9650.9650.9650.9650.9560.9650.9650.9650.9650.9650.965
audiology-NOFS0.9820.9820.9650.9730.9650.9650.9650.9820.9730.9560.9650.982
audiology-Overall0.9820.9820.9650.9470.9650.9650.9650.9650.9730.9560.9650.965
autoHorse-FS0.9710.9710.9610.9610.9610.9610.9610.9610.8740.9810.9610.961
autoHorse-NOFS0.9810.9810.9710.9810.9710.9810.9710.9710.9710.9710.9710.971
autoHorse-Overall0.9810.9810.9610.9610.9710.9810.9710.9710.9710.9710.9610.961
braziltourism-FS0.8160.8160.8060.8060.8160.8110.8060.8060.820.820.8160.816
braziltourism-NOFS0.7770.7910.8010.7910.8010.7910.7960.7960.7960.7910.7960.786
braziltourism-Overall0.7770.7910.8060.8060.8060.8060.7960.8060.7960.7910.7960.786
bridges-FS0.7960.7960.8330.8150.870.8330.870.870.8330.8330.8520.815
bridges-NOFS0.8330.8330.870.8330.8150.8520.8330.8150.8330.870.8330.833
bridges-Overall0.8330.8330.8520.8150.8330.8330.8330.8150.8330.870.8330.833
cjs-FS1.01.01.01.00.9990.9991.01.00.9670.9760.9941.0
cjs-NOFS1.01.00.9920.9910.9870.9880.9870.9790.9670.9730.9890.991
cjs-Overall1.01.01.01.00.9990.9991.01.00.9670.9760.9941.0
colic-FS0.8750.8750.8530.8590.8480.8640.8370.8370.8370.8370.8640.864
colic-NOFS0.8480.8530.8370.8590.870.8530.8590.880.8370.8530.8370.837
colic-Overall0.8750.8750.8640.870.870.8640.8370.880.8370.8530.8640.864
colleges_aaup-FS0.9910.9830.990.9830.9830.9830.990.990.9790.9790.9910.99
colleges_aaup-NOFS0.9810.9810.9790.9810.9790.9760.9810.9790.9850.9850.9850.976
colleges_aaup-Overall0.9910.9830.990.9830.9830.9830.990.990.9850.9850.9910.99
cylinder-bands-FS0.7520.7410.7480.7410.7560.7410.7560.7560.730.730.7850.778
cylinder-bands-NOFS0.7810.770.7930.7960.7810.7740.7670.7630.7590.7810.8070.807
cylinder-bands-Overall0.7810.770.8040.80.7740.7670.7670.7630.7590.7810.8070.807
dresses-sales-FS0.6320.6320.6320.6320.6320.6320.6320.6320.580.5960.6320.632
dresses-sales-NOFS0.6440.6480.6320.6240.6280.6080.6440.6480.6240.6120.6440.648
dresses-sales-Overall0.6440.6320.6320.6320.6320.6320.6320.6320.6240.6120.6440.632
eucalyptus-FS0.7910.7910.7740.7740.7720.7830.7830.7690.7120.7610.750.774
eucalyptus-NOFS0.7690.7690.7690.7690.7690.7690.7580.7850.7910.7910.7850.788
eucalyptus-Overall0.7910.7910.7690.7690.7770.780.7830.7690.7910.7910.7850.788
hepatitis-FS0.8460.8460.8330.8850.8850.8590.8460.8460.8210.8210.8850.885
hepatitis-NOFS0.8460.8460.8590.8850.8460.8330.8590.8720.8850.8720.8460.859
hepatitis-Overall0.8460.8460.8850.8590.8590.8590.8590.8720.8210.8210.8460.859
hungarian-FS0.8440.8440.830.8370.8370.8230.850.850.8440.8440.830.83
hungarian-NOFS0.8570.8570.8570.8570.8570.8640.8570.8640.8440.850.850.857
hungarian-Overall0.8570.8570.8370.8370.8570.8570.850.850.8440.8440.850.857
kdd_el_nino-small-FS0.9280.9280.9340.9390.9440.9360.9440.9310.880.8490.9440.954
kdd_el_nino-small-NOFS0.9460.9490.9440.9390.9460.9490.9310.9460.9260.9410.9440.944
kdd_el_nino-small-Overall0.9460.9490.9360.9410.9440.9410.9310.9460.9260.9410.9440.944
mushroom-FS1.01.01.01.01.01.01.01.00.9960.9981.01.0
mushroom-NOFS1.01.01.01.01.01.01.01.01.00.9981.01.0
mushroom-Overall1.01.01.01.01.01.01.01.00.9960.9981.01.0
pbcseq-FS0.7730.7730.7790.7830.7640.780.7580.7570.7820.7820.7720.769
pbcseq-NOFS0.7760.7720.7770.7760.7820.7770.7760.7760.7660.7640.7820.775
pbcseq-Overall0.7760.7720.7810.7790.7810.7810.7760.7760.7660.7640.7820.775
primary-tumor-FS0.8650.8710.8350.8410.8530.8710.8410.8760.8410.8590.8120.871
primary-tumor-NOFS0.8590.8760.8470.8530.8710.8410.8650.8710.8650.8530.8590.876
primary-tumor-Overall0.8590.8760.8470.8410.8710.8590.8650.8710.8650.8530.8590.876
profb-FS0.6730.670.670.6730.6730.6730.6730.6730.6760.6760.6730.673
profb-NOFS0.7050.7050.7080.7020.7110.7110.7050.7050.6670.670.7050.705
profb-Overall0.7050.7050.7050.7080.7050.7020.7050.7050.6760.6760.7050.705
schizo-FS0.5350.5350.5940.5940.5760.5470.5880.5880.5530.5590.6590.659
schizo-NOFS0.7180.7060.7410.7760.7350.7590.7240.7410.6060.5940.7760.788
schizo-Overall0.7180.7060.7530.7530.7530.7650.7240.7410.5530.5590.7760.788
sick-FS0.9790.9790.9810.9810.9830.9810.9790.9770.980.980.9810.981
sick-NOFS0.9790.9790.9820.9810.980.980.9810.9810.980.980.980.981
sick-Overall0.9790.9790.980.980.980.980.9810.9810.980.980.980.981
soybean-FS0.9440.9560.9710.9740.980.9560.9620.9620.9590.9590.9740.974
soybean-NOFS0.9770.9710.9680.9650.9710.9680.9650.9740.9770.9680.9680.974
soybean-Overall0.9440.9710.9620.9710.9680.9710.9650.9740.9770.9680.9680.974
stress-FS0.910.910.930.930.930.930.930.930.940.940.930.93
stress-NOFS0.90.90.910.90.90.90.890.90.950.930.90.9
stress-Overall0.910.910.930.930.930.930.930.930.940.940.90.93
vote-FS0.9590.9590.9590.9590.9590.9590.9590.9590.9590.9590.9450.959
vote-NOFS0.9540.9630.9590.9590.9590.9630.9680.9630.9680.9630.9590.968
vote-Overall0.9540.9630.9540.9630.9630.9720.9590.9590.9590.9590.9590.968
water-treatment-FS0.9430.9960.9920.9920.9470.9960.920.920.8480.8480.9890.981
water-treatment-NOFS0.9960.9960.9960.9920.9960.9960.9360.9850.8790.8640.9850.973
water-treatment-Overall0.9960.9960.9920.9920.9960.9960.9360.9850.8790.8640.9850.973
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 16. Real-world Results for the ACC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

C.5 MCAR: Downstream Task Results

This section presents the results for MCAR data under varying levels of missingness, and multiple metrics. Tables 17, 18, and 19 present the results for 10%, 25%, and 50% missingness for the AUC metric. Results for F1-score are presented in Tables 20, 21, and 22. Finally, Tables 23, 24, and 25 display the results for the classification accuracy.

Table 17.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.9020.9050.9220.9140.9130.9190.8850.9150.890.8970.9060.906
Australian-NOFS0.9160.9110.9170.9150.9110.9040.8860.8960.8920.8960.9150.91
Australian-Overall0.9020.9110.9220.9140.9090.9080.8850.8960.8920.8960.9060.906
boston-FS0.9330.9330.9260.9260.9240.9240.9120.9170.890.890.9260.926
boston-NOFS0.9310.9280.940.9350.9270.9310.9330.9080.9130.9210.9280.926
boston-Overall0.9330.9330.9380.9410.9220.9210.9120.9080.9130.9210.9260.926
churn-FS0.9140.9140.920.9160.9070.9160.8870.8830.8840.8850.9060.906
churn-NOFS0.9090.9120.9140.9160.910.9160.9040.910.8860.8860.9060.909
churn-Overall0.9090.9120.9170.9160.9080.9190.9040.910.8840.8860.9060.909
compas-two-years-FS0.7040.7040.6980.6920.6990.7050.6930.6930.6880.6880.6970.697
compas-two-years-NOFS0.7020.6980.7030.7010.7010.6980.7020.7020.7030.6990.6920.692
compas-two-years-Overall0.7020.7040.7030.6960.70.6990.6930.7020.6880.6880.6920.697
image-FS0.870.8450.8830.8750.850.850.8450.8490.8770.878
image-NOFS0.8850.8840.8810.8780.8770.8770.8630.8590.8920.887
image-Overall0.8850.8840.8820.8830.8770.8770.8630.8590.8920.887
page-blocks-FS0.990.9890.990.9890.9880.990.9830.9850.9690.9690.9880.988
page-blocks-NOFS0.9880.9880.990.990.9890.9880.9830.9820.9730.9740.9870.987
page-blocks-Overall0.9880.9890.990.990.9870.9890.9830.9820.9730.9740.9880.987
parkinsons-FS0.850.8460.8710.8710.8490.8490.850.8450.8480.8610.8680.866
parkinsons-NOFS0.8960.8960.9350.9230.8960.8950.8980.8920.8990.9070.9190.914
parkinsons-Overall0.8960.8960.9320.9250.9070.8940.8980.8920.8990.9070.9190.914
segment-FS0.9990.9991.01.00.9990.9990.9960.9970.9780.9770.9990.999
segment-NOFS0.9991.01.01.00.9991.00.9980.9990.9850.9861.00.999
segment-Overall0.9991.01.01.01.01.00.9980.9990.9850.9861.00.999
stock-FS0.990.990.9930.9930.9840.9870.9750.9750.9540.9540.990.99
stock-NOFS0.9890.990.9920.9930.990.9890.9790.980.9690.970.9910.991
stock-Overall0.990.990.9930.9930.9860.9860.9790.980.9690.970.990.99
zoo-FS0.9930.9930.8950.8950.9290.9290.9860.9860.8980.8950.9950.995
zoo-NOFS0.9790.9790.9920.9981.01.00.9730.9890.8970.9940.9031.0
zoo-Overall0.9790.9930.9941.00.9290.9890.9860.9890.8970.9940.9031.0
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 17. MCAR Results at 10% Missingness for the AUC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 18.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.8760.8770.8960.8820.8720.880.8450.8640.8310.8320.8870.887
Australian-NOFS0.8790.8840.8980.8790.8810.8950.8650.8750.8420.8240.8820.881
Australian-Overall0.8760.8770.8910.8830.8930.8840.8650.8640.8420.8320.8870.887
boston-FS0.9250.9170.9160.9130.9120.9170.8780.8780.9110.9110.9070.907
boston-NOFS0.9190.9060.9270.9170.9160.9070.8880.8870.9190.9090.9060.903
boston-Overall0.9250.9170.9170.9140.9260.9230.8780.8780.9190.9090.9070.907
churn-FS0.8660.870.8740.8710.8660.8740.8340.8190.850.8490.8620.86
churn-NOFS0.8640.870.8670.8650.8680.8690.8470.8510.8560.8590.8660.866
churn-Overall0.8660.870.8720.8760.8620.8670.8470.8510.8560.8590.8660.866
compas-two-years-FS0.6830.6760.690.6910.6870.6560.6420.6450.6780.6780.670.675
compas-two-years-NOFS0.6850.6850.690.6840.680.6730.6640.6630.6820.6810.6770.674
compas-two-years-Overall0.6850.6850.690.690.6740.6780.6640.6630.6820.6810.670.675
image-FS0.8010.8010.8390.8450.8170.8170.8510.8450.8780.878
image-NOFS0.8680.8660.8620.8640.8370.8440.8640.8560.880.882
image-Overall0.8680.8660.8620.8660.8370.8440.8640.8560.880.882
page-blocks-FS0.9830.9820.9860.9850.9820.9810.9660.9640.9120.9120.9830.979
page-blocks-NOFS0.9820.9830.9830.9850.9830.9830.9660.9680.9610.9660.9840.982
page-blocks-Overall0.9830.9820.9830.9860.9830.9830.9660.9680.9610.9660.9840.982
parkinsons-FS0.7930.8320.8470.850.8060.8380.780.8080.8330.8330.8490.933
parkinsons-NOFS0.8930.9040.9150.9190.8910.8990.860.8730.8370.8870.9180.93
parkinsons-Overall0.8930.9040.9190.9140.9010.8820.860.8730.8330.8330.9180.933
segment-FS0.9990.9990.9991.00.9990.9990.9780.9850.940.9410.9980.998
segment-NOFS0.9990.9991.01.00.9990.9980.9940.990.9620.9670.9980.999
segment-Overall0.9990.9991.01.00.9990.9990.9940.990.9620.9670.9980.999
stock-FS0.9760.9760.9880.990.9790.980.9370.9370.930.930.9830.981
stock-NOFS0.9830.9810.990.9920.9810.9820.9580.9590.9380.9390.9830.981
stock-Overall0.9830.9810.9910.990.9770.980.9580.9590.9380.9390.9830.981
zoo-FS0.9730.9741.01.00.9730.9970.9980.9981.01.00.990.994
zoo-NOFS0.9981.01.01.00.991.01.00.9890.9381.01.00.995
zoo-Overall0.9731.01.01.00.9981.00.9980.9891.01.01.00.995
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 18. MCAR Results at 25% Missingness for the AUC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 19.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.8640.8640.8450.8550.8460.8560.8060.8250.8560.8270.8650.868
Australian-NOFS0.8430.8590.8510.8580.8520.8360.7920.8210.8530.8380.860.866
Australian-Overall0.8430.8590.8520.8520.8540.8650.8060.8250.8530.8380.8650.868
boston-FS0.8630.8630.8930.8680.8390.8450.7540.7540.8620.8360.8540.851
boston-NOFS0.8530.8310.8850.8840.8480.8510.7650.7610.860.8080.8390.842
boston-Overall0.8530.8630.8840.8810.8530.8440.7540.7610.860.8360.8540.842
churn-FS0.7850.7850.7770.7870.760.790.7330.7390.7680.7620.7770.777
churn-NOFS0.7890.790.7850.7890.7870.7870.7720.7810.7760.7840.7840.785
churn-Overall0.7890.790.790.7890.7890.7990.7720.7810.7760.7620.7840.785
compas-two-years-FS0.6260.6260.6470.6390.6350.6420.5940.6040.6320.6320.6330.633
compas-two-years-NOFS0.6480.6460.6420.6390.6460.6390.6110.6140.6380.6390.6420.643
compas-two-years-Overall0.6260.6460.6510.6470.6490.6460.5940.6140.6320.6390.6330.643
image-FS0.7780.7540.7810.7540.6810.6710.7970.8170.8230.823
image-NOFS0.8260.820.8410.8240.7350.750.8450.8380.8430.847
image-Overall0.8260.820.820.8330.7350.750.8450.8380.8430.847
page-blocks-FS0.9430.9340.9630.9630.9520.9560.8960.8790.7910.7910.9550.934
page-blocks-NOFS0.960.9560.9690.9670.960.9580.8990.910.9060.9170.9570.958
page-blocks-Overall0.960.9560.9670.9680.9590.9550.8990.910.9060.9170.9570.958
parkinsons-FS0.7310.7040.8170.8420.7470.6930.70.6520.8120.7950.8160.809
parkinsons-NOFS0.8420.8190.8390.8690.840.8430.7530.6770.8210.8140.8470.822
parkinsons-Overall0.8420.8190.820.840.8370.8030.7530.6520.8120.7950.8470.809
segment-FS0.990.9920.9970.9860.9860.9920.8640.9090.8520.860.9920.991
segment-NOFS0.9950.9950.9950.9980.9920.9940.9230.930.9120.9140.9930.992
segment-Overall0.9950.9950.9970.9970.9940.9910.9230.930.9120.9140.9930.992
stock-FS0.9030.9030.9720.9490.9490.9180.7840.7840.80.80.9350.923
stock-NOFS0.9480.9440.9650.9710.920.9380.8090.8220.8580.8420.9350.95
stock-Overall0.9480.9440.9650.9710.9330.9320.8090.8220.8580.8420.9350.95
zoo-FS0.9650.9180.9250.810.860.9010.9210.8210.8330.8180.8940.699
zoo-NOFS0.970.8480.9380.9460.9020.9020.8410.8440.8210.8190.8840.829
zoo-Overall0.970.8480.930.9160.8540.9250.8410.8440.8330.8180.8840.829
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 19. MCAR Results at 50% Missingness for the AUC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 20.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.9020.9050.9220.9140.9130.9190.8850.9150.890.8970.9060.906
Australian-NOFS0.9160.9110.9170.9150.9110.9040.8860.8960.8920.8960.9150.91
Australian-Overall0.9020.9110.9220.9140.9090.9080.8850.8960.8920.8960.9060.906
boston-FS0.9330.9330.9260.9260.9240.9240.9120.9170.890.890.9260.926
boston-NOFS0.9310.9280.940.9350.9270.9310.9330.9080.9130.9210.9280.926
boston-Overall0.9330.9330.9380.9410.9220.9210.9120.9080.9130.9210.9260.926
churn-FS0.9140.9140.920.9160.9070.9160.8870.8830.8840.8850.9060.906
churn-NOFS0.9090.9120.9140.9160.910.9160.9040.910.8860.8860.9060.909
churn-Overall0.9090.9120.9170.9160.9080.9190.9040.910.8840.8860.9060.909
compas-two-years-FS0.7040.7040.6980.6920.6990.7050.6930.6930.6880.6880.6970.697
compas-two-years-NOFS0.7020.6980.7030.7010.7010.6980.7020.7020.7030.6990.6920.692
compas-two-years-Overall0.7020.7040.7030.6960.70.6990.6930.7020.6880.6880.6920.697
image-FS0.870.8450.8830.8750.850.850.8450.8490.8770.878
image-NOFS0.8850.8840.8810.8780.8770.8770.8630.8590.8920.887
image-Overall0.8850.8840.8820.8830.8770.8770.8630.8590.8920.887
page-blocks-FS0.990.9890.990.9890.9880.990.9830.9850.9690.9690.9880.988
page-blocks-NOFS0.9880.9880.990.990.9890.9880.9830.9820.9730.9740.9870.987
page-blocks-Overall0.9880.9890.990.990.9870.9890.9830.9820.9730.9740.9880.987
parkinsons-FS0.850.8460.8710.8710.8490.8490.850.8450.8480.8610.8680.866
parkinsons-NOFS0.8960.8960.9350.9230.8960.8950.8980.8920.8990.9070.9190.914
parkinsons-Overall0.8960.8960.9320.9250.9070.8940.8980.8920.8990.9070.9190.914
segment-FS0.9990.9991.01.00.9990.9990.9960.9970.9780.9770.9990.999
segment-NOFS0.9991.01.01.00.9991.00.9980.9990.9850.9861.00.999
segment-Overall0.9991.01.01.01.01.00.9980.9990.9850.9861.00.999
stock-FS0.990.990.9930.9930.9840.9870.9750.9750.9540.9540.990.99
stock-NOFS0.9890.990.9920.9930.990.9890.9790.980.9690.970.9910.991
stock-Overall0.990.990.9930.9930.9860.9860.9790.980.9690.970.990.99
zoo-FS0.9930.9930.8950.8950.9290.9290.9860.9860.8980.8950.9950.995
zoo-NOFS0.9790.9790.9920.9981.01.00.9730.9890.8970.9940.9031.0
zoo-Overall0.9790.9930.9941.00.9290.9890.9860.9890.8970.9940.9031.0
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 20. MCAR Results at 10% Missingness for the F1 Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 21.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.8760.8770.8960.8820.8720.880.8450.8640.8310.8320.8870.887
Australian-NOFS0.8790.8840.8980.8790.8810.8950.8650.8750.8420.8240.8820.881
Australian-Overall0.8760.8770.8910.8830.8930.8840.8650.8640.8420.8320.8870.887
boston-FS0.9250.9170.9160.9130.9120.9170.8780.8780.9110.9110.9070.907
boston-NOFS0.9190.9060.9270.9170.9160.9070.8880.8870.9190.9090.9060.903
boston-Overall0.9250.9170.9170.9140.9260.9230.8780.8780.9190.9090.9070.907
churn-FS0.8660.870.8740.8710.8660.8740.8340.8190.850.8490.8620.86
churn-NOFS0.8640.870.8670.8650.8680.8690.8470.8510.8560.8590.8660.866
churn-Overall0.8660.870.8720.8760.8620.8670.8470.8510.8560.8590.8660.866
compas-two-years-FS0.6830.6760.690.6910.6870.6560.6420.6450.6780.6780.670.675
compas-two-years-NOFS0.6850.6850.690.6840.680.6730.6640.6630.6820.6810.6770.674
compas-two-years-Overall0.6850.6850.690.690.6740.6780.6640.6630.6820.6810.670.675
image-FS0.8010.8010.8390.8450.8170.8170.8510.8450.8780.878
image-NOFS0.8680.8660.8620.8640.8370.8440.8640.8560.880.882
image-Overall0.8680.8660.8620.8660.8370.8440.8640.8560.880.882
page-blocks-FS0.9830.9820.9860.9850.9820.9810.9660.9640.9120.9120.9830.979
page-blocks-NOFS0.9820.9830.9830.9850.9830.9830.9660.9680.9610.9660.9840.982
page-blocks-Overall0.9830.9820.9830.9860.9830.9830.9660.9680.9610.9660.9840.982
parkinsons-FS0.7930.8320.8470.850.8060.8380.780.8080.8330.8330.8490.933
parkinsons-NOFS0.8930.9040.9150.9190.8910.8990.860.8730.8370.8870.9180.93
parkinsons-Overall0.8930.9040.9190.9140.9010.8820.860.8730.8330.8330.9180.933
segment-FS0.9990.9990.9991.00.9990.9990.9780.9850.940.9410.9980.998
segment-NOFS0.9990.9991.01.00.9990.9980.9940.990.9620.9670.9980.999
segment-Overall0.9990.9991.01.00.9990.9990.9940.990.9620.9670.9980.999
stock-FS0.9760.9760.9880.990.9790.980.9370.9370.930.930.9830.981
stock-NOFS0.9830.9810.990.9920.9810.9820.9580.9590.9380.9390.9830.981
stock-Overall0.9830.9810.9910.990.9770.980.9580.9590.9380.9390.9830.981
zoo-FS0.9730.9741.01.00.9730.9970.9980.9981.01.00.990.994
zoo-NOFS0.9981.01.01.00.991.01.00.9890.9381.01.00.995
zoo-Overall0.9731.01.01.00.9981.00.9980.9891.01.01.00.995
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 21. MCAR Results at 25% Missingness for the F1 Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 22.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.8640.8640.8450.8550.8460.8560.8060.8250.8560.8270.8650.868
Australian-NOFS0.8430.8590.8510.8580.8520.8360.7920.8210.8530.8380.860.866
Australian-Overall0.8430.8590.8520.8520.8540.8650.8060.8250.8530.8380.8650.868
boston-FS0.8630.8630.8930.8680.8390.8450.7540.7540.8620.8360.8540.851
boston-NOFS0.8530.8310.8850.8840.8480.8510.7650.7610.860.8080.8390.842
boston-Overall0.8530.8630.8840.8810.8530.8440.7540.7610.860.8360.8540.842
churn-FS0.7850.7850.7770.7870.760.790.7330.7390.7680.7620.7770.777
churn-NOFS0.7890.790.7850.7890.7870.7870.7720.7810.7760.7840.7840.785
churn-Overall0.7890.790.790.7890.7890.7990.7720.7810.7760.7620.7840.785
compas-two-years-FS0.6260.6260.6470.6390.6350.6420.5940.6040.6320.6320.6330.633
compas-two-years-NOFS0.6480.6460.6420.6390.6460.6390.6110.6140.6380.6390.6420.643
compas-two-years-Overall0.6260.6460.6510.6470.6490.6460.5940.6140.6320.6390.6330.643
image-FS0.7780.7540.7810.7540.6810.6710.7970.8170.8230.823
image-NOFS0.8260.820.8410.8240.7350.750.8450.8380.8430.847
image-Overall0.8260.820.820.8330.7350.750.8450.8380.8430.847
page-blocks-FS0.9430.9340.9630.9630.9520.9560.8960.8790.7910.7910.9550.934
page-blocks-NOFS0.960.9560.9690.9670.960.9580.8990.910.9060.9170.9570.958
page-blocks-Overall0.960.9560.9670.9680.9590.9550.8990.910.9060.9170.9570.958
parkinsons-FS0.7310.7040.8170.8420.7470.6930.70.6520.8120.7950.8160.809
parkinsons-NOFS0.8420.8190.8390.8690.840.8430.7530.6770.8210.8140.8470.822
parkinsons-Overall0.8420.8190.820.840.8370.8030.7530.6520.8120.7950.8470.809
segment-FS0.990.9920.9970.9860.9860.9920.8640.9090.8520.860.9920.991
segment-NOFS0.9950.9950.9950.9980.9920.9940.9230.930.9120.9140.9930.992
segment-Overall0.9950.9950.9970.9970.9940.9910.9230.930.9120.9140.9930.992
stock-FS0.9030.9030.9720.9490.9490.9180.7840.7840.80.80.9350.923
stock-NOFS0.9480.9440.9650.9710.920.9380.8090.8220.8580.8420.9350.95
stock-Overall0.9480.9440.9650.9710.9330.9320.8090.8220.8580.8420.9350.95
zoo-FS0.9650.9180.9250.810.860.9010.9210.8210.8330.8180.8940.699
zoo-NOFS0.970.8480.9380.9460.9020.9020.8410.8440.8210.8190.8840.829
zoo-Overall0.970.8480.930.9160.8540.9250.8410.8440.8330.8180.8840.829
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 22. MCAR Results at 50% Missingness for the F1 Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 23.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian0.8460.8550.8780.8580.8610.8410.8410.8410.8290.8380.8550.855
Australian-FS0.8460.8490.8750.8670.8640.8640.8410.8610.8350.8430.8550.855
Australian-NOFS0.8580.8550.870.8640.8610.8490.8430.8410.8290.8380.8580.849
boston0.8810.8810.8930.8970.8740.8620.8580.8620.8740.8810.8810.874
boston-FS0.8810.8810.8740.8740.8810.8810.8580.8620.8620.8620.8810.874
boston-NOFS0.8810.8810.8850.8930.8770.8740.8810.8620.8740.8810.8890.874
churn0.9360.9360.9440.9420.9430.9430.9280.9240.9240.9220.9310.932
churn-FS0.9360.9360.9440.9430.9360.9430.9230.9260.9240.9220.9340.94
churn-NOFS0.9360.9360.9440.9420.9440.9440.9280.9240.9240.9220.9310.932
compas-two-years0.6650.6630.6580.6550.6580.6530.6480.6620.650.650.6550.653
compas-two-years-FS0.6630.6630.6560.6550.6530.6620.6480.6480.650.650.6530.653
compas-two-years-NOFS0.6650.6590.6570.6540.6630.6510.6620.6620.6650.6570.6550.651
image0.8630.8690.8650.8660.8620.8660.8580.8610.8730.867
image-FS0.8560.8450.8630.8620.8590.8460.8470.8460.8640.864
image-NOFS0.8630.8690.8590.8720.8620.8660.8580.8610.8730.867
page-blocks0.9720.9720.9740.9730.970.9720.9660.9670.9610.9610.9730.973
page-blocks-FS0.9710.9720.9740.9740.9730.9720.9660.9680.9590.9590.9730.971
page-blocks-NOFS0.9720.9720.9740.9720.970.9710.9660.9670.9610.9610.9730.973
parkinsons0.8670.8670.9080.9180.8670.8570.8780.8880.8980.8880.8980.888
parkinsons-FS0.8370.8270.8670.8670.8370.8270.8670.8570.8670.8880.8570.857
parkinsons-NOFS0.8670.8670.9290.8980.8670.8570.8780.8880.8980.8880.8980.888
segment0.9960.9970.9960.9980.9970.9960.9930.9950.9580.9570.9970.997
segment-FS0.9960.9940.9980.9980.9930.9950.9920.9920.9470.9510.9940.993
segment-NOFS0.9960.9970.9950.9960.9950.9970.9930.9950.9580.9570.9970.997
stock0.9560.9560.960.960.9370.9430.9220.9240.9010.9050.9560.96
stock-FS0.9560.9560.960.9560.9330.9470.9240.9240.8720.8720.9560.96
stock-NOFS0.9470.9450.9560.9580.9430.9370.9220.9240.9010.9050.9540.954
zoo0.9610.9610.9611.00.9410.9410.9410.9410.9020.9610.8631.0
zoo-FS0.9610.9610.9410.9410.9410.9410.9410.9410.9220.9020.980.98
zoo-NOFS0.9610.9220.9610.981.01.00.980.9410.9020.9610.8631.0
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 23. MCAR Results at 10% Missingness for the ACC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 24.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian0.80.8060.8230.8380.8140.8090.8030.7910.7710.7710.8140.814
Australian-FS0.80.8060.8320.8260.8090.8090.7970.7910.7740.7710.8140.814
Australian-NOFS0.8030.8060.8350.8060.8170.8320.8030.8120.7710.7620.8090.806
boston0.8770.8660.870.8620.8660.870.8140.8140.8740.8620.8460.846
boston-FS0.8770.8660.8620.850.8580.8660.8140.8140.8580.8580.8460.846
boston-NOFS0.8810.8460.8810.870.8620.8460.8140.8140.8740.8620.8620.85
churn0.9190.9180.9220.9310.9220.9190.9040.8940.9140.9160.9240.921
churn-FS0.9190.9230.9220.9240.9250.9240.8950.8910.9070.9070.9220.923
churn-NOFS0.9240.9180.9310.9250.9250.9160.9040.8940.9140.9160.9240.921
compas-two-years0.6440.6420.6510.6510.6360.6410.6230.6290.6380.6390.6290.638
compas-two-years-FS0.6390.6370.6540.6530.6450.6190.610.6090.6350.6350.6290.638
compas-two-years-NOFS0.6440.6420.6530.6420.6430.6370.6230.6290.6380.6390.6380.631
image0.8530.8530.8530.8620.8440.8460.8610.860.8660.862
image-FS0.8150.8150.8490.8490.8450.8450.850.8520.8620.862
image-NOFS0.8530.8530.8620.860.8440.8460.8610.860.8660.862
page-blocks0.9660.9650.9650.9680.9660.9650.9470.9480.950.950.9630.963
page-blocks-FS0.9660.9650.9670.9670.9610.9660.9480.9470.9360.9360.9630.961
page-blocks-NOFS0.9650.9630.9670.9670.9670.9650.9470.9480.950.950.9630.963
parkinsons0.8780.8780.8980.9080.8780.8570.8980.8670.8370.8370.8670.888
parkinsons-FS0.8270.8370.8780.8780.8060.8570.8570.8270.8370.8370.8570.888
parkinsons-NOFS0.8780.8780.9080.8980.8880.8880.8980.8670.8670.8570.8670.898
segment0.9880.9910.9970.9970.9920.9930.9760.9730.9220.9270.9880.99
segment-FS0.9880.9910.9970.9970.990.9890.9580.9650.8970.8960.9850.988
segment-NOFS0.9890.990.9970.9970.990.990.9760.9730.9220.9270.9880.99
stock0.9310.9330.9490.9470.9240.9260.8930.8930.8650.8650.9310.924
stock-FS0.9240.9240.9520.9520.9180.9240.8590.8590.8530.8530.9310.933
stock-NOFS0.9310.9330.9470.9520.920.9310.8930.8930.8650.8650.9310.924
zoo0.9611.01.01.00.981.00.980.981.01.01.00.98
zoo-FS0.9610.9611.01.00.980.980.980.981.01.00.980.98
zoo-NOFS0.981.01.01.00.981.01.00.980.9021.01.00.98
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 24. MCAR Results at 25% Missingness for the ACC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 25.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian0.8030.8060.8260.8170.780.8120.730.7770.80.7680.80.817
Australian-FS0.80.80.7970.8060.7830.7830.730.7770.8030.7710.80.817
Australian-NOFS0.8030.8060.8090.8140.7680.780.7360.7740.80.7680.8030.809
boston0.8020.810.8340.8380.8060.7980.7310.7190.8220.810.7830.791
boston-FS0.810.810.8580.8140.7910.7870.7310.7310.8380.810.7830.787
boston-NOFS0.8020.7870.850.8340.8180.8180.7110.7190.8220.7790.7790.791
churn0.8970.8870.8980.8950.8920.8890.8720.8720.8880.890.8930.887
churn-FS0.8920.8920.8990.8940.8870.890.8630.8660.890.890.890.89
churn-NOFS0.8970.8870.9040.8950.8930.8920.8720.8720.8880.8850.8930.887
compas-two-years0.5930.6120.6150.6170.6140.6090.5810.5890.6060.6050.6030.606
compas-two-years-FS0.5930.5930.6120.6130.6030.6120.5810.5790.6060.6060.6030.603
compas-two-years-NOFS0.6130.6120.610.610.6150.6090.5840.5890.6080.6050.6120.606
image0.8460.8380.8370.8540.8070.8090.8550.8540.850.853
image-FS0.8170.8120.8260.8120.8070.8040.8280.8480.8440.844
image-NOFS0.8460.8380.8510.8430.8070.8090.8550.8540.850.853
page-blocks0.9420.9410.9530.9530.9450.9440.9110.920.9330.9360.9410.946
page-blocks-FS0.940.9450.9550.9540.9480.9410.910.9120.9230.9230.9420.942
page-blocks-NOFS0.9420.9410.9520.9530.9430.9460.9110.920.9330.9360.9410.946
parkinsons0.7960.7860.8780.8060.8060.7760.7760.7650.8470.8160.8160.816
parkinsons-FS0.7860.7760.8570.8060.8370.7760.7860.7650.8470.8160.8270.816
parkinsons-NOFS0.7960.7860.8470.8470.8270.8270.7760.7650.8570.8160.8160.806
segment0.9750.9740.9850.9840.9760.9710.9120.9190.8720.8780.9670.966
segment-FS0.9720.970.9870.9830.9650.9690.890.9020.8640.8580.9640.965
segment-NOFS0.9750.9740.9860.9870.970.9730.9120.9190.8720.8780.9670.966
stock0.880.880.9160.920.8480.8590.7310.7450.7660.7870.8510.874
stock-FS0.8230.8230.9160.8860.8650.8420.7180.7220.7220.7220.8510.844
stock-NOFS0.880.880.9090.9180.8420.8510.7310.7450.7660.7870.8510.874
zoo0.9610.8820.9020.8820.7840.8820.8240.8630.7840.7840.8630.863
zoo-FS0.9410.8820.9220.8240.8430.8630.9020.7650.7840.7840.8820.725
zoo-NOFS0.9610.8820.9220.9020.8240.8630.8240.8630.7840.7840.8630.863
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 25. MCAR Results at 50% Missingness for the ACC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

C.6 MAR: Downstream Task Results

This section presents the results for MAR data under varying levels of missingness, and multiple metrics. Tables 26, 27, and 28 present the results for 10%, 25%, and 50% missingness for the AUC metric. Results for F1-score are presented in Tables 29, 30, and 31. Finally, Tables 32, 33, and 34 display the results for the classification accuracy.

Table 26.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.8990.9070.8930.8950.9050.9070.880.8860.8730.8730.8950.911
Australian-NOFS0.9210.9120.910.9150.9010.9180.9120.9040.8860.890.9190.899
Australian-Overall0.8990.9070.9070.9150.9180.9120.9120.9040.8860.890.8950.911
boston-FS0.940.940.9260.9260.9280.9180.9180.9180.9260.9260.9330.94
boston-NOFS0.9360.9320.9470.9430.9380.9390.9360.9290.930.9230.940.932
boston-Overall0.940.940.9260.9210.9420.9220.9180.9180.930.9230.9330.94
churn-FS0.8930.8970.8990.8960.8960.8970.890.8810.8710.8730.8960.896
churn-NOFS0.8940.8960.8990.9020.8980.9020.8910.8940.8730.8750.9020.898
churn-Overall0.8930.8960.8990.8970.8980.90.8910.8940.8730.8750.8960.896
compas-two-years-FS0.7040.7040.6940.710.7070.7020.6870.6890.7050.7020.7060.693
compas-two-years-NOFS0.7070.7050.7110.7060.7030.70.6990.7030.7030.6970.7090.702
compas-two-years-Overall0.7040.7040.7110.710.70.6990.6990.7030.7030.6970.7060.702
image-FS0.8550.8580.8680.8570.8730.8730.820.820.8720.868
image-NOFS0.8780.8790.8790.8880.8750.8750.8620.8650.8820.885
image-Overall0.8780.8790.8750.8790.8750.8750.8620.8650.8820.885
page-blocks-FS0.9870.9880.9890.9880.9880.9880.9850.9860.9760.9540.9870.987
page-blocks-NOFS0.9880.9880.9890.990.9880.9880.9850.9850.9820.9830.9870.988
page-blocks-Overall0.9880.9880.9880.9880.9870.9870.9850.9850.9820.9830.9870.988
parkinsons-FS0.8730.8730.870.8710.8750.8760.8740.8740.8890.8010.870.87
parkinsons-NOFS0.9080.9110.9230.9260.9250.9120.9180.9180.930.8940.9170.912
parkinsons-Overall0.8730.8730.9360.870.9050.8730.9180.9180.930.8940.870.87
segment-FS0.9990.9991.01.00.9991.00.9990.9980.960.9540.9980.998
segment-NOFS0.9991.01.01.01.01.00.9991.00.9680.971.00.999
segment-Overall0.9991.01.01.00.9991.00.9991.00.9680.970.9980.998
stock-FS0.9930.9850.9950.9950.9910.9920.9910.9870.9650.9650.9910.991
stock-NOFS0.9930.9940.9950.9950.9930.9950.9910.9890.980.9810.9940.994
stock-Overall0.9930.9940.9950.9940.9930.9940.9910.9890.980.9810.9940.994
zoo-FS1.01.01.01.00.8241.01.01.01.01.00.9870.993
zoo-NOFS0.9921.01.01.01.00.9971.00.9970.991.00.9831.0
zoo-Overall0.9921.01.01.00.9971.01.00.9970.991.00.9830.993
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 26. MAR Results at 10% Missingness for the AUC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 27.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.9010.9050.8880.8860.8830.890.8640.870.8770.8690.9090.909
Australian-NOFS0.9030.8980.8870.8880.8870.8960.8840.8880.870.8750.9080.902
Australian-Overall0.9030.9050.8860.8840.880.8980.8640.870.8770.8690.9090.909
boston-FS0.9070.9070.9030.9010.9040.9080.8410.8440.8870.8870.8910.891
boston-NOFS0.8940.8910.9030.9050.90.9050.8580.8520.8950.8870.9010.89
boston-Overall0.8940.8910.9010.910.9140.90.8580.8520.8950.8870.9010.89
churn-FS0.8410.8410.8470.8410.8440.8390.8040.8110.8080.8160.8370.837
churn-NOFS0.8330.8360.8460.8420.8430.8310.840.8270.8170.8140.8320.843
churn-Overall0.8330.8360.8450.8460.8470.8410.840.8270.8080.8140.8370.843
compas-two-years-FS0.7010.6970.6880.7010.6880.6960.6850.70.6850.6810.6820.697
compas-two-years-NOFS0.7010.6950.6930.6950.690.690.6730.6920.6920.6970.6940.691
compas-two-years-Overall0.7010.6950.6950.6930.690.6850.6730.70.6850.6810.6820.691
image-FS0.8060.8060.7980.840.8350.8390.8260.8230.8710.871
image-NOFS0.8710.8750.8710.8650.860.8620.8690.860.8780.884
image-Overall0.8710.8750.8790.8690.860.8620.8690.860.8780.884
page-blocks-FS0.9760.9790.9820.9840.9770.9780.950.9510.9320.940.9790.98
page-blocks-NOFS0.9760.980.9820.9840.9750.9780.950.9560.8660.9470.9780.978
page-blocks-Overall0.9760.9790.9840.9820.9790.9760.950.9560.9320.9470.9780.978
parkinsons-FS0.8890.8760.8720.870.8890.8640.8030.8030.8440.8290.8490.881
parkinsons-NOFS0.9040.8920.9090.9180.8980.8920.8830.8710.8860.8830.9050.905
parkinsons-Overall0.9040.8920.8720.8760.8910.8750.8830.8710.8860.8830.9050.881
segment-FS0.9940.9971.01.00.9980.9970.9740.9610.9410.9630.9980.999
segment-NOFS0.9990.9991.01.00.9970.9980.9890.9950.9720.9750.9990.999
segment-Overall0.9990.9991.01.00.9990.9980.9890.9950.9720.9750.9990.999
stock-FS0.9790.9680.980.980.9750.9770.8820.9080.8860.9030.9740.974
stock-NOFS0.9790.9780.9810.9860.9730.9770.8870.9330.9310.9420.9760.975
stock-Overall0.9790.9780.9820.9810.9750.9780.8870.9330.9310.9420.9760.974
zoo-FS0.9910.9841.00.990.9860.9670.9730.9890.960.990.9230.925
zoo-NOFS0.990.9890.9731.00.9981.00.9970.9980.9860.9970.9520.99
zoo-Overall0.990.9890.9651.01.01.00.9730.9980.960.9970.9520.925
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 27. MAR Results at 25% Missingness for the AUC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 28.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.8550.8870.8670.8780.8520.840.7850.8380.8210.8210.8630.884
Australian-NOFS0.8620.8730.8790.8580.8510.8610.7720.8210.8220.8290.860.874
Australian-Overall0.8550.8870.8820.8610.8430.8490.7850.8380.8220.8210.8630.884
boston-FS0.8670.8970.8840.8840.8680.8770.8140.7860.8690.8560.8530.863
boston-NOFS0.8870.8970.8940.8990.8850.8750.8180.8320.8660.8640.8490.867
boston-Overall0.8670.8970.8950.8930.8710.8750.8180.8320.8660.8560.8530.863
churn-FS0.7580.7730.7640.7740.7540.7470.7040.7110.7650.7760.7660.757
churn-NOFS0.7550.7660.7750.7930.7560.7520.720.7370.7730.7640.7590.76
churn-Overall0.7550.7660.7750.7750.7610.7620.720.7370.7650.7640.7660.757
compas-two-years-FS0.6620.6680.6650.6740.6420.6640.6430.6580.6630.6710.6590.666
compas-two-years-NOFS0.6660.6820.6630.6740.6580.6780.6370.6620.6660.6820.6650.679
compas-two-years-Overall0.6660.6820.6680.6760.6470.6780.6430.6620.6660.6820.6650.679
image-FS0.7710.760.7420.7790.7210.7350.8120.8120.8050.81
image-NOFS0.8240.8240.8360.8240.7310.7490.8410.8280.8570.852
image-Overall0.8240.8240.8280.830.7310.7490.8410.8280.8570.852
page-blocks-FS0.9630.9640.9520.9590.9630.9620.890.9020.8780.870.9610.957
page-blocks-NOFS0.9630.9620.9590.960.960.9660.8960.9110.8980.9180.9610.958
page-blocks-Overall0.9630.9640.9580.9610.9590.9640.8960.9110.8980.9180.9610.957
parkinsons-FS0.8090.8550.8610.880.8390.8130.80.8010.8650.8840.8030.85
parkinsons-NOFS0.8750.8470.8830.8760.8550.8650.7780.770.9080.9030.8870.869
parkinsons-Overall0.8750.8470.8760.870.8550.840.80.770.9080.9030.8870.869
segment-FS0.9950.9930.9910.9910.9950.9910.8770.9090.8770.8950.9910.993
segment-NOFS0.9950.9930.9920.9940.9920.9940.8860.9030.9070.9340.9930.993
segment-Overall0.9950.9930.9940.9970.9850.9950.8860.9030.9070.9340.9910.993
stock-FS0.9620.9350.9660.9730.9570.950.8760.8690.7130.8030.9540.939
stock-NOFS0.9620.9590.9720.9710.9560.9620.8710.8970.9250.9280.9560.956
stock-Overall0.9620.9590.9690.9750.9530.9540.8710.8970.9250.9280.9540.956
zoo-FS0.960.9050.960.9560.9490.990.9460.9480.8960.8630.9670.884
zoo-NOFS0.9540.9520.9730.9810.9080.9240.9480.9520.9430.9650.9540.968
zoo-Overall0.9540.9520.9540.9570.9160.9790.9480.9520.9430.9650.9540.968
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 28. MAR Results at 50% Missingness for the AUC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 29.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.8990.9070.8930.8950.9050.9070.880.8860.8730.8730.8950.911
Australian-NOFS0.9210.9120.910.9150.9010.9180.9120.9040.8860.890.9190.899
Australian-Overall0.8990.9070.9070.9150.9180.9120.9120.9040.8860.890.8950.911
boston-FS0.940.940.9260.9260.9280.9180.9180.9180.9260.9260.9330.94
boston-NOFS0.9360.9320.9470.9430.9380.9390.9360.9290.930.9230.940.932
boston-Overall0.940.940.9260.9210.9420.9220.9180.9180.930.9230.9330.94
churn-FS0.8930.8970.8990.8960.8960.8970.890.8810.8710.8730.8960.896
churn-NOFS0.8940.8960.8990.9020.8980.9020.8910.8940.8730.8750.9020.898
churn-Overall0.8930.8960.8990.8970.8980.90.8910.8940.8730.8750.8960.896
compas-two-years-FS0.7040.7040.6940.710.7070.7020.6870.6890.7050.7020.7060.693
compas-two-years-NOFS0.7070.7050.7110.7060.7030.70.6990.7030.7030.6970.7090.702
compas-two-years-Overall0.7040.7040.7110.710.70.6990.6990.7030.7030.6970.7060.702
image-FS0.8550.8580.8680.8570.8730.8730.820.820.8720.868
image-NOFS0.8780.8790.8790.8880.8750.8750.8620.8650.8820.885
image-Overall0.8780.8790.8750.8790.8750.8750.8620.8650.8820.885
page-blocks-FS0.9870.9880.9890.9880.9880.9880.9850.9860.9760.9540.9870.987
page-blocks-NOFS0.9880.9880.9890.990.9880.9880.9850.9850.9820.9830.9870.988
page-blocks-Overall0.9880.9880.9880.9880.9870.9870.9850.9850.9820.9830.9870.988
parkinsons-FS0.8730.8730.870.8710.8750.8760.8740.8740.8890.8010.870.87
parkinsons-NOFS0.9080.9110.9230.9260.9250.9120.9180.9180.930.8940.9170.912
parkinsons-Overall0.8730.8730.9360.870.9050.8730.9180.9180.930.8940.870.87
segment-FS0.9990.9991.01.00.9991.00.9990.9980.960.9540.9980.998
segment-NOFS0.9991.01.01.01.01.00.9991.00.9680.971.00.999
segment-Overall0.9991.01.01.00.9991.00.9991.00.9680.970.9980.998
stock-FS0.9930.9850.9950.9950.9910.9920.9910.9870.9650.9650.9910.991
stock-NOFS0.9930.9940.9950.9950.9930.9950.9910.9890.980.9810.9940.994
stock-Overall0.9930.9940.9950.9940.9930.9940.9910.9890.980.9810.9940.994
zoo-FS1.01.01.01.00.8241.01.01.01.01.00.9870.993
zoo-NOFS0.9921.01.01.01.00.9971.00.9970.991.00.9831.0
zoo-Overall0.9921.01.01.00.9971.01.00.9970.991.00.9830.993
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 29. MAR Results at 10% Missingness for the F1 Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 30.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.9010.9050.8880.8860.8830.890.8640.870.8770.8690.9090.909
Australian-NOFS0.9030.8980.8870.8880.8870.8960.8840.8880.870.8750.9080.902
Australian-Overall0.9030.9050.8860.8840.880.8980.8640.870.8770.8690.9090.909
boston-FS0.9070.9070.9030.9010.9040.9080.8410.8440.8870.8870.8910.891
boston-NOFS0.8940.8910.9030.9050.90.9050.8580.8520.8950.8870.9010.89
boston-Overall0.8940.8910.9010.910.9140.90.8580.8520.8950.8870.9010.89
churn-FS0.8410.8410.8470.8410.8440.8390.8040.8110.8080.8160.8370.837
churn-NOFS0.8330.8360.8460.8420.8430.8310.840.8270.8170.8140.8320.843
churn-Overall0.8330.8360.8450.8460.8470.8410.840.8270.8080.8140.8370.843
compas-two-years-FS0.7010.6970.6880.7010.6880.6960.6850.70.6850.6810.6820.697
compas-two-years-NOFS0.7010.6950.6930.6950.690.690.6730.6920.6920.6970.6940.691
compas-two-years-Overall0.7010.6950.6950.6930.690.6850.6730.70.6850.6810.6820.691
image-FS0.8060.8060.7980.840.8350.8390.8260.8230.8710.871
image-NOFS0.8710.8750.8710.8650.860.8620.8690.860.8780.884
image-Overall0.8710.8750.8790.8690.860.8620.8690.860.8780.884
page-blocks-FS0.9760.9790.9820.9840.9770.9780.950.9510.9320.940.9790.98
page-blocks-NOFS0.9760.980.9820.9840.9750.9780.950.9560.8660.9470.9780.978
page-blocks-Overall0.9760.9790.9840.9820.9790.9760.950.9560.9320.9470.9780.978
parkinsons-FS0.8890.8760.8720.870.8890.8640.8030.8030.8440.8290.8490.881
parkinsons-NOFS0.9040.8920.9090.9180.8980.8920.8830.8710.8860.8830.9050.905
parkinsons-Overall0.9040.8920.8720.8760.8910.8750.8830.8710.8860.8830.9050.881
segment-FS0.9940.9971.01.00.9980.9970.9740.9610.9410.9630.9980.999
segment-NOFS0.9990.9991.01.00.9970.9980.9890.9950.9720.9750.9990.999
segment-Overall0.9990.9991.01.00.9990.9980.9890.9950.9720.9750.9990.999
stock-FS0.9790.9680.980.980.9750.9770.8820.9080.8860.9030.9740.974
stock-NOFS0.9790.9780.9810.9860.9730.9770.8870.9330.9310.9420.9760.975
stock-Overall0.9790.9780.9820.9810.9750.9780.8870.9330.9310.9420.9760.974
zoo-FS0.9910.9841.00.990.9860.9670.9730.9890.960.990.9230.925
zoo-NOFS0.990.9890.9731.00.9981.00.9970.9980.9860.9970.9520.99
zoo-Overall0.990.9890.9651.01.01.00.9730.9980.960.9970.9520.925
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 30. MAR Results at 25% Missingness for the F1 Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 31.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian-FS0.8550.8870.8670.8780.8520.840.7850.8380.8210.8210.8630.884
Australian-NOFS0.8620.8730.8790.8580.8510.8610.7720.8210.8220.8290.860.874
Australian-Overall0.8550.8870.8820.8610.8430.8490.7850.8380.8220.8210.8630.884
boston-FS0.8670.8970.8840.8840.8680.8770.8140.7860.8690.8560.8530.863
boston-NOFS0.8870.8970.8940.8990.8850.8750.8180.8320.8660.8640.8490.867
boston-Overall0.8670.8970.8950.8930.8710.8750.8180.8320.8660.8560.8530.863
churn-FS0.7580.7730.7640.7740.7540.7470.7040.7110.7650.7760.7660.757
churn-NOFS0.7550.7660.7750.7930.7560.7520.720.7370.7730.7640.7590.76
churn-Overall0.7550.7660.7750.7750.7610.7620.720.7370.7650.7640.7660.757
compas-two-years-FS0.6620.6680.6650.6740.6420.6640.6430.6580.6630.6710.6590.666
compas-two-years-NOFS0.6660.6820.6630.6740.6580.6780.6370.6620.6660.6820.6650.679
compas-two-years-Overall0.6660.6820.6680.6760.6470.6780.6430.6620.6660.6820.6650.679
image-FS0.7710.760.7420.7790.7210.7350.8120.8120.8050.81
image-NOFS0.8240.8240.8360.8240.7310.7490.8410.8280.8570.852
image-Overall0.8240.8240.8280.830.7310.7490.8410.8280.8570.852
page-blocks-FS0.9630.9640.9520.9590.9630.9620.890.9020.8780.870.9610.957
page-blocks-NOFS0.9630.9620.9590.960.960.9660.8960.9110.8980.9180.9610.958
page-blocks-Overall0.9630.9640.9580.9610.9590.9640.8960.9110.8980.9180.9610.957
parkinsons-FS0.8090.8550.8610.880.8390.8130.80.8010.8650.8840.8030.85
parkinsons-NOFS0.8750.8470.8830.8760.8550.8650.7780.770.9080.9030.8870.869
parkinsons-Overall0.8750.8470.8760.870.8550.840.80.770.9080.9030.8870.869
segment-FS0.9950.9930.9910.9910.9950.9910.8770.9090.8770.8950.9910.993
segment-NOFS0.9950.9930.9920.9940.9920.9940.8860.9030.9070.9340.9930.993
segment-Overall0.9950.9930.9940.9970.9850.9950.8860.9030.9070.9340.9910.993
stock-FS0.9620.9350.9660.9730.9570.950.8760.8690.7130.8030.9540.939
stock-NOFS0.9620.9590.9720.9710.9560.9620.8710.8970.9250.9280.9560.956
stock-Overall0.9620.9590.9690.9750.9530.9540.8710.8970.9250.9280.9540.956
zoo-FS0.960.9050.960.9560.9490.990.9460.9480.8960.8630.9670.884
zoo-NOFS0.9540.9520.9730.9810.9080.9240.9480.9520.9430.9650.9540.968
zoo-Overall0.9540.9520.9540.9570.9160.9790.9480.9520.9430.9650.9540.968
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 31. MAR Results at 50% Missingness for the F1 Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 32.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian0.8490.8430.8550.8610.8610.8520.8520.8460.8320.8380.8290.841
Australian-FS0.8490.8430.8380.8410.8520.8490.8290.8350.820.820.8290.841
Australian-NOFS0.8520.8520.8520.8580.8290.8580.8520.8460.8320.8380.8550.829
boston0.8930.8930.8770.8580.8890.8580.8540.8540.8740.8580.870.881
boston-FS0.8930.8930.870.8810.8770.8580.8540.8540.8770.8770.870.881
boston-NOFS0.9050.8850.8970.8850.8770.8890.8810.870.8740.8580.8970.87
churn0.940.9350.940.9330.9430.9360.9320.9280.9210.9220.9440.944
churn-FS0.940.9420.930.9320.940.9430.930.9260.920.9140.9440.944
churn-NOFS0.9410.9350.9430.9430.9430.9360.9320.9280.9210.9220.9420.933
compas-two-years0.6570.6570.6620.6580.6560.660.6660.6610.6560.6580.6640.662
compas-two-years-FS0.6570.6570.6540.6590.6580.6530.6490.6520.6510.6490.6640.644
compas-two-years-NOFS0.660.6650.6640.6630.6520.6610.6660.6610.6560.6580.6660.662
image0.8730.8640.8720.8720.8590.860.8540.8620.8650.871
image-FS0.8510.8550.8570.8580.8590.8590.840.840.8690.859
image-NOFS0.8730.8640.8670.8690.8590.860.8540.8620.8650.871
page-blocks0.9710.9710.9720.9720.9720.9720.9660.9680.9650.9630.9680.972
page-blocks-FS0.9720.9710.9710.9730.9730.9720.970.9660.9610.9580.9680.973
page-blocks-NOFS0.9710.9720.9720.9730.9710.9720.9660.9680.9650.9630.9680.972
parkinsons0.8780.8780.8980.8780.8670.8780.9080.8980.9080.8570.8780.878
parkinsons-FS0.8780.8780.8780.8780.8780.8780.8780.8780.8670.8570.8780.878
parkinsons-NOFS0.8980.8670.8880.8980.8980.8880.9080.8980.9080.8570.8980.898
segment0.9930.9940.9960.9970.9940.9940.9940.9960.9320.9320.9940.994
segment-FS0.9940.9940.9970.9960.9950.9940.9910.9890.9230.9160.9940.994
segment-NOFS0.9930.9940.9970.9980.9950.9950.9940.9960.9320.9320.9960.995
stock0.960.960.9730.9730.9560.9640.9450.9410.9240.9260.9620.962
stock-FS0.9520.9350.9750.9750.9490.9620.9640.9470.9010.9010.9560.956
stock-NOFS0.960.960.9640.9660.9620.9710.9450.9410.9240.9260.9620.962
zoo0.9611.01.01.00.981.01.00.980.9611.00.9410.941
zoo-FS1.01.01.01.00.8431.01.01.01.01.00.9410.941
zoo-NOFS0.9611.01.01.01.00.981.00.980.9611.00.9411.0
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 32. MAR Results at 10% Missingness for the ACC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 33.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian0.8320.8380.8410.8350.820.8350.80.7910.8140.8120.8350.843
Australian-FS0.8290.8380.8410.8350.8260.8260.80.7910.8140.8120.8350.843
Australian-NOFS0.8320.820.8350.8230.8120.8120.8090.8140.8090.8120.8320.826
boston0.8620.8580.850.870.8850.8580.8140.8180.8540.8340.850.834
boston-FS0.8620.8620.850.8660.8660.8540.8020.8020.8340.8340.8420.842
boston-NOFS0.8620.8580.8620.8580.850.850.8140.8180.8540.8580.850.834
churn0.9210.9150.9260.9220.9210.9180.8980.8920.90.90.9180.918
churn-FS0.9190.9220.9210.920.9180.9210.9010.8870.90.9010.9180.918
churn-NOFS0.9210.9150.9280.9240.920.9130.8980.8920.9030.90.9180.918
compas-two-years0.6560.6560.6530.6480.640.6460.6410.6560.6490.6520.6490.653
compas-two-years-FS0.6560.6490.6530.6560.6390.6490.6560.6560.6490.6520.6490.647
compas-two-years-NOFS0.6590.6560.6490.6590.6410.6540.6410.6540.6470.650.6520.653
image0.8610.8660.8690.8620.8590.860.8610.8520.8730.869
image-FS0.8380.8380.8190.8460.8430.8410.8430.8420.8670.867
image-NOFS0.8610.8660.8660.8680.8590.860.8610.8520.8730.869
page-blocks0.9620.9620.9660.9640.9610.9590.9450.9460.9470.9530.9590.958
page-blocks-FS0.9620.9620.9660.9650.9610.9620.9460.9420.9470.9490.9630.959
page-blocks-NOFS0.9620.9610.9650.9650.960.9590.9450.9460.9470.9530.9590.958
parkinsons0.8670.8780.8670.8670.8880.8470.8570.8570.8880.8780.8880.867
parkinsons-FS0.8670.8570.8670.8670.8670.8370.8370.8370.8670.8370.8470.867
parkinsons-NOFS0.8670.8780.8880.8980.8880.8880.8570.8570.8880.8780.8880.888
segment0.9910.9950.9970.9970.9920.990.9730.9790.9410.9380.9940.993
segment-FS0.990.9910.9970.9970.9910.990.9580.9450.8970.9390.9930.993
segment-NOFS0.9910.9950.9970.9970.9890.9920.9730.9790.9410.9380.9940.993
stock0.9260.9180.9410.9390.9140.9160.8130.8590.8460.8610.9140.918
stock-FS0.9260.9010.9310.9310.920.9160.7940.8440.80.8250.9180.918
stock-NOFS0.9260.9180.9450.9390.9160.9180.8130.8590.8460.8610.9140.922
zoo0.9610.9610.9221.01.01.00.9410.980.9220.980.9220.863
zoo-FS0.9610.9411.00.9610.9610.9410.9410.9610.9220.9410.8430.863
zoo-NOFS0.9610.9610.9221.00.981.00.980.980.980.980.9220.98
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 33. MAR Results at 25% Missingness for the ACC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 34.
DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
Australian0.7910.8230.8320.8120.7940.7830.7480.7910.7740.7620.8140.82
Australian-FS0.7910.8230.8170.8090.8030.7830.7480.7910.7620.7620.8140.82
Australian-NOFS0.8060.8230.8430.8120.7710.8120.7220.7620.7740.7740.8060.803
boston0.7980.8180.8380.8180.7980.810.7630.7630.7870.7910.7870.791
boston-FS0.7980.8180.8060.8220.8060.7980.7670.7390.7910.7910.7870.791
boston-NOFS0.8260.8340.8220.8260.8220.8220.7630.7630.7870.8020.7940.798
churn0.8940.8890.8960.8940.8880.8820.8650.8660.8870.8830.890.888
churn-FS0.8920.890.8930.8990.890.8840.870.8690.8870.8890.890.888
churn-NOFS0.8940.8890.8920.8910.8880.8820.8650.8660.8880.8830.8940.883
compas-two-years0.6210.6320.6290.6280.6080.6270.6070.6140.620.6290.6210.632
compas-two-years-FS0.6110.6230.6230.6240.5980.620.6070.6180.6190.6220.6260.627
compas-two-years-NOFS0.6210.6320.6290.6260.6180.6310.6040.6140.620.6290.6210.632
image0.840.8390.8450.8490.8090.8170.8540.8470.8560.853
image-FS0.8130.8150.8110.8170.8040.8040.8460.8460.8310.837
image-NOFS0.840.8390.8460.8460.8090.8170.8540.8470.8560.853
page-blocks0.9480.9490.9550.9520.9490.9470.9210.920.9330.9340.9490.949
page-blocks-FS0.9480.9490.9540.9540.950.9480.9160.9210.9320.930.9490.949
page-blocks-NOFS0.9480.950.9530.9530.950.9490.9210.920.9330.9340.9490.949
parkinsons0.8370.8570.8570.8570.8470.8370.8270.8470.8780.8670.8670.857
parkinsons-FS0.7960.8570.8160.8670.8470.8370.8270.8160.8470.8370.8370.837
parkinsons-NOFS0.8370.8570.8780.8670.8470.8370.8060.8470.8780.8670.8670.857
segment0.9820.9770.990.990.9610.9830.910.9110.8820.8980.9670.969
segment-FS0.9830.9770.9890.990.9830.980.8990.9150.8610.8720.9670.969
segment-NOFS0.9820.9770.990.9880.9830.9790.910.9110.8820.8980.9710.969
stock0.9010.8910.9220.9180.8630.8820.8110.8270.8340.8380.8820.874
stock-FS0.9010.8570.9240.9180.8970.8570.8150.7980.6950.7430.8820.859
stock-NOFS0.9010.8910.9240.9240.8760.8840.8110.8270.8340.8380.8820.874
zoo0.9020.9020.9410.9410.8630.9410.9020.9220.8820.9220.9220.922
zoo-FS0.9020.8820.9610.9220.9020.9610.8820.9220.8630.8430.9410.882
zoo-NOFS0.9020.9020.980.9610.8430.8630.9020.9220.8820.9220.9220.922
  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

Table 34. MAR Results at 50% Missingness for the ACC Metric

  • -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

C.7 Imputation Accuracy Results

In this section, we include the quantitative results of the experiments regarding the imputation quality. We split this section, into two subsections for each missingness mechanism. We report results for the train and test set using the default configurations for each imputation method (for details see Section C.3).

C.7.1 MCAR.

Tables 35 and 36 present the results at 10% missingness for R2 and accuracy scores, respectively. Tables 37 and 38 show the imputation R2 and accuracy score at 25% missingness, while Tables 39 and 40 show these at 50% missingness.

Table 35.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.0730.00.1710.080.00.0
Australian-Train0.0330.00.1740.1180.00.0
boston-Test0.2460.00.6360.4910.00.302
boston-Train0.2240.2520.6510.5210.00.336
churn-Test0.2920.00.4740.330.00.348
churn-Train0.2590.4290.460.3220.00.337
compas-two-years-Test0.060.00.2350.2090.00.0
compas-two-years-Train0.0580.1390.3160.2250.00.0
image-Test0.7110.00.7980.00.827
image-Train0.460.9380.7750.00.829
page-blocks-Test0.2990.00.8150.150.00.4
page-blocks-Train0.3190.3090.7960.1680.00.431
parkinsons-Test0.1940.040.6970.5780.00.489
parkinsons-Train0.2210.5920.8150.60.00.628
segment-Test0.3640.00.6650.4830.0530.48
segment-Train0.4330.410.7290.5390.0530.517
stock-Test0.5140.00.9310.6990.00.546
stock-Train0.5590.6560.9510.7010.00.588
zoo-Test0.00.00.6870.00.00.0
zoo-Train0.00.00.00.00.00.0

Table 35. Imputation R2-score for MCAR 10% Missingness

Table 36.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.6530.5940.7450.6820.6590.36
Australian-Train0.6790.7430.8250.6350.6050.364
boston-Test0.9550.8181.01.01.00.5
boston-Train0.7730.3640.9550.9550.9550.682
churn-Test0.770.6420.7950.7420.6890.708
churn-Train0.7460.7470.7690.7160.6730.715
compas-two-years-Test0.8580.5860.9550.7660.6820.835
compas-two-years-Train0.8320.940.9530.750.6680.821
zoo-Test0.8370.5720.8610.8060.6890.73
zoo-Train0.7740.8740.9090.8630.7340.743

Table 36. Imputation Accuracy Score for MCAR 10% Missingness

Table 37.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.0430.00.1250.0970.00.01
Australian-Train0.0340.00.1010.0690.00.001
boston-Test0.140.00.5850.4130.00.201
boston-Train0.0470.1590.6040.3820.00.175
churn-Test0.2370.00.3780.2110.00.283
churn-Train0.2430.3760.3910.2130.00.286
compas-two-years-Test0.0030.00.1520.1290.00.0
compas-two-years-Train0.0070.0310.1620.1530.00.0
image-Test0.2560.00.7120.00.757
image-Train0.1810.880.6440.00.765
page-blocks-Test0.1830.00.6140.2280.00.279
page-blocks-Train0.2270.2920.7680.2360.00.399
parkinsons-Test0.010.0050.7230.5610.00.466
parkinsons-Train0.0670.540.7640.3530.00.497
segment-Test0.2560.00.6960.4520.0530.473
segment-Train0.2430.4120.6770.4460.0530.46
stock-Test0.4230.00.9020.5620.00.504
stock-Train0.4430.4960.9010.5540.00.516
zoo-Test0.0590.00.00.1070.00.0
zoo-Train0.00.00.4210.4070.00.0

Table 37. Imputation R2-score for MCAR 25% Missingness

Table 38.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.6370.5930.7540.6320.6240.622
Australian-Train0.6460.6350.7450.6230.6160.592
boston-Test0.8570.8140.90.90.90.457
boston-Train0.8690.4430.9510.9510.9510.541
churn-Test0.7190.6170.7640.710.7030.679
churn-Train0.7210.7040.7760.7120.7060.678
compas-two-years-Test0.8310.550.9250.6830.6820.8
compas-two-years-Train0.8310.7580.9220.6870.6860.814
zoo-Test0.8240.5550.8970.7950.6610.564
zoo-Train0.7590.7990.8930.8420.7410.582

Table 38. Imputation Accuracy Score for MCAR 25% Missingness

Table 39.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.020.00.0460.0390.00.013
Australian-Train0.00.0040.0320.0270.00.01
boston-Test0.0790.00.4030.2130.00.192
boston-Train0.0890.0470.3850.1760.00.186
churn-Test0.0590.00.250.0770.00.02
churn-Train0.0570.1910.2430.0750.00.019
compas-two-years-Test0.0020.00.0830.0520.00.042
compas-two-years-Train0.0080.0590.0750.0510.00.04
image-Test0.0090.00.4550.00.574
image-Train0.0110.240.3930.00.573
page-blocks-Test0.0860.00.4110.1710.00.154
page-blocks-Train0.1740.1130.6050.1540.00.21
parkinsons-Test0.00.00.5660.3050.00.323
parkinsons-Train0.0030.2280.5950.1670.00.407
segment-Test0.1890.00.6210.220.0530.301
segment-Train0.1870.1880.6310.210.0530.309
stock-Test0.1940.00.7140.290.00.38
stock-Train0.1640.2240.7510.2890.00.401
zoo-Test0.00.0890.00.00.00.0
zoo-Train0.00.00.00.0020.00.0

Table 39. Imputation R2-score for MCAR 50% Missingness

Table 40.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.6220.5460.7230.6390.6390.599
Australian-Train0.6170.5530.710.6290.6290.624
boston-Test0.8920.6770.9310.9310.9310.462
boston-Train0.9050.4480.9310.9310.9310.44
churn-Test0.6010.5990.7490.7110.7110.603
churn-Train0.5960.6380.7420.7070.7070.601
compas-two-years-Test0.6580.5010.8530.6810.6810.782
compas-two-years-Train0.6460.5620.8450.6790.6790.777
zoo-Test0.6920.5110.770.6960.6890.567
zoo-Train0.6990.6120.7970.6870.6850.522

Table 40. Imputation Accuracy for MCAR 50% Missingness

C.7.2 MAR.

Tables 41 and 42 present the results at 10% missingness for R2 and Accuracy scores, respectively. Tables 43 and 44 show the R2-score and accuracy score at 25% missingness, while Tables 45 and 46 show these at 50% missingness.

Table 41.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.0040.00.1310.0640.00.0
Australian-Train0.0720.00.2460.1420.00.0
boston-Test0.0510.00.5540.4450.00.171
boston-Train0.0350.1990.7170.3820.00.278
churn-Test0.2260.00.4930.3240.00.344
churn-Train0.2260.4150.470.3210.00.336
compas-two-years-Test0.00.00.2190.1930.00.0
compas-two-years-Train0.00.1420.2470.2070.00.0
image-Test0.3120.00.7510.00.816
image-Train0.030.9430.7060.00.816
page-blocks-Test0.0310.00.6980.1330.00.212
page-blocks-Train0.0840.2860.6840.1340.00.22
parkinsons-Test0.1580.0050.7260.6320.00.536
parkinsons-Train0.0650.6330.7240.4910.00.564
segment-Test0.2810.00.6380.4760.0530.418
segment-Train0.2680.4370.7220.4470.0530.395
stock-Test0.4760.00.9250.6680.00.479
stock-Train0.4520.5730.9270.6080.00.481
zoo-Test0.00.4760.9290.3810.00.0
zoo-Train0.00.00.480.6810.00.0

Table 41. Imputation R2-score for MAR 10% Missingness

Table 42.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.610.6250.7860.7030.650.54
Australian-Train0.6120.6950.760.6220.5940.518
boston-Test1.00.9410.9411.01.00.471
boston-Train0.9410.5290.9410.9410.9410.353
churn-Test0.7410.6410.7910.7560.7190.695
churn-Train0.7550.7780.8090.7580.7270.731
compas-two-years-Test0.8540.6080.9550.730.7040.808
compas-two-years-Train0.8150.9140.9410.7240.7070.798
zoo-Test0.7890.5680.8880.8010.7220.471
zoo-Train0.8230.8050.9290.8230.720.477

Table 42. Imputation Accuracy Score for MAR 10% Missingness

Table 43.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.0070.00.1230.0530.00.01
Australian-Train0.0070.00.1420.0760.00.002
boston-Test0.0410.00.6410.2880.00.152
boston-Train0.010.0570.540.3010.00.168
churn-Test0.1180.00.4050.190.00.289
churn-Train0.1150.380.4110.190.00.296
compas-two-years-Test0.00.00.1690.1210.00.0
compas-two-years-Train0.00.1640.1570.1020.00.0
image-Test0.00.00.6210.00.732
image-Train0.00.8670.540.00.73
page-blocks-Test0.0260.00.6280.1740.00.159
page-blocks-Train0.0130.110.6780.1620.00.16
parkinsons-Test0.0160.00.660.5360.00.501
parkinsons-Train0.0050.5130.6710.3010.00.482
segment-Test0.00.00.6860.3460.0530.309
segment-Train0.00.2980.6750.3440.0530.313
stock-Test0.0280.00.8090.3460.00.366
stock-Train0.0210.3260.8410.3460.00.388
zoo-Test0.00.00.3580.5640.00.003
zoo-Train0.00.00.2610.1230.00.0

Table 43. Imputation R2-score for MAR 25% Missingness

Table 44.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.6750.5740.7560.6650.660.558
Australian-Train0.6240.6460.7330.6420.640.522
boston-Test0.8890.8890.9210.9210.9210.46
boston-Train0.790.5480.9190.9190.9190.468
churn-Test0.7520.6150.7810.7190.7040.684
churn-Train0.7460.6960.7740.7080.6990.676
compas-two-years-Test0.7010.5420.9040.6630.660.795
compas-two-years-Train0.6950.7780.9030.6550.6520.807
zoo-Test0.7660.5120.8890.7590.6780.561
zoo-Train0.7550.7630.9090.7540.6850.546

Table 44. Imputation Accuracy Score for MAR 25% Missingness

Table 45.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.00.00.0290.0270.00.003
Australian-Train0.00.0150.0340.0360.00.001
boston-Test0.00.00.4680.0930.00.104
boston-Train0.00.0210.4840.0960.00.162
churn-Test0.0190.00.260.0640.00.051
churn-Train0.0170.140.2580.0640.00.048
compas-two-years-Test0.00.00.1150.0720.00.027
compas-two-years-Train0.00.0290.1040.0670.00.032
image-Test0.00.00.3640.00.515
image-Train0.00.1760.2920.00.514
page-blocks-Test0.00.00.4370.0760.00.116
page-blocks-Train0.00.1030.4490.0790.00.11
parkinsons-Test0.0180.00.60.2280.00.304
parkinsons-Train0.0170.190.6080.1860.00.342
segment-Test0.00.00.590.1440.0530.098
segment-Train0.00.0690.5930.1350.0530.097
stock-Test0.00.00.7510.1590.00.118
stock-Train0.00.1130.7250.1360.00.098
zoo-Test0.00.0320.1910.220.00.102
zoo-Train0.00.00.1270.00.00.0

Table 45. Imputation R2-score for MAR 50% Missingness

Table 46.
DatasetGAINSOFTMFDAEMMPPCA
Australian-Test0.6050.5410.7240.6490.6490.568
Australian-Train0.6210.5320.720.6250.6250.582
boston-Test0.9230.650.930.9370.9370.469
boston-Train0.9280.4560.9440.9440.9440.368
churn-Test0.7250.5890.7450.6970.6970.595
churn-Train0.7320.6450.7530.7040.7040.576
compas-two-years-Test0.6340.5040.8530.6910.690.67
compas-two-years-Train0.6290.5810.8420.690.6890.671
zoo-Test0.60.5350.7640.6250.6250.599
zoo-Train0.5990.5850.8460.6880.6910.613

Table 46. Imputation Accuracy Score for MAR 50% Missingness

Footnotes

  1. 1 Although some ML algorithms such as KNN and Naive Bayes are robust to missing values, their implementations in popular platforms like sklearn does not currently support the presence of missing values.

    Footnote
  2. 2 Recent work [46] shows that when the causal graph of the distribution is known there are cases where MNAR data can be imputed.

    Footnote
  3. 3 https://www.datarobot.com/

    Footnote
  4. 4 https://auger.ai/

    Footnote
  5. 5 https://bigml.com/

    Footnote
  6. 6 Table 9 contains the dataset name, the missing feature, and the number of features selected for the missing feature as the outcome.

    Footnote
Skip Supplemental Material Section

Supplemental Material

REFERENCES

  1. [1] 2023. (unpublished).Google ScholarGoogle Scholar
  2. [2] Adhikari Deepak, Jiang Wei, Zhan Jinyu, He Zhiyuan, Rawat Danda B., Aickelin Uwe, and Khorshidi Hadi A.. 2022. A comprehensive survey on imputation of missing data in internet of things. ACM Comput. Surv. 55, 7, Article 133 (Dec.2022), 38 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Alaa Ahmed and Schaar Mihaela van der. 2018. AutoPrognosis: Automated clinical prognostic modeling via bayesian optimization with structured kernel learning. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Dy Jennifer and Krause Andreas (Eds.). PMLR, 139148. https://proceedings.mlr.press/v80/alaa18b.htmlGoogle ScholarGoogle Scholar
  4. [4] Alabadla Mustafa, Sidi Fatimah, Ishak Iskandar, Ibrahim Hamidah, Affendey Lilly Suriani, Ani Zafienas Che, Jabar Marzanah A., Bukar Umar Ali, Devaraj Navin Kumar, Muda Ahmad Sobri, Tharek Anas, Omar Noritah, and Jaya M. Izham Mohd. 2022. Systematic review of using machine learning in imputing missing values. IEEE Access 10 (2022), 4448344502. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Alcobaça Edesio, Siqueira Felipe, Rivolli Adriano, Garcia Luís P. F., Oliva Jefferson T., and Carvalho André C. P. L. F. de. 2020. MFE: Towards reproducible meta-feature extraction. J. Mach. Learn. Res. 21, 111 (2020), 15. http://jmlr.org/papers/v21/19-348.htmlGoogle ScholarGoogle Scholar
  6. [6] Andridge Rebecca R. and Little Roderick J. A.. 2010. A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78, 1 (Apr.2010), 4064.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Benjamini Yoav and Hochberg Yosef. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57, 1 (1995), 289300. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Bertsimas Dimitris, Pawlowski Colin, and Zhuo Ying Daisy. 2018. From predictive methods to missing data imputation: An optimization approach. J. Mach. Learn. Res. 18, 196 (2018), 139.Google ScholarGoogle Scholar
  9. [9] Biessmann Felix, Rukat Tammo, Schmidt Phillipp, Naidu Prathik, Schelter Sebastian, Taptunov Andrey, Lange Dustin, and Salinas David. 2019. DataWig: Missing value imputation for tables. J. Mach. Learn. Res. 20, 175 (2019), 16.Google ScholarGoogle Scholar
  10. [10] Bishop Christopher. 1998. Bayesian PCA. In Advances in Neural Information Processing Systems, Kearns M., Solla S., and Cohn D. (Eds.), Vol. 11. MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Camino Ramiro Daniel, Hammerschmidt Christian A., and State Radu. 2019. Improving missing data imputation with deep generative models. arXiv:1902.10666. Retrieved from http://arxiv.org/abs/1902.10666Google ScholarGoogle Scholar
  12. [12] Carpenter James R. and Smuk Melanie. 2021. Missing data: A statistical framework for practice. Biometr. J. 63, 5 (2021), 915947. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Emmanuel Tlamelo, Maupong Thabiso, Mpoeleng Dimane, Semong Thabo, Mphago Banyatsang, and Tabona Oteng. 2021. A survey on missing data in machine learning. J. Big Data 8, 1 (27 Oct.2021), 140. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Erickson Nick, Mueller Jonas, Shirkov Alexander, Zhang Hang, Larroy Pedro, Li Mu, and Smola Alexander. 2020. AutoGluon-Tabular: Robust and accurate AutoML for structured data. DOI: DOI: arXiv:2003.06505. Retrieved from https://arxiv.org/abs/2003.06505Google ScholarGoogle Scholar
  15. [15] Faisal Shahla and Tutz Gerhard. 2021. Multiple imputation using nearest neighbor methods. Inf. Sci. 570 (2021), 500516. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Feurer Matthias, Eggensperger Katharina, Falkner Stefan, Lindauer Marius, and Hutter Frank. 2020. Auto-Sklearn 2.0: Hands-free AutoML via meta-learning.Google ScholarGoogle Scholar
  17. [17] Fortuin Vincent, Baranchuk Dmitry, Rätsch Gunnar, and Mandt Stephan. 2020. GP-VAE: Deep probabilistic time series imputation. arxiv:1907.04155 [stat.ML]. Retrieved from https://arxiv.org/abs/1907.04155Google ScholarGoogle Scholar
  18. [18] Gama João and Brazdil Pavel. 1995. Characterization of classification algorithms.189200. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Garciarena Unai, Santana Roberto, and Mendiburu Alexander. 2017. Evolving imputation strategies for missing data in classification problems with TPOT. arXiv:1706.01120. Retrieved from http://arxiv.org/abs/1706.01120Google ScholarGoogle Scholar
  20. [20] Gijsbers Pieter and Vanschoren Joaquin. 2020. GAMA: A general automated machine learning assistant. arXiv:2007.04911. Retrieved from https://arxiv.org/abs/2007.04911Google ScholarGoogle Scholar
  21. [21] Gondara Lovedeep and Wang Ke. 2018. MIDA: Multiple Imputation Using Denoising Autoencoders. 260272. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] H2O.ai. 2022. DriverlessAI. Retrieved from https://www.h2o.ai/products/h2o-driverless-ai/Google ScholarGoogle Scholar
  23. [23] Harris Charles R., Millman K. Jarrod, Walt Stéfan J. van der, Gommers Ralf, Virtanen Pauli, Cournapeau David, Wieser Eric, Taylor Julian, Berg Sebastian, Smith Nathaniel J., Kern Robert, Picus Matti, Hoyer Stephan, Kerkwijk Marten H. van, Brett Matthew, Haldane Allan, Río Jaime Fernández del, Wiebe Mark, Peterson Pearu, Gérard-Marchant Pierre, Sheppard Kevin, Reddy Tyler, Weckesser Warren, Abbasi Hameer, Gohlke Christoph, and Oliphant Travis E.. 2020. Array programming with NumPy. Nature 585, 7825 (Sept.2020), 357362. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Hasan Md. Kamrul, Alam Md. Ashraful, Roy Shidhartho, Dutta Aishwariya, Jawad Md. Tasnim, and Das Sunanda. 2021. Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Inf. Med. Unlock. 27 (2021), 100799. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Hegde Harshad, Shimpi Neel, Panny Aloksagar, Glurich Ingrid, Christie Pamela, and Acharya Amit. 2019. MICE vs PPCA: Missing data imputation in healthcare. Inf. Med. Unlock. 17 (2019), 100275. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Herbold Steffen. 2020. Autorank: A Python package for automated ranking of classifiers. J.of Open Source Softw. 5, 48 (2020), 2173. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Honaker James, King Gary, and Blackwell Matthew. 2011. Amelia II: A program for missing data. J. Stat. Softw. 45, 7 (2011), 147. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Huque Md Hamidul, Carlin John B., Simpson Julie A., and Lee Katherine J.. 2018. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med. Res. Methodol. 18, 1 (12 Dec.2018), 168. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Jadhav Anil, Pramod Dhanya, and Ramanathan Krishnan. 2019. Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 33 (072019), 121. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Jäger Sebastian, Allhorn Arndt, and Bießmann Felix. 2021. A benchmark for data imputation methods. Front. Big Data 4 (2021). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Ke Jintao, Zhang Shuaichao, Yang Hai, and Chen Xiqun (Michael). 2019. PCA-based missing information imputation for real-time crash likelihood prediction under imbalanced data. Transportmetr. A: Transp. Sci. 15, 2 (2019), 872895. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Kingma Diederik P. and Welling Max. 2013. Auto-encoding variational Bayes. DOI: DOI: arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114Google ScholarGoogle Scholar
  33. [33] Kowarik Alexander and Templ Matthias. 2016. Imputation with the R package VIM. J. Stat. Softw. 74, 7 (2016), 116. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Kyureghian Gayaneh, Capps Oral, and Nayga Rodolfo M.. 2011. A Missing Variable Imputation Methodology with an Empirical Application. 313337. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Lall Ranjit and Robinson Thomas. 2022. The MIDAS touch: Accurate and scalable missing-data imputation with deep learning. Polit. Anal. 30, 2 (2022), 179196. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Le Trang T., Fu Weixuan, and Moore Jason H.. 2020. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36, 1 (2020), 250256.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Li Dan, Deogun Jitender, Spaulding William, and Shuart Bill. 2004. Towards missing data imputation: A study of fuzzy k-means clustering method. In Rough Sets and Current Trends in Computing, Tsumoto Shusaku, Słowiński Roman, Komorowski Jan, and Grzymała-Busse Jerzy W. (Eds.). Springer, Berlin, 573579.Google ScholarGoogle Scholar
  38. [38] Li Peng, Rao Xi, Blase Jennifer, Zhang Yue, Chu Xu, and Zhang Ce. 2019. CleanML: A benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv:1904.09483. Retrieved from http://arxiv.org/abs/1904.09483Google ScholarGoogle Scholar
  39. [39] Li Yuebiao, Li Zhiheng, and Li Li. 2014. Missing traffic data: Comparison of imputation methods. Intell. Transp. Syst. IET 8 (022014), 5157. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Little R. J. A. and Rubin D. B.. 2002. Statistical Analysis with Missing Data. Wiley. 2002027006Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Lu Haw-minn, Perrone Giancarlo, and Unpingco José. 2020. Multiple imputation with denoising autoencoder using metamorphic truth and imputation feedback. arXiv:2002.08338. Retrieved from https://arxiv.org/abs/2002.08338Google ScholarGoogle Scholar
  42. [42] Malarvizhi Ms. R.. 2012. KNN classifier performs better than k-means clustering in missing value imputation. IOSR J. Comput. Eng. 6 (2012), 1215.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Mamandipoor Behrooz, Majd Mahshid, Moz Monica, and Osmani Venet. 2019. Blood lactate concentration prediction in critical care patients: Handling missing values. arXiv:1910.01473. Retrieved from http://arxiv.org/abs/1910.01473Google ScholarGoogle Scholar
  44. [44] Mazumder Rahul, Hastie Trevor, and Tibshirani Robert. 2010. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 80 (2010), 22872322. http://jmlr.org/papers/v11/mazumder10a.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] McCoy John, Kroon Steve, and Auret Lidia. 2018. Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51 (012018), 141146. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Mohan Karthika and Pearl Judea. 2021. Graphical models for processing missing data. J. Am. Stat. Assoc. 116, 534 (2021), 10231037. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Musil Carol M., Warner Camille B., Yobas Piyanee Klainin, and Jones Susan L.. 2002. A comparison of imputation techniques for handling missing data. West. J. Nurs. Res. 24, 7 (2002), 815829. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Muzellec Boris, Josse Julie, Boyer Claire, and Cuturi Marco. 2020. Missing data imputation using optimal transport. arxiv:2002.03860 [stat.ML]. Retrieved from https://arxiv.org/abs/2002.03860Google ScholarGoogle Scholar
  49. [49] Neutatz Felix, Chen Binger, Alkhatib Yazan, Ye Jingwen, and Abedjan Ziawasch. 2022. Data cleaning and AutoML: Would an optimizer choose to clean? Datenb.-Spektr. 22, 2 (01 Jul2022), 121130. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Oba Shigeyuki, Sato Masa-aki, and Ishii Shin. 2003. Variational bayes method for mixture of principal component analyzers. Syst. Comput. Jpn. 34, 11 (2003), 5566. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Orczyk Tomasz and Porwik Piotr. 2013. Influence of missing data imputation method on the classification accuracy of the medical data. J. Med. Inf. Technol. 22 (2013).Google ScholarGoogle Scholar
  52. [52] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., and Garnett R. (Eds.). Curran Associates, Inc., 80248035.Google ScholarGoogle Scholar
  53. [53] Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., and Duchesnay E.. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 28252830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Pereira Ricardo Cardoso, Santos Miriam Seoane, Rodrigues Pedro Pereira, and Abreu Pedro Henriques. 2020. Reviewing autoencoders for missing data imputation: Technical trends, applications and outcomes. J. Artif. Intell. Res. 69 (2020), 12551285.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Perez-Lebel Alexandre, Varoquaux Gael, Morvan Marine Le, Josse Julie, and Poline Jean-Baptiste. 2022. Benchmarking missing-values approaches for predictive models on health databases. GigaScience 11 (042022). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Petrazzini Ben Omega, Naya Hugo, Lopez-Bello Fernando, Vazquez Gustavo, and Spangenberg Lucía. 2021. Evaluation of different approaches for missing data imputation on features associated to genomic data. BioData Mining 14, 1 (03 Sep.2021), 44. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Poulos Jason and Valle Rafael. 2018. Missing data imputation for supervised learning. Appl. Artif. Intell. 32, 2 (2018), 186196. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Qu Li, Li Li, Zhang Yi, and Hu Jianming. 2009. PPCA-based missing data imputation for traffic flow volume: A systematical approach. IEEE Trans. Intell. Transport. Syst. 10, 3 (2009), 512522. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Rekatsinas Theodoros, Chu Xu, Ilyas Ihab F., and Ré Christopher. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10, 11 (Aug.2017), 11901201. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Rivolli Adriano, Garcia Luís P. F., Soares Carlos, Vanschoren Joaquin, and Carvalho André C. P. L. F. de. 2022. Meta-features for meta-learning. Knowl.-Bas. Syst. 240 (2022), 108101. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Roskams-Hieter Breeshey, Wells Jude, and Wade Sara. 2022. Leveraging variational autoencoders for multiple data imputation. arxiv:2209.15321 [stat.ML]. Retrieved from https://arxiv.org/abs/2209.15321Google ScholarGoogle Scholar
  62. [62] Rubin Donald B.. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581592.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Ryu Seunghyoung, Kim Minsoo, and Kim Hongseok. 2020. Denoising autoencoder-based missing value imputation for smart meters. IEEE Access PP (022020), 11. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Shahbazian Reza and Trubitsyna Irina. 2022. DEGAIN: Generative-adversarial-network-based missing data imputation. Information 13, 12 (2022). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Stacklies Wolfram, Redestig Henning, Scholz Matthias, Walther Dirk, and Selbig Joachim. 2007. pcaMethods a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23, 9 (032007), 11641167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Stekhoven Daniel J. and Bühlmann Peter. 2011. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (102011), 112118. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Sterne Jonathan A. C., White Ian R., Carlin John B., Spratt Michael, Royston Patrick, Kenward Michael G., Wood Angela M., and Carpenter James R.. 2009. Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. Br. Med. J. 338 (2009). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Thornton Chris, Hutter Frank, Hoos Holger H., and Leyton-Brown Kevin. 2012. Auto-WEKA: Automated selection and hyper-parameter optimization of classification algorithms. arXiv:1208.3719. Retrieved from http://arxiv.org/abs/1208.3719Google ScholarGoogle Scholar
  69. [69] Tibshirani Robert. 1996. Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc.: Ser. B (Methodol.) 58, 1 (1996), 267288. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Tipping Michael E. and Bishop Christopher M.. 1999. Probabilistic principal component analysis. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 61, 3 (1999), 611622. http://www.jstor.org/stable/2680726Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Troyanskaya Olga, Cantor Michael, Sherlock Gavin, Brown Pat, Hastie Trevor, Tibshirani Robert, Botstein David, and Altman Russ B.. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (062001), 520525. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Tsamardinos Ioannis and Aliferis Constantin F.. 2003. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. R4), Bishop Christopher M. and Frey Brendan J. (Eds.). PMLR, 300307.Google ScholarGoogle Scholar
  73. [73] Tsamardinos Ioannis, Charonyktakis Paulos, Papoutsoglou Georgios, Borboudakis Giorgos, Lakiotaki Kleanthi, Zenklusen Jean Claude, Juhl Hartmut, Chatzaki Ekaterini, and Lagani Vincenzo. 2022. Just add data: Automated predictive modeling for knowledge discovery and feature selection. npj Precis. Oncol. 6, 1 (16 Jun2022), 38. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Tsamardinos Ioannis, Greasidou Elissavet, Tsagris Michalis, and Borboudakis Giorgos. 2017. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. arXiv:1708.07180. Retrieved from http://arxiv.org/abs/1708.07180Google ScholarGoogle Scholar
  75. [75] Tsamardinos I., Lagani V., and Pappas D.. 2012. Discovering multiple, equivalent biomarker signatures. In Proceedings of the 7th Conference of the Hellenic Society for Computational Biology and Bioinformatics (HSCBB ’12). Heraklion.Google ScholarGoogle Scholar
  76. [76] Buuren S. van and Groothuis-Oudshoorn C. G. M.. 1999. Flexible Multivariate Imputation by MICE. Vol. (PG/VGZ/99.054). TNO Prevention and Health, Leiden.Google ScholarGoogle Scholar
  77. [77] Buuren Stef van and Groothuis-Oudshoorn Karin. 2011. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45, 3 (2011), 167. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Vanschoren J., Rijn J. N. van, Bischl B., and Torgo L.. 2013. OpenML : Networked science in machine learning. SIGKDD Explor. 15, 2 (2013), 4960. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Waljee Akbar K., Mukherjee Ashin, Singal Amit G., Zhang Yiwei, Warren Jeffrey, Balis Ulysses, Marrero Jorge, Zhu Ji, and Higgins Peter D. R.. 2013. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3, 8 (2013). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Waljee Akbar K., Mukherjee Ashin, Singal Amit G., Zhang Yiwei, Warren Jeffrey, Balis Ulysses, Marrero Jorge, Zhu Ji, and Higgins Peter D. R.. 2013. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3, 8 (2013). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  81. [81] Woźnica Katarzyna and Biecek Przemysław. 2020. Does imputation matter? Benchmark for predictive models. arxiv:2007.02837 [stat.ML]. Retrieved from https://arixv.org/abs/2007.02837Google ScholarGoogle Scholar
  82. [82] Wu Richard, Zhang Aoqian, Ilyas Ihab, and Rekatsinas Theodoros. 2020. Attention-based learning for missing data imputation in HoloClean. In Proceedings of Machine Learning and Systems, Dhillon I., Papailiopoulos D., and Sze V. (Eds.), Vol. 2. 307325.Google ScholarGoogle Scholar
  83. [83] Yoon Jinsung, Jordon James, and Schaar Mihaela van der. 2018. GAIN: Missing data imputation using generative adversarial nets. arXiv :1806.02920. Retrieved from https://arxiv.org/abs/1806.02920Google ScholarGoogle Scholar
  84. [84] Zhang Shichao. 2012. Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 85, 11 (2012), 25412552. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. [85] Zhang Xinmeng, Yan Chao, Gao Cheng, Malin Bradley A., and Chen You. 2020. Predicting missing values in medical data via xgboost regression. J. Healthc. Inf. Res. 4, 4 (01 Dec.2020), 383394. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Do We Really Need Imputation in AutoML Predictive Modeling?

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 6
      July 2024
      760 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3613684
      Issue’s Table of Contents

      Copyright © 2024 Copyright held by the owner/author(s).

      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 April 2024
      • Online AM: 16 February 2024
      • Accepted: 19 January 2024
      • Revised: 26 November 2023
      • Received: 28 March 2023
      Published in tkdd Volume 18, Issue 6

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)489
      • Downloads (Last 6 weeks)259

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader