1 Introduction

A wide variety of important applications where Machine Learning (ML) models are typically applied, such as fraud detection, medical diagnosis, and oil spill (Japkowicz & Stephen, 2002) suffer from the problem of class imbalance. The class imbalance problem corresponds to domains where some classes are represented by a large number of instances, while others are represented by only a few. Prior research has shown that class imbalance has a negative impact on the performance of the learned ML models that tend to be overwhelmed by the large classes and ignore the small ones. This typically happens because ML classifiers operate on data drawn from the same distribution as the training data, adopting the same data representation, and assume that maximizing accuracy is the primary goal (Chawla et al., 2004). Given the frequency of imbalanced learning problems in real applications and the issues it raises on learning ML models, the research of methods for handling it has become a significant research topic (Japkowicz & Stephen, 2002; Branco et al., 2016).

The approaches proposed by the research community to solve the class imbalance problem include pre-processing methods, as well as the definition of learning methods specifically designed for this problem (Japkowicz & Stephen, 2002; Chawla et al., 2004; He & Garcia, 2009; Chawla, 2010; Branco et al., 2016). Pre-processing and resampling approaches are more widely studied as they enable the subsequent adoption of any standard ML classification model. The main ideas consist in transforming the original training set, making it more suitable for learning the important class(es) either by reducing the number of instances belonging to the majority classes, or by augmenting the number of rare instances through synthetic data generation procedures (He & Garcia, 2009). Well known examples are the Random Undersampling (Kubat & Matwin, 1997) and the Condensed Nearest Neighbors (Hart, 1968) procedures for data reduction, or the Random Oversampling (Kubat & Matwin, 1997) and the Synthetic Minority Oversampling Technique (Chawla et al., 2002) for data augmentation. Many proposals in the literature try to refine such basic approaches by combining the aforementioned solutions in different fashions or by resorting to advanced Automated Machine Learning (He et al., 2021) approaches. Unfortunately, despite handling somewhat with the class imbalance problem, many of the most widely used pre-processing approaches suffer from issues related to the removal of majority class instances from sparse regions, and to the generation of noisy/erroneous synthetic minority instances (He et al., 2008; Bellinger et al., 2019, 2021; Hassanat et al., 2022). A further limitation of the majority - if not the entirety - of the state-of-the-art approaches is that they (implicitly) exploit the number of instances belonging to a specific class to characterize differences and similarities among instances, even though features should be the ones capturing these differences/similarities.

To overcome the weaknesses of state-of-the-art approaches, we propose froid a pre-processing framework for Features Reduction and OutlIer Detection that allows solving the imbalanced learning problem through unsupervised representation learning. froid handles imbalanced learning by facing the problem from a different perspective. Indeed, instead of augmenting the instances of the minority classes or reducing the instances of the majority classes, froid analyzes the relationships among the records in the dataset through unsupervised approaches. The goal of froid is to design attributes creating an unsupervised data representation that enhance the differences between records belonging to minority and majority classes such that a ML model can achieve outstanding performance regardless of the class imbalance.

In particular, froid exploits two families of methods to augment the expressiveness of the records in a dataset. The first family of methods is the one of Outlier Detection (OD) approaches (Chandola et al., 2009; Hodge & Austin, 2004). An unsupervised OD approach is meant to identify outliers, i.e., instances which deviate significantly from the majority of the data and do not conform to a notion of normal behavior (Chandola et al., 2009). Our intuition is that records belonging to minority classes should be considered as outliers with respect to records belonging to majority classes. Therefore, the usage of unsupervised OD methods can create attributes capturing the level of outlierness of a record with respect to other records concerning a certain OD criterion. A similar intuition to solve the task of supervised OD was proposed in Zhao and Hryniewicki (2018). However, in Zhao and Hryniewicki (2018) unsupervised OD methods are used to boost a supervised OD problem. Thus, it is known form the problem definition that there are outliers in the data. On the other hand, in our case, we are just making a supposition that instances belonging to minority class can be recognized as outliers by unsupervised OD approaches. The second family of methods comprehends Features Reduction (FR), also known as features projection, features extraction, or dimensionality reduction. FR methods transform the data from a high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in Principal Component Analysis (Pearson, 1901), but many nonlinear dimensionality reduction techniques also exist (Cox & Cox, 2008; Van der Maaten & Hinton, 2008; Tenenbaum et al., 2000). Similar to the reasons that brought us to rely on OD approaches, our idea is that unsupervised FR techniques might unveil a data representation that better separates among instances belonging to different classes. Indeed, rare instances should acquire a reduced data representation substantially different from that of the instances belonging to the regular class that, on the other hand, should fall in denser areas of the reduced representation. In the literature, there is a limited set of methods relying on FR to address imbalanced learning problems (Naseriparsa & Kashani, 2014; Gopi et al., 2016). However, these approaches only rely on a unique FR method and combine it with resampling techniques. On the other hand, we adopt a large array of FR approaches, and we do not augment the number of records in the dataset. Also, froid subsequently combines the outcomes of OD and FR approaches through several workflows to create more and more expressive features to separate records of different classes for the classification task.

We experimented with froid on 64 benchmarking datasets and 2 case studies by training 5 different ML models after pre-processing the data with froid. First, we observed which type of classifier benefits more from the pre-processed data returned by froid. Second, we performed an ablation study of froid showing which is the impact of every set of features extracted by the various OD and FR approaches. Third, we compared froid with some state-of-the-art techniques to deal with imbalanced learning. The results show that (i) on average LigthGBM (Ke et al., 2017) is the best classifier exploiting the unsupervised representation returned by froid, (ii) the more features are extracted through froid, the higher are the performance of the classifier, and (iii) froid outperform all the state-of-the-art approaches at the cost of a not negligible running time required to extract all the features. Finally, we highlight that, besides imbalanced learning, froid also succeeds in the supervised outlier detection task.

The rest of the paper is organized as follows. Section 2 reviews related works of imbalanced learning and of supervised outlier detection. In Sect. 3 we illustrate our proposal to solve imbalanced learning through outlier detection and features projection approaches. Section 4 reports the experimental results on benchmarking datasets as well as on two case studies. Finally, Sect. 5 summarizes our contributions and discusses open research directions.

2 Related works

A large array of approaches have been proposed to face imbalanced learning. In the last twenty years, several surveys and literature reviews have categorized and discussed peculiarities and characteristics of the various approaches (Japkowicz & Stephen, 2002; Chawla et al., 2004; Su & Tsai, 2011; He & Garcia, 2009; Chawla, 2010; Branco et al., 2016). Recently, most of these approaches have been implemented and are freely available in Python open-source libraries like imblearn (Lemaitre et al., 2017). The two principal strategies to recover from the challenges raised by imbalanced learning contexts consist in modifying the data used to train the classifier, or altering the classification algorithm itself to account for misclassification costs of the different classes during the learning process. Random undersampling and oversampling (Kubat & Matwin, 1997) are the two most classic approaches to handling imbalance. It is well known that they suffer from the risk of discarding informative instances and overfitting the minority instances, respectively. Refinement of undersampling techniques like the Condensed Nearest Neighbors (CNN) (Hart, 1968) or the Edited Nearest Neighbors (ENN) (Wilson, 1972) brought slight improvements to such issues but not effective solutions. Nowadays, the Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002) is probably the most widely used, exploited, and extended oversampling approach. For instance, ADASYN (He et al., 2008) is very similar to SMOTE, but it generates a different number of samples depending on an estimate of the local distribution of the class to be oversampled. SVMSMOTE (Nguyen et al., 2011) exploits an SVM algorithm to detect the samples to use for generating the synthetic instances for oversampling the minority class.

Besides improving the procedure of these first resampling approaches, one of the most pursued research directions consists in combining them with other Data Mining or Machine Learning approaches such as clustering algorithms or simple classification models. For instance, the ClustFirstClass undersampling approach (Sobhani et al., 2014) tries to overcome the problem of discarding informative instances by first running the k-Means clustering (Tan, 2005) on the majority class and then at least one instance is maintained from each cluster. In Sundarkumar and Ravi (2015), the majority class outliers are removed with Reverse k-Nearest Neighborhood (RkNN) (Achtert et al., 2006). Then, the selection of support vectors using the One-Class Support Vector Machine (OCSVM) (Schölkopf et al., 1999) is used to undersample the majority class. In Sanguanmak and Hanskunatai (2016) is presented the DBSM approach for simultaneous undersampling and oversampling. The oversampling is performed with SMOTE, while the undersampling is realized by selecting half of the data which are present in the clusters returned by the DBSCAN method (Ester et al., 1996). A similar solution is presented in Branco et al. (2018) with the idea of biasing the strategies to reinforce some regions of the datasets instead of sampling uniformly. Such biases are applied through random undersampling and SMOTE. In Douzas et al. (2018) is presented k-SMOTE a refinement of SMOTE that exploits k-Means to avoid the generation of noisy synthetic minority instances which are erroneously close to a dense area of records belonging to the majority class. SMOTEFUNA is a further refinement of SMOTE presented in Tarawneh et al. (2020). SMOTEFUNA generates synthetic records between a randomly selected instance and its furthest neighbor of the minority class which have not a nearest neighbor of the majority class. Instances of the majority class are considered also in Koziarski and Wozniak (2017); Koziarski et al. (2021). Indeed, in Koziarski and Wozniak (2017) is proposed the CCR algorithm that Combines Cleaning of the decision border around minority objects with guided synthetic Resampling to re-balance the dataset. In Koziarski et al. (2019) is proposed the Radial-Based Oversampling method (RBO) that discovers regions in which the synthetic objects from minority class should be generated with radial basis functions. An extension of CCR and RBO is presented in Koziarski et al. (2021): Radial-Based CCR exploits the class potential to locate sub-regions of the data-space for synthetic oversampling and adopts radial basis functions. The ClUstered REsampling (CURE) method (Bellinger et al., 2019) uses hierarchical clustering and a newly defined distance measure to guide the resampling procedure. Such clusters take into account the structure of the data. This aspect enables CURE to avoid the generation of synthetic instances in “wrong” regions, and allows the undersample of non-borderline regions of the majority class. In Bellinger et al. (2021) is proposed ReMix, a pre-processing approach that leverages batch resampling and instance mixing to enable the induction of robust deep models from imbalanced and long-tailed datasets by expanding the minority class to reduce predictive bias. The objective of ReMix is not only to improve the predictive performance but also to increase model calibration. In Sharma et al. (2018) SWIM (Sampling WIth the Majority) is presented as a method for synthetic oversampling that exploits the information inherent in the majority class to synthesize minority class records. In a certain sense, we use an idea similar to SWIM because the objective of froid is to enhance discriminative aspects among minority an majority records by looking to both of them and not only to minority records. Finally, the most widely studied case study for imbalanced learning is fraud detection (Padmaja et al., 2007; Makki et al., 2019; Tran et al., 2021, 2021; Esenogho et al., 2022). In all these works, besides experimenting with existing techniques, further refinements and extensions are proposed but all focusing on data resampling.

From the analysis of these approaches, we noticed the following aspects. First, they all focus only on recovering issues of previous versions. Second, the usage of additional mining or learning techniques brings some benefits but also problems related to the hyper-parameter tuning. Third, and most importantly, all these methods account only for the “number of instances” dimension of a dataset and leave unaltered the features used to represent the records in the dataset. In this paper, we define an approach to modify the data used to train the classifier, therefore falling in this aforementioned category of papers. However, we focus on the features used instead of the instances that should be present in the training set, and we avoid any hyper-parameter tuning.

As stated at the beginning of this section, the second and less followed line of research is relative to making crucial changes in the classification algorithm in order to account for imbalanced class scenarios. In Akbani et al. (2004) is proposed an upgrade of SVM based on a variant of SMOTE and combined with the error costs presented in Veropoulos et al. (1999) which penalizes more misclassifications w.r.t. minority instances than misclassifications to majority instances. The DataBoost-IM method (Guo & Viktor, 2004) identifies instances which are hard to classify through boosting approaches. Then it trains an ensemble-based boosting algorithm by generating synthetic instances with biased information toward the hard instances on which the next classifier in the boosting procedures needs to focus. In Wang et al. (2014) is presented an instance-weighted variant of the SVM with both 1-norm and 2-norm formats to deal with imbalanced learning. Also none of these approaches account for the features describing the records.

Conversely, the following approaches also account for this delicate aspect. In Naseriparsa and Kashani (2014), the Principal Component Analysis (PCA) (Tan, 2005) feature projection approach is combined with SMOTE in a case study on a single dataset. The procedure first applies PCA to the dataset, then SMOTE for each of the minority features before applying the classification algorithm. In Gopi et al. (2016) is defined a Support Vector Machine-Recursive Feature Elimination (SVM-RFE) wrapper for feature selection. The Automated Imbalanced Classification (ATOMIC) method presented in Moniz and Cerqueira (2021) is an Automated Machine Learning (AutoML (He et al., 2021)) approach for imbalanced classification that extends the features describing the data with additional statistical features. In Ksieniewicz (2019), an ensemble of classifiers is trained on datasets obtained as random subspace and augmented through SMOTE. In Korycki and Krawczyk (2021), a similar methodology is applied to find the most discriminative low-dimensional representation instead of a random one. Since all these approaches do not increase the number of features and yet improve the performance of “traditional” approaches for imbalanced learning, our idea is to follow this intuition but augment the features used to represent the data in order to maximize the difference between instances belonging to different classes.

Another research field related to our proposal is outlier detection (Hodge & Austin, 2004; Chandola et al., 2009). Indeed, our intuition is that instances belonging to the minority class can be considered to some extent as outliers w.r.t. instances belonging to the majority class. In supervised outlier detection, a predictive model is trained on a dataset that has labeled instances for normal as well as anomaly classes. Thus, our intuition is in line with the XGBOD method presented in Zhao and Hryniewicki (2018). Indeed, XGBOD uses multiple unsupervised outlier detection algorithms to extract an alternative representation of the instances that augment the predictive capabilities of XGBOOST (Chen & Guestrin, 2016) to solve supervised outlier detection. The same intuition is followed by the geodesic-based outlier detection method which considers Global Disconnectivity score and Local real Degree (GDLD) as measures of outlierness presented in Shi et al. (2020). Indeed, GDLD is evaluated in the imbalanced learning setting considering records belonging to the minority class as outliers. In Shimauchi (2021) is presented a semi-supervised outlier detection algorithm that extends XGBOD through the augmentation of the representations with a Generative Adversarial Network (GAN) (Goodfellow et al., 2014). In Fernández et al. (2022) is proposed a framework for supervised outlier detection that is formed by a pipeline with an unsupervised outlier detection followed by a supervised predictive model that is used to tune the hyper-parameters of the unsupervised outlier detection algorithm. Finally, it is worth mentioning the approach illustrated in Loureiro et al. (2004) that applies in a case study, an unsupervised outlier detection method based on hierarchical clustering. From our perspective, the interesting aspect of Loureiro et al. (2004) is that it employs an unsupervised clustering-based strategy similarly to works previously discussed to solve imbalanced learning. Thus, it supports our intuition that solving imbalanced learning also through approaches used for outlier detection is a viable path. ODBOT, an alternative to XGBOD, i.e., an outlier detection-based oversampling technique for imbalanced datasets learning is presented in Ibrahim (2021). ODBOT handles multi-class imbalanc by finding clusters within the minority class(es), and then, generating synthetic samples by consideration the outliers detected in these clusters. Finally, we underline how in Hassanat et al. (2022) is shown that oversampling methods are not reliable. Indeed, the authors of Hassanat et al. (2022) reports an experimentation on more than 70 oversampling methods that reveal that the oversampling methods studied generate minority samples that are most likely to be majority. Hence, oversampling methodologies are quite likely to be unreliable in imbalanced learning settings and should be avoided in real-world applications.

While works like (Shi et al., 2020; Ibrahim, 2021; Ksieniewicz, 2019; Korycki & Krawczyk, 2021) already introduced the idea of boosting imbalanced learning methods through outlier detection or feature reduction, we stress the fact that, to the best of our knowledge, our proposal is the first in which they are used simultaneously and into subsequent iterations. Furthermore, our proposal departs from Shi et al. (2020); Ibrahim (2021); Ksieniewicz (2019); Korycki and Krawczyk (2021) because they always augment the number of instances, while froid leaves their number unaltered and plays only with the different data representations in order to leverage the discriminative power of ML models.

Fig. 1
figure 1

Utility of representing a dataset through OD and FR scores. Top left: synthetically generated imbalanced dataset with two dimensions and with two classes. Top right: decision boundary learned by a Decision Tree trained on the imbalanced dataset. Bottom left: synthetic dataset represented through LOF and PCA. Fourth plot: decision boundary of a Decision Tree trained on the unsupervised alternative representation (Color figure online)

3 Methodology

In this paper we present froid, a Features Reduction and OutlIer Detection pre-processing framework for solving imbalanced learning through unsupervised representation learning. froid is a pre-processing framework that takes as input a dataset X and returns a transformed version of it to be used as training for a ML classification algorithm. The main idea of froid is to represent the instances in X through alternative representations aimed at fostering the differences among instances belonging to different classes. Thus, froid relies on Outlier Detection (OD) and on Features Reduction (FR) approaches.

In Fig. 1 we illustrate an example that visualizes our intuition of representing the imbalanced input data through unsupervised techniques of OD and FR. The first plot (top left) depicts a synthetically generated imbalanced dataset with two dimensions and two classes with frequencies.95 and.05, respectively. In the second plot (top right), is illustrated the decision boundary learned by a Decision Tree classifier (Tan, 2005) trained on the imbalanced dataset. We immediately notice how most of the instances belonging to the minority class and located in the range \(X_0 \in [-1,2]\) and \(X_1 \in [-1, 1]\), highlighted by the yellow rectangle, are wrongly classified as majority instances by the Decision Tree. The third plot (bottom left) represents the synthetic dataset using as features the Local Outlier Factor (Breunig et al., 2000) score (LOF), an unsupervised OD approach, and the first Principal Component returned by the Principal Component Analysis (Pearson, 1901) (PCA) approach that is a FR method. We notice how the instances of the minority class are now displaced along two parallel horizontal directions. The fourth plot (bottom right) shows the decision boundary of a Decision Tree trained on this novel representation with the same parameter setting as the previous one. The rare instances inside the yellow rectangle in the second plot are represented with yellow squares in the fourth plot and are located approximately in the ranges \(LOF \in [-.8,-.4], PCA \in [1.0, 1.5]\) and \(LOF \in [-.8,-.46], PCA \in [-2.5, -.5]\). Among these rare instances, we notice that only three are not covered by decision rules labelling instances as minority class, i.e., green decision boundaries areas, and are therefore misclassified as majority class instances. Hence, by representing a two-dimensional dataset through an OD score and an FR dimension, we have improved the performance of an ML model, passing from an F1 measure of.60 to an F1 measure of.64. This simple example simultaneously explains and proves the intuition behind the proposed idea.

Fig. 2
figure 2

Illustration of froid framework for features extraction. The input dataset X is passed through a set of unsupervised outlier detection functions \(\theta _i\) (in blue) and through a set of unsupervised features reduction functions \(\rho _i\) (in yellow) originating the datasets \(X_o\) and \(X_r\), respectively. Such datasets are passed again through the functions \(\theta _i\) and \(\rho _i\) originating the datasets \(X_{oo}, X_{or}, X_{rr}, X_{ro}\), respectively. Finally, all the datasets are combined and the best features are selected from the combination with \(\zeta\) (Color figure online)

3.1 Imbalanced learning pre-processing framework

In this section, we define the Features Reduction and OutlIer Detection pre-processing framework froid to solve imbalanced learning through unsupervised representation learning.

Problem setting The froid framework is depicted in Fig. 2 and highlighted with the dashed box. froid works as detailed in the following. Let \(X \in \mathcal {R}^{n \times m}\) denote the original input dataset as a set of n instances with m records. Each record \(x_i \in X\) has attached a label \(y_i \in [0, \dots , l]\) indicating the class of the record where l is the number of the classes. However, since froid is an unsupervised representation learning framework, the class y is not used. On the other hand, froid makes usage of:

  • A set of u Outlier Detection (OD) functions \(\Theta = \{\theta _1, \dots , \theta _u\}\)

  • A set of v Feature Reduction (FR) functions \(P = \{\rho _1, \dots , \rho _v\}\)

  • A features selection function \(\zeta\)

Outlier detection transformation We define an OD function \(\theta _j\) as a mapping function where the output can be a real-valued vector \(\theta _j(X) \in \mathcal {R}^{n \times 1}\) that describes the degree of outlierness of each instance \(x_i \in X\). Outliers are instances that deviate significantly from the majority of the data and do not conform to a notion of normal behavior (Chandola et al., 2009). We include in this representation the cases in which the OD function \(\theta _j\) returns a binary-valued vector \(\theta _j(X) \in \{0, 1\}^{n \times 1}\) that indicates if the \(i^{th}\) instance is an outlier or not. We indicate with \(X_o \in \mathcal {R}^{n \times u}\) the result of the application of the u OD functions on X, i.e., \(X_o = [\theta _1(X), \dots , \theta _u(X)]\). Details of the OD approach implementing the OD functions adopted are provided in the next section.

Features reduction transformation We define an FR function \(\rho _j\) as a mapping function where the output \(\rho _j(X) \in \mathcal {R}^{n \times p}\) describes the representation of each instance \(x_i \in X\) into a p-dimensional space. FR methods transform the data from a high-dimensional to a low-dimensional space. The lower dimensionality aims to capture salient aspects of the higher dimensionality. We indicate with \(X_r \in \mathcal {R}^{n \times (v p)}\) the result of the application of the v FR functions on X, i.e., \(X_r = [\rho _1(X), \dots , \rho _v(X)]\). Details of the FR methods implementing the functions adopted are in the next section.

Features selection transformation We define a features selection function \(\zeta\) as a mapping function \(\zeta (X): \mathcal {R}^{n \times c} \rightarrow \mathcal {R}^{n \times k}\) that reduces the dimensionality of a given input \(X \in \mathcal {R}^{n \times c}\) to an output \(X' \in \mathcal {R}^{n \times k}\) where \(k < c\), i.e., \(\zeta\) remove some of the c columns from the input X, i.e., \(X' = \zeta (X)\).

Workflow description Given the input dataset X, the OD functions \(\Theta\), the FR function P, and the features selection function \(\zeta\), we can recognize three phases in the pre-processing performed by froid. We define an OD function \(\theta _j\) as a mapping function where the output can be a real-valued vector \(\theta _j(X) \in \mathcal {R}^{n \times 1}\) that describes the degree of outlierness of each instance \(x_i \in X\). Outliers are instances that deviate significantly from the majority of the data and do not conform to a notion of normal behavior (Chandola et al., 2009). We include in this representation the cases in which the OD function \(\theta _j\) returns a binary-valued vector \(\theta _j(X) \in \{0, 1\}^{n \times 1}\) that indicates if the \(i^{th}\) instance is an outlier or not. We indicate with \(X_o \in \mathcal {R}^{n \times u}\) the result of the application of the u OD functions on X, i.e., \(X_o = [\theta _1(X), \dots , \theta _u(X)]\). Details of the OD approach implementing the OD functions adopted are provided in the next section.

In the first phase, the OD and FR functions \(\Theta\) and P are applied to X obtaining \(X_o\) and \(X_r\), respectively. In the second phase, the OD and FR functions \(\Theta\) and P are applied subsequently on \(X_o\) and \(X_r\) obtaining the following representations:

  • \(X_{oo} = [\theta _1(X_o), \dots , \theta _u(X_o)]\)

  • \(X_{ro} = [\rho _1(X_o), \dots , \rho _v(X_o)]\)

  • \(X_{rr} = [\rho _1(X_r), \dots , \rho _v(X_r)]\)

  • \(X_{or} = [\theta _1(X_r), \dots , \theta _u(X_r)]\)

In the third phase, all the data representation obtained together with the input dataset X are concatenated and passed again to the features selection operator resulting in \(X_a = \zeta ([X, X_o, X_r, X_{oo}, X_{ro}, X_{rr}, X_{or}])\). We highlight how froid never augments the number of records in the dataset along with the various phases and data transformation, i.e., \(|X |= |X_a |\), differently from all the other state-of-the-art approaches for the class imbalance problem. On the other hand, the result of the unsupervised pre-processing \(X_a \in \mathcal {R}^{n \times m'}\) is a data representation using several features \(m'\) that is unknown a priori because it strictly depends on the data transformation functions \(\Theta , P, \zeta\) employed. In the end, any ML model can be trained on \(X_a, y\). We underline that when selecting a heterogeneous sets of OD methods \(\Theta\) and FR methods P, we can guarantee that every record in X will be represented w.r.t. different data-driven criteria. In addition, the risk of creating correlated features is effectively minimized, if not entirely eliminated, through (i) the utilization of a feature selection function, denoted as \(\zeta\), and (ii) the adoption of tree-based approaches as final classifiers. Thus, through this approach, we are able to guarantee that the representation of each record in \(X_a\) is diverse and distinct, enabling ML models to capture various aspects and characteristics of the data to amplify the separation between records of majority and minority classes.

In the rest of this section, we illustrate the OD and FR methods we considered to implement the functions of the froid pre-processing framework.

3.2 Outlier detection methods

We consider a large set of OD methods to implement the OD functions \(\Theta\). In particular, we rely on \(u{=}14\) OD methods based on different ideas and strategies to assign an outlier score/label to a given instance. In the following, we briefly describe the selected OD methods, which are highlighted in bold.

A large family of OD methods relies on the notion of locality, i.e., the outlier score is assigned by comparing a record with its neighbors with respect to a distance function and neighborhood size k. kNN, k-Nearest Neighbor (Tan, 2005) is a supervised ML algorithm frequently used for classification problems. However, it can also be used as an OD method by returning as outlier score of a record the largest distance from the instances in its kNN. LOF, Local Outlier Factor (Breunig et al., 2000) assigns an outlier score by comparing the local density of a record with the local densities of its kNN. If a record lies in an area with a density substantially lower than its neighbors, then it is considered an outlier. LoOP, Local Outlier Probability (Kriegel et al., 2009) is a local density-based OD method that extends LOF by measuring the local deviation of the density of a given instance with respect to its neighbors as LOF scores. It can work directly on the input data or on the result of a clustering algorithm by relating the outlier score calculus to the distances of the clusters’ centroids. COF, Connectivity-Based Outlier Factor (Pokrajac et al., 2008), overcome some limitations of LOF by calculating the outlier score of a record as its degree of connectivity. COF differs from LOF as it uses the chaining distance to calculate the kNN. The chaining distances are the minimum of the total sum of the distances linking all neighbors. The connectivity is then calculated as the ratio between the average chaining distance of the record and the mean average chaining distance of the records in the kNN. An additional possibility we explored for implementing the OD functions is to employ clustering approaches that highlight instances not belonging to any cluster as outliers (Khan et al., 2014). CBLOF, Cluster-Based Local Outlier Factor (He et al., 2003) takes as input both the dataset and a clustering algorithm and labels each cluster as “small” or “large” with respect to two \(\alpha\) and \(\beta\) parameters. The outlier score of a certain instance is then calculated w.r.t. the size of the cluster the point belongs to and the distance to the nearest “large” cluster.

Another family of OD approaches exploits global statistical tests and global models to discover anomalous behaviors. The Elliptical Envelope (Rousseeuw & van Driessen, 1999) algorithm (EllEnv) creates a global elliptical area that surrounds input data. Values that fall inside the envelope are considered normal data, and anything outside is considered an outlier. OCSVM, One-Class SVM (Schölkopf et al., 1999) is a variation of Support Vector Machines (SVM) (Tan, 2005) that can be used in an unsupervised setting for OD. The idea of OCSVM is to find a function that is positive for regions with a high density of points, and negative for small densities, considering the records that fall into negative regions of the hyperplane as outliers. MCD, Minimum Covariance Determinant (Hubert & Debruyne, 2010) is commonly applied on Gaussian-distributed data. MCD fits a minimum covariance determinant model (Hubert et al., 2018) and computes the outlier score through the Mahalanobis distance calculation. HBOS, Histogram-Based Outlier Detection (Goldstein & Dengel, 2012) assumes feature independence and calculates the outlier scores by building histograms. COPOD, Copula-Based Outlier Detection method (Li et al., 2020), instead, creates an empirical copula and uses it to predict each record’s tail probabilities to determine its outlier score.

An efficient and effective OD approach consists of using an ensemble of “weak” OD methods. The Feature Bagging (FeaBag) (Lazarevic & Kumar, 2005) exploits a set of OD methods, each of them applied on a random set of features selected from the original feature space. Each OD method identifies different outliers and assigns to all instances outlier scores that correspond to their probability of being outliers. The combination of such scores is returned as the final output. Isolation Forest (Liu et al., 2008) (IsoFor) is one of the most famous OD approaches. IsoFor isolates instances by randomly selecting a feature and then randomly selecting a split value in the range of the feature. This process is represented through a tree where the number of splittings required to isolate a record equals the path length from the root node to the leaf. Hence, an instance is considered an outlier when a forest collectively produces shorter path lengths for that instance. An extension of HBOS is LODA, Lightweight On-line Detector of Anomalies (Pevný, 2016). LODA approximates the joint probability using a collection of one-dimensional histograms, where every one-dimensional histogram is efficiently constructed on an input space projected onto a randomly generated vector. Even though one-dimensional histograms are weak OD methods, their collection yields a strong OD approach. SUOD, Scalable Unsupervised Outlier Detection (Zhao et al., 2020) is another OD ensemble method. Given in input a dataset and a set of unsupervised OD methods, SUOD randomly projects the original input onto lower-dimensional spaces and speeds up the training through a balanced parallel scheduling to assign averaged outliers scores.

3.3 Features reduction methods

We consider a wide set of FR methods to implement the FR functions P. We rely on \(v=8\) FR methods described in the following and highlighted in bold.

Most of the approaches in the literature are based on the idea of finding novel directions along which the original data should be projected. The various techniques differ on how these directions are built or derived. PCA, Principal Component Analysis (Pearson, 1901; Hasan & Abdulazeez, 2021) is the process of computing the principal components and using them to perform a feature projection of the data along with these principal directions. Indeed, a principal component is the direction of the line that best fits the data while being orthogonal to the previous component. Principal components, therefore, are the derived variables formed as a linear combination of the original variables that explains the most variance. MDS, MultiDimensional Scaling (Cox & Cox, 2008) is a process that translates the records of a given high-dimensional dataset into a low-dimensional representation with respect to the pairwise distances observed in the original space: instances which are close in the original space should be close also in the reduced space, and vice-versa. IsoMap, Isometric Features Mapping (Tenenbaum et al., 2000) is a nonlinear dimensionality reduction method. IsoMap estimates the intrinsic geometry of a data manifold by estimating the geodesic distance between all pairs of instances on a weighted graph built with respect to the nearest neighbor identified through a fixed radius. The top eigenvectors of the geodesic distance matrix represent the coordinates in the new reduced space. LLE, Locally Linear Embedding (Roweis & Saul, 2000) is similar to IsoMap, but instead of using the geodesic distance, it uses a distance based on the ability to reconstruct a record with respect to its neighbors. A well-known issue of LLE is the regularization problem. A way to address it is to use different methods for LLE, for instance the Modified LLE (Zhang & Wang, 2006), or the HLLE, Hessian eigenmapping LLE (Donoho & Grimes, 2003). SpectEmb, Spectral Embedding (Bengio et al., 2006) is exploited for non-linear dimensionality reduction using a spectral decomposition of a the graph modeling the dataset. Although SE is similar to IsoMap and LLE, it differs in how the weights are calculated, and it adopts the eigenvectors returned from a Laplacian Matrix as reduced dimensionality. Finally, we employed t-SNE, t-distributed Stochastic Neighbor Embedding (Van der Maaten & Hinton, 2008) is a form of MDS that, besides preserving the distances, also aims at preserving the neighborhoods of the instances by modeling the distances as probability distributions belonging to a certain neighborhood.

4 Experiments

We report here the experiments carried out to validate froid. First, we illustrate the experimental setting with the datasets used, the classifiers adopted, the implementations and parameters employed, the competitors analyzed, and the evaluation measures tested. Second, we show which is the best ML classifier among the various datasets and the improvement of froid w.r.t. training the models on the original data. Third, we report an ablation study of the unsupervised features adopted by froid. Fourth, we compare froid with state-of-the-art solutions. Fifth, we prove that the pre-processing of froid is beneficial also for supervised outlier detection. Finally, we discuss which are the most important features adopted by froid in two real case studies.

Table 1 Dataset description aggregated through K-Means clustering: \(n_{ train }\) instances training set; \(n_{ test }\) instances test set; m number of features; \(m_{ num }\) number numerical features; \(p^+_{ train }\) positive rate training; \(p^+_{ test }\) positive rate test set; \(FDR\) Maximum Fisher’s Discriminant Ratio, \(FBP\) Fraction of Borderline Points, \(ECP\) Entropy of Class Proportions and \(IR\) Imbalance Ratio

4.1 Experimental setting

In this section, we illustrate the experimental setting with the datasets used, the classifiers adopted, the implementations and parameters employed, the competitors analyzed, and the evaluation measures tested.

4.1.1 Datasets and machine learning classifiers

We ran experiments on a selection of 64 binary classification datasets widely referenced and used for imbalanced learning experiments publicly available from the UCI, Kaggle, ODDS, KEEL and imblearn repositories and a fraud detection challenge.Footnote 1 For each dataset, the following pre-processing is applied. First, we remove records with null values without replacement not to compromise the originality of the data. Next, we eliminate columns with poor explanatory potential, such as IDs, names, etc. Categorical columns are encoded through one-hot encoding to preserve the semantic meaning of the variables for usage with OD and FR methods based on distances or vectors and also for the correctness w.r.t. the ML models.Footnote 2 Datasets description after this pre-processing, as well as some data complexity measures (Sotoca et al., 2005; Cano, 2013), are available in Table 11 in the Appendix.Footnote 3 We summarize the information contained in Table 11 by running K-Means with \(k=4\) to group the different types of datasets analyzed and provide a brief description of the datasets. Indeed, in Table 1, we report a summary of the datasets through the centroids of the four clusters. We observe that the majority of the datasets is “small-sized” and with the lowest \(FDR\) (cluster A). In contrast, the other larger datasets are further separated either w.r.t. the dimensionality or w.r.t. \(FBP\).

Before training the ML models or running the imbalanced learning pre-processing solutions, we applied to the datasets the Robust Scaler that normalizes the features using statistics that are robust to outliers.Footnote 4 The Robust Scaler removes the median and scales the data with respect to the Interquartile Range (IQR), i.e., the difference between the \(3^{rd}\) quartile (\(75^{th}\) quantile) and the \(1^{st}\) quartile (\(25^{th}\) quantile). This choice is tied with better discrimination among instances belonging to minority or majority classes. Indeed, the Robust Scaler normalizes values, and those far away from the median value and outside the IQR will get values markedly greater/smaller than zero. If datasets still have to be partitioned into training and test, we split them using a stratified hold-out partitioning based on the target class, with 70% of the data used for the training and 30% for the test. Otherwise, we keep the original train-test partitioning. To guarantee a statistically valid evaluation, as proposed in Rajkomar et al. (2018), we bootstrapped each test set 100 times, and we report in the manuscript the mean values obtained by the various classifiers over these runs.

As ML classifiers, due to the proven empirical superiority of ensemble models (Breiman, 2001; Shwartz-Ziv & Armon, 2022), we decided to experiment with Decision Tree (Breiman et al., 1984) (DT), Random Forest (Breiman, 2001) (RF), XGBoost (Chen & Guestrin, 2016), LightGBM (Ke et al., 2017), and CatBoost (Prokhorenkova et al., 2018) as implemented by the sklearn, xgboost, catboost, and lightgbm Python libraries.Footnote 5 If not differently specified, we adopted the default parameter setting proposed by the various libraries to assess to which extent different pre-processing techniques are more effective for solving imbalanced learning with the same hyperparameter values.

Table 2 Mapping between OD and FR methods used by froid and the selected implementations with the parameters varied: sklrn means sklearn, pynomaly means pynml

4.1.2 Experimental details

We implemented froid in PythonFootnote 6 by relying on the following libraries. The core of the algorithm is realized following the scikit-learn style and adopting the notion of pipeline such that every OD or FR method can be subsequently enabled or disabled. For the OD methods we relied on the implementations of the Python libraries sklearn, pyod (Zhao et al., 2019), and PyNomaly,Footnote 7 while for FR methods on the implementations offered by the Python library sklearn. The mapping between OD and FR methods and the selected implementations with the parameters varied is reported in Table 2. For instance, the LoOP method is implemented with the pynomaly library and used with the parameter number of neighbors \(k \in [1, 5, 10, 20]\). Among the selected implementations of OD functions \(\Theta\), given an instance \(x_i\), all of them can be used both for returning a binary value indicating if \(x_i\) is an outlier or not and for returning the degree of outlierness of \(x_i\). Hence, the number of features extracted by froid through OD methods is theoretically 2u where \(u = |\Theta |\). However, in practice, such a number is higher than 2u and depends on the parameter combinations used for each OD method.Footnote 8 On the other hand, with all the FR methods we projected the input data into \(p=2\) dimensions.Footnote 9 Finally, if not differently specified, the features selection function \(\zeta\) is implemented with the sklearn libraryFootnote 10 that trains a LightGBM model on the dataset and selects only the features having importance higher than the average. As an alternative, we also implement \(\zeta\) with the variance threshold functionFootnote 11 that removes from the input dataset all low-variance features below a certain threshold.Footnote 12

We underline that, for every dataset, a different number of features might be generated by froid because some parameters configurations for OD and FR methods might be invalid depending on the dataset characteristics. Also, some of the classifiers implementations are not able to handle missing and/or too large values. In this case, froid drops all the generated featured that meet one of these conditions. On average, we observed that froid generates about 508 features per dataset.

4.1.3 Imbalanced learning competitors

In order to establish to which extent froid is in line with state-of-the-art approaches for solving the imbalanced learning problem, we compared the performance of froid against the following competitors.Footnote 13 Besides standard approaches for imbalanced learning such as Random Undersampling (rund) (Kubat & Matwin, 1997), Random Oversampling (rove) (Kubat & Matwin, 1997), Synthetic Minority Oversampling Technique (smote) (Chawla et al., 2002), and Adaptive Synthetic (adasyn) (He et al., 2008), we compared froid also against ClUstered REsampling (cure) (Bellinger et al., 2019), Radial-Based Oversampling (Koziarski et al., 2019) (rbo), Combines Cleaning and Resampling (ccr) (Koziarski & Wozniak, 2017), SVMSMOTE (svmsmt) (Nguyen et al., 2011), and Sampling WIth the Majority (swim) (Sharma et al., 2018). Finally, we adopted and re-implemented the eXtreme Gradient Boosting Outlier Detector (xgbod) (Zhao & Hryniewicki, 2018) for the tasks of both imbalanced learning and supervised outlier detection for which the algorithm is designed.Footnote 14

4.1.4 Evaluation measures

As evaluation measures, we considered the following metrics (Tan, 2005). Precision is the fraction of relevant instances among the retrieved instances, while Recall, also known as True Positive Rate or Sensitivity, is the fraction of relevant instances that were retrieved, i.e., \(Precision = \frac{ tp }{ tp + fp }\) and \(Recall = TPR = Sensitivity = \frac{ tp }{ tp + fn }\) where \(tp\) is the number of true positives, \(fp\) is the number of false positives, and \(fn\) is the number of false negatives. The F1-measure is the harmonic mean of Precision and Recall, i.e., \(F1=2\frac{ Precision \cdot Recall }{ Precision + Recall }\). Another widely used operator to judge the performance of ML classifier is the Area Under the ROC Curve \(AUC\), i.e., the area under the curve described by \(FPR\) and \(TPR\), where \(FPR = \frac{fp}{fp+tn}\). An index typically used to evaluate the performance of credit score models (Torrent et al., 2020) is the \(GINI\) coefficient defined as \(GINI =2 AUC -1\). It ranges from 0 (chance results) to 1.0, which corresponds to perfect discrimination between classes. The Precision-Recall Area Under the Curve \(PRA\) is typically used to judge the performance of ML models on heavily imbalanced datasets because it cares less about the major negative class (Saito & Rehmsmeier, 2015). The \(PRA\) curve and can be viewed as the average of \(Precision\) calculated for each \(Recall\) threshold. We highlight that all the aforementioned metrics are designed to evaluate binary classifiers with respect to the positive class. Hence, in our experiments, we report the average score obtained considering every class of the various datasets analyzed as positive if not differently specified. Finally, we also evaluated the Geometric mean (\(GM\)) as the root of the product of class-wise Sensitivity. This measure tries to maximize the accuracy on each of the classes while keeping these accuracies balanced. We judge the classification results with this long list of measures in order to assess the goodness of the various pre-processing techniques with respect to different and complementary objective evaluation perspectives.Footnote 15 For all measures, the higher the values, the better the results. In the rest of the paper, we report aggregated results in terms of average score of the evaluation metric, average ranking position w.r.t. a certain evaluation metric, and number of wins. Detailed results on the various datasets can be found in the Appendix. We base our observations mainly on \(PRA\) as it is the evaluation measure most widely used for assessing the goodness of imbalanced learning tasks.

Furthermore, we evaluated the time required by the various pre-processing competitors and froid versions to prepare the dataset (P-Time), and its subsequent impact on the training T-Time.

Table 3 Average scores, ranks and number of wins of PRA, GINI, F1, GM among various datasets and ML models trained on original data X or on data pre-processed with froid \(X_a\)

4.2 Performance analysis of different ML models

In Tables 12131415 in the Appendix we show the detailed comparison of the performance among different ML modelsFootnote 16 w.r.t. \(PRA\) by comparing the performance obtained on the original data X with those obtained on the pre-processing returned by froid indicated with \(X_a\). Table 3 summarizes these results for PRA, GINI, GM, and F1 by reporting the average scores, average ranks, and the number of wins (between considering \(X_a\) and not considering X froid preprocessing).Footnote 17 We notice that for PRA, LightGBM achieves the best performance overall and significantly overcomes the model not using froid. Besides PRA, LightGBM with froid is ranked second for GINI, and F1. We also observe how CatBoost and Random Forest do not benefit from the usage of froid reporting the overall best performance w.r.t GINI and GM. Since we consider PRA as the most reliable indicator for the class imbalance setting and in order to avoid the repetition of results due to multiple ML models, if not differently specified, in the rest of the paper, we only report the performance related to the LightGBM classifier that we assume to be the best ML model for the datasets analyzed and also because it is notoriously faster than XGBoost (Ke et al., 2017). The non-parametric Friedman test that compares the average ranks of learning methods over multiple datasets w.r.t. the various evaluation measures guarantees that these results are statistically significant, i.e., the null hypothesis that all methods are equivalent is rejected (\(p\text {-}value <.001\)). This result is verified for every table presented in this paper. Thus, we avoid repeating the statistical significance of the experiments in the following sections.

Table 4 Average scores, ranks and number of wins of PRA, GINI, F1, GM among various datasets obtained by LightGBM trained on different features combinations
Table 5 Average relative importance and ranks of importances by category of features for froid: X original features, \(X_o\) OD over original, \(X_r\) FR over original, \(X_{rr}\) FR over FR, \(X_{ro}\) OD over FR, \(X_{oo}\) OD over OD, \(X_{rr}\) FR over FR
Table 6 Average relative features importance grouped by category of features for froid for different clusters of similar datasets

4.3 Pre-processing with different features combinations

In this section, we analyze the impact of the different features adopted by froid. Table 4 reports the average PRA, GINI, F1, GM, the corresponding average ranks, and number of wins among all datasets classified with LightGBM on the various datasets analyzed for different pre-processing inputs:Footnote 18

  • X is the original dataset,

  • \(X_a\) is the output of froid,

  • \(X_{\lnot \zeta }\) is not using the features selection function \(\zeta\),

  • \(X_{\sigma }\) is using variance threshold as feature selection \(\zeta\) with threshold set as.2,

  • \(X_f\) is not using the original features of X, but only all the unsupervised features learned by froid,

  • \(X_{l}\) is considering \(X_o\) and \(X_r\) besides X,

  • \(X_{ fo }\) is only considering \(X_o\), \(X_{oo}\), \(X_{or}\),

  • \(X_{ fr }\) is only considering \(X_r\), \(X_{rr}\), and \(X_{ro}\),

  • \(X_{O}\) is considering X, and \(X_{ fo }\),

  • \(X_{R}\) is considering X and \(X_{ fr }\),

where \(X_o\), \(X_r\), \(X_{oo}\), \(X_{rr}\), \(X_{ro}\), \(X_{or}\) are defined as in Sect. 3. What emerges is that froid is firmly ranked first w.r.t. the four measures, it has the best performance w.r.t. GM and F1 and it is runner up for PRA and GINI. Also regarding the number of wins is always placed second. The overall champion regarding the number of dataset for which it is first is the LightGBM on the original data X. However, the performance for the remaining datasets place it among the last positions w.r.t. to the rank indicator, and constantly and statistically worse than various alternatives using froid.

From the comparison between \(X_a\), \(X_{\lnot \zeta }\) and \(X_{\sigma }\) emerges that (i) the usage of a feature selection method contributes in increasing the performance, and (ii) that the usage of more efficient but less adaptive feature selection functions like variance threshold impact negatively the performance of the model. Thus, the impact of the class imbalance on the model used by the feature selection method, is the aspect that contributes in the increase of the performance and the appropriate usage of the most reliable set of features.

Detailed performance on the various datasets are available in Tables 16171819 in the Appendix. Here, among the other evaluation measures, we can observe that, in many cases, froid boosts the PRA on the original dataset with an improvement ranging from.1% to 81.2%, with an average boost of 12%. Therefore, this experiment confirms that it makes sense to consider all the alternative unsupervised features created by froid together with the original ones and appropriately selected and not only a subset of them.

Furthermore, in Table 5 we report the average features importance obtained by froid (\(X_a\)) over all the datasets paired with the rank of the features importanceFootnote 19 w.r.t. the different categories of unsupervised features adopted. We notice that the features involving FR not mixed with OD, i.e., \(X_r\) and \(X_{rr}\) are, on average, the most beneficial for the classification, followed by single OD \(X_o\) and by the original features X. However, we notice that there is no marked discrepancy in the usage of the features and that the boost of froid is given by the simultaneous usage of all the categories of features created. The detailed relative importance of the various datasets is available in Table 20 in the Appendix. In Table 6, we report the average relative features importance grouped by category of features for froid for the clusters of similar datasets described in Sect. 4.1. The insights of this table are the following: for datasets in cluster A, we have the general behavior already discussed for Table 5. The original features are consistently more important for datasets in cluster D, while for datasets in cluster B are beneficial features in \(X_{rr}\). Finally, for datasets in cluster C, the features in \(X_{oo}\) are not used at all, while are more important those in \(X_{ro}\). Hence, we can infer that the effectiveness of froid is given by the massive production of unsupervised descriptive features that can be helpful in every situation, independently from the dataset characteristics. This improvement can be effectively exploited only by ML models like LightGBM that can appropriately select the most discriminative and informative features and are not harmed by the course of dimensionality issue.

Finally, we studied if there are sets of features in \(X_o\), \(X_{oo}\), \(X_{or}\), \(X_r\), \(X_{rr}\), and \(X_{ro}\) that are never or scarcely used by the classifiers adopted. An analysis performed at the dataset level highlighted that for all the analyzed datasets, froid uses original features X in \(\sim 97\%\) of the datasets, while the features generated by froid, i.e., \(X_o\), \(X_r\), \(X_{rr}\), \(X_{ro}\), \(X_{or}\), and \(X_{oo}\) are instead used in \(\sim 92\%\), \(\sim 84\%\), \(\sim 87\%\), \(\sim 80\%\), \(\sim 80\%\) and \(\sim 71\%\) of the datasets, respectively. Thus, all the types of features are consistently used in more than half of the datasets analyzed. This result emphasizes how froid self-adapts promptly to each dataset’s peculiarities.

Table 7 Average scores, average ranks and number of wins of PRA, GINI, F1, GM
Fig. 3
figure 3

Critical difference plots with Nemenyi at 95% confidence level for PRA (top-left), GINI (top right), F1 (bottom-left), GM (bottom-right)

Fig. 4
figure 4

Scatter plots comparing the performance in terms of PRA between couples of methods. Every point is a dataset, the closer is to the diagonal, the more similar the performance

4.4 Comparison with state-of-the-art approaches

In this section we compare froid against state-of-the-art approaches for imbalanced learning. In Table 7, are reported the average PRA, GINI, F1, GM, the corresponding average ranks among all datasets classified with LightGBM for the various competitors and the number of wins. What emerges is that, overall, froid is the best pre-processing method with respect to the four evaluation measures. Also, there is not a clear second-best performer, even though svmsmt and swim are the best ones for some indicators. Thus, froid appears to be markedly better than the other approaches. The comparison of the ranks of all methods against each other is visually represented in Fig. 3 with Critical Difference (CD) diagrams (Demsar, 2006). Two methods are tied if the null hypothesis that their performance is the same cannot be rejected using the Nemenyi test at \(\alpha =.05\). For PRA, froid is the only pre-processing method not tied with other approaches ranked less than seventh. This means that w.r.t. PRA, it is statistically insignificant to use froid or other sate-of-the-art approaches like svmsmt, swim or rbo. Furthermore, even though the results show that froid is always statistically tied with other methods, independently from the evaluation measure considered, it has always the highest number of wins, it is in the top three with respect to the rank for PRA, GM, and F1, and it has the highest average PRA and GINI. No other method guarantees such stability across different evaluation measures. To further enhance the improvement of froid vs orig, froid vs svmsmt, and froid vs swim w.r.t PRA we report in Fig. 4 scatter plots in which every point represent a dataset and along the x-axis and y-axis we have the performance in terms of PRA for the method reported in the label. The closer the point is to the diagonal, the more similar the performance. The leftmost scatter plot highlights that only in a few cases not using froid is better than using it and that, in some cases, its usage brings a considerable boost in terms of PRA. The central scatter plot signals that froid is never markedly worse than svmsmt as all the points below the diagonal are close to it, but in some cases, the performance of froid are markedly better than those of svmsmt. The rightmost scatter plot highlights how the performance of froid are correlated to those of swim but the majority of the points lies above the diagonal signaling the superiority of our proposal.

The drawback of using froid is the markedly higher time required to prepare the dataset. Indeed, while the training time \(T\) remains in the same order of magnitude for all the methods (except xgbod and rbo), the pre-processing time \(P\) becomes consistently higher, being in the order of hours instead of in the order of seconds for froid. However, we underline that the standard deviation of the time required by froid is 6,436.18 while the median time is 231.22 s. The great variability in the pre-processing time \(P\) indicates that froid pre-processing time is markedly impacted by the dimensionality of the dataset analyzed. We recall that froid is designed as a pre-processing method to be used when a good and reliable model should be deployed, and the time required to build this model is typically not an issue. Details of the performance on the various datasets are available in Tables 21222324 in the Appendix. In order to recover from the high computational time required by the froid framework, we have developed a streamlined variant that focuses solely on the most efficient OD and FR methods. Compared to the original froid, this lightweight version demonstrates a remarkable performance boost, being approximately an order of magnitude faster. However, its effectiveness in addressing the imbalanced learning problem falls short in terms of PRA and GINI when compared to both froid and other state-of-the-art approaches. The average rank for the PRA and GINI metrics was approximately 10th and 9th, respectively. Regrettably, due to the limitations of this lightweight version in achieving comparable accuracy without utilizing the full range of OD and FR methods employed by froid, we have made the decision not to include it in our experimentation. We believe that the complete froid framework remains the most reliable option for achieving optimal performance in solving the imbalanced learning problem. Therefore, we have chosen not to report the results obtained using the lightweight version.

Table 8 Average scores, average ranks and wins of PRA, GINI, F1, GM for LightGBM trained after different pre-processing methods with/without hyper-parameter tuning (-hpt) or calibration (-cal)

Furthermore, we analyzed which is the impact of procedures of hyper-parameter tuning and calibration (Niculescu-Mizil & Caruana, 2005; Zadrozny & Elkan, 2002) on the performance of the LightGBM for the various datasets.Footnote 20 We use the suffix -hpt and -cal to indicate a training procedure involving hyper-parameter tuning or calibration, respectively. Table 8 illustrates the average PRA, GINI, F1, GM, the corresponding average ranks and number of wins among all datasets classified with LightGBM on the original dataset (orig), after froid pre-processing with and without hyper-parameter tuning (-hpt) or calibration (-cal). The results show that hyper-parameter or calibration approaches alone do not reach the performance achieved by froid. Also, calibration on top of the models trained after froid pre-processing further improves the performance.

Table 9 PRA for supervised outlier detection datasets obtained by LightGBM trained after different pre-processing methods.

4.5 Results on supervised outlier detection

In this section, we demonstrate that the pre-processing of froid is beneficial not only for imbalanced learning but also for supervised outlier detection. As competitors, we report the same approaches used in the previous section, with xgbod being the actual state-of-the-art in this field (Zhao & Hryniewicki, 2018). Table 9 reports the PRA w.r.t. the label “is outlier” on the datasets having the ground truth for the outliers. We observe that froid is the best performer for satellite and second best performer for glass. This result further stress the breakthrough idea we introduced in this paper about representing instances towards a variegate composition of unsupervised OD and FR approaches.

4.6 Features importance on real case studies

In this section, we experimented with the effectiveness of froid in two real case studies.Footnote 21 In particular, the diva dataset is a privately released dataset on fraud evasion, periodically issued by the Italian Ministry of Economics. In diva, financial activities for 11,187 citizens are recorded. The 18 features describe different aspects of the taxpayers, including their past financial credit score, declared income and property value, debt, and detailed taxation info. The positive label mark 35 relevant tax evaders, accounting for less than \(.3\%\) of the total labels. The hospital dataset collects 14,390 records describing patients through 15 features after a data cleaning phase, including demographic variables, hospital usage aspects, and past medical history. The positives indicate the 130 discharged patients, accounting for roughly \(.9\%\) of the patients. We applied to these datasets the same pre-processing described in Sect. 4.1. The comparison of the performance between the LightGBM obtained on the original dataset and those obtained after running froid are reported in Table 10. The last line reports the relative improvement of froid over the performance on the original datataset. We observe how by using froid, we can correctly identify many more fraudulent citizens and discharged patients. Indeed, the GINIFootnote 22 obtained by froid increases up to .97 and .84, an increase by \(7.8\%\) and \(1.2\%\) respectively compared to orig.

Table 10 Performance on the case study datasets with and without froid
Fig. 5
figure 5

Normalized LightGM features importance for diva (left) and hospital (right)

In Fig. 5 is reported the normalized features importance of the ten most important features for both datasets. We immediately notice that for both cases, among the ten most important features, we have many features generated by froid. In particular, we notice how using OD approaches is quite an effective procedure to design highly discriminative features over the diva dataset. Moreover, for hospital, we see beneficial effects of FR methods: \(KPCA\) is among the top 10 most informative features when applied not only on original data but also on scores derived from OD and FR methods. This analysis confirms that froid’s idea of making inception among FR and OD methods (and vice-versa) is indeed a winning one as it helps to derive the most discriminative and important features for the classification in the setting of imbalanced learning. We highlight from this analysis also emerging weaknesses w.r.t. the usage of froid. Indeed, while the overall performance improves, the usage of froid leads to an inevitable loss of interpretability (Guidotti et al., 2018). In fact, if the original dataset is used the features used by the classifier selected are only those belonging to the original domain, and therefore their meaning is perfectly understandable by any domain expert. On the other hand, when features created by froid becomes fundamental for the classification, only machine learning experts can understand their meaning (Tomsett et al., 2018). On the contrary, this weakness is not held by classic approaches like smote or adasyn, which do not modify the features describing the dataset.

5 Conclusion

We have presented froid an unsupervised pre-processing framework for solving imbalanced learning problems through outlier detection and features reduction methods. froid augments the dimensions used to represent the input data combining in different ways a wide variety of outlier detection and dimensionality reduction methods. This dimensionality augmentation boosts ML classification models in solving the classification task for imbalanced data. A wide and deep experimentation shows that froid overcomes state-of-the-art pre-processing approaches for imbalanced learning at the price of a not negligible time required to build all the novel dimensions. Our insight for such boost in performance is that froid does not generate any novel synthetic data but only amplify the expressiveness of the existing records. The effectiveness of froid in its current form is that it is creating a mass of descriptive discriminative features that can be suitable to any dataset, i.e., maybe a subset might not be helpful for datasets with specific characteristics, but they might be useful for other datasets. Indeed, we might see froid as a sort of “brute-force approach” generating a massive number of descriptive features in the hope that one of them, or better, a combination of some of them, together with the original feature, bring a boost to the discriminative power of a ML model.

Starting from the results obtained with froid, several future research directions can be pursued. First, we would like to design a pre-processing step to be applied before froid that is responsible for which is the most appropriate subset of features to generate while simultaneously improving the performance and reducing the running time. Second, inspired by Shi et al. (2020); Ibrahim (2021), we would like to experiment which would be the performance obtained by combining froid with one of the state-of-the-art oversampling and/or undersampling approaches as well as with cost-sensitive classifiers. For instance, if smote is applied after froid, then it will generate synthetic minority instances approximating the features learned by froid. On the one hand, this might boost the discriminative power of ML models learned on top of these datasets enriched both in terms of features and in terms of records. However, on the other hand, there is the risk that smote would not be able to appropriately generate synthetic instances with the features learned by froid, leading to a degradation of the performance. Third, we would like to design an extremely randomized version of froid that adopts bootstrap samples and random features selection to simultaneously accelerate the procedure and also exploit a sort of ensemble strategy. Finally, we would like to check if the froid approach can be used to solve the imbalanced learning problem also for multi-class problem settings and for other data types such as images or time series through the usage of autoencoder approaches.