Solving imbalanced learning with outlier detection and features reduction

Lusito, Salvatore; Pugnana, Andrea; Guidotti, Riccardo

doi:10.1007/s10994-023-06448-0

Solving imbalanced learning with outlier detection and features reduction

Open access
Published: 07 December 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Solving imbalanced learning with outlier detection and features reduction

Download PDF

1285 Accesses
3 Altmetric
Explore all metrics

Abstract

A critical problem for several real world applications is class imbalance. Indeed, in contexts like fraud detection or medical diagnostics, standard machine learning models fail because they are designed to handle balanced class distributions. Existing solutions typically increase the rare class instances by generating synthetic records to achieve a balanced class distribution. However, these procedures generate not plausible data and tend to create unnecessary noise. We propose a change of perspective where instead of relying on resampling techniques, we depend on unsupervised features engineering approaches to represent records with a combination of features that will help the classifier capturing the differences among classes, even in presence of imbalanced data. Thus, we combine a large array of outlier detection, features projection, and features selection approaches to augment the expressiveness of the dataset population. We show the effectiveness of our proposal in a deep and wide set of benchmarking experiments as well as in real case studies.

A survey on semi-supervised learning

Article Open access 15 November 2019

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on ensemble learning

Article 30 August 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A wide variety of important applications where Machine Learning (ML) models are typically applied, such as fraud detection, medical diagnosis, and oil spill (Japkowicz & Stephen, 2002) suffer from the problem of class imbalance. The class imbalance problem corresponds to domains where some classes are represented by a large number of instances, while others are represented by only a few. Prior research has shown that class imbalance has a negative impact on the performance of the learned ML models that tend to be overwhelmed by the large classes and ignore the small ones. This typically happens because ML classifiers operate on data drawn from the same distribution as the training data, adopting the same data representation, and assume that maximizing accuracy is the primary goal (Chawla et al., 2004). Given the frequency of imbalanced learning problems in real applications and the issues it raises on learning ML models, the research of methods for handling it has become a significant research topic (Japkowicz & Stephen, 2002; Branco et al., 2016).

The approaches proposed by the research community to solve the class imbalance problem include pre-processing methods, as well as the definition of learning methods specifically designed for this problem (Japkowicz & Stephen, 2002; Chawla et al., 2004; He & Garcia, 2009; Chawla, 2010; Branco et al., 2016). Pre-processing and resampling approaches are more widely studied as they enable the subsequent adoption of any standard ML classification model. The main ideas consist in transforming the original training set, making it more suitable for learning the important class(es) either by reducing the number of instances belonging to the majority classes, or by augmenting the number of rare instances through synthetic data generation procedures (He & Garcia, 2009). Well known examples are the Random Undersampling (Kubat & Matwin, 1997) and the Condensed Nearest Neighbors (Hart, 1968) procedures for data reduction, or the Random Oversampling (Kubat & Matwin, 1997) and the Synthetic Minority Oversampling Technique (Chawla et al., 2002) for data augmentation. Many proposals in the literature try to refine such basic approaches by combining the aforementioned solutions in different fashions or by resorting to advanced Automated Machine Learning (He et al., 2021) approaches. Unfortunately, despite handling somewhat with the class imbalance problem, many of the most widely used pre-processing approaches suffer from issues related to the removal of majority class instances from sparse regions, and to the generation of noisy/erroneous synthetic minority instances (He et al., 2008; Bellinger et al., 2019, 2021; Hassanat et al., 2022). A further limitation of the majority - if not the entirety - of the state-of-the-art approaches is that they (implicitly) exploit the number of instances belonging to a specific class to characterize differences and similarities among instances, even though features should be the ones capturing these differences/similarities.

To overcome the weaknesses of state-of-the-art approaches, we propose froid a pre-processing framework for Features Reduction and OutlIer Detection that allows solving the imbalanced learning problem through unsupervised representation learning. froid handles imbalanced learning by facing the problem from a different perspective. Indeed, instead of augmenting the instances of the minority classes or reducing the instances of the majority classes, froid analyzes the relationships among the records in the dataset through unsupervised approaches. The goal of froid is to design attributes creating an unsupervised data representation that enhance the differences between records belonging to minority and majority classes such that a ML model can achieve outstanding performance regardless of the class imbalance.

In particular, froid exploits two families of methods to augment the expressiveness of the records in a dataset. The first family of methods is the one of Outlier Detection (OD) approaches (Chandola et al., 2009; Hodge & Austin, 2004). An unsupervised OD approach is meant to identify outliers, i.e., instances which deviate significantly from the majority of the data and do not conform to a notion of normal behavior (Chandola et al., 2009). Our intuition is that records belonging to minority classes should be considered as outliers with respect to records belonging to majority classes. Therefore, the usage of unsupervised OD methods can create attributes capturing the level of outlierness of a record with respect to other records concerning a certain OD criterion. A similar intuition to solve the task of supervised OD was proposed in Zhao and Hryniewicki (2018). However, in Zhao and Hryniewicki (2018) unsupervised OD methods are used to boost a supervised OD problem. Thus, it is known form the problem definition that there are outliers in the data. On the other hand, in our case, we are just making a supposition that instances belonging to minority class can be recognized as outliers by unsupervised OD approaches. The second family of methods comprehends Features Reduction (FR), also known as features projection, features extraction, or dimensionality reduction. FR methods transform the data from a high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in Principal Component Analysis (Pearson, 1901), but many nonlinear dimensionality reduction techniques also exist (Cox & Cox, 2008; Van der Maaten & Hinton, 2008; Tenenbaum et al., 2000). Similar to the reasons that brought us to rely on OD approaches, our idea is that unsupervised FR techniques might unveil a data representation that better separates among instances belonging to different classes. Indeed, rare instances should acquire a reduced data representation substantially different from that of the instances belonging to the regular class that, on the other hand, should fall in denser areas of the reduced representation. In the literature, there is a limited set of methods relying on FR to address imbalanced learning problems (Naseriparsa & Kashani, 2014; Gopi et al., 2016). However, these approaches only rely on a unique FR method and combine it with resampling techniques. On the other hand, we adopt a large array of FR approaches, and we do not augment the number of records in the dataset. Also, froid subsequently combines the outcomes of OD and FR approaches through several workflows to create more and more expressive features to separate records of different classes for the classification task.

We experimented with froid on 64 benchmarking datasets and 2 case studies by training 5 different ML models after pre-processing the data with froid. First, we observed which type of classifier benefits more from the pre-processed data returned by froid. Second, we performed an ablation study of froid showing which is the impact of every set of features extracted by the various OD and FR approaches. Third, we compared froid with some state-of-the-art techniques to deal with imbalanced learning. The results show that (i) on average LigthGBM (Ke et al., 2017) is the best classifier exploiting the unsupervised representation returned by froid, (ii) the more features are extracted through froid, the higher are the performance of the classifier, and (iii) froid outperform all the state-of-the-art approaches at the cost of a not negligible running time required to extract all the features. Finally, we highlight that, besides imbalanced learning, froid also succeeds in the supervised outlier detection task.

The rest of the paper is organized as follows. Section 2 reviews related works of imbalanced learning and of supervised outlier detection. In Sect. 3 we illustrate our proposal to solve imbalanced learning through outlier detection and features projection approaches. Section 4 reports the experimental results on benchmarking datasets as well as on two case studies. Finally, Sect. 5 summarizes our contributions and discusses open research directions.

2 Related works

A large array of approaches have been proposed to face imbalanced learning. In the last twenty years, several surveys and literature reviews have categorized and discussed peculiarities and characteristics of the various approaches (Japkowicz & Stephen, 2002; Chawla et al., 2004; Su & Tsai, 2011; He & Garcia, 2009; Chawla, 2010; Branco et al., 2016). Recently, most of these approaches have been implemented and are freely available in Python open-source libraries like imblearn (Lemaitre et al., 2017). The two principal strategies to recover from the challenges raised by imbalanced learning contexts consist in modifying the data used to train the classifier, or altering the classification algorithm itself to account for misclassification costs of the different classes during the learning process. Random undersampling and oversampling (Kubat & Matwin, 1997) are the two most classic approaches to handling imbalance. It is well known that they suffer from the risk of discarding informative instances and overfitting the minority instances, respectively. Refinement of undersampling techniques like the Condensed Nearest Neighbors (CNN) (Hart, 1968) or the Edited Nearest Neighbors (ENN) (Wilson, 1972) brought slight improvements to such issues but not effective solutions. Nowadays, the Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002) is probably the most widely used, exploited, and extended oversampling approach. For instance, ADASYN (He et al., 2008) is very similar to SMOTE, but it generates a different number of samples depending on an estimate of the local distribution of the class to be oversampled. SVMSMOTE (Nguyen et al., 2011) exploits an SVM algorithm to detect the samples to use for generating the synthetic instances for oversampling the minority class.

Besides improving the procedure of these first resampling approaches, one of the most pursued research directions consists in combining them with other Data Mining or Machine Learning approaches such as clustering algorithms or simple classification models. For instance, the ClustFirstClass undersampling approach (Sobhani et al., 2014) tries to overcome the problem of discarding informative instances by first running the k-Means clustering (Tan, 2005) on the majority class and then at least one instance is maintained from each cluster. In Sundarkumar and Ravi (2015), the majority class outliers are removed with Reverse k-Nearest Neighborhood (RkNN) (Achtert et al., 2006). Then, the selection of support vectors using the One-Class Support Vector Machine (OCSVM) (Schölkopf et al., 1999) is used to undersample the majority class. In Sanguanmak and Hanskunatai (2016) is presented the DBSM approach for simultaneous undersampling and oversampling. The oversampling is performed with SMOTE, while the undersampling is realized by selecting half of the data which are present in the clusters returned by the DBSCAN method (Ester et al., 1996). A similar solution is presented in Branco et al. (2018) with the idea of biasing the strategies to reinforce some regions of the datasets instead of sampling uniformly. Such biases are applied through random undersampling and SMOTE. In Douzas et al. (2018) is presented k-SMOTE a refinement of SMOTE that exploits k-Means to avoid the generation of noisy synthetic minority instances which are erroneously close to a dense area of records belonging to the majority class. SMOTEFUNA is a further refinement of SMOTE presented in Tarawneh et al. (2020). SMOTEFUNA generates synthetic records between a randomly selected instance and its furthest neighbor of the minority class which have not a nearest neighbor of the majority class. Instances of the majority class are considered also in Koziarski and Wozniak (2017); Koziarski et al. (2021). Indeed, in Koziarski and Wozniak (2017) is proposed the CCR algorithm that Combines Cleaning of the decision border around minority objects with guided synthetic Resampling to re-balance the dataset. In Koziarski et al. (2019) is proposed the Radial-Based Oversampling method (RBO) that discovers regions in which the synthetic objects from minority class should be generated with radial basis functions. An extension of CCR and RBO is presented in Koziarski et al. (2021): Radial-Based CCR exploits the class potential to locate sub-regions of the data-space for synthetic oversampling and adopts radial basis functions. The ClUstered REsampling (CURE) method (Bellinger et al., 2019) uses hierarchical clustering and a newly defined distance measure to guide the resampling procedure. Such clusters take into account the structure of the data. This aspect enables CURE to avoid the generation of synthetic instances in “wrong” regions, and allows the undersample of non-borderline regions of the majority class. In Bellinger et al. (2021) is proposed ReMix, a pre-processing approach that leverages batch resampling and instance mixing to enable the induction of robust deep models from imbalanced and long-tailed datasets by expanding the minority class to reduce predictive bias. The objective of ReMix is not only to improve the predictive performance but also to increase model calibration. In Sharma et al. (2018) SWIM (Sampling WIth the Majority) is presented as a method for synthetic oversampling that exploits the information inherent in the majority class to synthesize minority class records. In a certain sense, we use an idea similar to SWIM because the objective of froid is to enhance discriminative aspects among minority an majority records by looking to both of them and not only to minority records. Finally, the most widely studied case study for imbalanced learning is fraud detection (Padmaja et al., 2007; Makki et al., 2019; Tran et al., 2021, 2021; Esenogho et al., 2022). In all these works, besides experimenting with existing techniques, further refinements and extensions are proposed but all focusing on data resampling.

From the analysis of these approaches, we noticed the following aspects. First, they all focus only on recovering issues of previous versions. Second, the usage of additional mining or learning techniques brings some benefits but also problems related to the hyper-parameter tuning. Third, and most importantly, all these methods account only for the “number of instances” dimension of a dataset and leave unaltered the features used to represent the records in the dataset. In this paper, we define an approach to modify the data used to train the classifier, therefore falling in this aforementioned category of papers. However, we focus on the features used instead of the instances that should be present in the training set, and we avoid any hyper-parameter tuning.

As stated at the beginning of this section, the second and less followed line of research is relative to making crucial changes in the classification algorithm in order to account for imbalanced class scenarios. In Akbani et al. (2004) is proposed an upgrade of SVM based on a variant of SMOTE and combined with the error costs presented in Veropoulos et al. (1999) which penalizes more misclassifications w.r.t. minority instances than misclassifications to majority instances. The DataBoost-IM method (Guo & Viktor, 2004) identifies instances which are hard to classify through boosting approaches. Then it trains an ensemble-based boosting algorithm by generating synthetic instances with biased information toward the hard instances on which the next classifier in the boosting procedures needs to focus. In Wang et al. (2014) is presented an instance-weighted variant of the SVM with both 1-norm and 2-norm formats to deal with imbalanced learning. Also none of these approaches account for the features describing the records.

Conversely, the following approaches also account for this delicate aspect. In Naseriparsa and Kashani (2014), the Principal Component Analysis (PCA) (Tan, 2005) feature projection approach is combined with SMOTE in a case study on a single dataset. The procedure first applies PCA to the dataset, then SMOTE for each of the minority features before applying the classification algorithm. In Gopi et al. (2016) is defined a Support Vector Machine-Recursive Feature Elimination (SVM-RFE) wrapper for feature selection. The Automated Imbalanced Classification (ATOMIC) method presented in Moniz and Cerqueira (2021) is an Automated Machine Learning (AutoML (He et al., 2021)) approach for imbalanced classification that extends the features describing the data with additional statistical features. In Ksieniewicz (2019), an ensemble of classifiers is trained on datasets obtained as random subspace and augmented through SMOTE. In Korycki and Krawczyk (2021), a similar methodology is applied to find the most discriminative low-dimensional representation instead of a random one. Since all these approaches do not increase the number of features and yet improve the performance of “traditional” approaches for imbalanced learning, our idea is to follow this intuition but augment the features used to represent the data in order to maximize the difference between instances belonging to different classes.

Another research field related to our proposal is outlier detection (Hodge & Austin, 2004; Chandola et al., 2009). Indeed, our intuition is that instances belonging to the minority class can be considered to some extent as outliers w.r.t. instances belonging to the majority class. In supervised outlier detection, a predictive model is trained on a dataset that has labeled instances for normal as well as anomaly classes. Thus, our intuition is in line with the XGBOD method presented in Zhao and Hryniewicki (2018). Indeed, XGBOD uses multiple unsupervised outlier detection algorithms to extract an alternative representation of the instances that augment the predictive capabilities of XGBOOST (Chen & Guestrin, 2016) to solve supervised outlier detection. The same intuition is followed by the geodesic-based outlier detection method which considers Global Disconnectivity score and Local real Degree (GDLD) as measures of outlierness presented in Shi et al. (2020). Indeed, GDLD is evaluated in the imbalanced learning setting considering records belonging to the minority class as outliers. In Shimauchi (2021) is presented a semi-supervised outlier detection algorithm that extends XGBOD through the augmentation of the representations with a Generative Adversarial Network (GAN) (Goodfellow et al., 2014). In Fernández et al. (2022) is proposed a framework for supervised outlier detection that is formed by a pipeline with an unsupervised outlier detection followed by a supervised predictive model that is used to tune the hyper-parameters of the unsupervised outlier detection algorithm. Finally, it is worth mentioning the approach illustrated in Loureiro et al. (2004) that applies in a case study, an unsupervised outlier detection method based on hierarchical clustering. From our perspective, the interesting aspect of Loureiro et al. (2004) is that it employs an unsupervised clustering-based strategy similarly to works previously discussed to solve imbalanced learning. Thus, it supports our intuition that solving imbalanced learning also through approaches used for outlier detection is a viable path. ODBOT, an alternative to XGBOD, i.e., an outlier detection-based oversampling technique for imbalanced datasets learning is presented in Ibrahim (2021). ODBOT handles multi-class imbalanc by finding clusters within the minority class(es), and then, generating synthetic samples by consideration the outliers detected in these clusters. Finally, we underline how in Hassanat et al. (2022) is shown that oversampling methods are not reliable. Indeed, the authors of Hassanat et al. (2022) reports an experimentation on more than 70 oversampling methods that reveal that the oversampling methods studied generate minority samples that are most likely to be majority. Hence, oversampling methodologies are quite likely to be unreliable in imbalanced learning settings and should be avoided in real-world applications.

While works like (Shi et al., 2020; Ibrahim, 2021; Ksieniewicz, 2019; Korycki & Krawczyk, 2021) already introduced the idea of boosting imbalanced learning methods through outlier detection or feature reduction, we stress the fact that, to the best of our knowledge, our proposal is the first in which they are used simultaneously and into subsequent iterations. Furthermore, our proposal departs from Shi et al. (2020); Ibrahim (2021); Ksieniewicz (2019); Korycki and Krawczyk (2021) because they always augment the number of instances, while froid leaves their number unaltered and plays only with the different data representations in order to leverage the discriminative power of ML models.

3 Methodology

In this paper we present froid, a Features Reduction and OutlIer Detection pre-processing framework for solving imbalanced learning through unsupervised representation learning. froid is a pre-processing framework that takes as input a dataset X and returns a transformed version of it to be used as training for a ML classification algorithm. The main idea of froid is to represent the instances in X through alternative representations aimed at fostering the differences among instances belonging to different classes. Thus, froid relies on Outlier Detection (OD) and on Features Reduction (FR) approaches.

In Fig. 1 we illustrate an example that visualizes our intuition of representing the imbalanced input data through unsupervised techniques of OD and FR. The first plot (top left) depicts a synthetically generated imbalanced dataset with two dimensions and two classes with frequencies.95 and.05, respectively. In the second plot (top right), is illustrated the decision boundary learned by a Decision Tree classifier (Tan, 2005) trained on the imbalanced dataset. We immediately notice how most of the instances belonging to the minority class and located in the range \(X_0 \in [-1,2]\) and \(X_1 \in [-1, 1]\), highlighted by the yellow rectangle, are wrongly classified as majority instances by the Decision Tree. The third plot (bottom left) represents the synthetic dataset using as features the Local Outlier Factor (Breunig et al., 2000) score (LOF), an unsupervised OD approach, and the first Principal Component returned by the Principal Component Analysis (Pearson, 1901) (PCA) approach that is a FR method. We notice how the instances of the minority class are now displaced along two parallel horizontal directions. The fourth plot (bottom right) shows the decision boundary of a Decision Tree trained on this novel representation with the same parameter setting as the previous one. The rare instances inside the yellow rectangle in the second plot are represented with yellow squares in the fourth plot and are located approximately in the ranges \(LOF \in [-.8,-.4], PCA \in [1.0, 1.5]\) and \(LOF \in [-.8,-.46], PCA \in [-2.5, -.5]\). Among these rare instances, we notice that only three are not covered by decision rules labelling instances as minority class, i.e., green decision boundaries areas, and are therefore misclassified as majority class instances. Hence, by representing a two-dimensional dataset through an OD score and an FR dimension, we have improved the performance of an ML model, passing from an F1 measure of.60 to an F1 measure of.64. This simple example simultaneously explains and proves the intuition behind the proposed idea.

3.1 Imbalanced learning pre-processing framework

In this section, we define the Features Reduction and OutlIer Detection pre-processing framework froid to solve imbalanced learning through unsupervised representation learning.

Problem setting The froid framework is depicted in Fig. 2 and highlighted with the dashed box. froid works as detailed in the following. Let \(X \in \mathcal {R}^{n \times m}\) denote the original input dataset as a set of n instances with m records. Each record \(x_i \in X\) has attached a label \(y_i \in [0, \dots , l]\) indicating the class of the record where l is the number of the classes. However, since froid is an unsupervised representation learning framework, the class y is not used. On the other hand, froid makes usage of:

A set of u Outlier Detection (OD) functions \(\Theta = \{\theta _1, \dots , \theta _u\}\)
A set of v Feature Reduction (FR) functions \(P = \{\rho _1, \dots , \rho _v\}\)
A features selection function \(\zeta\)

Outlier detection transformation We define an OD function \(\theta _j\) as a mapping function where the output can be a real-valued vector \(\theta _j(X) \in \mathcal {R}^{n \times 1}\) that describes the degree of outlierness of each instance \(x_i \in X\). Outliers are instances that deviate significantly from the majority of the data and do not conform to a notion of normal behavior (Chandola et al., 2009). We include in this representation the cases in which the OD function \(\theta _j\) returns a binary-valued vector \(\theta _j(X) \in \{0, 1\}^{n \times 1}\) that indicates if the \(i^{th}\) instance is an outlier or not. We indicate with \(X_o \in \mathcal {R}^{n \times u}\) the result of the application of the u OD functions on X, i.e., \(X_o = [\theta _1(X), \dots , \theta _u(X)]\). Details of the OD approach implementing the OD functions adopted are provided in the next section.

Features reduction transformation We define an FR function \(\rho _j\) as a mapping function where the output \(\rho _j(X) \in \mathcal {R}^{n \times p}\) describes the representation of each instance \(x_i \in X\) into a p-dimensional space. FR methods transform the data from a high-dimensional to a low-dimensional space. The lower dimensionality aims to capture salient aspects of the higher dimensionality. We indicate with \(X_r \in \mathcal {R}^{n \times (v p)}\) the result of the application of the v FR functions on X, i.e., \(X_r = [\rho _1(X), \dots , \rho _v(X)]\). Details of the FR methods implementing the functions adopted are in the next section.

Features selection transformation We define a features selection function \(\zeta\) as a mapping function \(\zeta (X): \mathcal {R}^{n \times c} \rightarrow \mathcal {R}^{n \times k}\) that reduces the dimensionality of a given input \(X \in \mathcal {R}^{n \times c}\) to an output \(X' \in \mathcal {R}^{n \times k}\) where \(k < c\), i.e., \(\zeta\) remove some of the c columns from the input X, i.e., \(X' = \zeta (X)\).

Workflow description Given the input dataset X, the OD functions \(\Theta\), the FR function P, and the features selection function \(\zeta\), we can recognize three phases in the pre-processing performed by froid. We define an OD function \(\theta _j\) as a mapping function where the output can be a real-valued vector \(\theta _j(X) \in \mathcal {R}^{n \times 1}\) that describes the degree of outlierness of each instance \(x_i \in X\). Outliers are instances that deviate significantly from the majority of the data and do not conform to a notion of normal behavior (Chandola et al., 2009). We include in this representation the cases in which the OD function \(\theta _j\) returns a binary-valued vector \(\theta _j(X) \in \{0, 1\}^{n \times 1}\) that indicates if the \(i^{th}\) instance is an outlier or not. We indicate with \(X_o \in \mathcal {R}^{n \times u}\) the result of the application of the u OD functions on X, i.e., \(X_o = [\theta _1(X), \dots , \theta _u(X)]\). Details of the OD approach implementing the OD functions adopted are provided in the next section.

In the first phase, the OD and FR functions \(\Theta\) and P are applied to X obtaining \(X_o\) and \(X_r\), respectively. In the second phase, the OD and FR functions \(\Theta\) and P are applied subsequently on \(X_o\) and \(X_r\) obtaining the following representations:

\(X_{oo} = [\theta _1(X_o), \dots , \theta _u(X_o)]\)
\(X_{ro} = [\rho _1(X_o), \dots , \rho _v(X_o)]\)
\(X_{rr} = [\rho _1(X_r), \dots , \rho _v(X_r)]\)
\(X_{or} = [\theta _1(X_r), \dots , \theta _u(X_r)]\)

In the third phase, all the data representation obtained together with the input dataset X are concatenated and passed again to the features selection operator resulting in \(X_a = \zeta ([X, X_o, X_r, X_{oo}, X_{ro}, X_{rr}, X_{or}])\). We highlight how froid never augments the number of records in the dataset along with the various phases and data transformation, i.e., \(|X |= |X_a |\), differently from all the other state-of-the-art approaches for the class imbalance problem. On the other hand, the result of the unsupervised pre-processing \(X_a \in \mathcal {R}^{n \times m'}\) is a data representation using several features \(m'\) that is unknown a priori because it strictly depends on the data transformation functions \(\Theta , P, \zeta\) employed. In the end, any ML model can be trained on \(X_a, y\). We underline that when selecting a heterogeneous sets of OD methods \(\Theta\) and FR methods P, we can guarantee that every record in X will be represented w.r.t. different data-driven criteria. In addition, the risk of creating correlated features is effectively minimized, if not entirely eliminated, through (i) the utilization of a feature selection function, denoted as \(\zeta\), and (ii) the adoption of tree-based approaches as final classifiers. Thus, through this approach, we are able to guarantee that the representation of each record in \(X_a\) is diverse and distinct, enabling ML models to capture various aspects and characteristics of the data to amplify the separation between records of majority and minority classes.

In the rest of this section, we illustrate the OD and FR methods we considered to implement the functions of the froid pre-processing framework.

3.2 Outlier detection methods

We consider a large set of OD methods to implement the OD functions \(\Theta\). In particular, we rely on \(u{=}14\) OD methods based on different ideas and strategies to assign an outlier score/label to a given instance. In the following, we briefly describe the selected OD methods, which are highlighted in bold.

A large family of OD methods relies on the notion of locality, i.e., the outlier score is assigned by comparing a record with its neighbors with respect to a distance function and neighborhood size k. kNN, k-Nearest Neighbor (Tan, 2005) is a supervised ML algorithm frequently used for classification problems. However, it can also be used as an OD method by returning as outlier score of a record the largest distance from the instances in its kNN. LOF, Local Outlier Factor (Breunig et al., 2000) assigns an outlier score by comparing the local density of a record with the local densities of its kNN. If a record lies in an area with a density substantially lower than its neighbors, then it is considered an outlier. LoOP, Local Outlier Probability (Kriegel et al., 2009) is a local density-based OD method that extends LOF by measuring the local deviation of the density of a given instance with respect to its neighbors as LOF scores. It can work directly on the input data or on the result of a clustering algorithm by relating the outlier score calculus to the distances of the clusters’ centroids. COF, Connectivity-Based Outlier Factor (Pokrajac et al., 2008), overcome some limitations of LOF by calculating the outlier score of a record as its degree of connectivity. COF differs from LOF as it uses the chaining distance to calculate the kNN. The chaining distances are the minimum of the total sum of the distances linking all neighbors. The connectivity is then calculated as the ratio between the average chaining distance of the record and the mean average chaining distance of the records in the kNN. An additional possibility we explored for implementing the OD functions is to employ clustering approaches that highlight instances not belonging to any cluster as outliers (Khan et al., 2014). CBLOF, Cluster-Based Local Outlier Factor (He et al., 2003) takes as input both the dataset and a clustering algorithm and labels each cluster as “small” or “large” with respect to two \(\alpha\) and \(\beta\) parameters. The outlier score of a certain instance is then calculated w.r.t. the size of the cluster the point belongs to and the distance to the nearest “large” cluster.

Another family of OD approaches exploits global statistical tests and global models to discover anomalous behaviors. The Elliptical Envelope (Rousseeuw & van Driessen, 1999) algorithm (EllEnv) creates a global elliptical area that surrounds input data. Values that fall inside the envelope are considered normal data, and anything outside is considered an outlier. OCSVM, One-Class SVM (Schölkopf et al., 1999) is a variation of Support Vector Machines (SVM) (Tan, 2005) that can be used in an unsupervised setting for OD. The idea of OCSVM is to find a function that is positive for regions with a high density of points, and negative for small densities, considering the records that fall into negative regions of the hyperplane as outliers. MCD, Minimum Covariance Determinant (Hubert & Debruyne, 2010) is commonly applied on Gaussian-distributed data. MCD fits a minimum covariance determinant model (Hubert et al., 2018) and computes the outlier score through the Mahalanobis distance calculation. HBOS, Histogram-Based Outlier Detection (Goldstein & Dengel, 2012) assumes feature independence and calculates the outlier scores by building histograms. COPOD, Copula-Based Outlier Detection method (Li et al., 2020), instead, creates an empirical copula and uses it to predict each record’s tail probabilities to determine its outlier score.

An efficient and effective OD approach consists of using an ensemble of “weak” OD methods. The Feature Bagging (FeaBag) (Lazarevic & Kumar, 2005) exploits a set of OD methods, each of them applied on a random set of features selected from the original feature space. Each OD method identifies different outliers and assigns to all instances outlier scores that correspond to their probability of being outliers. The combination of such scores is returned as the final output. Isolation Forest (Liu et al., 2008) (IsoFor) is one of the most famous OD approaches. IsoFor isolates instances by randomly selecting a feature and then randomly selecting a split value in the range of the feature. This process is represented through a tree where the number of splittings required to isolate a record equals the path length from the root node to the leaf. Hence, an instance is considered an outlier when a forest collectively produces shorter path lengths for that instance. An extension of HBOS is LODA, Lightweight On-line Detector of Anomalies (Pevný, 2016). LODA approximates the joint probability using a collection of one-dimensional histograms, where every one-dimensional histogram is efficiently constructed on an input space projected onto a randomly generated vector. Even though one-dimensional histograms are weak OD methods, their collection yields a strong OD approach. SUOD, Scalable Unsupervised Outlier Detection (Zhao et al., 2020) is another OD ensemble method. Given in input a dataset and a set of unsupervised OD methods, SUOD randomly projects the original input onto lower-dimensional spaces and speeds up the training through a balanced parallel scheduling to assign averaged outliers scores.

3.3 Features reduction methods

We consider a wide set of FR methods to implement the FR functions P. We rely on \(v=8\) FR methods described in the following and highlighted in bold.

Most of the approaches in the literature are based on the idea of finding novel directions along which the original data should be projected. The various techniques differ on how these directions are built or derived. PCA, Principal Component Analysis (Pearson, 1901; Hasan & Abdulazeez, 2021) is the process of computing the principal components and using them to perform a feature projection of the data along with these principal directions. Indeed, a principal component is the direction of the line that best fits the data while being orthogonal to the previous component. Principal components, therefore, are the derived variables formed as a linear combination of the original variables that explains the most variance. MDS, MultiDimensional Scaling (Cox & Cox, 2008) is a process that translates the records of a given high-dimensional dataset into a low-dimensional representation with respect to the pairwise distances observed in the original space: instances which are close in the original space should be close also in the reduced space, and vice-versa. IsoMap, Isometric Features Mapping (Tenenbaum et al., 2000) is a nonlinear dimensionality reduction method. IsoMap estimates the intrinsic geometry of a data manifold by estimating the geodesic distance between all pairs of instances on a weighted graph built with respect to the nearest neighbor identified through a fixed radius. The top eigenvectors of the geodesic distance matrix represent the coordinates in the new reduced space. LLE, Locally Linear Embedding (Roweis & Saul, 2000) is similar to IsoMap, but instead of using the geodesic distance, it uses a distance based on the ability to reconstruct a record with respect to its neighbors. A well-known issue of LLE is the regularization problem. A way to address it is to use different methods for LLE, for instance the Modified LLE (Zhang & Wang, 2006), or the HLLE, Hessian eigenmapping LLE (Donoho & Grimes, 2003). SpectEmb, Spectral Embedding (Bengio et al., 2006) is exploited for non-linear dimensionality reduction using a spectral decomposition of a the graph modeling the dataset. Although SE is similar to IsoMap and LLE, it differs in how the weights are calculated, and it adopts the eigenvectors returned from a Laplacian Matrix as reduced dimensionality. Finally, we employed t-SNE, t-distributed Stochastic Neighbor Embedding (Van der Maaten & Hinton, 2008) is a form of MDS that, besides preserving the distances, also aims at preserving the neighborhoods of the instances by modeling the distances as probability distributions belonging to a certain neighborhood.

4 Experiments

We report here the experiments carried out to validate froid. First, we illustrate the experimental setting with the datasets used, the classifiers adopted, the implementations and parameters employed, the competitors analyzed, and the evaluation measures tested. Second, we show which is the best ML classifier among the various datasets and the improvement of froid w.r.t. training the models on the original data. Third, we report an ablation study of the unsupervised features adopted by froid. Fourth, we compare froid with state-of-the-art solutions. Fifth, we prove that the pre-processing of froid is beneficial also for supervised outlier detection. Finally, we discuss which are the most important features adopted by froid in two real case studies.

Table 1 Dataset description aggregated through K-Means clustering: \(n_{ train }\) instances training set; \(n_{ test }\) instances test set; m number of features; \(m_{ num }\) number numerical features; \(p^+_{ train }\) positive rate training; \(p^+_{ test }\) positive rate test set; \(FDR\) Maximum Fisher’s Discriminant Ratio, \(FBP\) Fraction of Borderline Points, \(ECP\) Entropy of Class Proportions and \(IR\) Imbalance Ratio

Solving imbalanced learning with outlier detection and features reduction

Abstract

Similar content being viewed by others

A survey on semi-supervised learning

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on ensemble learning

1 Introduction

2 Related works

3 Methodology

3.1 Imbalanced learning pre-processing framework

3.2 Outlier detection methods

3.3 Features reduction methods

4 Experiments

4.1 Experimental setting

4.1.1 Datasets and machine learning classifiers

4.1.2 Experimental details

4.1.3 Imbalanced learning competitors

4.1.4 Evaluation measures

4.2 Performance analysis of different ML models

4.3 Pre-processing with different features combinations

4.4 Comparison with state-of-the-art approaches

4.5 Results on supervised outlier detection

4.6 Features importance on real case studies

5 Conclusion

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix A: Additional results

Appendix A: Additional results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation