Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks

Joloudari, Javad Hassannataj; Marefat, Abdolreza; Nematollahi, Mohammad Ali; Oyelere, Solomon Sunday; Hussain, Sadiq

doi:10.3390/app13064006

Open AccessArticle

Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks

¹

Department of Computer Engineering, Faculty of Engineering, University of Birjand, Birjand 9717434765, Iran

²

Department of Artificial Intelligence, Technical and Engineering Faculty, South Tehran Branch, Islamic Azad University, Tehran 1477893780, Iran

³

Department of Computer Sciences, Fasa University, Fasa 7461686131, Iran

⁴

Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, 93187 Skellefteå, Sweden

⁵

Examination Branch, Dibrugarh University, Dibrugarh 786004, Assam, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(6), 4006; https://doi.org/10.3390/app13064006

Submission received: 15 February 2023 / Revised: 10 March 2023 / Accepted: 18 March 2023 / Published: 21 March 2023

Download

Browse Figures

Versions Notes

Abstract

:

Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models from achieving satisfactory results. ID is the occurrence of a situation where the quantity of the samples belonging to one class outnumbers that of the other by a wide margin, making such models’ learning process biased towards the majority class. In recent years, to address this issue, several solutions have been put forward, which opt for either synthetically generating new data for the minority class or reducing the number of majority classes to balance the data. Hence, in this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) mixed with a variety of well-known imbalanced data solutions meaning oversampling and undersampling. Then, we propose a CNN-based model in combination with SMOTE to effectively handle imbalanced data. To evaluate our methods, we have used KEEL, breast cancer, and Z-Alizadeh Sani datasets. In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions. The classification results demonstrate that the mixed Synthetic Minority Oversampling Technique (SMOTE)-Normalization-CNN outperforms different methodologies achieving 99.08% accuracy on the 24 imbalanced datasets. Therefore, the proposed mixed model can be applied to imbalanced binary classification problems on other real datasets.

Keywords:

imbalanced data; resampling; normalization; deep neural network; convolutional neural network

1. Introduction

Learning a classifier from an imbalanced dataset is an important topic and still a complicated problem in supervised learning algorithms. In other words, a class imbalance is a customary long-standing challenge in classification problems [1,2,3,4,5], which deals with a dataset that contains an asymmetrically larger number of samples of the majority class. The imbalanced datasets appear in vast real-world research, such as life sciences [6], facial age approximation [7], anomaly detection [8], determining counterfeit credit card transactions [9], medical imaging [10], DNA sequence identification [11], and so forth. For an imbalanced binary classification problem, samples are typically characterized by two classes, namely majority and minority.

1.1. Context of the Study

In general terms, the minority class often illustrates samples of higher importance and interest rather than the majority class. Nevertheless, compared to the minority class, the majority class usually has a more significant number of samples in a meaningful way, and sometimes, the situation may be extremely serious.

Different situations can occur in confronting the imbalanced datasets, and four common cases are depicted in Figure 1, where the blue-filled circles represent the samples of the majority class; in contrast, the red circles denote the minority class [1]. It has been shown that the type of data complexity is the principal determining factor of classification performance reduction [2].

Most of the classical classification methods, such as decision trees [2,3,4], KNN [5,6], and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [7,8], usually train models that maximize the accuracy of proposed algorithms, sometimes ignoring the minority class [9,10,11]. Hence, several techniques have been designed and implemented to handle the imbalanced binary classification problems. Among these techniques, oversampling and undersampling are well-known [12,13,14,15]. Yet, the common undersampling and oversampling algorithms modify the initial class distribution of the dataset by excluding the majority class samples or expanding the minority class samples.

Cost-sensitive learning algorithms were among the solutions for the above-mentioned issues of imbalanced data [16,17,18]. Such algorithms designate misclassification cost errors for multiple classes, mainly lower costs for the samples of the majority class and higher for the minority class. In addition, Bagging [19] and Boosting [20] methods, which are based on ensemble learning algorithms, are among the other commonly used methods to handle imbalanced class problems [7,8,21,22].

In this paper, we use several undersampling and oversampling methods in the process of implementing our methodology, which is briefly introduced in the sequel:

RUS: Among undersampling methods, random undersampling (RUS) is the simplest one, in which the samples of the majority class are randomly removed until suitable balanced data are obtained [23].
Tomek Links: Some of the undersampling techniques focus on overlap elimination. For example, the Tomek Links [24] method, which is a modification of the Condensed Nearest Neighbor rule, is one of these methods.
One-Sided Selection: As a development of the Tomek Links algorithm, one can refer to the One-Sided Selection or briefly OSS method [25] that merges Tomek Links and the Condensed Nearest Neighbor algorithms.
Near Miss is another popular undersampling method that randomly removes the majority of class samples. When two samples classified in different classes are very close to each other, it removes the sample belonging to the larger class [5].
ROS: Among the oversampling algorithms, random oversampling (ROS) is the simplest one that merely selects and copies the samples from the minority class randomly, leading to more balanced data [26].
SMOTE: The best-known oversampling method is the Synthetic Minority Oversampling Technique (SMOTE) [14,27,28] which leverages the kNN algorithm to identify the neighbors of minority class samples and generates the new sample by selecting the kth neighbor randomly [29].

1.2. Research Gap

It is worth noting that the methods mentioned above may cause some unexpected issues. For example, undersampling techniques may ignore some valuable data, which could be vital for training a classifier. In contrast, oversampling algorithms may cause overfitting. Furthermore, for cost-sensitive learning techniques, it is not straightforward to determine the exact misclassification cost, and different misclassification costs might result in different induced outcomes. Moreover, Bagging and Boosting algorithms may exclude some valuable data while they propose sampling methods in every single iteration, and they may face an overfitting problem. Consequently, the classification results obtained by these methods are not stable.

1.3. Motivation and Contribution

In order to address these problems, this paper proposes two DL-based methods mixed with different resampling methods for better tackling the issue of an imbalanced dataset. The existing DL-based methods, especially CNN architectures, have been employed in a wide variety of challenges, and they have proven to be extremely powerful in terms of learning balanced datasets. Their efficacy has not been satisfactorily investigated when tackling imbalanced datasets [30].

CNNs are types of architectures that contain convolutional blocks and can provide an end-to-end classification algorithm. These blocks are a stack of different layers, namely convolutional layers, pooling layers, and activation functions. The most significant attributes of such models are their learning capacity with fewer parameters and translational invariance concerning the input data. In CNNs, the input data are fed to multiple convolutional blocks, which are named mainly backbone as a whole, and then followed by a sequence of fully connected layers to be classified.

The training procedure is performed using Focal Loss (FL), which optimizes the abstraction learned by the models to handle complex samples better.

In particular, the main contributions of this paper are threefold. First, 24 popular imbalanced datasets from KEEL Dataset Repository, breast cancer dataset from KDD Cup, and Z-Alizadeh Sani are chosen. The proposed pipeline is trained and validated 100 times to achieve more reliable results. Second, this paper is the first research work that extensively investigates the most efficient mix of Deep Neural Networks (DNN) with the famous resampling method, SMOTE, for imbalanced data. Lastly, the mixed SMOTE-Normalization-CNN methodology has proved to produce superior results in comparison with related research works in terms of accuracy, precision, recall, Geometric mean (G-Mean), specificity, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), and Kappa.

The rest of the paper is outlined as follows: Section 2 provides a brief review of the existing works on imbalanced datasets; Section 3 presents the details of our proposed methodology; Section 4 includes the implementation setup and the evaluation process and then focuses on the results obtained by the proposed methods; Section 5 presents a discussion on comparing our method to the others, and the last section draws a reasoned conclusion and future work.

2. Related Work

To better handle imbalanced data, Li et al. [31] proposed a developed AdaBoost algorithm called Adaboost-A, which is based on the AUC evaluation metric. In fact, by considering the impact of misclassification probabilities and AUC, the algorithm mentioned above betters the computational performance of the Adaboost algorithm. Furthermore, this study proposed an ensemble learning algorithm named PSOPD-AdaBoost-A, by which the multipliers of Adaboost weak classifiers were optimized.

In [32], the authors provide a detailed exploratory comparison between the problem of handling class overlapping and class imbalances using a full range of class overlaps along with a large scale of class imbalance degrees. The rest of this study contains a thorough review of the current methods and solutions for handling the imbalanced data classification problem, characterized by two categories: distribution-based and class overlap-based algorithms.

The literature [33] proposed an undersampling technique that mainly uses the well-known Naïve Bayes classifier. Based on a random primary selection, this classifier is leveraged to select the most informational samples among the existing training dataset. At first, the model is trained on a small training set, and after that, by an iterative teaching method over the current samples, the base model is taught. The practical outcomes showed that the proposed undersampling technique is comparable to other resampling techniques.

In [34], Dablain et al. proposed a novel deep oversampling method called DeepSMOTE, which contains three principle parts and mainly uses the properties of the effective SMOTE method. Despite being simple, the method is efficient and powerful. An encoder/decoder structure, being SMOTE-based and including an improved loss function, form the three parts of this method. The results of this study show the advantages of the proposed method, especially in GAN-based oversampling cases.

The research study [35] investigates the effects of resampling on the performances of multi-class Artificial Neural Networks. Different resampling techniques (both undersampling and oversampling) were examined on several cybersecurity datasets. Furthermore, to determine the results of the proposed methods, various evaluation metrics were used. Finally, four observed patterns were reported that compare the impacts of resampling on the evaluation metrics and model training duration.

In another study [36], the authors dealt with the issue of the probable bias and tendency of a learning classifier toward the majority of samples for an imbalanced dataset. This work implemented an innovative three-dimensional framework that includes a discriminator, a generator, and a classifier, together with decision boundary regularization. The remarkable aspect of the proposed method is training a generator in association with a classifier. The reporting results show better performance of the technique than the existing methods.

To improve the efficiency and functioning of undersampling methods for imbalanced data, Xie et al. [37] proposed a new undersampling technique that leverages consecutive density peaks to gradually take out samples from the majority class of imbalanced data. In order to determine the importance of the samples of the majority class, two factors were considered, which generate a sequence of samples for learning classifiers. The study compared the implemented algorithm to six well-known undersampling methods over 40 public benchmarks, and the results verify the outperformance of the proposed technique.

In order to design learning classifiers that provide stable performances on imbalanced datasets, the study [38] proposed three different methods. These methods are mainly based on genetic algorithms which automatically specify the ratios of samples for oversampling, undersampling, and hybrid sampling techniques. The implemented algorithms were examined on 14 imbalanced datasets, and the results show that they achieved the best AUC compared to random sampling methods.

To decrease the domination of the majority class samples, the study [39] implemented a novel hybrid method called CDSMOTE, which uses class decomposition and oversampling on the minority class samples. Contrary to general undersampling algorithms, this proposed method keeps the majority class samples, leading to more balanced data. The algorithm was examined on 60 imbalanced public datasets, and the results show comparable performance compared to the existing algorithms.

By introducing a new algorithm called SMOTE-LOF, Maulidevi and Surendro [40] attempted to refine SMOTE. This method distinguishes the noise that arises when dealing with imbalanced datasets by adding the Local Outlier Factor (LOF). By examining the proposed algorithm over different imbalanced datasets, the results were compared to SMOTE. Unlike small data samples, for a large-scale dataset with a small imbalance ratio, SMOTE-LOF outperforms SMOTE.

In order to annihilate the overlap between the majority class and the minority class in an imbalanced dataset and obtain a balanced and normalized class distribution, the study [29] implemented two innovative density-based methods. These methods were density-based undersampling (DB_US) and density-based hybrid sampling (DB_HS). The first method applies merely an undersampling algorithm, while the second implements both undersampling and oversampling approaches. In addition, the balanced datasets were modeled employing Random Forest (RF) and Support Vector Machine (SVM) classifiers. As a result, the two proposed methods eliminated high-density samples from the majority class and omitted the noises of both classes. The performance of these methods was examined on 16 imbalanced datasets.

In the literature [41], the authors proposed a novel classification method called the Bagging Supervised Autoencoder Classifier (BSAC) to model credit scoring problems. This algorithm essentially leverages the superlative implementation of a supervised autoencoder based on the axioms of multi-task learning. Furthermore, BSAC tackles the issue of imbalanced datasets by engaging a variation of the Bagging procedure based on undersampling techniques. The examinations of benchmark and real-world credit scoring datasets show the robustness and efficiency of BSAC.

To improve the performance of the basic antlion optimization (ALO), in [42], a novel modified antlion optimization method (MALO) was introduced. This algorithm adds an extra variable that depends on the step size of the ants as revising the antlion position. Furthermore, MALO is modified to the issues of sample reduction to achieve better performance due to various metrics. MALO was examined on several benchmarks and balanced and imbalanced datasets. The results show the outperformance of MALO against the primary ALO method and some other comparable algorithms.

In [43], Yang et al., implemented a sampling level technique called the gravitational balanced multiple kernel learning (GBMKL) algorithms, which merges the gravity approach to produce the gravitation-balanced midpoint samples (GBMS) placed on the classification boundary. Moreover, to better the generalization efficiency, the classification boundary was modified according to the nearest neighbors of the boundary (NNB) samples. Finally, two regularization terms that correspond to GBMS and NNB were formulated to prevent overfitting. The resulting method was examined on 54 artificial and real-life imbalanced datasets, and the outcomes show the dominance of the implemented method.

Tanimoto et al. in [44] studied the near-miss positive samples in the class of imbalanced datasets. They showed that if the true positive samples are severely limited, the accuracy of the proposed model could be increased by obtaining modified label-like side information positivity to identify near-miss samples from true negatives. Furthermore, the proposed method follows learning using privileged information that leverages side information for training the desired model devoid of predicting the side information itself. The results of the experiments show the outperformance of the method in contrast to the existing algorithms.

The research study [45] proposed new development of SMOTE by merging it with the Kalman filter. After applying SMOTE to the given dataset, the implemented algorithm, called Kalman-SMOTE (KSMOTE), excludes the noisy samples in the resultant dataset that simultaneously contains the initial data and the synthetically added samples. The method was examined on a broad range of datasets, and the results show that the implemented algorithm outperforms the existing methods.

Since oversampling techniques cannot usually achieve high performance in the presence of noise, the study [46] implemented an innovative oversampling algorithm called IR-SMOTE that handles this issue. By sorting the majority class samples and the k-means clustering algorithm, the noise in minority class clusters is eliminated. After that, using the kernel density estimation method, the amount of synthetic samples is compatibly designated to each cluster. Finally, regarding random-SMOTE, the desired algorithm was improved to add new samples with ensured diversity.

The literature [30] studied the performance of convolutional neural networks (CNNs) in the presence of imbalanced data for classification problems. In order to explore this probable impact, the research used MNIST, CIFAR-10, and ImageNet as benchmarks, alongside undersampling, oversampling, two-phase training, and thresholding. The results show that imbalanced data have a detrimental effect on the performance of the proposed method. Furthermore, one should implement oversampling to the level that removes the imbalance, while the extent of the imbalance determines the ideal undersampling ratio. In addition, oversampling does not lead to the overfitting of CNNs.

Fault diagnosis of complex equipment, which plays an important role in industries, is a crucial technology, and CNN is a general tool for this purpose. In this case, faults are not common, which leads to imbalanced data, and therefore, one cannot propose CNN methods directly. In order to address this problem, a hierarchical training CNN is implemented in [47]. At first, the method uses a number-resampling technique to balance data. Then, a magnet-loss pretraining algorithm is provided to handle the overlap between diverse faults. The proposed method was examined on the public dataset CWRU with an accuracy of 94.28%.

3. Methodology

In this paper, we have used our methods applied to various datasets collected from benchmark repositories such as the KEEL (https://sci2s.ugr.es/keel/imbalanced.php, accessed on 17 March 2023), breast cancer (https://www.kdd.org/kdd-cup, accessed on 17 March 2023), and Z-Alizadeh Sani (https://archive.ics.uci.edu/ml/datasets/Z-Alizadeh+Sani, accessed on 17 March 2023) datasets in order to address the class imbalance problem. Figure 2 demonstrates an overview of our proposed methodology, whose details are included in this section.

Based on Figure 2, the main steps in our methodology include preprocessing, classification, and analysis of models.

3.1. Dataset Preprocessing

As stated before, the most acute problem in classifying imbalanced data is that classifiers become biased toward the majority class. There are several methods to overcome this issue which are generally called resampling techniques. By adding minority class samples or removing samples from the majority class, resampling turns the data into a more balanced one. In this regard, there are two principal methods: oversampling and undersampling.

Oversampling algorithms generate new samples, duplicated or synthetic, that belong to the minority class. In contrast, the undersampling techniques delete samples that belong to the majority class to afford balance to the dataset [23].

As a preprocessing step in our methodology, we have utilized various well-known oversampling and undersampling techniques for balancing the dataset. Normalization and split datasets are the next steps in data preprocessing. These are elaborated on in the following.

3.1.1. Oversampling Techniques

Random Over-Sampling (ROS)

The first and simplest method in this field is random oversampling (ROS), which aims to help the distribution of datasets by increasing the number of samples in the majority class until the class distributions tend to balance. This approach is non-heuristic, meaning that it does not boast any intelligent decision boundaries. Random oversampling is usually applied to the level that excludes the imbalance. By merely regenerating samples from the minority class, ROS tackles a balance in the training model. However, duplicating similar samples may lead to the problem of overfitting, particularly for the samples belonging to the minority classes [26,48]. Figure 3 shows an illustration of the oversampling technique.

Synthetic Minority Oversampling Technique

Synthetic Minority Oversampling Technique (SMOTE) [14,49,50,51] is another resampling technique that aims to increase the amount of minority class samples by creating synthetic samples in the minority class and is applied for balancing datasets with a highly unbalanced ratio. In order to avoid the issue of overfitting, the synthetic generation of new samples differed from the multiplication algorithm.

The main idea behind SMOTE is to generate new samples of data in the minority class by interpolation between samples of this class that are in close vicinity of each other [4,52]. Thus, SMOTE increases the number of minority class examples within an imbalanced dataset and consequently enables the classifier to achieve better generalizability. The formal procedure for SMOTE can be explained as follows: Firstly, N, which is the desired amount of oversampling, should be set to an integer number. This number can be opted for in that the dataset becomes balanced with a ratio of 1:1 within the different classes. Then, three main steps should be taken iteratively. These steps are 1: Randomly selecting a sample that belongs to the minority class, 2: The K (default 5) nearest neighbors of this sample should be selected, 3: N of these K neighbors are selected randomly for interpolation and generating new samples [53]. An intuition of how SMOTE works is shown in Figure 4.

3.1.2. Undersampling Techniques

Random Under-Sampling

The simplest technique among under-sampling methods is Random Under-Sampling (RUS) which is a data-level approach. Here, the algorithm tries to reduce the number of the majority class samples to balance data. In RUS, we randomly select samples within the majority class and delete them, which makes the distribution of a class-imbalanced dataset with a highly unbalanced ratio more balanced. RUS is a non-heuristic approach that does not behave as smart as some other algorithms. Its main drawback is the high probability of losing valuable information within a dataset [4]. More precisely, the principal issue in proposing this method is that there is no control over what information about the majority class is being thrown away. As a result, the samples that contain information and details about the decision boundary may be removed, and that valuable information is lost [23]. An overview of RUS is shown in Figure 5.

Tomek Links

Tomek Links (TL) [24] is another effective undersampling technique used for balancing the data. TLs are pairs of samples that are very close two each other, but they belong to different classes. These samples are contiguous to the borderline between classes. In mathematical language, given a pair of samples

(S_{i} . S_{j})

from the dataset, we say that there is a TL between the two samples if at least one of the two following inequalities is satisfied:

δ (S_{i} . S_{l}) < δ (S_{i} . S_{j}), δ (S_{j} . S_{l}) < δ (S_{i} . S_{j})

(1)

where

δ (x . y)

is the distance between

x

and

y

[54].

Generally, one of the two samples that form a TL is considered a noisy sample, or the two samples together are considered borderline [4]. In this case, by eliminating the samples of the majority class that belong to the pairs forming TLs, the distance between the two classes increases, and the dataset becomes more balanced [23]. See Figure 6, which shows how TLs can be used to reduce the number of samples in the majority class.

One-Sided Selection (OSS)

One-Sided Selection (OSS) [25] is proposed as an undersampling technique whose main idea is to combine TL and Condensed Nearest Neighbor Rule. In order to address the issue of imbalanced datasets, this approach leaves the minority class samples completely intact. It filters out the redundant samples in the majority class through a modification of the condensed nearest-neighbor rule [55].

In OSS,

δ (x . y)

is supposed to be a distance value that meets the requirements for being a TL, where

x

is chosen from the majority class, and

y

is selected from the minority. This way, two scenarios can happen: (1) a TL is found to be on the class boundary when both

x

and

y

exist in the right class regions (2) a TL is found to be inside one of the class regions when either

x

or

y

lies in the wrong region. OSS was introduced to decrease the number of majority class samples by omitting the data points which are borderline or noisy [56]. Figure 7 illustrates a diagram of the OSS technique.

NearMiss

The last undersampling technique proposed in this study and introduced here is NearMiss [5]. This method is based on the K-nearest neighbors algorithm and categorized as NearMiss-1, NearMiss-2, and NearMiss-3. The main idea behind NearMiss is to consider the mean distances of samples from the majority class to the samples from the minority class.

Contrary to randomly removing samples from the majority class, these methods eliminate these samples intelligibly. NearMiss-1 removes the majority class samples whose mean distances to the three nearest samples of the minority class are minimal. On the other hand, NearMiss-2 deletes the samples from the majority class with minimal average distances to the three farthest minority samples. Finally, NearMiss-3 selects a certain number of the closest samples of the majority class regarding every minority class sample [57].

As claimed in [5], the results of experiments showed that NearMiss-2 has a better performance than NearMiss-1 and NearMiss-3. Furthermore, it outperforms the RUS technique [23].

It is worth noting that NearMiss can be fine-tuned in two aspects: The variant that can be chosen from 1, 2, and 3. In addition, the number of neighbors to consider for calculating the mean distances is three as the default. An outline of the NearMiss algorithms is shown in Figure 8.

3.1.3. Normalization

Normalization is one of the most crucial preprocessing steps for any challenge in machine learning. It can be performed by scaling or transforming the original data to balance the contributions of different features in data samples. In this study, we have normalized the input data to make a distribution between zero and one.

3.1.4. Split Dataset

Further, due to the low number of samples in datasets which makes the classification result extremely unstable, we have trained and evaluated our models for 100 runs. In each run, first, we randomly shuffle and split the data into training and testing sets and train them according to the model for 2000 epochs and then evaluate it.

3.2. Models

In this section, we introduce our two proposed Artificial Neural Network (ANN)-based models, including Deep Neural Network and Convolutional Neural Network.

3.2.1. Proposed Deep Neural Network

Deep Neural Networks (DNNs) have recently become among the favorite approaches in various fields in the domain of Artificial Intelligence [58]. These networks, which are famously called models, are characterized by several layers that contain a huge number of computational units. These units, which are interconnected, meaning that the output of one unit is the input of the other, are conceived as the imitation of the physiological brain’s structure. In mathematical terms, they are a set of parametrized linear and non-linear transformations capable of being adjusted in order to output abstractions of the input data [59]. This capability comes from the amalgamation of multiple layers full of perceptrons. Although a single perceptron cannot handle data that are not linearly separable, they are the basis of Multi-Layer Perceptrons (MLPs), whose ability to transform highly non-linear data makes them a powerful and efficient tool in machine learning [60].

Furthermore, the first proposed method in this paper is a DNN-based model. The architecture of this model is demonstrated in Figure 9.

As is observed in Figure 9, our DNN model comprises different layers, including a fully connected one followed by an activation function and a batch normalization layer. Then, a fully connected layer, an activation function, and a batch normalization followed by a dropout and a single neuron fully connected layer come after.

3.2.2. Proposed Convolutional Neural Network

Convolutional operations are the main components in Convolutional Neural Network (CNN)-based models. These operations enable CNNs to extract and learn the salient features existent in the input data [61]. CNN comprises different layers that output feature maps, resulting in sliding different kernels on the input and applying activation functions [62]. Compared with DNNs, the major advantage of CNNs over DNNs is their capability to reduce the computational cost in each layer. The convoluted features extracted by these models are compact representations of the input data, which can be further used in downstream tasks such as classification [63].

In this paper, the second method proposed for the binary classification of the input data is a CNN-based model. The architecture of this model is demonstrated in Figure 10.

As is seen in Figure 10, our proposed model consists of 4 layers (two 1-dimensional convolutional layers and two fully connected layers). After each hidden layer, a non-linear activation function (ReLU) is applied to the output. To make our training process more efficient, we have experimented with several loss functions, among which Focal Loss (FL) [64] claimed better supervision of the network. In fact, it was invented to address the issue of class imbalance. FL belongs to the cost-sensitive methods which were originally introduced in the case of object detection, where the imbalance between background and salient object is often frequent. FL is a modification to the cross-entropy loss in that during the training procedure; the neural network receives more cost for wrongly predicting complex training samples.

More precisely, the cross-entropy loss function is among the most common loss functions in deep learning that originates from information theory. It is seemingly identical to the negative log-likelihood loss function, and for the binary classification problems, the binary cross-entropy loss function, denoted by

l_{B C E}

is as follows:

l_{B C E} (y, \overset{⌢}{y}) = - (y \log (\overset{⌢}{y}) + (1 - y) \log (1 - \overset{⌢}{y}))

(2)

Here

y, \overset{⌢}{y} \in {\{0, 1\}}^{N}

, where N is the number of samples,

\overset{⌢}{y}

is the predicted value, and

y

denotes the ground truth table.

The problem with the cross-entropy loss function is that in the case of imbalance classification, the larger class overwhelms the loss by dominating the gradient [65]. Hence, to obtain the Focal loss function, one can simplify and rewrite Equation (2) in the following way:

CE (p, y) = \{\begin{matrix} - \log (p), & if y = 1 \\ - \log (1 - p), & if y = 0 \end{matrix}\}

(3)

Name the probability of predicting the ground truth class

p_{t}

and define

p_{t}

as:

{CEp}_{t} = \{\begin{matrix} p, & if y = 1 \\ 1 - p, & if y = 0 \end{matrix}\}

(4)

Therefore,

l_{B C E}

can be rewritten and simplified as:

l_{B C E} (p, y) = C E (p_{t}) = - \log (p_{t})

(5)

Finally, FL augments a modulating factor

α {(1 - p_{t})}^{γ}

to the binary class entropy loss function, where γ > 0 is a tunable focusing parameter which yields the following equation:

F L (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} \log (p_{t})

(6)

4. Experimental Results

This section comprised simulation setup, dataset description, split dataset, evaluation metrics, and classification results.

4.1. Simulation Setup

This section includes the implementation details of our proposed methods. The tools used in this paper are listed in Table 1.

Moreover, in our implementation, we used the Adam algorithm to optimize the models’ parameters with a learning rate of 0.001. For the loss function, FL is used with the alpha parameter set to 0.25 and gamma parameters set to 2. Further, the list of hyperparameters of the DNN and CNN models and the description of the parameters for oversampling techniques is described in detail in Table 2, Table 3 and Table 4.

4.2. Dataset Description

In order to examine our proposed methods, we used the KEEL [66] dataset repository, breast cancer, and Z-Alizadeh Sani datasets. As is depicted in Table 5, the datasets comprise different imbalanced datasets for classification tasks.

Based on Table 5, the first column indicates the number of attributes of each dataset. The second, the sum of positive and negative samples, is calculated as all samples. Furthermore, the imbalance ratio between minority or positive and majority or negative classes is assigned in the third column. Meanwhile, the imbalance ratio is achieved by dividing negative samples into positive samples. As described in Section 3.1.3 (split dataset), the dataset was randomly shuffled and split into training and testing sets which the dataset was trained according to the model for 2000 epochs. The generated models were trained and evaluated for 100 runs.

4.3. Evaluation Metrics

This section includes the elaboration of the metrics which is used to evaluate the performance of our proposed models. A fundamental classification metric tool is the Confusion Matrix. This tool is a way of demonstrating the number of correctly and incorrectly predicted samples by a classifier. It is usually a table that contains the actual and predicted state of samples compared to each other. Figure 11 depicts such a matrix for a binary classifier. This matrix includes four items, namely True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). The formal description of these four items is as follows:

True Positive (TP): The number of samples that belongs to the positive class and are correctly predicted as positive by the classifier
True Negative (TN): The number of samples that belongs to the negative class and are correctly predicted as negative by the classifier
False Positive (FP): Number of samples that belongs to the negative class, even though they are predicted as positive by the classifier
False Negative (FN): Number of samples that belongs to the positive class, even though they are predicted as negative by the classifier

Figure 11. Confusion matrix.

Based on Figure 11, the classes of the minority and majority are remarked as positive and negative classes, respectively. Therefore, a confusion matrix is used to obtain performance metrics for the models on the imbalanced datasets. We utilized eight metrics, such as accuracy, precision, recall, F1-score, G-Mean, specificity, AUC-ROC, and Kappa, for evaluating the DNN and CNN models [29,42,67,68,69,70,71].

4.3.1. Accuracy

Accuracy is the ratio of the number of samples that are predicted correctly to the total of the input samples, as formulated in (7).

Accuracy = \frac{TP + TN}{FP + FN + TP + TN}

(7)

4.3.2. Specificity

The specificity is the proportion of true-negative samples to the overall number of true-negative and false-positive samples. The specificity or True Negative Rate (TNR) of a classifier is calculated using Equation (8).

Specificity = \frac{TN}{TN + FP}

(8)

4.3.3. Recall

A recall is another measurement that shows the ratio of predicted positive samples to all the relevant samples, meaning the samples which have been actually positive. The recall is a significant metric for imbalanced datasets, demonstrating the learning accuracy of the positive class. It is calculated by Equation (9).

sensitivity = \frac{TP}{TP + FN}

(9)

4.3.4. G-Mean

The G-mean is exploited as an accuracy metric as it can gauge the accuracy rates of majority and minority classes. It is achieved by Equation (10).

G - Mean = \sqrt{Sensitivity \times Specificity}

(10)

4.3.5. Precision

Precision shows how well a classifier’s performance is in terms of predicting positive samples. As Equation (11) shows, it is easily calculated by dividing the number of true positives by the total number of predicted samples as positive.

Precision = \frac{TP}{TP + FP}

(11)

4.3.6. F1-Score

F1-Score, which is also called F-score or F-measure, indicates the balance which exists between recall and precision for a classifier. The closer it is to one, the more balance between precision and recall exists. F1-Score can be obtained by Equation (12).

F 1 - Score = \frac{2 TP}{2 TP + FP + FN}

(12)

4.3.7. Kappa

The kappa metric considers the random classification model accuracy to evaluate the obtained classification accuracy. It is an important metric that indicates whether the accuracy of the classifier is at the level of reliability. The values of the Kappa are between −1 to 1. On the other hand, three reliability levels of Kappa have been exploited to assess the accuracy as follows:

Kappa $\geq 75$ : Robust consistency, high reliable accuracy.
$0.4 \leq$ Kappa < 0.75: the accuracy’s reliance level is generally.
Kappa < 0.4: Accuracy is unreliable.

The kappa formula has been specified in (13).

Kappa = \frac{Accuracy - random}{1 - random}

(13)

4.3.8. AUC-ROC

The AUC-ROC is a crucial measurement to evaluate the performance of generated classification models. A ROC plot represents the trade-off between true positives and false positives, which actually indicates the correlation between specificity and recall. Furthermore, AUC specifies the amount of separability power of the classifier. The AUC range is from 0 to 1. Therefore, the higher the AUC means the model has better performance at recognizing the minority and majority classes.

4.4. Classification Results

In this section, we demonstrate our experimental results based on the evaluation metrics such as accuracy, precision, recall, F1-score, G-Mean, specificity, AUC, and Kappa. The results have been elaborated by obtaining the average for each metric on three imbalanced datasets, including the KEEL repository, breast cancer, and Z-Alizadeh Sani for classification tasks.

The results are given in Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11 for six models, such as SMOTE + NORM. + CNN/DNN, TL + Normalization (NORM.) + CNN/DNN, OSS + NORM. + CNN/DNN, NearMiss + NORM. + CNN/DNN, ROS + NORM. + CNN/DNN, and RUS + NORM. + CNN/DNN, respectively. We marked the best results in boldface.

According to the results obtained, the proposed SMOTE + NORM. + CNN model outperforms other models in terms of eight metrics on the datasets. As a result, the mixed SMOTE-NORM.-CNN model demonstrates the impact of using SMOTE in our CNN model so that the overall performance has been enhanced.

Moreover, in our experiment, the ROC plots based on the best AUC scores gained through the models are shown in Figure 12a–z on the datasets. Due to Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11 and ROC plots, the mixed SMOTE-NORM.-CNN model has the best AUC value.

Table 6. SMOTE + NORM. + CNN/DNN.

Dataset	Acc (CNN/DNN)		Pre (CNN/DNN)		Rec (CNN/DNN)		F1 (CNN/DNN)		G-Mean (CNN/DNN)		Spe (CNN/DNN)		AUC (CNN/DNN)		Kap (CNN/DNN)
ecoli1	99.11	99.38	99.13	99.40	99.11	99.38	99.11	99.38	99.11	99.38	98.83	99.17	99.00	99.25	98.21	98.77
ecoli2	99.60	99.76	99.60	99.77	99.60	99.76	99.60	99.76	99.60	99.76	99.44	99.54	99.03	99.43	99.19	99.53
ecoli3	99.53	99.64	99.54	99.65	99.53	99.64	99.53	99.64	99.53	99.64	99.21	99.39	99.20	99.31	99.06	99.29
ecoli-0_vs_1	99.91	99.83	99.92	99.84	99.91	99.83	99.91	99.83	99.91	99.83	99.93	99.79	99.27	99.10	99.83	99.66
glass0	99.26	99.19	99.27	99.26	99.26	99.19	99.26	99.19	99.26	99.19	99.27	99.00	99.31	99.53	99.27	98.38
glass1	99.31	99.23	99.32	99.25	99.31	99.23	99.31	99.23	99.31	99.23	99.32	99.64	99.25	99.34	99.30	98.46
glass6	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Haberman	95.60	95.44	95.61	96.30	95.60	95.77	95.60	96.03	95.60	96.10	95.61	96.45	97.03	98.05	95.00	96.01
iris0	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	99.99	99.99	100.00	100.00
new-thyroid1	99.95	99.96	99.96	99.96	99.95	99.96	99.60	99.96	99.96	99.96	99.95	99.92	99.90	99.95	99.96	99.92
new-thyroid2	99.95	99.97	99.96	99.97	99.95	99.97	99.95	99.97	99.95	99.97	99.96	99.94	99.38	99.61	99.94	99.94
page-blocks0	99.42	99.39	99.42	99.39	99.42	99.39	99.42	99.39	99.42	99.39	99.38	99.25	99.27	99.28	99.23	98.97
pima	99.35	99.15	99.36	99.15	99.35	99.14	99.35	99.15	99.35	99.21	99.30	99.37	99.81	99.25	99.46	99.03
segment0	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.67	99.31	99.99	99.99
vehicle0	99.95	99.97	99.95	99.97	99.95	99.97	99.95	99.97	99.95	99.97	99.89	99.95	99.83	99.60	99.88	99.95
vehicle1	99.81	99.76	99.81	99.76	99.81	99.76	99.81	99.76	99.81	99.76	99.87	99.67	99.90	99.15	99.90	99.52
vehicle2	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.84	99.57	99.97	99.98
vehicle3	99.85	99.74	99.86	99.75	99.85	99.74	99.85	99.74	99.85	99.74	99.82	99.62	99.80	99.20	99.91	99.48
wisconsin	99.80	99.84	99.81	99.85	99.80	99.84	99.80	99.84	99.80	99.84	99.80	99.75	99.46	99.51	99.64	99.60
yeast1	98.98	98.71	98.98	98.72	98.98	98.71	98.98	98.71	98.98	98.71	98.98	99.81	98.80	98.62	98.67	98.35
yeast3	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	99.90	99.94
yeast-2_vs_4	99.95	99.91	99.96	99.92	99.95	99.91	99.95	99.91	99.95	99.91	99.94	99.92	99.80	99.45	99.73	99.67
penbased	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.99	99.88	99.81	99.98	99.96	99.99	99.99
nursery	88.71	87.42	88.72	87.43	88.71	87.42	88.71	87.42	88.71	87.42	88.62	87.30	90.00	89.92	88.40	87.30
breast cancer	99.67	98.84	99.68	98.85	99.67	98.84	99.67	98.84	99.67	98.84	99.46	98.67	99.42	99.14	99.48	98.62
Z-Alizadeh Sani	98.57	97.91	98.58	97.92	98.57	97.91	98.57	97.91	98.57	97.85	98.42	97.84	99.14	99.02	98.21	98.04
Average	99.08	98.96	99.09	99.00	99.08	98.97	99.09	98.98	99.08	98.98	99.03	98.99	99.08	99.02	98.92	98.78