1 Introduction

In machine learning, imbalanced data is a very common problem when training a classifier. A dataset is imbalanced if there is a difference between the number of training examples in one class and the number in the other class. This difference can lead to machine learning algorithms being biased towards predicting unseen samples as members of the majority class, a problem which also affects deep neural networks [1]. The degree of data imbalance can be measured by the imbalance ratio, defined to be the number of examples in the majority class divided by the number in the minority class. Even with the same imbalance ratio, the smaller the number of minority class samples, the harder it is for machine learning algorithms to learn effective distribution and classification bounds.

In traditional machine learning, imbalanced data has been well-studied [2, 3]. Approaches include oversampling [4], undersampling [3], cost-sensitive learning [5], and ensemble learning [6, 7]. Such methods have also been applied to deep learning [1]. For example, [8] report a GAN-based minority sample synthesis method following the paradigm of oversampling.

Recently, imbalanced dataset processing methods have been specifically designed for deep learning, for example those based on loss functions such as focal loss [9], gradient harmonizing [10], weighted softmax loss [11] and class imbalance loss [12]. Modified optimizer methods [13] are another technical solution, such as DM-SGD [14] and ABSGD [15]. The principle of all these methods is to adjust the gradient to allow the deep learning method to support the imbalanced dataset. However, they all require data preprocessing or additional hyper-parameters. Additionally, most of them can only work in situations in which minority samples are not extremely scarce.

There is therefore a need for an end-to-end neural network method supporting extremely imbalanced datasets, that does not require data preprocessing or additional hyper-parameters, and that is also compatible with existing neural network systems. We propose the Batch Balance Wrapper framework (BBW) to meet this need: It is not sensitive to imbalance ratio, it learns well and it is extremely fast. In addition, it can handle situations in which there are only a few minority samples in the dataset.

In BBW, stratified sampling is first used to divide the training and test sets of an imbalanced dataset. Then, two new layers are added to the start of an existing neural network. The first layer performs an adaptive normalization of the input, improving the expressiveness of the minority sample distribution. The second layer diversifies the normalized samples, alleviating overfitting of the minority class. Finally, the neural network, complete with the two added layers, is trained using a Batch Balance (BB) algorithm which samples training data in such a way that data items in each batch are always balanced during the learning process. In this way, the learning method is not biased towards a certain class.

We also propose the evaluation metrics Valuable Convergence Ratio (VCR) and Average False-positive rate of Valuable Convergence (AFVC). These measures are intended to evaluate neural networks on an imbalanced dataset with few minority samples by using the ‘leave-one-out’ method for cross-validation. Here, the leave-one-out method is defined so that it only applies to minority samples.

We tested BBW on three imbalanced binary datasets with few minority samples, the CHB-MIT Scalp EEG Dataset (CHB-MIT) [16], the resampled University of Bonn EEG time series dataset (BonnEEG) [17], and the resampled First Affiliated Hospital of Xi’an Jiao Tong University Tuberculosis Chest Radiograph Dataset (FAHXJU) [18]. The maximum imbalance ratio reaches 1167:1 with only 16 positive samples, 200:1 with just 2 positive samples.

We carried out six experiments, all of which use the DenseNet121 architecture [19] as a baseline model. First, we demonstrated the feasibility of the proposed approach by training the baseline both with and without BBW. Second, we compared BBW to the baseline using the proposed modified leave-one-out method for cross-validation. Third, BBW was compared to six existing approaches (down sampling [3], oversampling [4], class weights [1], focal loss [9], weighted softmax loss (WSL) [11], and class imbalance loss (CIL) [12]) for training with unbalanced data. Fourth, we carried out an ablation study to show the contribution of different parts of the BBW framework. Fifth, we re-ran the second experiment using the normal definition of epoch. Sixth and finally, we measured the performance of the baseline, this time adopting the normal epoch definition and constraining learning counts to those used by BBW.

Experiment 1 illustrated the improved learning behavior of the BBW-wrapped neural network in comparison to the Baseline. Experiment 2 showed that BBW attained 14-40% higher VCR and 9-15% lower AFVC. Experiment 3 found that BBW was a better approach than downsampling, oversampling, class weights, focal loss, weighted softmax loss, and class imbalance loss. Experiment 4 demonstrated by ablation that each component of BBW improves the results, and that the overall BBW is better than its individual parts. Experiments 5 and 6 suggested that BBW is 16.39 times faster, while in addition the performance is better.

In this work, our primary contributions are:

  • We propose the Batch Balance Wrapper Framework (BBW) for adapting general DNNs, allowing them to be well trained from extremely imbalanced datasets with few minority samples.

  • We propose the input adaptive normalization layer, sample diversification layer, and batch balance strategy, which perform the mechanisms of trainable normalization, sample dynamic synthesis, and sample dynamic balance. They can improve the expressiveness of the sample distribution of minority samples and alleviate the overfitting of the minority class.

  • BBW is not sensitive to imbalance ratio, and is extremely fast. It does not require data preprocessing, additional hyper-parameters, or even data normalization.

  • We propose the evaluation measures Valuable Convergence Ratio (VCR) and Average False-positive-rate of Valuable Convergence (AFVC). Combined with the proposed minority-class-only leave-one-out cross validation, they can fairly evaluate neural networks on an imbalanced dataset with few minority samples.

  • Using three different datasets, with balance ratio as high as 1167:1 (CHB-MIT) and as few as 2 positive examples (BonnEEG), we demonstrate that BBW achieves better performance compared to existing methods (downsampling, oversampling, class weights, focal loss, weighted softmax loss, and class imbalance loss).

2 Related work

Processing methods for imbalanced datasets have been well-studied in traditional machine learning. These methods can be divided into two main groups, dataset preprocessing-based methods, and algorithm modification-based methods [20, 21]. The main idea of dataset preprocessing-based methods, such as oversampling and downsampling, is to try to preprocess the dataset to alleviate its imbalance. Zhang et al. [3] propose several multi-class imbalance learning methods. The oversampling method attempts to create more minority samples to alleviate the problem [4]. The simplest implementation is to randomly repeat minority samples until the dataset reaches an acceptable imbalance ratio. The downsampling method attempts to drop majority samples in order to balance the dataset. A more sophisticated approach is to delete majority samples which are far away from the classification boundary. An interesting development of the resample idea is the SMOTE method [22] and its variants [23, 24], which can create reasonable samples based on clustering theory and interpolation theory. Sun et al. [25] propose an imbalanced classification algorithm combining SMOTE and Support Vector Machines. This embeds SMOTE into the iteration of ADASVM-TW to synthesize samples during processing. Following the oversample paradigm, Zhou et al. [8] propose a sample synthesis method based on generative adversarial networks to augment minority samples of original imbalanced datasets.

Approaches based on algorithm modification attempt to improve the classification method to allow it to support imbalanced datasets, such as cost-sensitive learning [5] and the ensemble method [6, 7]. Cost-sensitive learning assigns different costs to the different misclassifications, thereby adjusting the classification results by minimizing the total cost. The ensemble method in imbalanced dataset processing divides the original dataset into a series of balanced subdatasets, then assembles all the sub-classifiers to boost the final result [26, 27]. Hayashi et al. [28] report an imbalanced learning algorithm focusing on the main class, based on a cluster-based zero-shot classifier. There are also some classification methods that are inherently insensitive to imbalance ratio, such as the decision tree method [29].

The above methods are designed for traditional machine learning, but they can also be used in deep learning. Buda et al. [1] report the effect of applying traditional imbalanced processing methods to deep neural networks, such as the paradigms of oversampling, downsampling, cost-sensitive learning, ensemble learning, etc. Taherkhani et al. [30] propose a transfer learning based multi-class imbalanced classification method by combining an adaptive boosting algorithm and neural networks. Pérez-Hernández et al. [31] describe binarization techniques on neural networks, which convert a multi-class task into several binary tasks to reduce multi-class imbalance problems. In recent years, some interesting imbalanced dataset processing methods specifically for deep learning have been developed. The methods can be categorized into two main groups, loss-based methods and optimizer-based methods. The principle of both categories is to adjust the gradient to let the deep learning method support imbalanced datasets. The loss-based method assigns different weights to each class/sample in order to adjust the loss. In recent work, this is the most widely followed method for neural networks supporting imbalanced data. The most successful instantiation of this idea is focal loss as proposed by [9], which can automatically decide the weights of each sample via predicted probability. Following this, Li et al. [10] propose the Gradient Harmonizing mechanism, which can adjust the gradient using gradient density. Jia et al. [11] propose weighted softmax loss, adaptively parameterized by maximum multi-class imbalance ratio. Zhang et al. [12] devise class imbalance loss to improve the cross-entropy loss on imbalanced datasets.

Optimizer-based methods modify neural networks in order to support imbalanced datasets. Zhang et al. [14] propose DM-SGD, supporting imbalanced datasets by actively selecting samples. Qi et al. [15] develop ABSGD, supporting imbalanced classification by weighting gradients based on an attention mechanism.

In conclusion, therefore, it should be noted that the above-mentioned methods require data preprocessing or additional hyper-parameters, which create the need for additional calculation or parameter tuning. There is a need for an end-to-end neural network method supporting imbalanced datasets, that does not require data preprocessing or additional hyper-parameters, and is compatible with all existing neural network systems. The method we propose can meet these needs, and will be described next.

3 Methodology

In this section, we will explain the proposed Batch Balance Wrapper framework (BBW). This allows deep neural networks to learn from extremely imbalanced datasets. BBW is designed to work with the Batch Balance algorithm (BB) and stabilize it. The framework is illustrated in Fig. 1.

Fig. 1
figure 1

Batch balance wrapper framework

As shown in Fig. 1, BBW has five parts that are organized in logical order. The first part of BBW is the stratified sampling method, which we use to divide the data into the training set and test set. After this, two new neural network layers are added to the start of the existing DNN. The first of these added neural network layers is the input adaptive normalization layer. The second is the sample diversity layer. After this comes the pre-existing classifier DNN whose training performance on imbalanced data we wish to improve. In our experiments, we use DenseNet121 [19] as an example, but it can be replaced with any DNN which is a classifier. The final part of BBW is BB, which makes sure the samples in each batch are always balanced during the learning process. In the following subsections, we will show the details and implementation of each part of the framework.

3.1 Stratified sampling based dataset division

In a very imbalanced dataset that has a small number of minority samples, the random sampling method, which is normally used in a balanced dataset for dividing it into training and test sets, will cause the number of minority samples in the test set to be small, possibly even zero. This causes the evaluation metric applied to the test set to be invalid. Therefore, we use the stratified sampling method for division so that the test set can maintain the imbalance ratio of the original dataset. This allows the evaluation metric to be more credible, and avoids a situation in which it cannot be calculated from the test set.

In the implementation, we first shuffle the samples of each class. Then, we divide the shuffled samples into training subsets and test subsets using a fixed ratio, e.g. 7:3. Finally, we concatenate the samples of classes in the training subsets or test subsets to obtain a training set and a test set, and then shuffle them again.

It should be noted that if the number of minority samples is too few to divide, the leave-one-out method should be utilized to select the minority samples for the test set, and the imbalance ratio of the original dataset should be used to determine the number of majority samples in the test set. In this situation, majority samples should be randomly selected from the original dataset, and the rest should belong to the training set. These changes will ensure that the training set has enough minority samples for training. Additionally, this approach does not require too many models that need to be trained and evaluated, compared with the standard leave-one-out procedure. The details of this modified leave-one-out method are described in Algorithm 2 in Section 5.

3.2 Input adaptive normalization

In traditional input normalization, sufficient training data is always required, and it is assumed that this training data can correctly express the real distribution of test data. Therefore, the parameters of the normalization method can be determined from the training set, and those parameters are directly applied to the test samples. However, in an extremely imbalanced dataset, the number of minority samples may be too few to express the real distribution of the test data, even where the dataset is very large. If the parameters that we determined from the training set are directly used, the samples in the testing set will not be correctly normalized when the value of the testing set is too large, too small, or has a different statistical profile from the training set. In such cases, those inputs may produce an unrepresentative output.

Therefore, our intuition is that if the input normalization process can be trained together with the neural network, the whole system will be more robust in cases where the minority samples cannot express the correct distribution. We refer to this as input adaptive normalization, because the neural network can simultaneously ‘think about’ the input normalization and feature extraction together. We believe that neural networks can better adapt to outliers in the testing set by using the normalization process and forward process together, even though the number of minority samples is too few to express the correct distribution.

To implement this process, a parameterizable normalization method [32] is used as the first layer of the neural network. Equations (1) to (3) show the principle of input adaptive normalization, where xi denotes the input of the adaptive normalization layer, μ is the mean of the inputs, σ2 is the variance, w and b are the trainable parameters of the linear transformation, and yi denotes the output of the layer. The process can be described as normalizing x into the mean of zero and the variance of 1, then multiplying x by the linear trainable parameters. Therefore, we can use the input adaptive normalization layer to normalize the input of the neural network, and allow its trainable parameters to learn the distribution of the input data. Most importantly, those parameters will be trained together with the neural network.

$$ \mu = \frac{1}{m}\sum\limits^{m}_{i=1}x_{i} $$
(1)
$$ \sigma^{2} = \frac{1}{m}\sum\limits^{m}_{i=1}(x_{i} - \mu)^{2} $$
(2)
$$ y_{i} = w\frac{x_{i} - \mu}{\sqrt{\sigma^{2} + \epsilon}} + b $$
(3)

As an alternative to (1) to (3), other trainable normalization methods could also be considered. However, this approach works well with our implementation.

3.3 Sample diversification

In BB, the minority samples will be used many more times than the majority samples throughout the entire training process. This will cause the neural network to overfit the minority class before the majority class is properly learned. This could also cause the loss to be insensitive to the minority class, because the majority class, which is reused less times, can always provide a larger loss. Therefore, our idea is to apply a random transformation to the input data that are reused, to ensure that they are not exactly the same when they are fed into the neural network. Hence, we use a noise function as the second layer of the network to perform a random transformation of the adaptively normalized input data, which is the reason we refer to it as the sample diversification layer.

In the implementation, we select the Gaussian noise function as the sample diversification function, which adds zero-centered, one-variance Gaussian noise to the normalized input data. Equation (4) shows this principle, where X is the adaptively normalized input data, \(X^{\prime }\) is the noised normalized input data, and R is a random tensor that obeys the Gaussian distribution.

$$ X^{\prime} \leftarrow X + R, \quad where \quad R \sim \mathcal N(0, 1) $$
(4)

It should be noted that if the noise is too complex, it will inhibit the neural network’s ability to learn the patterns in the original data. Similarly, if the noise is too simple, such that it is easy for the neural network to find the patterns of the noise, the noise layer will lose the ability to diversify samples. We believe that the best choice could be for the noise to be changed over time, because it would then be hard for the neural network to find such patterns. However, in this work, the Gaussian noise is sufficient for our implementation.

3.4 Batch balance

We think that there are two problems that restrict the training of neural networks on extremely imbalanced datasets. The first problem is the invalid sampling problem which is the probability that there is no minority sample in a batched training set. The second problem is that the majority samples in the training set can always provide a much larger total loss than the minority samples, because majority samples are much larger in number than minority samples. Our solution is to keep the sample balanced in the batched training set, to keep the loss fair in different classes. BB achieves this, working after the side effects of forcing the data to be balanced are solved by the BBW framework. The batch, here, is defined as the sample set that is fed to a deep neural network in one gradient-updating iteration.

Algorithm 1 shows the process of BB. The algorithm works in one ‘epoch’. Normally, an epoch is defined as a stage of training in which all training samples are fed to a neural network. However, in BB, we define one epoch to be a stage of training in which all minority training samples are fed to the network; we use this definition because the number of majority samples of the training set is decided by the number of minority samples in Algorithm 1.

figure f

The meaning of the input and output variables in Algorithm 1 can be interpreted literally. The algorithm is defined on an epoch and returns a model trained on the current epoch. In one batch of an epoch, we first randomly sample \(\frac {batchSize}{2}\) samples from the minority training set without replacement to obtain the batched minority training subset. Then, we randomly sample \(\frac {batchSize}{2}\) samples from the majority training set to obtain the batched majority training subset. Next, we concatenate the batched minority training subset and batched majority training subset together to obtain the batched training set. Then, we shuffle the batched training set, and find the corresponding label of each sample. Finally, the shuffled batched training set is fed to the neural network to finish the training process of the current batch.

Something else of note in Algorithm 1 is that the number of minority samples in the training set should be divisible by \(\frac {batchSize}{2}\). The algorithm works with a binary classification task; if we want BB to work with a multi-classification task, we just need to sample data from every class and change \(\frac {batchSize}{2}\) to \(\frac {batchSize}{numberOfClasses}\).

Finally, we will outline the BBW framework as a whole (Fig. 1). In BB, we can deduce that the minority samples will be reused many more times than the majority samples before the training processes converge. Therefore, BBW is designed to stabilize the training process. From the perspective of the minority class, BBW looks like an upsampling technique. However, from the perspective of the majority class, BBW is a downsampling one. The diversification layer is responsible for sample synthesis, but the BB process carries out class-based downsampling. The input adaptive normalization layer normalizes the input, and cooperates with the feature extracting process. BBW is thus a combination of components which are able to cooperate with each other. It is designed for an extremely imbalanced dataset.

4 Datasets

4.1 CHB-MIT scalp EEG dataset (CHB-MIT)

The CHB-MIT Scalp EEG Dataset [16] is an ElectroEncephaloGram dataset with the task of detecting the occurrence of epilepsy from an EEG signal. This dataset was chosen to evaluate BBW because the multichannel brainwave data is imbalanced, and in particular the minority samples are extremely lacking. Another benefit is that the patient specificity of EEG is significant, which can fully test BBW on various situations of learning patterns and let us deeply analyze the principles behind BBW. Figure 2 shows the schematic diagram of CHB-MIT. One blue wavy line represents the brainwave of one channel recorded against time, and there are 23 channels. The time interval marked in red represents the occurrence of epilepsy during this time. The \({S^{S}_{n}} - {S^{E}_{n}}\) pair in the figure represents the start and end points of the seizure.

Fig. 2
figure 2

CHB-MIT dataset schematic

There are 24 cases in this dataset, but cases 12 and 13 were excluded because they frequently changed their channel definition during the recording. In each case, there are dozens of hours of EEG data, but only a few minutes of seizure onset EEG data. The dataset is thus extremely imbalanced and has few minority samples, making it an ideal one on which to test BBW.

This seizure detection task is a typical classification problem, but the data is successive. To segment the data into fixed lengths, we used the non-overlapping sliding window method to cut continuous data, as depicted in Fig. 3. tW is the sliding window size, and Wn represents the data fragment n denoted by window n. We labeled the data fragment True if the data fragment contained any seizure in the time window, otherwise we marked the data fragment False.

Fig. 3
figure 3

Non-overlapping sliding window

In this work, we set tW to 30 seconds. The statistical information of the fragments after selecting the data is shown in Table 1. There are thousands of samples, but few of them are positive samples. The imbalance ratio can be as high as 1167.31:1. There are as few as 8 positive samples. It can therefore be said that this dataset is extremely imbalanced, and has few minority samples.

Table 1 Data fragment statistical information for each case

BBW is compatible with all types of neural networks, and we use the convolutional neural network in this task. To feed the data fragments to the CNN, we must express them in matrix or tensor form. A data fragment is already a matrix with a time and channel axis, but it only shows the time and spatial domain information. To expose the frequency domain information that is hidden in the data fragment, we used the short-time Fourier transform (STFT) to calculate the spectrogram of each channel in the data fragment, as shown in Fig. 4. By stacking the spectrogram of each channel, we can convert the data fragment into a 3D tensor. The three dimensions of this 3D tensor are time, frequency, and channel. It can express the time domain, frequency domain, and spatial domain (channel) information at the same time. Now, we believe that the data is clear enough for CNN to learn its features.

Fig. 4
figure 4

Using a short-time Fourier transform to expose frequency domain information

4.2 Bonn EEG Dataset (BonnEEG)

The University of Bonn EEG time series dataset (BonnEEG) was released by [17] for EEG-based epilepsy detection. It contains 500 EEG samples, labeled by five related subject statuses. The EEG data were collected from five healthy human subjects and five human subjects with epilepsy. All EEG segments were single-channel signals with an acquisition duration of 23.6 seconds, with 4,097 sampling points and a sampling rate of 173.61 Hz. The epilepsy detection task is abstracted as a binary classification task. To emphasize the extreme imbalance and the few minority samples, the dataset is resampled to only 2 minority samples (epilepsy) with 400 majority samples (normal).

4.3 First affiliated hospital of Xi’an Jiao Tong University Tuberculosis chest radiograph dataset (FAHXJU)

The FAHXJU dataset was collected at the First Affiliated Hospital of Xi’an Jiao Tong University [18]. It contains 1,403 chest radiographs, labeled by the types of tuberculosis. The task is to classify the two types of tuberculosis from the chest radiograph, which is a binary classification problem. There are 1,345 majority samples (cavity) and only 58 minority samples (exudation) in the dataset.

5 Evaluation

In extremely imbalanced datasets with few minority samples, those samples are too few to evaluate the performance of the method. In this situation, the leave-one-out method is the only appropriate method to conduct the cross-validation. However, this method is very time consuming, especially in a situation in which the total number of samples is very large. In addition, it is hard to calculate evaluation metrics such as F-measure, P-R curve, ROC curve, and AUC for extremely imbalanced data when using the leave-one-out method.

Our solution is to only use the leave-one-out method for the minority class, and to randomly sample the majority samples according to the imbalance ratio. Then, the leave-one-out positive sample and randomly sampled negative sample set are concatenated to obtain the test set. Finally, we calculate the average evaluation score of all leave-one-out models, as in the normal leave-one-out process. The detailed core process is shown in Algorithm 2. leaveOneOutPositiveSample is the minority sample which we ‘left out’.

figure g

The evaluation measures we chose were True Positive Rate (TPR) and False Positive Rate (FPR). The calculations are shown in (5) and (6), where TP is the number of true positive examples, P is the total number of positive examples, FP is the number of false positive examples, and N is the total number of negative examples.

$$ TPR = \frac{TP}{P} $$
(5)
$$ FPR = \frac{FP}{N} $$
(6)

Although it should be easy to calculate the average TPR and FPR of all leave-one-out models, a flaw concerning the stop criterion of this method is that TPR = 1 can always appear during the epochs of the training process, which may lead to bias in our evaluation process. This is because it will improperly pull up the FPR if TPR = 1 only occurs at the earlier epochs. Actually, in this situation, instead of triggering the stop criterion, we prefer to consider that the model cannot correctly predict the leave-one-out sample, and believe that the neural network does not converge at all. Therefore, we defined a filter to handle this situation, which we refer to as Valuable Convergence. A valuable convergence must satisfy TPR = 1 and FPR < 0.5. The condition TPR = 1 means the model must correctly predict the leave-one-out sample, which is the only one positive sample in the test set. FPR < 0.5 means the predictive ability for negative samples in the test set is just better than random prediction.

Next, we use the Valuable Convergence Ratio (VCR) and Average FPR of Valuable Convergence (AFVC) to evaluate the performance of the learning method. The formulas for VCR and AFVC are shown in (7) and (8), where Nvc is the number of valuable convergence, Nlm is the number of leave-one-out models, Nms is the number of minority samples in dataset, and each FPRvc is the FPR value of a valuable convergence. This evaluation method is suitable for datasets that are extremely imbalanced and have only very few minority samples.

$$ VCR = \frac{N_{vc}}{N_{lm}} = \frac{N_{vc}}{N_{ms}} $$
(7)
$$ AFVC = \frac{\sum FPR_{vc}}{N_{vc}} $$
(8)

In general, the VCR and AFVC measures are defined to evaluate models fairly on such a dataset with few minority samples when using the leave-one-out method. Furthermore, the leave-one-out method is modified to only apply to minority samples.

6 Experiments

6.1 Outline

We carried out six experiments to demonstrate the effectiveness of our proposed BBW framework for learning effectively with extremely imbalanced data containing few minority samples. Experiment 1 compares the learning behavior of the DenseNet121 [19] neural network architecture when used with BBW and without. Experiment 2 follows the same setting, but the test set is created using the modified leave-one-out method proposed in Section 5. Experiment 3 measures the performance of six previous methods for dealing with imbalanced data containing few minority samples. Each is trained in an identical setting so that results may be compared directly with tha

t of BBW in Experiment 2. Experiment 4 is an ablation study to show the effect on training of components within BBW. Experiment 5 repeats the setting of Experiment 2 but uses a different definition of epoch. Finally, Experiment 6 measures the performance of DenseNet121, once again using the standard definition of epoch, but this time restricted to the same learning counts as BBW in Experiment 2.

6.2 Experiment 1 - verification of BBW learning

The aim was to test the learning abilities of BBW by comparing it with a baseline implementation, in order to demonstrate the differences in learning behavior between them.

Only the CHB-MIT dataset was used for this experiment. We set both training and test sets to the whole dataset, then calculated the True Positive Rate (TPR) and False Positive Rate (FPR) of the test set to observe the learning ability of each learning method. We did not use the TPR and FPR that were calculated from the training set because there are differences in the behaviors of some neural network structures in the training process and test process, such as the noised layer and dropout layer, which are only activated in the training process. By setting the training set and test set to the same dataset, we can obtain more objective TPR and FPR to describe the learning ability of each learning method.

The basic architecture used for the experiment is a modified DenseNet121 [19], which is an important convolutional neural network backend for automatic feature extraction. This is then wrapped by the BBW framework. As a baseline for comparison, the DenseNet121 network is used alone, with no BBW wrapping. For the test to be fair, the baseline model used the same parameters of batch and epoch as the BBW version. We also normalized the input data with a min-max method for DenseNet121 alone, even though the wrapped version did not require this step.

The resulting learning ability just for Case 1 (the first case in dataset) is shown in Fig. 5. In the figure, we can observe that the TPR and FPR of both models fluctuate up and down at the early stage. However, from about the 40th epoch, the TPR and FPR of the baseline DenseNet121 both go down with occasional fluctuations, which indicates that the majority class has begun to overwhelm the minority class. In other words, the total loss of majority samples is much larger than the total loss of minority samples, which results in the neural network preferring the majority class in order to ensure that the total loss of each batch can be lower. By contrast, the TPR of the BBW-wrapped model gradually approaches 1, and FPR gradually approaches 0. This demonstrates the good learning ability of BBW.

Fig. 5
figure 5

Experiment 1: Learning ability (Case 1 from CHB-MIT dataset shown)

6.3 Experiment 2 - BBW performance using leave-one-out

The aim was to demonstrate the better performance of BBW when compared to the baseline. However, in this case the test set was created using the modified leave-one-out method discussed earlier.

All three datasets were used: CHB-MIT, BonnEEG and FAHXJU. This time, performance metrics were VCR, AFVC (both discussed earlier), and Time.

The general setup was very similar to that of Experiment 1, i.e. DenseNet121 wrapped with BBW was compared with DenseNet121 alone. However, the test set in this experiment was created by the modified leave-one-out method proposed in Section 5.

The results can be seen in Table 2. We can observe that the results are much better than those of the baseline on all three datasets. It is as expected, since Experiment 1 showed that a neural network wrapped by BBW has better learning ability in extremely imbalanced datasets with few minority samples.

Table 2 Experiment 2: BBW and baseline DenseNet121 compared - All datasets

More detailed results for CHB-MIT can be seen in Table 3, itemized for the 24 cases. Figures given are the average performances of the leave-one-out models in each case.

Table 3 Experiment 2: BBW and baseline DenseNet121 compared - CHB-MIT dataset

We can observe in the table that almost all of BBW’s results are much better than those of the baseline, except for Case 19 in epoch 300. This might be because the pattern in Case 19 is simpler, leading to the over-fitting of our method. We know from Fig. 5 that the learning ability curve of controlled Experiment 2 is more unstable. The unpredictable fluctuations might have resulted in better results in Case 19 for the Experiment 2 baseline model. Overall, the average VCR of BBW is 14-40% higher than the baseline, and the average AFVC of our method is 9-15% lower.

We can also observe that the results in Table 3 of some cases in the baseline model are better and better with the increase of epoch. Indeed, the average performance of the baseline model becomes better when the epoch increases. This is because patterns in the data are not very difficult for a network to learn, and the performance is reliant on the imbalance ratio of the dataset. The larger the imbalance ratio, the more epochs are needed. If the patterns are not easy for the neural network part of the BBW framework, the performance will not become better, even if the number of epochs is increased. This can be observed in Cases 16, 17, etc. However, the performance of BBW is independent of the imbalance ratio. In other words, BBW is insensitive to the ratio. Additionally, BBW converges more quickly, and is more stable and more robust.

6.4 Experiment 3 - comparison of BBW with existing methods

The aim was to compare the performance of BBW with six popular imbalanced data processing methods, using the same datasets and the same underlying model: down sampling [3], oversampling [4], class weights [1], focal loss [9], weighted softmax loss (WSL) [11], and class imbalance loss (CIL) [12].

All three datasets were used: CHB-MIT, BonnEEG and FAHXJU. The performance metrics were VCR, AFVC, Precision, and F1.

The six methods were implemented by adapting standard code and then trained using the DenseNet121 model. All settings were the same as Experiment 2.

Results across all three datasets are shown in Table 4 for Epoch 100. We will compare the results with those for BBW in Table 2. Table 4 indicates that downsampling is powerless in this situation; this is as expected, because there are too few samples for training after downsampling. Class weights, focal loss, WSL, and CIL even had the opposite effect compared with the baseline in Experiment 2 because they are not designed for this extremely imbalanced situation. Oversampling is a challenging competitor; it does improve the performance compared with the baseline in Experiment 2. However, there still exists a large performance gap compared with BBW in Experiment 2. The results do not prove that these popular methods do not work; they just tell us that they cannot work well on extremely imbalanced datasets with few minority samples because they are not designed for such an extreme situation. Our method works just because it is specially designed for such datasets.

Table 4 Experiment 3 - BBW and previous methods compared

6.5 Experiment 4 - Ablation study on BBW

The aim was to measure the performance of BBW when certain components of it are removed, in order to demonstrate their individual contributions.

The CHB-MIT dataset was used, and the performance metrics were VCR, AFVC, Precision, and F1.

The basic setup was the same as in Experiment 2. Four model configurations were trained – the baseline DenseNet121 from Experiment 2, baseline with BB added, baseline with adaptive normalization & diversification, and finally the complete BBW system.

The results are shown in Table 5, which demonstrates the gain in performance caused by the components of BBW. The table also shows that the performance of the complete BBW is better than its individual components. Firstly, we observe that adding BB to the baseline DenseNet121 slightly improves the classification performance. This is because BB arbitrarily corrects the imbalance, but it inevitably leads to instability in the learning process and overfitting of the minority class. Adding just the input adaptive normalization layer and the sample diversification layer does not significantly improve the classification performance, since the training is still extremely imbalanced. However, we see a clear improvement when wrapping the model with the full BBW framework. The result of the ablation study meets our expectations.

Table 5 Experiment 4: Ablation experiments on BBW

6.6 Experiment 5 - Normal definition of epoch

The aim was to investigate how imbalance ratios and patterns of samples influence the loss to affect the final learning results. The normal definition of epoch was used. In all other respects, the model was the same as the baseline in Experiment 2. The CHB-MIT dataset was used, and the performance metrics were VCR, AFVC, and Time.

The results of this experiment when epoch = 100 are shown in Table 6. We can see that behaviour is much more unstable, but some typical scenarios can still be analyzed by comparing them with other experiments. By comparing Tables 6 and 3, it is apparent that more than \(\frac {2}{3}\) of the VCR in Experiment 2 (BBW) is better than the VCR of Experiment 5. Additionally, almost all of the AFVC in Experiment 2 is lower than that of Experiment 5 in the additional \(\frac {1}{3}\). There are only four cases in Experiment 5 in which the results are better than those in Experiment 2, if we relax the constraints on epoch. Even so, the total processing time of Experiment 5 is 16.39 times longer than the total processing time of Experiment 2, i.e. BBW is 16.39 times faster.

Table 6 Experiment 5: Baseline using normal epoch definition

In Table 6, we can observe that some cases converge sufficiently (such as Case 22), but some cases cannot converge to an accepted result (such as Case 6). This phenomenon can be explained as follows: The learning method will learn the patterns of the majority class quickly if the patterns are easy. The total loss of majority samples will be less than the total loss of minority samples, which have not been well-learned. While the learning process is occurring, the discriminative ability of the minority class will catch up with the discriminative ability of the majority class. The learning method will then report an accepted convergence at some epoch. If the patterns of samples are a little difficult for the learning method, the total loss of the majority class will be greater than the total loss of the minority class, even if the learning method has already learned the majority class preferentially. In fact, the learning method will always try to learn the majority class first, because it can always provide a larger total loss at the beginning of the learning process. Another situation that can lead to learning failure is when the patterns of the samples are too easy for the learning method, leading to the premature overfitting of the majority class. To solve these problems, BB is stabilized by the BBW framework, and attempts to balance ‘by force’ the number of samples in each class in one batch, and also to balance the total loss of each class in one batch to ensure the learning method does not bias towards some class.

6.7 Experiment 6 - Normal epoch definition and training counts limited

The aim was to measure the performance of the baseline in Experiment 2 when using the normal definition of epoch and also limiting sample learning times.

The CHB-MIT dataset was used. The performance metrics were VCR, AFVC, and Time.

After conducting the previous experiment, we were curious about the performance of a DenseNet121 model that has the normal epoch definition and the same total sample learning times as in Experiment 2 for BBW. We ensured that the total sample learning times in this experiment were the same as those of Experiment 2 by controlling the epoch. The epoch adjustment formula is calculated by (9) and (10), where Ns is the number of all samples in dataset, ri is the imbalance ratio of the dataset, and ra denotes the epoch adjustment ratio computed independently for programming convenience. The epoch in the formula is the epoch of Experiment 2, and epochnew is the epoch for this experiment, to ensure they have the same total sample learning times. The ceiling function used here is to prevent epoch = 0.

$$ r_{a} = \frac{N_{ms} \times 2 - 2}{N_{s} - 1 - r_{i}} $$
(9)
$$ epoch_{new} = \lceil r_{a} \times epoch \rceil $$
(10)

The results are shown in Table 7. We can observe that the system did not conduct valuable learning at all. However, the performance improved with the increase in the epoch. This indicates that the learning method is still in the early stage of the learning process. It again proves the efficiency of BBW. The total processing time here is slightly less than that for Experiment 2 because it has fewer cross-validation steps.

Table 7 Experiment 6: DenseNet121 using normal epoch definition and same total sample learning times as for BBW in Experiment 2

Lastly, It should be noted that almost all traditional machine learning and deep learning methods that support imbalanced datasets require either data preprocessing or additional hyper-parameters. Our method is an end-to-end method that supports imbalanced datasets, and does not require additional data preprocessing or hyper-parameters. Additionally, BBW displayed the best performance in the extreme situation. The experiments designed in this work prove the ability of the proposed BBW framework to perform well on extremely imbalanced datasets with few minority samples, and verify its core principles.

7 Conclusion

In summary, the proposed BBW framework can adapt general DNNs to be trained better on extremely imbalanced datasets with few minority samples. In essence, BBW performs downsampling of majority samples, and oversampling of minority samples. In addition, it carries out sample synthesis within the sample diversification layer. The input adaptive normalization layer in BBW allows DNNs to perform the normalization process automatically and natively. Moreover, BBW does not require data preprocessing or additional hyper-parameters, not even data normalization. Experimental results in this paper demonstrate the performance and efficiency of BBW.

In our early studies, we found that the ability to discriminate the minority class is very sensitive to the dropout method. This phenomenon needs further evidence and study. We will also attempt in future work to enrich the BBW framework and to combine it with other methods.