Introduction

Due to the increase in high-dimensional data and the limited number of samples, the "big P small n" paradigm has become a major challenge in the field of biomedical data mining1,2. Especially for microarray profile datasets, the number of genes is much larger than the number of samples, but only a few feature genes are closely related to cancer3,4. Feature selection can remove irrelevant and redundant genes, improve the classification and diagnosis rate of cancer, and help to improve the treatment of cancer5,6. According to their interaction with classifiers, feature selection methods can be divided into four categories: filter, embedded, wrapper, and hybrid methods7,8,9,10. The filter method sorts genes according to the correlation of individual genes or the ability to distinguish target categories11,12. The embedded method automatically selects the feature gene according to the algorithm13,14. It quickly selects the optimal feature gene subset through algorithm training and feature selection at the same time. The wrapper method usually uses the classification model containing a heuristic algorithm and selects the optimal feature subset according to the classification performance15,16,17. Although the wrapper method is lower in computational efficiency than the filter method, its classification performance is usually better than the latter18.

The hybrid method is generally a combination of the filter method and wrapper method19,20. First, the filter method is used to quickly remove irrelevant features on a large scale and reduce the feature subset. Then, using the wrapper method, the optimal feature gene subset is selected. The hybrid method can combine the computational efficiency of the filter method and the high classification performance of the wrapper method21. For example, Su et al. combined the K-S test with CFS and compared it with four advanced methods. The results show that the hybrid method is effective8. Elnaz Pashaei22, Xiongshi Deng23, and Jamshid Pirgazi24 also adopted the hybrid method of combining the filter and wrapper method and achieved good results in many public cancer datasets. In recent years, an increasing number of researchers have considered a hybrid method combining filters and wrappers to select features from gene expression data25. This paper presents a three-stage hybrid feature selection method: VEW, which combines the filter method and wrapper method. In the first stage, we use a variance filter to filter out genes that do not meet the variance threshold. In the second stage, we use the extremely randomized tree (ERT) algorithm to sort the importance of the gene subsets obtained in the previous stage, and further reduce the subset of feature genes. In the third stage, we input the gene subset obtained in the second stage into the whale optimization algorithm (WOA) to obtain the optimal feature gene subset. Through the analysis and comparison of the experimental results, we verify that the VEW method has obvious advantages in the selection performance of feature genes, the number of selected genes and the calculation time. This paper mainly finds that the three-stage hybrid algorithm combining the filter method and wrapper method has significant performance improvement and is easy to implement.

The rest of this paper is organized as follows: first, we summarize the research work and algorithm principle of the variance filter, ERT, and WOA and introduce the hybrid algorithm VEW in detail. In the results section, based on 7 published cancer gene expression datasets, we compare the VEW method with 11 related feature selection algorithms and other advanced feature selection algorithms. Finally, we summarize the experimental results and future work direction of this paper.

Methods

Variance filter

The variance filter is a simple filter method, that can quickly remove low-variance genes with poor classification performance. Michal Marczyk removed redundant feature genes from high-throughput data by an adaptive variance filter, which effectively improved the cancer classification performance26. In this paper, we set the variance threshold to 0.05 to quickly screen feature genes in a large range.

ERT

ERT is similar to the random forest, which is a machine-learning algorithm composed of multiple decision trees. Unlike the random forest, the ERT uses all training samples to obtain each decision tree and forks the decision tree by randomly selecting split nodes. Liang et al.27 identified promoters and their strength through feature selection of ERT.

WOA

Mirjalili (2016) proposed a new swarm intelligence optimization algorithm based on the predatory behaviour of humpback whales: the WOA28. The WOA algorithm achieves the goal of optimizing the time by simulating the hunting behaviour of humpback whales in nature, such as whale group search, encirclement, pursuit, and attack of prey. The WOA is divided into the exploration and development stage. In the exploration stage, whales conduct random searches for prey. In the development stage, whales adopt two hunting modes: shrinking enclosure and spiral bubble net. Figure 1 shows the workflow of the WOA. In the development stage, whales hunt in the direction of the current optimal position. In the contraction and encirclement hunting mode, the optimal position in the whale population is set as prey, and other individuals in the population shrink, encircle, and approach the prey. The position update is shown in Formulas (1) and (2):

$$D=\left|C{X}_{q,t}-{X}_{i,t}\right|$$
(1)
$${X}_{i,t+l}={X}_{q,t}-AD$$
(2)

where \({X}_{q,t}\) is the current optimal solution, \({X}_{i,t}\) is the current whale individual, and \(D\) is the distance between the current whale individual and the current optimal solution.

Figure 1
figure 1

WOA workflow.

A is the convergence factor, and C is the disturbance factor. The \(A,C\) calculation Formula are shown in (3) and (4):

$$A=2a\cdot {rand}_{1}-a$$
(3)
$$C=2\cdot {rand}_{2}$$
(4)

where \({rand}_{1}\) and \({rand}_{2}\) are random numbers between \([\mathrm{0,1}]\). The coefficient \(a=2-2t/ T\), and \(a\) decreases linearly from 2 to 0. In addition, \(t\) is the current iteration number and \(T\) is the maximum iteration number.

In the spiral bubble net model, by calculating the distance between whales and prey, whales spit out bubbles in a spiral path to corral prey. The whale spiral position update is shown in (5) and (6):

$${D}^{{\prime}}=\left|{X}_{q,t}-{X}_{i,t}\right|$$
(5)
$${X}_{i,t+l}={X}_{q,t}+{D}^{{\prime}}\cdot {e}^{bl}\mathrm{cos}\left(2\pi l\right)$$
(6)

\(D^{{\prime}}={|X}_{q,t}-{X}_{i,t}|\) is the distance between an individual whale and prey, and \(b\) is the spiral constant, \(l\in [-\mathrm{1,1}]\). The whale encircles its prey while spiralling inward. The algorithm selects and distinguishes these two modes through the random variable \(p\) and updates the whale position, as shown in Formula (7):

$${X}_{i,t+l}\left\{\begin{array}{c}{X}_{q,t}+{D}^{{\prime}}\cdot {e}^{bl}\mathrm{cos}\,\left(2\pi l\right)\, p\ge 0.5\\ {X}_{q,t}-AD\,p<0.5\end{array}\right.$$
(7)

In the exploration stage, humpback whales do not know the location of their prey and can only randomly select a whale individual in the population as a target to search for prey. At this time, the random search location update is shown in Formula (8) and (9):

$$D=\left|{C\cdot X}_{rand,t}-{X}_{i,t}\right|$$
(8)
$${X}_{i,t+l}={X}_{rand,t}-AD$$
(9)

\({X}_{rand,t}\) is the location of randomly selected whales, and \(D\) is the distance from humpback whales to randomly selected whales.

Coding rules

We set the whale individual position as \(X=\left\{{x}_{1},\cdots ,{x}_{n}\right\}, { x}_{i}\epsilon [\mathrm{0,1}]\) and convert \(X\) to binary position \(X^{{\prime}}=\left\{{{x}^{{\prime}}}_{1},\cdots ,{{x}^{{\prime}}}_{n}\right\}, { {x}^{{\prime}}}_{i}\epsilon \{\mathrm{0,1}\}\) with length \(n\). Here, \({{x}^{{\prime}}}_{i}=1\) indicates that the feature is selected, and \({{x}^{{\prime}}}_{i}=0\) indicates that the feature is not selected. The WOA algorithm adopts binary encoding as shown in Formula (10):

$${x}_{i}^{{\prime}}=\left\{\begin{array}{*{20}l}1 , & \quad if\,rand<{x}_{i}\\ 0 , &\quad otherwise\end{array}\right.$$
(10)

where \({x}_{i}\) represents the \(i\)-dimensional value of an individual whale at position \(X\), and \(rand\) is a random number between \([\mathrm{0,1}]\).

Fitness function

The fitness function is used to evaluate the advantages and disadvantages of each feature subset. In this paper, \(KNN\) is selected as the fitness function of the classification problem, as shown in Formula (11):

$$fitness=\alpha \left(1-{KNN}_{acc}\right)+\left(1-\alpha \right)\cdot \frac{\left|R\right|}{\left|C\right|}$$
(11)

where \(|R|\) is the length of the selected feature subset and \(|C|\) is the total number of features. \({KNN}_{acc}\) is the classification accuracy using the \(KNN\) classifier, and \(\alpha \) is the weight coefficient. In this paper, we set \(\alpha =0.99\).

VEW

In this paper, we propose a three-stage gene selection method: VEW, which combines a variance filter, ERT, and WOA. In the first stage, we use the variance filter method to screen genes and select feature genes that are greater than the variance threshold. In the second stage, we use the ERT to calculate the importance score of each gene, further screen the genes and eliminate the genes with a score of zero. Finally, we use the WOA to obtain the optimal subset of feature genes. The pseudocode code of VEW is shown in Algorithm 1. Figure 2 shows the gene selection process of the VEW algorithm. We also discuss the time complexity of VEW. The time complexity of VEW is mainly composed of two stages. The time complexity of the ERT in the second stage is \(O(M\times (mnlogn))\), where \(M\) is the number of decision trees, \(n\) is the number of genes in the sample and \(m\) is the number of feature genes. In the third stage, the time complexity of the WOA is \(O(N\times T\times D)\). Here, \(N\) is the population size obtained in the second stage, \(T\) is the maximum number of iterations and \(D\) is the problem size. In the method proposed in this paper, the first two stages involve simple filterings and sorting of gene sets, which are fast and time-consuming, respectively. Because \(O(N\times T\times D)\gg O(M\times (mnlogn))\), the time spent by the algorithm is mainly concentrated in the third stage.

figure a
Figure 2
figure 2

VEW workflow.

Results

Data and parameter setting

To evaluate the performance of each algorithm, seven microarray gene expression profile datasets are used in this paper. All datasets used are from public websites: http://csse.szu.edu.cn/staff/zhuzx/Datasets.Html29 and https://github.com/Pengeace/MGRFE-GaRFE25. Table S1 in the supplementary material provides a detailed overview of the feature of the seven microarray datasets, including samples, number of genes, and classes. In these datasets, the number of genes ranged from 7129 to 54,675, and the number of samples is less than 300. These datasets include acute lymphoblastic leukemial type L3 (ALL3), gastric 1 (Gas1), type 1 diabetes (T1D), myeloma (Mye), ovarian cancer (Ova), leukemia (Leuk), and mixed-lineage leukemial (MLL). Only the MLL dataset is a three-class dataset, whereas the others are binary. The number of class samples in most datasets is uneven. In data preprocessing, we fill in the missing values with the mean and map the new data values to \([\mathrm{0,1}]\) using the minimum maximum normalization method. All the experimental results in this paper are generated on a PC equipped with a Corei7-8750 CPU, 16 GB of memory, and 2.20 GHz frequency. All algorithms are implemented using Python language and two public package tools; scikit learn and scikit feature. In this paper, we use three different external classifiers to evaluate the performance of each algorithm, namely, the decision tree (DT), support vector machine (SVM), and logistic regression (LR). After tenfold cross-validation of each standard classifier, the classification performance of each algorithm is recorded. The tenfold cross-validation method randomly divides the dataset into 10 parts, nine of which are divided into training sets, and the rest are divided into test sets. We compare the VEW method with 11 different methods and other methods in the literature. The 11 different methods include the T test (T), Wilcoxon test (W), variance filter-univariate feature selection (VU), extremely randomized tree-univariate feature selection (EU), variance filter-extremely randomized tree (VE), variance filter-bat algorithm (VB), extremely randomized tree-bat algorithm (EB), variance filter-firefly algorithm (VF), extremely randomized tree-firefly algorithm (EF), variance filter-whale optimization algorithm (VW) and extremely randomized tree-whale optimization algorithm (EW). Table S2 in the supplementary materials lists the specific parameter values of each algorithm and classifier. All experiments were run independently 10 times, and the average value was taken. Seven evaluation criteria are used to reflect the performance of each algorithm: the number of selected genes, classification accuracy (Acc), precision (Pre), recall rate (Recall), F1-score (F1), standard deviation (SD), and algorithm running time. The calculation formulas for the four important evaluation criteria are as follows:

$$Acc=\frac{TN+TP}{P+N}$$
(12)
$$Precision=\frac{TP}{TP+FP}$$
(13)
$$Recall=\frac{TP}{TP+FN}$$
(14)
$$F1=2*\frac{Precision*Recall}{Precision+Recall}$$
(15)

The number of positive samples is (P), and the number of negative samples is (N). True positive (TP): the real category of the sample is positive, and the model prediction is also positive. True negative (TN): the real category of the sample is a negative case, and the model prediction is also a negative case. False-positive (FP): the real category of the sample is negative, and the model prediction is positive. False-negative (FN): the real category of the sample is positive and the model prediction is negative. Because the precision, recall, and F1 are for a single class, we assign the same weight to each class and calculate their average values.

Comparison of performance

We comprehensively compare the VEW method with T, W, VU, EU, VE, VB, EB, VF, EF, VW, and EW. The best performance values in each dataset are highlighted in black bold. Tables 1, 2 and 3 show the performance values of the four evaluation criteria of each algorithm on the three classifiers. It can be seen from Table 1 that on DT, the VEW method has obvious advantages over the other methods. The Acc, Pre, Recall, and F1 winning times are 7, 6, 7, and 6 times, respectively. The average Acc on seven datasets reaches 86.47%, which is significantly better than the other nine methods, with 100% Acc achieved on the ovarian dataset. As shown in Table 2, on the SVM, the winning times of the VEW method on the four evaluation criteria is 6, and the Acc is 100% on the ovarian and leukemia datasets. Moreover, the average Acc reaches 89.00%. As shown in Table 3, on LR, the number of winning times of the VEW on the four evaluation criteria is 6 and reaches 100% on both the leukemia and ovarian datasets. The average Acc of VEW was significantly higher than that of the other nine methods and reaches a maximum of 89.58%. In summary, compared with other methods, VEW has obvious advantages in Acc, pre, recall, and F1, especially in DT, and achieves the highest average Acc in LR. This also proves that the hybrid method proposed in this paper can effectively improve the performance of each index. Table S3 in the supplementary materials lists the number of genes selected by each algorithm in the seven datasets. From the results, the average number of genes selected by the VEW method is the lowest. VW and EW selects fewer genes than VEW in three of the datasets, but combined with other indicators, we find that this advantage comes at the expense of other performances. In addition, in most datasets, the number of genes selected by the VEW method is only 1/4 to 1/250 of that of other comparative methods. The above experiments prove that VEW can better combine the advantages of various methods and select fewer feature genes without sacrificing performance. To further verify the performance advantages of the VEW method, we compare it with other advanced algorithms in the literature. Table 4 lists the comparison results between the proposed algorithm and other literature methods, where "\" indicates the lack of experimental data. It can be seen from the results in Table 4 that compared with other advanced algorithms, the VEW method also has certain competitiveness in Acc.

Table 1 Performance comparison of algorithms on classifier DT.
Table 2 Performance comparison of algorithms on classifier SVM.
Table 3 Performance comparison of algorithms on classifier LR.
Table 4 Comparison between the VEW and other advance methods in Acc.

Comparison of running time

We analyse the running time of all algorithms, and Table 5 lists the average running time of each algorithm on each dataset. It can be seen from the results that the EF method has the longest running time and the VE method has the shortest average running time. The running time of VEW is less than that of T, W, VB, EB, VF, EF, VW, and EW and more than that of VU, EU, and VE. According to the previous analysis results, other comparison methods are significantly lower than VEW in terms of Acc, the number of selected genes, etc. This shows that VEW can improve other performances and shorten the overall running time through the hybrid method.

Table 5 Comparison of the running time \((10s)\) between the VEW and other methods.

Biological inferences

Due to the randomness of the VEW method, multiple results with the same performance but different feature genes may be obtained in multiple experiments. We adopt the following principles to solve this problem: (1). The results with high Acc in multiple classifiers are comprehensively selected. (2). When Acc is the same, a subset of feature genes with a small number is preferentially selected. (3). When the numbers of Acc and feature genes are the same, the subset of genes with the highest frequency is selected. Table 6 lists the number of optimal gene subsets, probe/UniProt ID and average Acc on different classifiers selected by VEW in each dataset after 10 independent runs. To test the effectiveness of VEW in the selection of cancer-related biomarkers, we perform biological inference on the selected best subset of genes (partial genes) in three of the datasets. Tables S4S6 in the supplementary materials list the probe/UniProt ID, gene name and gene function description corresponding to the best gene subset selected by VEW on the three datasets.

Table 6 Optimal subset of genes selected by the VEW.

Forgione et al. found that KMT2A is associated with ALL and that KMT2A rearrangement is a driver of highly pathogenic leukemia30. FASN is the only human lipogenic enzyme that can be used for de novo fatty acid synthesis and is highly expressed in cancer cells. Reducing FASN expression can make ALL cells sensitive to differentiation therapy31. Vojta et al. determined MGAT5B is widely associated with a variety of cancer types, including gastric cancer, and may have potential value for disease prognosis32. Rosenblum et al. found that DPP7 plays an important role in regulating peptide hormone signalling and can serve as an emerging target for a variety of cancers including myeloma33. ITGAX is closely related to the treatment of multiple cancers, but its correlation with myeloma needs further study34. Gao et al. found that MUC1 is a potential target for developing drugs for myeloma patients, and MUC1 based cancer vaccines can effectively prevent cancer progression and metastasis35. Similarly, PA2G4 plays an important role in the progression and spread of myeloma and can serve as a potential new therapeutic target for myeloma36. The above results show the validity of VEW in biological inference and the practicability of the method proposed in this paper. Of all the evaluation criteria, Acc was the most important, so we tested the performance of the VEW method in the dataset when \(\alpha \) took different values. As shown in Table S7, when \(\alpha =0.99\), the algorithm performance was the best. Therefore, we set \(\alpha =0.99\).

Conclusion

The purpose of VEW is to select effective feature genes from high-dimensional gene expression data. Unlike other similar methods, VEW is a three-stage hybrid method that combines the three constitutive methods well. We quickly screen feature genes in a large range through a variance filter and ERT and then accurately screen them in a small range through a WOA. This improves performance and reduces time consumption. The results in Tables S4S6 show that our method can select important genes related to a tumor in multiple datasets, and the results of other researchers also verify the effectiveness and practicability of genes selected by the VEW method from a medical perspective. The results in Tables 1, 2, 3, 4, 5 show that VEW significantly improves performance while reducing run time. The number of genes selected by VEW on all datasets is no more than 10, and the Acc reaches 100% on the ovarian and leukemia datasets; the average Acc on multiple datasets also reaches 89.58%. Compared with other advanced algorithms, VEW has obvious advantages in the number of gene selections, Acc, Precision, Recall, F1, and running time.

As shown in Table S7, we also test the performance value of the VEW method on different datasets when \(\alpha \) takes different values, which proves the rationality of our \(\alpha \) value. Because the variance filter is simple and efficient, we first use it to filter out redundant genes and use the ERT in the second phase of VEW, which can further narrow the scope of gene screening, increase the randomness of the screening process, and avoid falling into local optimization. The results of the basic WOA in the third stage also show that our idea can significantly improve the overall algorithm performance. We believe that the addition of the three-stage hybrid algorithm of the ERT is the key reason for the performance improvement. The ERT increases the randomness of the overall algorithm and further sorts and filters the gene subset, which also increases the screening accuracy of the whale optimization algorithm. However, the methods proposed in this paper also have many limitations. For example, the basic WOA has the disadvantages of low accuracy, slow convergence speed, and easy trapping in local optima. In addition, the filtering method in the first stage needs to select the better one to improve the performance of the overall algorithm. In future research, we can further select better filter and wrapper methods and combine them with ERT to form a new three-stage hybrid algorithm to improve the performance of the overall algorithm.