A new hybrid algorithm for three-stage gene selection based on whale optimization

Liu, Junjian; Qu, Chiwen; Zhang, Lupeng; Tang, Yifan; Li, Jinlong; Feng, Huicong; Zeng, Xiaomin; Peng, Xiaoning

doi:10.1038/s41598-023-30862-y

Download PDF

Article
Open access
Published: 07 March 2023

A new hybrid algorithm for three-stage gene selection based on whale optimization

Junjian Liu¹^na1,
Chiwen Qu^1,4^na1,
Lupeng Zhang²^na1,
Yifan Tang³^na1,
Jinlong Li²,
Huicong Feng³,
Xiaomin Zeng⁵ &
…
Xiaoning Peng^1,2,3

Scientific Reports volume 13, Article number: 3783 (2023) Cite this article

976 Accesses
4 Citations
Metrics details

Subjects

Abstract

In biomedical data mining, the gene dimension is often much larger than the sample size. To solve this problem, we need to use a feature selection algorithm to select feature gene subsets with a strong correlation with phenotype to ensure the accuracy of subsequent analysis. This paper presents a new three-stage hybrid feature gene selection method, that combines a variance filter, extremely randomized tree, and whale optimization algorithm. First, a variance filter is used to reduce the dimension of the feature gene space, and an extremely randomized tree is used to further reduce the feature gene set. Finally, the whale optimization algorithm is used to select the optimal feature gene subset. We evaluate the proposed method with three different classifiers in seven published gene expression profile datasets and compare it with other advanced feature selection algorithms. The results show that the proposed method has significant advantages in a variety of evaluation indicators.

An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets

Article Open access 09 December 2019

A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications

Article Open access 08 February 2021

Correlation-centred variable selection of a gene expression signature to predict breast cancer metastasis

Article Open access 13 May 2020

Introduction

Due to the increase in high-dimensional data and the limited number of samples, the "big P small n" paradigm has become a major challenge in the field of biomedical data mining^1,2. Especially for microarray profile datasets, the number of genes is much larger than the number of samples, but only a few feature genes are closely related to cancer^3,4. Feature selection can remove irrelevant and redundant genes, improve the classification and diagnosis rate of cancer, and help to improve the treatment of cancer^5,6. According to their interaction with classifiers, feature selection methods can be divided into four categories: filter, embedded, wrapper, and hybrid methods^7,8,9,10. The filter method sorts genes according to the correlation of individual genes or the ability to distinguish target categories^11,12. The embedded method automatically selects the feature gene according to the algorithm^13,14. It quickly selects the optimal feature gene subset through algorithm training and feature selection at the same time. The wrapper method usually uses the classification model containing a heuristic algorithm and selects the optimal feature subset according to the classification performance^15,16,17. Although the wrapper method is lower in computational efficiency than the filter method, its classification performance is usually better than the latter¹⁸.

The hybrid method is generally a combination of the filter method and wrapper method^19,20. First, the filter method is used to quickly remove irrelevant features on a large scale and reduce the feature subset. Then, using the wrapper method, the optimal feature gene subset is selected. The hybrid method can combine the computational efficiency of the filter method and the high classification performance of the wrapper method²¹. For example, Su et al. combined the K-S test with CFS and compared it with four advanced methods. The results show that the hybrid method is effective⁸. Elnaz Pashaei²², Xiongshi Deng²³, and Jamshid Pirgazi²⁴ also adopted the hybrid method of combining the filter and wrapper method and achieved good results in many public cancer datasets. In recent years, an increasing number of researchers have considered a hybrid method combining filters and wrappers to select features from gene expression data²⁵. This paper presents a three-stage hybrid feature selection method: VEW, which combines the filter method and wrapper method. In the first stage, we use a variance filter to filter out genes that do not meet the variance threshold. In the second stage, we use the extremely randomized tree (ERT) algorithm to sort the importance of the gene subsets obtained in the previous stage, and further reduce the subset of feature genes. In the third stage, we input the gene subset obtained in the second stage into the whale optimization algorithm (WOA) to obtain the optimal feature gene subset. Through the analysis and comparison of the experimental results, we verify that the VEW method has obvious advantages in the selection performance of feature genes, the number of selected genes and the calculation time. This paper mainly finds that the three-stage hybrid algorithm combining the filter method and wrapper method has significant performance improvement and is easy to implement.

The rest of this paper is organized as follows: first, we summarize the research work and algorithm principle of the variance filter, ERT, and WOA and introduce the hybrid algorithm VEW in detail. In the results section, based on 7 published cancer gene expression datasets, we compare the VEW method with 11 related feature selection algorithms and other advanced feature selection algorithms. Finally, we summarize the experimental results and future work direction of this paper.

Methods

Variance filter

The variance filter is a simple filter method, that can quickly remove low-variance genes with poor classification performance. Michal Marczyk removed redundant feature genes from high-throughput data by an adaptive variance filter, which effectively improved the cancer classification performance²⁶. In this paper, we set the variance threshold to 0.05 to quickly screen feature genes in a large range.

ERT

ERT is similar to the random forest, which is a machine-learning algorithm composed of multiple decision trees. Unlike the random forest, the ERT uses all training samples to obtain each decision tree and forks the decision tree by randomly selecting split nodes. Liang et al.²⁷ identified promoters and their strength through feature selection of ERT.

WOA

Mirjalili (2016) proposed a new swarm intelligence optimization algorithm based on the predatory behaviour of humpback whales: the WOA²⁸. The WOA algorithm achieves the goal of optimizing the time by simulating the hunting behaviour of humpback whales in nature, such as whale group search, encirclement, pursuit, and attack of prey. The WOA is divided into the exploration and development stage. In the exploration stage, whales conduct random searches for prey. In the development stage, whales adopt two hunting modes: shrinking enclosure and spiral bubble net. Figure 1 shows the workflow of the WOA. In the development stage, whales hunt in the direction of the current optimal position. In the contraction and encirclement hunting mode, the optimal position in the whale population is set as prey, and other individuals in the population shrink, encircle, and approach the prey. The position update is shown in Formulas (1) and (2):

$$D=\left|C{X}_{q,t}-{X}_{i,t}\right|$$

(1)

$${X}_{i,t+l}={X}_{q,t}-AD$$

(2)

where ${X}_{q,t}$ is the current optimal solution, ${X}_{i,t}$ is the current whale individual, and $D$ is the distance between the current whale individual and the current optimal solution.

A is the convergence factor, and C is the disturbance factor. The $A,C$ calculation Formula are shown in (3) and (4):

$$A=2a\cdot {rand}_{1}-a$$

(3)

$$C=2\cdot {rand}_{2}$$

(4)

where ${rand}_{1}$ and ${rand}_{2}$ are random numbers between $[\mathrm{0,1}]$. The coefficient $a=2-2t/ T$, and $a$ decreases linearly from 2 to 0. In addition, $t$ is the current iteration number and $T$ is the maximum iteration number.

In the spiral bubble net model, by calculating the distance between whales and prey, whales spit out bubbles in a spiral path to corral prey. The whale spiral position update is shown in (5) and (6):

$${D}^{{\prime}}=\left|{X}_{q,t}-{X}_{i,t}\right|$$

(5)

$${X}_{i,t+l}={X}_{q,t}+{D}^{{\prime}}\cdot {e}^{bl}\mathrm{cos}\left(2\pi l\right)$$

(6)

$D^{{\prime}}={|X}_{q,t}-{X}_{i,t}|$ is the distance between an individual whale and prey, and $b$ is the spiral constant, $l\in [-\mathrm{1,1}]$. The whale encircles its prey while spiralling inward. The algorithm selects and distinguishes these two modes through the random variable $p$ and updates the whale position, as shown in Formula (7):

$${X}_{i,t+l}\left\{\begin{array}{c}{X}_{q,t}+{D}^{{\prime}}\cdot {e}^{bl}\mathrm{cos}\,\left(2\pi l\right)\, p\ge 0.5\\ {X}_{q,t}-AD\,p<0.5\end{array}\right.$$

(7)

In the exploration stage, humpback whales do not know the location of their prey and can only randomly select a whale individual in the population as a target to search for prey. At this time, the random search location update is shown in Formula (8) and (9):

$$D=\left|{C\cdot X}_{rand,t}-{X}_{i,t}\right|$$

(8)

$${X}_{i,t+l}={X}_{rand,t}-AD$$

(9)

${X}_{rand,t}$ is the location of randomly selected whales, and $D$ is the distance from humpback whales to randomly selected whales.

Coding rules

We set the whale individual position as $X=\left\{{x}_{1},\cdots ,{x}_{n}\right\}, { x}_{i}\epsilon [\mathrm{0,1}]$ and convert $X$ to binary position $X^{{\prime}}=\left\{{{x}^{{\prime}}}_{1},\cdots ,{{x}^{{\prime}}}_{n}\right\}, { {x}^{{\prime}}}_{i}\epsilon \{\mathrm{0,1}\}$ with length $n$. Here, ${{x}^{{\prime}}}_{i}=1$ indicates that the feature is selected, and ${{x}^{{\prime}}}_{i}=0$ indicates that the feature is not selected. The WOA algorithm adopts binary encoding as shown in Formula (10):

$${x}_{i}^{{\prime}}=\left\{\begin{array}{*{20}l}1 , & \quad if\,rand<{x}_{i}\\ 0 , &\quad otherwise\end{array}\right.$$

(10)

where ${x}_{i}$ represents the $i$-dimensional value of an individual whale at position $X$, and $rand$ is a random number between $[\mathrm{0,1}]$.

Fitness function

The fitness function is used to evaluate the advantages and disadvantages of each feature subset. In this paper, $KNN$ is selected as the fitness function of the classification problem, as shown in Formula (11):

$$fitness=\alpha \left(1-{KNN}_{acc}\right)+\left(1-\alpha \right)\cdot \frac{\left|R\right|}{\left|C\right|}$$

(11)

where $|R|$ is the length of the selected feature subset and $|C|$ is the total number of features. ${KNN}_{acc}$ is the classification accuracy using the $KNN$ classifier, and $\alpha $ is the weight coefficient. In this paper, we set $\alpha =0.99$.

VEW

In this paper, we propose a three-stage gene selection method: VEW, which combines a variance filter, ERT, and WOA. In the first stage, we use the variance filter method to screen genes and select feature genes that are greater than the variance threshold. In the second stage, we use the ERT to calculate the importance score of each gene, further screen the genes and eliminate the genes with a score of zero. Finally, we use the WOA to obtain the optimal subset of feature genes. The pseudocode code of VEW is shown in Algorithm 1. Figure 2 shows the gene selection process of the VEW algorithm. We also discuss the time complexity of VEW. The time complexity of VEW is mainly composed of two stages. The time complexity of the ERT in the second stage is $O(M\times (mnlogn))$, where $M$ is the number of decision trees, $n$ is the number of genes in the sample and $m$ is the number of feature genes. In the third stage, the time complexity of the WOA is $O(N\times T\times D)$. Here, $N$ is the population size obtained in the second stage, $T$ is the maximum number of iterations and $D$ is the problem size. In the method proposed in this paper, the first two stages involve simple filterings and sorting of gene sets, which are fast and time-consuming, respectively. Because $O(N\times T\times D)\gg O(M\times (mnlogn))$, the time spent by the algorithm is mainly concentrated in the third stage.

Results

Data and parameter setting

To evaluate the performance of each algorithm, seven microarray gene expression profile datasets are used in this paper. All datasets used are from public websites: http://csse.szu.edu.cn/staff/zhuzx/Datasets.Html²⁹ and https://github.com/Pengeace/MGRFE-GaRFE²⁵. Table S1 in the supplementary material provides a detailed overview of the feature of the seven microarray datasets, including samples, number of genes, and classes. In these datasets, the number of genes ranged from 7129 to 54,675, and the number of samples is less than 300. These datasets include acute lymphoblastic leukemial type L3 (ALL3), gastric 1 (Gas1), type 1 diabetes (T1D), myeloma (Mye), ovarian cancer (Ova), leukemia (Leuk), and mixed-lineage leukemial (MLL). Only the MLL dataset is a three-class dataset, whereas the others are binary. The number of class samples in most datasets is uneven. In data preprocessing, we fill in the missing values with the mean and map the new data values to $[\mathrm{0,1}]$ using the minimum maximum normalization method. All the experimental results in this paper are generated on a PC equipped with a Corei7-8750 CPU, 16 GB of memory, and 2.20 GHz frequency. All algorithms are implemented using Python language and two public package tools; scikit learn and scikit feature. In this paper, we use three different external classifiers to evaluate the performance of each algorithm, namely, the decision tree (DT), support vector machine (SVM), and logistic regression (LR). After tenfold cross-validation of each standard classifier, the classification performance of each algorithm is recorded. The tenfold cross-validation method randomly divides the dataset into 10 parts, nine of which are divided into training sets, and the rest are divided into test sets. We compare the VEW method with 11 different methods and other methods in the literature. The 11 different methods include the T test (T), Wilcoxon test (W), variance filter-univariate feature selection (VU), extremely randomized tree-univariate feature selection (EU), variance filter-extremely randomized tree (VE), variance filter-bat algorithm (VB), extremely randomized tree-bat algorithm (EB), variance filter-firefly algorithm (VF), extremely randomized tree-firefly algorithm (EF), variance filter-whale optimization algorithm (VW) and extremely randomized tree-whale optimization algorithm (EW). Table S2 in the supplementary materials lists the specific parameter values of each algorithm and classifier. All experiments were run independently 10 times, and the average value was taken. Seven evaluation criteria are used to reflect the performance of each algorithm: the number of selected genes, classification accuracy (Acc), precision (Pre), recall rate (Recall), F1-score (F1), standard deviation (SD), and algorithm running time. The calculation formulas for the four important evaluation criteria are as follows:

$$Acc=\frac{TN+TP}{P+N}$$

(12)

$$Precision=\frac{TP}{TP+FP}$$

(13)

$$Recall=\frac{TP}{TP+FN}$$

(14)

$$F1=2*\frac{Precision*Recall}{Precision+Recall}$$

(15)

The number of positive samples is (P), and the number of negative samples is (N). True positive (TP): the real category of the sample is positive, and the model prediction is also positive. True negative (TN): the real category of the sample is a negative case, and the model prediction is also a negative case. False-positive (FP): the real category of the sample is negative, and the model prediction is positive. False-negative (FN): the real category of the sample is positive and the model prediction is negative. Because the precision, recall, and F1 are for a single class, we assign the same weight to each class and calculate their average values.

Comparison of performance

We comprehensively compare the VEW method with T, W, VU, EU, VE, VB, EB, VF, EF, VW, and EW. The best performance values in each dataset are highlighted in black bold. Tables 1, 2 and 3 show the performance values of the four evaluation criteria of each algorithm on the three classifiers. It can be seen from Table 1 that on DT, the VEW method has obvious advantages over the other methods. The Acc, Pre, Recall, and F1 winning times are 7, 6, 7, and 6 times, respectively. The average Acc on seven datasets reaches 86.47%, which is significantly better than the other nine methods, with 100% Acc achieved on the ovarian dataset. As shown in Table 2, on the SVM, the winning times of the VEW method on the four evaluation criteria is 6, and the Acc is 100% on the ovarian and leukemia datasets. Moreover, the average Acc reaches 89.00%. As shown in Table 3, on LR, the number of winning times of the VEW on the four evaluation criteria is 6 and reaches 100% on both the leukemia and ovarian datasets. The average Acc of VEW was significantly higher than that of the other nine methods and reaches a maximum of 89.58%. In summary, compared with other methods, VEW has obvious advantages in Acc, pre, recall, and F1, especially in DT, and achieves the highest average Acc in LR. This also proves that the hybrid method proposed in this paper can effectively improve the performance of each index. Table S3 in the supplementary materials lists the number of genes selected by each algorithm in the seven datasets. From the results, the average number of genes selected by the VEW method is the lowest. VW and EW selects fewer genes than VEW in three of the datasets, but combined with other indicators, we find that this advantage comes at the expense of other performances. In addition, in most datasets, the number of genes selected by the VEW method is only 1/4 to 1/250 of that of other comparative methods. The above experiments prove that VEW can better combine the advantages of various methods and select fewer feature genes without sacrificing performance. To further verify the performance advantages of the VEW method, we compare it with other advanced algorithms in the literature. Table 4 lists the comparison results between the proposed algorithm and other literature methods, where "\" indicates the lack of experimental data. It can be seen from the results in Table 4 that compared with other advanced algorithms, the VEW method also has certain competitiveness in Acc.

Table 1 Performance comparison of algorithms on classifier DT.

Full size table

Table 2 Performance comparison of algorithms on classifier SVM.

Full size table

Table 3 Performance comparison of algorithms on classifier LR.

Full size table

Table 4 Comparison between the VEW and other advance methods in Acc.

Full size table

Comparison of running time

We analyse the running time of all algorithms, and Table 5 lists the average running time of each algorithm on each dataset. It can be seen from the results that the EF method has the longest running time and the VE method has the shortest average running time. The running time of VEW is less than that of T, W, VB, EB, VF, EF, VW, and EW and more than that of VU, EU, and VE. According to the previous analysis results, other comparison methods are significantly lower than VEW in terms of Acc, the number of selected genes, etc. This shows that VEW can improve other performances and shorten the overall running time through the hybrid method.

Table 5 Comparison of the running time $(10s)$ between the VEW and other methods.

Full size table

Biological inferences

Due to the randomness of the VEW method, multiple results with the same performance but different feature genes may be obtained in multiple experiments. We adopt the following principles to solve this problem: (1). The results with high Acc in multiple classifiers are comprehensively selected. (2). When Acc is the same, a subset of feature genes with a small number is preferentially selected. (3). When the numbers of Acc and feature genes are the same, the subset of genes with the highest frequency is selected. Table 6 lists the number of optimal gene subsets, probe/UniProt ID and average Acc on different classifiers selected by VEW in each dataset after 10 independent runs. To test the effectiveness of VEW in the selection of cancer-related biomarkers, we perform biological inference on the selected best subset of genes (partial genes) in three of the datasets. Tables S4–S6 in the supplementary materials list the probe/UniProt ID, gene name and gene function description corresponding to the best gene subset selected by VEW on the three datasets.

Table 6 Optimal subset of genes selected by the VEW.

Full size table

Forgione et al. found that KMT2A is associated with ALL and that KMT2A rearrangement is a driver of highly pathogenic leukemia³⁰. FASN is the only human lipogenic enzyme that can be used for de novo fatty acid synthesis and is highly expressed in cancer cells. Reducing FASN expression can make ALL cells sensitive to differentiation therapy³¹. Vojta et al. determined MGAT5B is widely associated with a variety of cancer types, including gastric cancer, and may have potential value for disease prognosis³². Rosenblum et al. found that DPP7 plays an important role in regulating peptide hormone signalling and can serve as an emerging target for a variety of cancers including myeloma³³. ITGAX is closely related to the treatment of multiple cancers, but its correlation with myeloma needs further study³⁴. Gao et al. found that MUC1 is a potential target for developing drugs for myeloma patients, and MUC1 based cancer vaccines can effectively prevent cancer progression and metastasis³⁵. Similarly, PA2G4 plays an important role in the progression and spread of myeloma and can serve as a potential new therapeutic target for myeloma³⁶. The above results show the validity of VEW in biological inference and the practicability of the method proposed in this paper. Of all the evaluation criteria, Acc was the most important, so we tested the performance of the VEW method in the dataset when $\alpha $ took different values. As shown in Table S7, when $\alpha =0.99$, the algorithm performance was the best. Therefore, we set $\alpha =0.99$.

Conclusion

The purpose of VEW is to select effective feature genes from high-dimensional gene expression data. Unlike other similar methods, VEW is a three-stage hybrid method that combines the three constitutive methods well. We quickly screen feature genes in a large range through a variance filter and ERT and then accurately screen them in a small range through a WOA. This improves performance and reduces time consumption. The results in Tables S4–S6 show that our method can select important genes related to a tumor in multiple datasets, and the results of other researchers also verify the effectiveness and practicability of genes selected by the VEW method from a medical perspective. The results in Tables 1, 2, 3, 4, 5 show that VEW significantly improves performance while reducing run time. The number of genes selected by VEW on all datasets is no more than 10, and the Acc reaches 100% on the ovarian and leukemia datasets; the average Acc on multiple datasets also reaches 89.58%. Compared with other advanced algorithms, VEW has obvious advantages in the number of gene selections, Acc, Precision, Recall, F1, and running time.

As shown in Table S7, we also test the performance value of the VEW method on different datasets when $\alpha $ takes different values, which proves the rationality of our $\alpha $ value. Because the variance filter is simple and efficient, we first use it to filter out redundant genes and use the ERT in the second phase of VEW, which can further narrow the scope of gene screening, increase the randomness of the screening process, and avoid falling into local optimization. The results of the basic WOA in the third stage also show that our idea can significantly improve the overall algorithm performance. We believe that the addition of the three-stage hybrid algorithm of the ERT is the key reason for the performance improvement. The ERT increases the randomness of the overall algorithm and further sorts and filters the gene subset, which also increases the screening accuracy of the whale optimization algorithm. However, the methods proposed in this paper also have many limitations. For example, the basic WOA has the disadvantages of low accuracy, slow convergence speed, and easy trapping in local optima. In addition, the filtering method in the first stage needs to select the better one to improve the performance of the overall algorithm. In future research, we can further select better filter and wrapper methods and combine them with ERT to form a new three-stage hybrid algorithm to improve the performance of the overall algorithm.

Data availability

All datasets used are from public websites: http://csse.szu.edu.cn/staff/zhuzx/Datasets.Html and https://github.com/Pengeace/MGRFE-GaRFE.

Abbreviations

ERT:: Extremely randomized tree
WOA:: Whale optimization algorithm
VEW:: Variance filter-Extremely randomized tree-Whale optimization algorithm
ALL3:: Acute lymphoblastic leukemia type L3
Gas1:: Gastric1
T1D:: Type1 diabetes
Mye:: Myeloma
Ova:: Ovarian cancer
Leuk:: Leukemia
MLL:: Mixed-lineage leukemia
DT:: Decision Tree
SVM:: Support Vector Machine
LR:: Logistic Regression
T:: T-test
W:: Wilcoxon-test
VU:: Variance filter-univariate feature selection
EU:: Extremely randomized tree-univariate feature selection
VE:: Variance filter-extremely randomized tree
VB:: Variance filter-bat algorithm
EB:: Extremely randomized tree-bat algorithm
VF:: Variance filter-firefly algorithm
EF:: Extremely randomized tree-firefly algorithm
VW:: Variance filter-whale optimization algorithm
EW:: Extremely randomized tree-whale optimization algorithm
Acc:: Classification accuracy
Pre:: Precision
Recall:: Recall rate
F1:: F1-Score
SD:: Standard deviation
TP:: True positive
TN:: True negative
FP:: False positive
FN:: False negative

References

Diao, G. & Vidyashankar, A. N. Assessing genome-wide statistical significance for large p small n problems. Genetics 194(3), 781–783 (2013).
Article PubMed PubMed Central Google Scholar
Marsh-Wakefield, F. M. et al. Making the most of high-dimensional cytometry data. Immunol. Cell Biol. 99(7), 680–696 (2021).
Article PubMed PubMed Central Google Scholar
Kumar Myakalwar, A. et al. Less is more: Avoiding the LIBS dimensionality curse through judicious feature selection for explosive detection. Sci. Rep. 5, 13169 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Malepathirana, T., Senanayake, D., Vidanaarachchi, R., Gautam, V. & Halgamuge, S. Dimensionality reduction for visualizing high-dimensional biological data. Biosystems 220, 104749 (2022).
Article CAS PubMed Google Scholar
Hira, Z. M. & Gillies, D. F. A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015, 198363 (2015).
Google Scholar
Chuang, L. Y., Ke, C. H., Chang, H. W. & Yang, C. H. A two-stage feature selection method for gene expression data. OMICS 13(2), 127–137 (2009).
Article CAS PubMed Google Scholar
Bir-Jmel, A., Douiri, S. M. & Elbernoussi, S. Gene selection via a new hybrid ant colony optimization algorithm for cancer classification in high-dimensional data. Comput. Math. Methods Med. 2019, 7828590 (2019).
Article PubMed PubMed Central MATH Google Scholar
Su, Q., Wang, Y., Jiang, X., Chen, F. & Lu, W. C. A cancer gene selection algorithm based on the K-S test and CFS. Biomed. Res. Int. 2017, 1645619 (2017).
Article PubMed PubMed Central Google Scholar
Alshamlan, H., Badr, G. & Alohali, Y. mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray gene expression profiling. Biomed. Res. Int. 2015, 604910 (2015).
Article PubMed PubMed Central Google Scholar
Aziz, R., Verma, C. K. & Srivastava, N. A novel approach for dimension reduction of microarray. Comput. Biol. Chem. 71, 161–169 (2017).
Article CAS PubMed Google Scholar
Liu, L., Tang, S., Wu, F. X., Wang, Y. P. & Wang, J. An ensemble hybrid feature selection method for neuropsychiatric disorder classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 19(3), 1459–1471 (2022).
PubMed Google Scholar
Wang, W., Lu, L. & Wei, W. A novel supervised filter feature selection method based on gaussian probability density for fault diagnosis of permanent magnet DC motors. Sensors (Basel) 22(19), 7121 (2022).
Article ADS PubMed Google Scholar
Zhang, D. et al. Heart disease prediction based on the embedded feature selection method and deep neural network. J. Healthc. Eng. 2021, 6260022 (2021).
PubMed PubMed Central Google Scholar
Guo, J., Jin, M., Chen, Y. & Liu, J. An embedded gene selection method using knockoffs optimizing neural network. BMC Bioinform. 21(1), 414 (2020).
Article CAS Google Scholar
Dashtban, M. & Balafar, M. Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. Genomics 109(2), 91–107 (2017).
Article CAS PubMed Google Scholar
Mao, Y. & Yang, Y. A wrapper feature subset selection method based on randomized search and multilayer structure. Biomed. Res. Int. 2019, 9864213 (2019).
Article PubMed PubMed Central Google Scholar
Abasabadi, S., Nematzadeh, H., Motameni, H. & Akbari, E. Hybrid feature selection based on SLI and genetic algorithm for microarray datasets. J. Supercomput. 78(18), 19725–19753 (2022).
Article PubMed PubMed Central Google Scholar
Pfeifer, B., Alachiotis, N., Pavlidis, P. & Schimek, M. G. Genome scans for selection and introgression based on k-nearest neighbour techniques. Mol. Ecol. Resour. 20(6), 1597–1609 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tang, F., Zhang, L., Xu, L., Zou, Q. & Feng, H. The accurate prediction and characterization of cancerlectin by a combined machine learning and GO analysis. Brief Bioinform. 22(6), bbab227 (2021).
Article PubMed Google Scholar
Yao, D., Yang, J., Zhan, X., Zhan, X. & Xie, Z. A novel random forests-based feature selection method for microarray expression data analysis. Int. J. Data Min. Bioinform. 13(1), 84–101 (2015).
Article PubMed Google Scholar
Yu, H. & Ni, J. An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(4), 657–666 (2014).
Article PubMed Google Scholar
Pashaei, E. & Pashaei, E. Gene selection using hybrid dragonfly black hole algorithm: A case study on RNA-seq COVID-19 data. Anal. Biochem. 627, 114242 (2021).
Article CAS PubMed Google Scholar
Deng, X., Li, M., Deng, S. & Wang, L. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med. Biol. Eng. Comput. 60(3), 663–681 (2022).
Article PubMed Google Scholar
Pirgazi, J., Alimoradi, M., Esmaeili Abharian, T. & Olyaee, M. H. An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Sci. Rep. 9(1), 18580 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Peng, C. et al. MGRFE: Multilayer recursive feature elimination based on an embedded genetic algorithm for cancer classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 18(2), 621–632 (2021).
Article PubMed Google Scholar
Marczyk, M., Jaksik, R., Polanski, A. & Polanska, J. GaMRed-adaptive filtering of high-throughput biological data. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(1), 149–157 (2020).
CAS PubMed Google Scholar
Liang, Y., Zhang, S., Qiao, H. & Yao, Y. iPromoter-ET: Identifying promoters and their strength by extremely randomized trees-based feature selection. Anal. Biochem. 630, 114335 (2021).
Article CAS PubMed Google Scholar
Liu, W. et al. Improved WOA and its application in feature selection. PLoS ONE 17(5), e0267041 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ge, R. et al. McTwo: A two-step feature selection algorithm based on maximal information coefficient. BMC Bioinform. 17, 142 (2016).
Article Google Scholar
Forgione, M. O., McClure, B. J., Eadie, L. N., Yeung, D. T. & White, D. L. KMT2A rearranged acute lymphoblastic leukaemia: Unravelling the genomic complexity and heterogeneity of this high-risk disease. Cancer Lett. 469, 410–418 (2020).
Article CAS PubMed Google Scholar
Humbert, M. et al. Reducing FASN expression sensitizes acute myeloid leukemia cells to differentiation therapy. Cell Death Differ. 28(8), 2465–2481 (2021).
Article CAS PubMed PubMed Central Google Scholar
Vojta, A., Samaržija, I., Bočkor, L. & Zoldoš, V. Glyco-genes change expression in cancer through aberrant methylation. Biochim. Biophys. Acta 1860(8), 1776–1785 (2016).
Article CAS PubMed Google Scholar
Rosenblum, J. S. & Kozarich, J. W. Prolyl peptidases: A serine protease subfamily with high potential for drug discovery. Curr. Opin. Chem. Biol. 7(4), 496–504 (2003).
Article CAS PubMed Google Scholar
Aasebø, E. et al. The progression of acute myeloid leukemia from first diagnosis to chemoresistant relapse: A comparison of proteomic and phosphoproteomic profiles. Cancers (Basel) 12(6), 1466 (2020).
Article PubMed Google Scholar
Endo, S. et al. MUC1/KL-6 expression confers an aggressive phenotype upon myeloma cells. Biochem. Biophys. Res. Commun. 507(1–4), 246–252 (2018).
Article CAS PubMed Google Scholar
Shen, Y. J. et al. Progression signature underlies clonal evolution and dissemination of multiple myeloma. Blood 137(17), 2360–2372 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yang, Z., Zhang, T. & Zhang, D. A novel algorithm with differential evolution and coral reef optimization for extreme learning machine training. Cogn. Neurodyn. 10(1), 73–83 (2016).
Article MathSciNet PubMed Google Scholar

Download references

Funding

This project was supported by the National Natural Science Foundation of China (Grant No. 81472860), the Key R & D project of Hunan Province (Grant No. 2020DK2002), the Key Project of Developmental Biology and Breeding from Hunan Province (2022XKQ0205) and the research start-up fund for Prof. Peng Xiaoning from Jishou University (Grant No. 91602-111900).

Author information

These authors contributed equally: Junjian Liu, Chiwen Qu, Lupeng Zhang and Yifan Tang.

Authors and Affiliations

Department of Statistics, College of Mathematics and Computer Science, Hunan Normal University, Changsha, 410081, Hunan, China
Junjian Liu, Chiwen Qu & Xiaoning Peng
Department of Biochemistry and Molecular Biology, Jishou University School of Medicine, Jishou, 416000, Hunan, China
Lupeng Zhang, Jinlong Li & Xiaoning Peng
Department of Pathology and Pathophysiology, Hunan Normal University School of Medicine, Changsha, 410013, Hunan, China
Yifan Tang, Huicong Feng & Xiaoning Peng
School of Information Engineering, Baise University, Baise, 533000, Guangxi, China
Chiwen Qu
Department of Epidemiology and Health Statistics, Xiangya Public Health School, Central South University, Changsha, 410078, Hunan, China
Xiaomin Zeng

Authors

Junjian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chiwen Qu
View author publications
You can also search for this author in PubMed Google Scholar
Lupeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jinlong Li
View author publications
You can also search for this author in PubMed Google Scholar
Huicong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaomin Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoning Peng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.L., and L.Z. designed the research; L.Z., and C.Q. collected data, J.L., and C.Q. wrote and performed computer programs, J.L., H.F., and Y.T. analyzed and interpreted the results, J.L. wrote the first version of the manuscript. X.Z., and X.P. designed the research, revised and edited the manuscript. All authors read and approved the final submitted manuscript.

Corresponding authors

Correspondence to Xiaomin Zeng or Xiaoning Peng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, J., Qu, C., Zhang, L. et al. A new hybrid algorithm for three-stage gene selection based on whale optimization. Sci Rep 13, 3783 (2023). https://doi.org/10.1038/s41598-023-30862-y

Download citation

Received: 10 August 2022
Accepted: 02 March 2023
Published: 07 March 2023
DOI: https://doi.org/10.1038/s41598-023-30862-y

This article is cited by

The prediction of Chongqing's GDP based on the LASSO method and chaotic whale group algorithm–back propagation neural network–ARIMA model
- Juntao Chen
- Jibo Wu
Scientific Reports (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets

A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications

Correlation-centred variable selection of a gene expression signature to predict breast cancer metastasis

Introduction

Methods

Variance filter

ERT

WOA

Coding rules

Fitness function

VEW

Results

Data and parameter setting

Comparison of performance

Comparison of running time

Biological inferences

Conclusion

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

The prediction of Chongqing's GDP based on the LASSO method and chaotic whale group algorithm–back propagation neural network–ARIMA model

Comments

Search

Quick links