Gene Correlation Guided Gene Selection for Microarray Data Classification

Yang, Dong; Zhu, Xuchang

doi:https://doi.org/10.1155/2021/6490118

BioMed Research International

On this page

Abstract Introduction Related Work Experimental Results Discussion and Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2021 | Article ID 6490118 | https://doi.org/10.1155/2021/6490118

Gene Correlation Guided Gene Selection for Microarray Data Classification

Dong Yang¹and Xuchang Zhu²

Academic Editor: Chang Tang

Received18 Jun 2021

Accepted09 Aug 2021

Published16 Aug 2021

Abstract

The microarray cancer data obtained by DNA microarray technology play an important role for cancer prevention, diagnosis, and treatment. However, predicting the different types of tumors is a challenging task since the sample size in microarray data is often small but the dimensionality is very high. Gene selection, which is an effective means, is aimed at mitigating the curse of dimensionality problem and can boost the classification accuracy of microarray data. However, many of previous gene selection methods focus on model design, but neglect the correlation between different genes. In this paper, we introduce a novel unsupervised gene selection method by taking the gene correlation into consideration, named gene correlation guided gene selection (G³CS). Specifically, we calculate the covariance of different gene dimension pairs and embed it into our unsupervised gene selection model to regularize the gene selection coefficient matrix. In such a manner, redundant genes can be effectively excluded. In addition, we utilize a matrix factorization term to exploit the cluster structure of original microarray data to assist the learning process. We design an iterative updating algorithm with convergence guarantee to solve the resultant optimization problem. Experimental results on six publicly available microarray datasets are conducted to validate the efficacy of our proposed method.

1. Introduction

During cell division and growth, abnormal changes often happen to genes, which results in varying cancers. With the rapid development of kinds of biomedical technologies [1], DNA microarray comes into being and lots of microarray data can be obtained for cancer prevention, diagnosis, and treatment [2–12]. For various microarray data, classifying the different types of tumors is an important task, but challenging due to the high dimensionality and small numbers of samples [13–15] since the small number of data samples with large number of genes can easily result in the “curse of dimensionality” and overfitting problems of data processing and learning models. When the dimension of samples is too high, the distance between any two samples is very inaccurate. Therefore, the classification task for this kind of data is often challenging. However, it has been verified by some existing biological experiments that only a very small proportion of genes contribute significantly to biological process and disease indication. Directly processing original high dimensional microarray data not only degenerates the classification performance but also brings extra computation burden of hardware. Therefore, it is necessary to select a subset of discriminative genes from high-dimensional microarray data to serve subsequent tasks [16–25]. If we treat each gene as a feature dimension in microarray data, gene selection is similar to the feature selection task in machine learning and data mining community [26–37]. In fact, many feature selection methods can be used well for gene selection. Therefore, mathematical gene selection methods can be also grouped into three classes, i.e., filter methods, wrapper methods, and embedded methods.

Filter methods often measure the importance of different genes in a straight-forward manner based on certain criteria such as -test [38, 39], -score [40], signal-to-noise ratio (SNR) [41], Laplacian score [42], mutual information [43], and information gain [44]. In [41], Golub et al. firstly used the SNR function to evaluate the weights of the genes. Many traditional feature selection methods such as ReliefF [45] and MRMR [46] are combined and used for gene selection [47]. Since filter methods only depend on the intrinsic properties of original data [48], a good ranking criterion function is very important.

As to wrapper methods, varying classification algorithms are often used as a fitness evaluation to determine the subset of genes and the selected genes can in turn enhance the classification performance [2, 49–56]. In general, wrapper methods can obtain better results than filter methods, but bring more expensive computational cost. A lot of evolutionary algorithms such as genetic algorithm (GA), differential evolution (DE), ant colony optimization (ACO), and simulated annealing are commonly used as wrapper methods for gene selection [57, 58].

For embedded methods, the geometric structure and intrinsic property of data are exploited to construct gene selection models. Among this kind of methods, some mathematical regularization terms with specific physic meanings such as representative and sparse characters are commonly used assumptions. Typical models include self-representation [32, 33, 59–62], low-rank representation [63, 64], and matrix factorization [65–67]. Based on these basic models, many variants have been proposed, such as Laplacian graph regularized low-rank representation [63]. Considering the robustness to outliers, Wang et al. [66] proposed a robust -norm regularized characteristic gene selection method. In [68], Guo et al. proposed to identify the disease-associated genes by utilizing ensemble consensus-guided unsupervised feature selection method. In an unsupervised manner, the major priori information can be used is the intrinsic local geometric structure of data. Therefore, embedded methods that use this priori information can achieve good performance for various of microarray data and obtain more and more attention.

Although there are lots of computational methods proposed for gene selection and achieve great success, most of them focus on the relation of data samples while the correlation between different genes is ignored. The expression values of different genes should be interrelated for a certain microarray data matrix. Therefore, we propose to calculate the correlation of gene pairs to regularize the gene selection model, which is named as named gene correlation guided gene selection (G³CS). In detail, in order to exclude redundant genes, the covariance of different gene dimension pairs is calculated and embedded into our unsupervised gene selection model to regularize the gene selection coefficient matrix. In addition, we utilize a matrix factorization model which can capture the cluster structure of original data to assist the learning process. We design an iterative updating algorithm to solve the resultant problem. Finally, experimental results on six publicly available real microarray datasets are conducted to demonstrate that the proposed G³CS can steadily perform better than other state-of-the-art computational gene selection methods in terms of microarray data classification. In Figure 1, we give a brief flowchart of our proposed G³CS model.

In this section, we introduce some gene selection works that are most related to our proposed method. Before that, we firstly present some notations will be used in the following sections. Throughout this paper, matrices and vectors are denoted as boldface capital letters boldface lower case letters, respectively. Given an matrix , represents its -th element, and denotes its -th row and -th column, respectively. is the transpose of . If is square, is the trace of . denotes an identity matrix with size . is a vector with all elements are 1. denotes the -norm of matrix , which is used to constrain the row sparsity of . is the well-known Frobenius norm of .

Since our proposed G³CS belongs to the embedded method, we give a brief review about some related embedded methods.

2.1. GRSL-GS

In [20], Tang et al. proposed a manifold regularized subspace learning model for gene selection, in which the model projects original high dimensional microarray data into a lower-dimensional subspace, then original genes are constrained to be well represented by the selected gene subset. In order to capture the local manifold structure of original data, a Laplacian graph regularization term is imposed on the low-dimensional data space. Finally, the learned projection matrix can be regarded as an important indicator of different genes. Specifically, the mathematical model of GRSL-GS can be formulated as follows: where denotes the projection matrix, represents the data reconstruction coefficient matrix, and is the Laplacian matrix calculated from original data. is a hyper-parameter that balances the two regularization terms. The first term in Eq. (1) constraints that original microarray data can be reconstructed from the projected lower-dimensional gene dictionary, and the second term is the graph Laplacian regularization term used to preserve the intrinsic local manifold structure of original data samples. Although GRSL-GS captures the local structure information, it does not exploit the gene correlation.

2.2. AHEDL

Considering that the graph Laplacian in GRSL-GS can only capture pairwise sample relationship, Zheng et al. [22] introduced a computational gene selection model via adaptive hypergraph embedded dictionary learning (AHEDL). Similar to GRSL-GS, AHEDL also learns a dictionary from original high dimensional microarray data, and the learned dictionary is then used to represent original data by a reconstruction coefficient matrix. The difference of dictionary learning between GRSL-GS and AHEDL is that GRSL-GS uses projection process but AHEDL directly utilizes traditional dictionary learning model. The -norm is imposed on the coefficient matrix for selecting discriminate genes.

In addition, the hypergraph is also learned in an adaptive manner. In a nutshell, AHEDL can be formulated as follows:

As can be seen from Eq. (2), AHEDL integrates adaptive hypergraph learning, dictionary learning, and gene selection into a uniform framework. The dictionary matrix , representation coefficient matrix and hypergraph can constrain each other during the optimization process to obtain their optimums. Since can be regarded as the new representation of in the dictionary space, the row sparsity imposed on by using the -norm can be used to measure the importance of gene dimensions in the learned dictionary space.

3. Proposed Method

Given a microarray data , which contains data samples with different genes. Gene selection aims to select a gene subset that contains only a small number of genes for subsequent tasks. Without sample label information, we should exploit the intrinsic structure of data as much as possible. In this work, we deploy traditional regression model as the basic architecture to formulate G³CS, which can be represented as follows: where is a projection matrix that projects original data into label space , where is the cluster indicator vector corresponding to . In order to measure the importance of different genes, we impose the -norm on to constrain that important genes contribute more during the projection process. In machine learning and data mining community, matrix factorization of target matrix often shows remarkable performance [67, 69]. In our G³CS model, we also decompose into two components, i.e., and . As a result, Eq. (3) can be rewritten as following form with appropriate constraints: where constrain each column of to be independent with each other. is a relaxation constraint that makes each row of to have only one nonzero element. The constraints in Eq. (8) make the model to conduct orthogonal clustering which works well for unsupervised feature selection [70].However, by minimizing Eq. (8) directly for gene selection neglects the gene correlation information which is important in biomedical process. In this work, we embed the gene correlation information into G³CS. It is well known that in probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector. In this work, we use covariance to calculate the correlation of different gene pairs, then, we can get a symmetric semipositive definite covariance matrix . The -th entry of covariance matrix can be calculated as follows: where is the gene average vector, which is calculated as follows:

However, the diagonal elements in only reflect the relationship between a gene dimension and itself, which makes no sense in our model. Therefore, we adjust to get a new correlation matrix by the following equation:

In such a manner, represents the correlation between the -th gene dimension with all other gene dimensions. Then, can be embedded into Eq. (8) to emphasize the independence of selected gene dimensions from the perspective of data information. Therefore, we have

In addition, the local geometric structure information of original data should be preserved as much as possible in the learned new space . By using the Gaussian kernel function, we can get a similarity matrix from original data by the following equation:

where represents the set of nearest neighbors of , and is a width parameter. and are set to 5 and 0.5, respectively, in our experiments. In our G³CS model, we require that if two data samples are closed to each other in original space, their cluster indicator vectors in new space should also be close. This constraint can be formulated by using the following form: where is the Laplacian matrix corresponding to . Finally, we get the mathematical formulation of our G³CS model as follows: where , and are three hyperparameters to balance different regularization terms. In summary, Eq. (1) integrates regression, matrix factorization, gene correlation, and data local structure exploitation into a unified framework. The gene correlation regularizes the model to exclude redundant gene dimensions.

4. Optimization Algorithm

There are three variables in Eq. (1) that need to be optimized; we cannot obtain a close-form solution simultaneously for all of them. Therefore, we design an algorithm to iteratively update these variables. For each time, we update a variable by fixing other ones.

4.1. Optimize

When other variables are fixed, solving is equal to the following problem:

By taking the derivative of Eq. (12) with respect to and setting it to zero, we have

Then, we have the closed-form solution of as follows: where is a diagonal matrix with . At each iteration, and can be updated alternatively.

4.2. Optimize

When fixing other variables, the optimization problem is equal to the following equation:

By adding a constant matrix into the -norm, Eq. (15) is equal to

Since is an orthogonal matrix, then, we have where . In order to ensure the orthogonal constraint of , we add a large positive constant and the optimization problem can be converted to

By setting the derivative of Eq. (18) respect to to 0, we have then can be updated by the following equation in each iteration:

4.3. Optimize

When fixing other variables, the optimization problem for is equal to the following equation:

We add a penalty term for the constraint and a Lagrange multiplier for the constraint . Then, the Lagrange function for Eq. (21) can be written as follows:

By setting the derivative of Eq. (22) with respect to to 0, we have

According to the Kuhn-Tucker conditions , we have

After we solve the resultant optimization problem as described by Eq. (1), we can measure the importance of each gene dimension by calculating the -norm of each row of . We summarize the optimization procedure of the G³CS model in Algorithm 1.

Input: Microarray data matrix X, parameters: α, β and γ. A small constant ε =0.0000001.
Initialize: M, S, F, and Z.
While not converged do
1. Update P via Eq.(14);
2. Update F via Eq.(20);
3. Update Z by solving Eq.(24);
6. Check convergence condition: .
End while
Output: P.
Gene selection: Sort the l₂-norm of the rows of P in decent order and select the largest K values. The gene dimension indexes with the the largest K values are selected to form the gene subset.

The proposed algorithm converges well with increasing iteration times. In our experiments, when the objective function value change between two continuous iteration times is very small, we stop the optimization process and obtain good results.

5. Experimental Results

In this section, extensive experiments are conducted on several real microarray datasets to validate the efficacy of the proposed G³CS. In order to demonstrate that the gene subset selected by G³CS can obtain better classification results, we use three kinds of classification algorithms including Support Vector Machine (SVM), Random Forest (RF), and -nearest neighbor (KNN) to test the selected gene subset obtained by different previous gene selection methods.

5.1. Microarray Datasets

Six publicly available microarray datasets are used in our experiments, which are colon cancer (colon) [71], B-cell chronic lymphocytic leukemia (CLL SUB 111), breast, lung, tumors-11, and global cancer map (GCM) (1CLL SUB 111 and lung can be downloaded from: http://featureselection.asu.edu/datasets.php; breast and GCM can be downloaded from: http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi; tumors-11 can be downloaded from: http://datam.i2r.a-star.edu.sg/datasets/krbd/index.html.) and are used to test the performance of the proposed G³CS and other gene selection methods used for comparison. These datasets are collected for diagnosis of different kinds of cancers such as colon cancer, lung cancer, Ewing’s family of tumors, non-Hodgkin lymphoma, and rhabdomyosarcoma and prostate cancer. For an instance, CLL SUB 111 contains high-density oligonucleotide arrays which can be used to identify molecular correlates of genetically and clinically distinct subgroups of B-cell chronic lymphocytic leukemia (B-CLL). Lung is a dataset used to determine whether global biological differences underlie common pathological features of prostate cancer and to identify genes that might anticipate the clinical behaviour of this disease.

It should be noted that the above six datasets are typical with high-dimensional genes. In each dataset, the number of genes is much larger than the number of samples, which brings challenge for many practical tasks. In Table 1, we give a brief description about these datasets.

5.2. Experimental Settings

In the proposed G³CS, we have three parameters that need to be adjusted, i.e., , , and . In our experiments, we varied their values by a “grid-search” in the range . In addition, the optimal number of selected genes is also unknown, we set different numbers of selected genes for different datasets, and the final best results obtained from the optimal parameter setting were reported. In our experiments, the number of selected genes was tuned from for each dataset. For each gene subset, the three abovementioned different basic classification methods were used to classify the microarray data for testing the discrimination of selected genes. In order to validate the efficacy of the proposed G³CS, we compare it with other six state of-the-art gene selection methods including: (i)-test [72], which is a traditional filter-based gene selection method, it uses the statistical hypothesis testing(ii)RLR [73], which is based on linear discriminant analysis criterion. The class centroid is estimated to define both the between-class separability and the within-class compactness(iii)WLMGS (Weight Local Modularity based Gene Selection) [74], which uses the weight local modularity of a weighted sample graph to evaluate the discriminative power of gene subset(iv)LNNFW [75], which uses the -nearest neighbors rule to minimize the within-class distances and maximize the between-class distances(v)GRSL-GS [20], which is based on subspace learning and manifold regularization(vi)AHEDL [22], which is based on dictionary learning theory with adaptive hypergraph learning and regularization

As to WLMGS and GRSL-GS, we set the number of nearest neighbor for constructing the sample graph to 5. The kernel width used in the Gaussian kernel function and other regularization parameters in GRSL-GS and RLR are tuned with 5-fold cross validation (CV). For other parameters in other methods, we use the recommended settings in the corresponding references. We run all the implementation programs on a desktop computer with Intel Core i5-4200M 2.5 GHz CPU and 8 GB RAM.

5.3. Experimental Comparison of Different Methods

In order to verify the superiority of the proposed G³CS, we compare it with the other six state-of-the-art gene selection methods on different datasets. For each dataset, we can obtain 5 different gene subsets with the numbers of selected genes which vary from 10 to 50. As to each gene subset, three classifiers and 5-fold CV are used for classification performance evaluation, and we report the average accuracy of 5 times of CV in Table 2. We mark the best results in bold font for clear comparison. As can be seen from the results, the proposed G³CS can consistently outperform other methods in terms of averaged classification accuracy, which demonstrates that G³CS can effectively select more discriminative genes for original high-dimensional microarray data for classification task.

5.4. Classification Accuracy with Different Numbers of Selected Genes

Since the optimal number of selected genes for each dataset is hard to determine, we investigate the classification performance of different methods on different datasets with different numbers of selected genes. We plot the classification accuracy curves of different methods on different datasets with varied numbers of selected genes in Figures 2–7. For each method and each dataset, we plot the average classification accuracy value of the 5 times CV obtained by the SVM classifier. As can be seen from Figures 2–7, the proposed G³CS performs steadily better than other methods when the number of selected genes changes. With a small number of selected genes, our method can select more discriminative genes than other methods, which validates that the selected gene subset obtained by G³CS can better serve classification of microarray data.

6. Discussion and Conclusions

In this work, we present a novel gene selection method by taking the gene correlation into consideration, named gene correlation guided gene selection (G³CS). In detail, we capture the correlation of different gene dimension pairs by calculating the covariance matrix from the perspective of gene dimension and embed it into the proposed model to regularize the gene selection coefficient learning. In such a manner, redundant genes can be effectively excluded to reduce the redundancy of the selected genes. In addition, a matrix factorization term was utilized to exploit the cluster structure of original microarray data to assist the learning process. We design an iterative updating algorithm to solve the resultant optimization problem. Experimental results on six publicly available microarray datasets are conducted to validate the efficacy of our proposed method. With varied selected gene dimensions, the proposed method can consistently outperform other compared ones in terms of classification accuracy.

Data Availability

The datasets used in this work are publicly available at: http://featureselection.asu.edu/datasets.php, http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We would like to thank Dr. Zheng for providing their Matlab code for generating the comparison results of this paper.

References

V. T. V. Lj, H. Dai, V. D. V. Mj et al., “Gene expression profiling predicts clinical outcome of breast cancer,” Nature, vol. 415, no. 6871, pp. 530–536, 2002.
View at: Google Scholar
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, no. 1/3, pp. 389–422, 2002.
View at: Publisher Site | Google Scholar
T. Nguyen and S. Nahavandi, “Modified ahp for gene selection and cancer classification using type-2 fuzzy logic,” IEEE Transactions on Fuzzy Systems, vol. 24, no. 2, pp. 273–287, 2016.
View at: Publisher Site | Google Scholar
L. de Cecco, M. Giannoccaro, E. Marchesi et al., “Integrative mirna-gene expression analysis enables refinement of associated biology and prediction of response to cetuximab in head and neck squamous cell cancer,” Genes, vol. 8, no. 1, p. 35, 2017.
View at: Publisher Site | Google Scholar
L. Naranjo, C. J. Pérez, J. Martín, and Y. Campos-Roca, “A two-stage variable selection and classification approach for Parkinson's disease detection by using voice recording replications,” Computer Methods and Programs in Biomedicine, vol. 142, pp. 147–156, 2017.
View at: Publisher Site | Google Scholar
X. Huang, Y. Gao, B. Jiang, Z. Zhou, and A. Zhan, “Reference gene selection for quantitative gene expression studies during biological invasions: a test on multiple genes and tissues in a model ascidian Ciona savignyi,” Gene, vol. 576, no. 1, pp. 79–87, 2016.
View at: Publisher Site | Google Scholar
A. C. Anauate, M. F. Leal, F. Wisnieski et al., “Identification of suitable reference genes for miRNA expression normalization in gastric cancer,” Gene, vol. 621, pp. 59–68, 2017.
View at: Publisher Site | Google Scholar
S. Zhang, J. Wang, T. Ghoshal et al., “lncRNA gene signatures for prediction of breast cancer intrinsic subtypes and prognosis,” Genes, vol. 9, no. 2, p. 65, 2018.
View at: Publisher Site | Google Scholar
H. H. Huang and Y. Liang, “Hybrid L_{1/2 + 2} method for gene selection in the Cox proportional hazards model,” Computer Methods and Programs in Biomedicine, vol. 164, pp. 65–73, 2018.
View at: Publisher Site | Google Scholar
S. Das, A. Rai, D. C. Mishra, and S. N. Rai, “Statistical approach for selection of biologically informative genes,” Gene, vol. 655, pp. 71–83, 2018.
View at: Publisher Site | Google Scholar
J. Li, Y. Wang, T. Jiang, H. Xiao, and X. Song, “Grouped gene selection and multi-classification of acute leukemia via new regularized multinomial regression,” Gene, vol. 667, pp. 18–24, 2018.
View at: Publisher Site | Google Scholar
A. Dabba, A. Tari, and S. Meftali, “Hybridization of moth flame optimization algorithm and quantum computing for gene selection in microarray data,” Journal of Ambient Intelligence and Humanized Computing, vol. 12, no. 2, pp. 2731–2750, 2021.
View at: Publisher Site | Google Scholar
D. W. Scott, The Curse of Dimensionality and Dimension Reduction, John Wiley & Sons, Inc., 2008.
I. S. Oh, J. S. Lee, and B. R. Moon, “Hybrid genetic algorithms for feature selection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1424–1437, 2004.
View at: Publisher Site | Google Scholar
K. Buza, “Classification of gene expression data: a hubness-aware semi-supervised approach,” Computer Methods and Programs in Biomedicine, vol. 127, no. C, pp. 105–113, 2016.
View at: Publisher Site | Google Scholar
H. Song, X. Zhang, C. Shi, S. Wang, A. Wu, and C. Wei, “Selection and verification of candidate reference genes for mature microrna expression by quantitative rt-pcr in the tea plant (camellia sinensis),” Genes, vol. 7, no. 6, p. 25, 2016.
View at: Publisher Site | Google Scholar
J. Ramos, J. A. Castellanos-Garzón, A. González-Briones, J. F. de Paz, and J. M. Corchado, “An agent-based clustering approach for gene selection in gene expression microarray,” Interdisciplinary Sciences Computational Life Sciences, vol. 9, no. 1, pp. 1–13, 2017.
View at: Publisher Site | Google Scholar
Y. Miao, H. Jiang, H. Liu, and Y. D. Yao, “An Alzheimers disease related genes identification method based on multiple classifier integration,” Computer Methods and Programs in Biomedicine, vol. 150, pp. 107–115, 2017.
View at: Publisher Site | Google Scholar
W. Z. Wang, B. P. Yang, C. L. Feng et al., “Efficient sugarcane transformation via bar gene selection,” Tropical Plant Biology, vol. 10, no. 2, pp. 75–85, 2017.
View at: Google Scholar
C. Tang, L. Cao, X. Zheng, and M. Wang, “Gene selection for microarray data classification via subspace learning and manifold regularization,” Medical & Biological Engineering & Computing, vol. 56, no. 7, pp. 1271–1284, 2018.
View at: Publisher Site | Google Scholar
Z. Y. Algamal, R. Alhamzawi, and H. T. Mohammad Ali, “Gene selection for microarray gene expression classification using Bayesian lasso quantile regression,” Computers in Biology and Medicine, vol. 97, pp. 145–152, 2018.
View at: Publisher Site | Google Scholar
X. Zheng, W. Zhu, C. Tang, and M. Wang, “Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning,” Gene, vol. 706, pp. 188–200, 2019.
View at: Publisher Site | Google Scholar
J. Liu, R. Su, J. Zhang, and L. Wei, “Classification and gene selection of triple-negative breast cancer subtype embedding gene connectivity matrix in deep neural network,” Briefings in Bioinformatics, 2021.
View at: Google Scholar
E. P. Kirk, R. Ong, K. Boggs et al., “Gene selection for the Australian Reproductive Genetic Carrier Screening Project ("Mackenzie's Mission"),” European Journal of Human Genetics, vol. 29, no. 1, pp. 79–87, 2021.
View at: Publisher Site | Google Scholar
C. Tang, X. Zheng, X. Liu et al., “Crossview locality preserved diversity and consensus learning for multi-view unsupervised feature selection,” IEEE Transactions on Knowledge and Data Engineering, 2021.
View at: Google Scholar
P. Mitra, C. Murthy, and S. K. Pal, “Unsupervised feature selection using feature similarity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 301–312, 2002.
View at: Publisher Site | Google Scholar
M. al-Rajab, J. Lu, and Q. Xu, “Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis,” Computer Methods and Programs in Biomedicine, vol. 146, pp. 11–24, 2017.
View at: Publisher Site | Google Scholar
H. Shi, Y. Luo, C. Xu, and Y. Wen, “Manifold regularized transfer distance metric learning,” in British Machine Vision Conference, pp. 158.1–158.11, Swansea, UK, 2015.
View at: Google Scholar
X. Shen, F. Shen, L. Liu, Y. Yuan, W. Liu, and Q. Sun, “Multiview discrete hashing for scalable multimedia search,” ACM Transactions on Intelligent Systems and Technology, vol. 9, no. 5, pp. 53:1–53:21, 2018.
View at: Google Scholar
X. Shen, F. Shen, Q. S. Sun, Y. Yang, Y. H. Yuan, and H. T. Shen, “Semi-paired discrete hashing: learning latent hash codes for semi-paired cross-view retrieval,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4275–4288, 2017.
View at: Publisher Site | Google Scholar
S. Li, C. Tang, X. Liu, Y. Liu, and J. Chen, “Dual graph regularized compact feature representation for unsupervised feature selection,” Neurocomputing, vol. 342, no. 331, pp. 77–96, 2019.
View at: Publisher Site | Google Scholar
C. Tang, X. Zhu, J. Chen, P. Wang, X. Liu, and J. Tian, “Robust graph regularized unsupervised feature selection,” Expert Systems with Applications, vol. 96, pp. 64–76, 2018.
View at: Publisher Site | Google Scholar
C. Tang, X. Liu, M. Li et al., “Robust unsupervised feature selection via dual self-representation and manifold regularization,” Knowledge-Based Systems, vol. 145, pp. 109–120, 2018.
View at: Publisher Site | Google Scholar
C. Tang, J. Chen, X. Liu et al., “Consensus learning guided multi-view unsupervised feature selection,” Knowledge-Based Systems, vol. 160, pp. 49–60, 2018.
View at: Publisher Site | Google Scholar
C. Tang, X. Zhu, X. Liu et al., “Learning a joint affinity graph for multiview subspace clustering,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1724–1736, 2018.
View at: Publisher Site | Google Scholar
C. Tang, X. Zhu, X. Liu, and L. Wang, “Cross-view local structure preserved diversity and consensus learning for multi-view unsupervised feature selection,” in AAAI Conference on Artificial Intelligence, pp. 5101–5108, Hawaii, USA, 2019.
View at: Google Scholar
C. Tang, X. Liu, X. Zhu et al., “Feature selective projection with low-rank embedding and dual Laplacian regularization,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 9, pp. 1747–1760, 2019.
View at: Google Scholar
S. Dudoit, Y. H. Yang, M. J. Callow, and T. P. Speed, “Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments,” Statistica Sinica, vol. 12, no. 1, pp. 111–139, 2000.
View at: Google Scholar
A. D. Long, H. J. Mangalam, B. Y. Chan, L. Tolleri, G. W. Hatfield, and P. Baldi, “Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and A Bayesian Statistical Framework,” Journal of Biological Chemistry, vol. 276, no. 23, pp. 19937–19944, 2001.
View at: Publisher Site | Google Scholar
J. G. Thomas, J. M. Olson, S. J. Tapscott, and L. P. Zhao, “An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles,” Genome Research, vol. 11, no. 7, pp. 1227–1236, 2001.
View at: Publisher Site | Google Scholar
T. Golub, D. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–537, 1999.
View at: Publisher Site | Google Scholar
X. He, D. Cai, and P. Niyogi, “Laplacian score for feature selection,” Advances in Neural Information Processing Systems, vol. 18, pp. 507–514, 2005.
View at: Google Scholar
R. Cai, Z. Hao, X. Yang, and W. Wen, “An efficient gene selection algorithm based on mutual information,” Neurocomputing, vol. 72, no. 4-6, pp. 991–999, 2009.
View at: Publisher Site | Google Scholar
L. Y. Chuang, C. H. Yang, J. C. Li, and C. H. Yang, “A hybrid bpso-cga approach for gene selection and classification of microarray data,” Journal of Computational Biology, vol. 19, no. 1, pp. 68–82, 2012.
View at: Publisher Site | Google Scholar
M. Robnik-Šikonja and I. Kononenko, “Theoretical and empirical analysis of relieff and rrelieff,” Machine Learning, vol. 53, no. 1/2, pp. 23–69, 2003.
View at: Publisher Site | Google Scholar
Hanchuan Peng, Fuhui Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
View at: Publisher Site | Google Scholar
Z. Yi, C. Ding, and L. Tao, “Gene selection algorithm by combining relieff and mrmr,” BMC Genomics, vol. 9, no. S2, p. S27, 2008.
View at: Google Scholar
H. Kim, S.-M. Choi, and S. Park, “Gseh: a novel approach to select prostate cancer-associated genes using gene expression heterogeneity,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, no. 1, pp. 129–146, 2018.
View at: Publisher Site | Google Scholar
K. B. Duan, J. C. Rajapakse, H. Wang, and F. Azuaje, “Multiple svm-rfe for gene selection in cancer classification with expression data,” IEEE Transactions on Nanobioscience, vol. 4, no. 3, pp. 228–234, 2005.
View at: Publisher Site | Google Scholar
X. Zhou and D. P. Tuck, “Msvm-rfe: extensions of svm-rfe for multiclass gene selection on dna microarray data,” Bioinformatics, vol. 23, no. 9, pp. 1106–1114, 2007.
View at: Publisher Site | Google Scholar
Y. Liang, F. Zhang, J. Wang, T. Joshi, Y. Wang, and D. Xu, “Prediction of drought-resistant genes in arabidopsis thaliana using svm-rfe,” PLoS One, vol. 6, no. 7, article e21750, 2011.
View at: Publisher Site | Google Scholar
E. Tapia, P. Bulacio, and L. Angelone, “Sparse and stable gene selection with consensus svm-rfe,” Pattern Recognition Letters, vol. 33, no. 2, pp. 164–172, 2012.
View at: Publisher Site | Google Scholar
X. Han, X. Chang, L. Quan et al., “Feature subset selection by gravitational search algorithm optimization,” Information Sciences, vol. 281, pp. 128–146, 2014.
View at: Publisher Site | Google Scholar
D. Ghosh and A. M. Chinnaiyan, “Classification and selection of biomarkers in genomic data using lasso,” Journal of Biomedicine and Biotechnology, vol. 2005, no. 2, 154 pages, 2005.
View at: Publisher Site | Google Scholar
G. Wu, R. Mallipeddi, and P. N. Suganthan, “Ensemble strategies for population-based optimization algorithms - a survey,” Swarm and Evolutionary Computation, vol. 44, pp. 695–711, 2019.
View at: Publisher Site | Google Scholar
A. K. Shukla, “Identification of cancerous gene groups from microarray data by employing adaptive genetic and support vector machine technique,” Computational Intelligence, vol. 36, no. 1, pp. 102–131, 2020.
View at: Publisher Site | Google Scholar
S. Das, A. Abraham, U. K. Chakraborty, and A. Konar, “Differential evolution using a neighborhood-based mutation operator,” IEEE Transactions on Evolutionary Computation, vol. 13, no. 3, pp. 526–553, 2009.
View at: Publisher Site | Google Scholar
S. Dwivedi, M. Vardhan, and S. Tripathi, “Incorporating evolutionary computation for securing wireless network against cyberthreats,” The Journal of Supercomputing, vol. 76, pp. 8691–8728, 2020.
View at: Google Scholar
P. Zhu, W. Zuo, L. Zhang, Q. Hu, and S. C. K. Shiu, “Unsupervised feature selection by regularized self-representation,” Pattern Recognition, vol. 48, no. 2, pp. 438–446, 2015.
View at: Publisher Site | Google Scholar
R. Shang, Z. Zhang, L. Jiao, C. Liu, and Y. Li, “Self-representation based dual-graph regularized feature selection clustering,” Neurocomputing, vol. 171, no. 1, pp. 1242–1253, 2016.
View at: Publisher Site | Google Scholar
P. Zhu, W. Zhu, W. Wang, W. Zuo, and Q. Hu, “Non-convex regularized self-representation for unsupervised feature selection,” Image and Vision Computing, vol. 60, no. 1, pp. 22–29, 2017.
View at: Publisher Site | Google Scholar
Y. Liu, K. Liu, C. Zhang, J. Wang, and X. Wang, “Unsupervised feature selection via diversity-induced self-representation,” Neurocomputing, vol. 219, pp. 350–363, 2017.
View at: Publisher Site | Google Scholar
Y. X. Wang, J. X. Liu, Y. L. Gao, C. H. Zheng, and J. L. Shang, “Differentially expressed genes selection via Laplacian regularized low-rank representation method,” Computational Biology and Chemistry, vol. 65, no. 1, pp. 185–192, 2016.
View at: Publisher Site | Google Scholar
R. Zheng, M. Li, Z. Liang, F.-X. Wu, Y. Pan, and J. Wang, “Sinnlrr: a robust subspace clustering method for cell type detection by non-negative and low-rank representation,” Bioinformatics, vol. 35, no. 19, pp. 3642–3650, 2019.
View at: Publisher Site | Google Scholar
C.-H. Zheng, T.-Y. Ng, L. Zhang, C.-K. Shiu, and H.-Q. Wang, “Tumor classification based on non-negative matrix factorization using gene expression data,” IEEE Transactions on Nanobioscience, vol. 10, no. 2, pp. 86–93, 2011.
View at: Publisher Site | Google Scholar
D. Wang, J. X. Liu, Y. L. Gao, J. Yu, C. H. Zheng, and Y. Xu, “An NMF-L2,1-norm constraint method for characteristic gene selection,” PLoS One, vol. 11, no. 7, article e0158494, 2016.
View at: Publisher Site | Google Scholar
S. Du, Y. Ma, S. Li, and Y. Ma, “Robust unsupervised feature selection via matrix factorization,” Neurocomputing, vol. 241, pp. 115–127, 2017.
View at: Publisher Site | Google Scholar
X. Guo, X. Jiang, J. Xu, X. Quan, M. Wu, and H. Zhang, “Ensemble consensus-guided unsupervised feature selection to identify Huntington’s disease-associated genes,” Genes, vol. 9, no. 7, p. 350, 2018.
View at: Publisher Site | Google Scholar
D. Han and J. Kim, “Unsupervised simultaneous orthogonal basis clustering feature selection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5016–5023, Boston, USA, 2015.
View at: Google Scholar
M. Qi, T. Wang, F. Liu, B. Zhang, J. Wang, and Y. Yi, “Unsupervised feature selection by regularized matrix factorization,” Neurocomputing, vol. 273, pp. 593–610, 2018.
View at: Publisher Site | Google Scholar
U. Alon, N. Barkai, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences, vol. 96, no. 12, pp. 6745–6750, 1999.
View at: Publisher Site | Google Scholar
K. A. Lê Cao, A. Bonnet, and S. Gadat, “Multiclass classification and gene selection with a stochastic algorithm,” Computational Statistics and Data Analysis, vol. 53, no. 10, pp. 3601–3615, 2009.
View at: Publisher Site | Google Scholar
S. Guo, D. Guo, L. Chen, and Q. Jiang, “A centroid-based gene selection method for microarray data classification,” Journal of Theoretical Biology, vol. 400, pp. 32–41, 2016.
View at: Publisher Site | Google Scholar
G. Zhao and Y. Wu, “Feature subset selection for cancer classification using weight local modularity,” Scientific Reports, vol. 6, no. 1, p. 34759, 2016.
View at: Publisher Site | Google Scholar
S. An, J. Wang, and J. Wei, “Local-nearest-neighbors-based feature weighting for gene selection,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, no. 5, pp. 1538–1548, 2018.
View at: Google Scholar

Copyright

Copyright © 2021 Dong Yang and Xuchang Zhu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

340

Downloads

952

Citations

BioMed Research International

Gene Correlation Guided Gene Selection for Microarray Data Classification

Abstract

1. Introduction

2. Related Work

2.1. GRSL-GS

2.2. AHEDL

3. Proposed Method

4. Optimization Algorithm

4.1. Optimize

4.2. Optimize

4.3. Optimize

5. Experimental Results

5.1. Microarray Datasets

5.2. Experimental Settings

5.3. Experimental Comparison of Different Methods

5.4. Classification Accuracy with Different Numbers of Selected Genes

6. Discussion and Conclusions

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright