Abstract

The microarray cancer data obtained by DNA microarray technology play an important role for cancer prevention, diagnosis, and treatment. However, predicting the different types of tumors is a challenging task since the sample size in microarray data is often small but the dimensionality is very high. Gene selection, which is an effective means, is aimed at mitigating the curse of dimensionality problem and can boost the classification accuracy of microarray data. However, many of previous gene selection methods focus on model design, but neglect the correlation between different genes. In this paper, we introduce a novel unsupervised gene selection method by taking the gene correlation into consideration, named gene correlation guided gene selection (G3CS). Specifically, we calculate the covariance of different gene dimension pairs and embed it into our unsupervised gene selection model to regularize the gene selection coefficient matrix. In such a manner, redundant genes can be effectively excluded. In addition, we utilize a matrix factorization term to exploit the cluster structure of original microarray data to assist the learning process. We design an iterative updating algorithm with convergence guarantee to solve the resultant optimization problem. Experimental results on six publicly available microarray datasets are conducted to validate the efficacy of our proposed method.

1. Introduction

During cell division and growth, abnormal changes often happen to genes, which results in varying cancers. With the rapid development of kinds of biomedical technologies [1], DNA microarray comes into being and lots of microarray data can be obtained for cancer prevention, diagnosis, and treatment [212]. For various microarray data, classifying the different types of tumors is an important task, but challenging due to the high dimensionality and small numbers of samples [1315] since the small number of data samples with large number of genes can easily result in the “curse of dimensionality” and overfitting problems of data processing and learning models. When the dimension of samples is too high, the distance between any two samples is very inaccurate. Therefore, the classification task for this kind of data is often challenging. However, it has been verified by some existing biological experiments that only a very small proportion of genes contribute significantly to biological process and disease indication. Directly processing original high dimensional microarray data not only degenerates the classification performance but also brings extra computation burden of hardware. Therefore, it is necessary to select a subset of discriminative genes from high-dimensional microarray data to serve subsequent tasks [1625]. If we treat each gene as a feature dimension in microarray data, gene selection is similar to the feature selection task in machine learning and data mining community [2637]. In fact, many feature selection methods can be used well for gene selection. Therefore, mathematical gene selection methods can be also grouped into three classes, i.e., filter methods, wrapper methods, and embedded methods.

Filter methods often measure the importance of different genes in a straight-forward manner based on certain criteria such as -test [38, 39], -score [40], signal-to-noise ratio (SNR) [41], Laplacian score [42], mutual information [43], and information gain [44]. In [41], Golub et al. firstly used the SNR function to evaluate the weights of the genes. Many traditional feature selection methods such as ReliefF [45] and MRMR [46] are combined and used for gene selection [47]. Since filter methods only depend on the intrinsic properties of original data [48], a good ranking criterion function is very important.

As to wrapper methods, varying classification algorithms are often used as a fitness evaluation to determine the subset of genes and the selected genes can in turn enhance the classification performance [2, 4956]. In general, wrapper methods can obtain better results than filter methods, but bring more expensive computational cost. A lot of evolutionary algorithms such as genetic algorithm (GA), differential evolution (DE), ant colony optimization (ACO), and simulated annealing are commonly used as wrapper methods for gene selection [57, 58].

For embedded methods, the geometric structure and intrinsic property of data are exploited to construct gene selection models. Among this kind of methods, some mathematical regularization terms with specific physic meanings such as representative and sparse characters are commonly used assumptions. Typical models include self-representation [32, 33, 5962], low-rank representation [63, 64], and matrix factorization [6567]. Based on these basic models, many variants have been proposed, such as Laplacian graph regularized low-rank representation [63]. Considering the robustness to outliers, Wang et al. [66] proposed a robust -norm regularized characteristic gene selection method. In [68], Guo et al. proposed to identify the disease-associated genes by utilizing ensemble consensus-guided unsupervised feature selection method. In an unsupervised manner, the major priori information can be used is the intrinsic local geometric structure of data. Therefore, embedded methods that use this priori information can achieve good performance for various of microarray data and obtain more and more attention.

Although there are lots of computational methods proposed for gene selection and achieve great success, most of them focus on the relation of data samples while the correlation between different genes is ignored. The expression values of different genes should be interrelated for a certain microarray data matrix. Therefore, we propose to calculate the correlation of gene pairs to regularize the gene selection model, which is named as named gene correlation guided gene selection (G3CS). In detail, in order to exclude redundant genes, the covariance of different gene dimension pairs is calculated and embedded into our unsupervised gene selection model to regularize the gene selection coefficient matrix. In addition, we utilize a matrix factorization model which can capture the cluster structure of original data to assist the learning process. We design an iterative updating algorithm to solve the resultant problem. Finally, experimental results on six publicly available real microarray datasets are conducted to demonstrate that the proposed G3CS can steadily perform better than other state-of-the-art computational gene selection methods in terms of microarray data classification. In Figure 1, we give a brief flowchart of our proposed G3CS model.

In this section, we introduce some gene selection works that are most related to our proposed method. Before that, we firstly present some notations will be used in the following sections. Throughout this paper, matrices and vectors are denoted as boldface capital letters boldface lower case letters, respectively. Given an matrix , represents its -th element, and denotes its -th row and -th column, respectively. is the transpose of . If is square, is the trace of . denotes an identity matrix with size . is a vector with all elements are 1. denotes the -norm of matrix , which is used to constrain the row sparsity of . is the well-known Frobenius norm of .

Since our proposed G3CS belongs to the embedded method, we give a brief review about some related embedded methods.

2.1. GRSL-GS

In [20], Tang et al. proposed a manifold regularized subspace learning model for gene selection, in which the model projects original high dimensional microarray data into a lower-dimensional subspace, then original genes are constrained to be well represented by the selected gene subset. In order to capture the local manifold structure of original data, a Laplacian graph regularization term is imposed on the low-dimensional data space. Finally, the learned projection matrix can be regarded as an important indicator of different genes. Specifically, the mathematical model of GRSL-GS can be formulated as follows: where denotes the projection matrix, represents the data reconstruction coefficient matrix, and is the Laplacian matrix calculated from original data. is a hyper-parameter that balances the two regularization terms. The first term in Eq. (1) constraints that original microarray data can be reconstructed from the projected lower-dimensional gene dictionary, and the second term is the graph Laplacian regularization term used to preserve the intrinsic local manifold structure of original data samples. Although GRSL-GS captures the local structure information, it does not exploit the gene correlation.

2.2. AHEDL

Considering that the graph Laplacian in GRSL-GS can only capture pairwise sample relationship, Zheng et al. [22] introduced a computational gene selection model via adaptive hypergraph embedded dictionary learning (AHEDL). Similar to GRSL-GS, AHEDL also learns a dictionary from original high dimensional microarray data, and the learned dictionary is then used to represent original data by a reconstruction coefficient matrix. The difference of dictionary learning between GRSL-GS and AHEDL is that GRSL-GS uses projection process but AHEDL directly utilizes traditional dictionary learning model. The -norm is imposed on the coefficient matrix for selecting discriminate genes.

In addition, the hypergraph is also learned in an adaptive manner. In a nutshell, AHEDL can be formulated as follows:

As can be seen from Eq. (2), AHEDL integrates adaptive hypergraph learning, dictionary learning, and gene selection into a uniform framework. The dictionary matrix , representation coefficient matrix and hypergraph can constrain each other during the optimization process to obtain their optimums. Since can be regarded as the new representation of in the dictionary space, the row sparsity imposed on by using the -norm can be used to measure the importance of gene dimensions in the learned dictionary space.

3. Proposed Method

Given a microarray data , which contains data samples with different genes. Gene selection aims to select a gene subset that contains only a small number of genes for subsequent tasks. Without sample label information, we should exploit the intrinsic structure of data as much as possible. In this work, we deploy traditional regression model as the basic architecture to formulate G3CS, which can be represented as follows: where is a projection matrix that projects original data into label space , where is the cluster indicator vector corresponding to . In order to measure the importance of different genes, we impose the -norm on to constrain that important genes contribute more during the projection process. In machine learning and data mining community, matrix factorization of target matrix often shows remarkable performance [67, 69]. In our G3CS model, we also decompose into two components, i.e., and . As a result, Eq. (3) can be rewritten as following form with appropriate constraints: where constrain each column of to be independent with each other. is a relaxation constraint that makes each row of to have only one nonzero element. The constraints in Eq. (8) make the model to conduct orthogonal clustering which works well for unsupervised feature selection [70].However, by minimizing Eq. (8) directly for gene selection neglects the gene correlation information which is important in biomedical process. In this work, we embed the gene correlation information into G3CS. It is well known that in probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector. In this work, we use covariance to calculate the correlation of different gene pairs, then, we can get a symmetric semipositive definite covariance matrix . The -th entry of covariance matrix can be calculated as follows: where is the gene average vector, which is calculated as follows:

However, the diagonal elements in only reflect the relationship between a gene dimension and itself, which makes no sense in our model. Therefore, we adjust to get a new correlation matrix by the following equation:

In such a manner, represents the correlation between the -th gene dimension with all other gene dimensions. Then, can be embedded into Eq. (8) to emphasize the independence of selected gene dimensions from the perspective of data information. Therefore, we have

In addition, the local geometric structure information of original data should be preserved as much as possible in the learned new space . By using the Gaussian kernel function, we can get a similarity matrix from original data by the following equation:

where represents the set of nearest neighbors of , and is a width parameter. and are set to 5 and 0.5, respectively, in our experiments. In our G3CS model, we require that if two data samples are closed to each other in original space, their cluster indicator vectors in new space should also be close. This constraint can be formulated by using the following form: where is the Laplacian matrix corresponding to . Finally, we get the mathematical formulation of our G3CS model as follows: where , and are three hyperparameters to balance different regularization terms. In summary, Eq. (1) integrates regression, matrix factorization, gene correlation, and data local structure exploitation into a unified framework. The gene correlation regularizes the model to exclude redundant gene dimensions.

4. Optimization Algorithm

There are three variables in Eq. (1) that need to be optimized; we cannot obtain a close-form solution simultaneously for all of them. Therefore, we design an algorithm to iteratively update these variables. For each time, we update a variable by fixing other ones.

4.1. Optimize

When other variables are fixed, solving is equal to the following problem:

By taking the derivative of Eq. (12) with respect to and setting it to zero, we have

Then, we have the closed-form solution of as follows: where is a diagonal matrix with . At each iteration, and can be updated alternatively.

4.2. Optimize

When fixing other variables, the optimization problem is equal to the following equation:

By adding a constant matrix into the -norm, Eq. (15) is equal to

Since is an orthogonal matrix, then, we have where . In order to ensure the orthogonal constraint of , we add a large positive constant and the optimization problem can be converted to

By setting the derivative of Eq. (18) respect to to 0, we have then can be updated by the following equation in each iteration:

4.3. Optimize

When fixing other variables, the optimization problem for is equal to the following equation:

We add a penalty term for the constraint and a Lagrange multiplier for the constraint . Then, the Lagrange function for Eq. (21) can be written as follows:

By setting the derivative of Eq. (22) with respect to to 0, we have

According to the Kuhn-Tucker conditions , we have

After we solve the resultant optimization problem as described by Eq. (1), we can measure the importance of each gene dimension by calculating the -norm of each row of . We summarize the optimization procedure of the G3CS model in Algorithm 1.

Input: Microarray data matrix X, parameters: α, β and γ. A small constant ε =0.0000001.
Initialize: M, S, F, and Z.
While not converged do
1. Update P via Eq.(14);
2. Update F via Eq.(20);
3. Update Z by solving Eq.(24);
6. Check convergence condition: .
End while
Output: P.
Gene selection: Sort the l2-norm of the rows of P in decent order and select the largest K values. The gene dimension indexes with the the largest K values are selected to form the gene subset.

The proposed algorithm converges well with increasing iteration times. In our experiments, when the objective function value change between two continuous iteration times is very small, we stop the optimization process and obtain good results.

5. Experimental Results

In this section, extensive experiments are conducted on several real microarray datasets to validate the efficacy of the proposed G3CS. In order to demonstrate that the gene subset selected by G3CS can obtain better classification results, we use three kinds of classification algorithms including Support Vector Machine (SVM), Random Forest (RF), and -nearest neighbor (KNN) to test the selected gene subset obtained by different previous gene selection methods.

5.1. Microarray Datasets

Six publicly available microarray datasets are used in our experiments, which are colon cancer (colon) [71], B-cell chronic lymphocytic leukemia (CLL SUB 111), breast, lung, tumors-11, and global cancer map (GCM) (1CLL SUB 111 and lung can be downloaded from: http://featureselection.asu.edu/datasets.php; breast and GCM can be downloaded from: http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi; tumors-11 can be downloaded from: http://datam.i2r.a-star.edu.sg/datasets/krbd/index.html.) and are used to test the performance of the proposed G3CS and other gene selection methods used for comparison. These datasets are collected for diagnosis of different kinds of cancers such as colon cancer, lung cancer, Ewing’s family of tumors, non-Hodgkin lymphoma, and rhabdomyosarcoma and prostate cancer. For an instance, CLL SUB 111 contains high-density oligonucleotide arrays which can be used to identify molecular correlates of genetically and clinically distinct subgroups of B-cell chronic lymphocytic leukemia (B-CLL). Lung is a dataset used to determine whether global biological differences underlie common pathological features of prostate cancer and to identify genes that might anticipate the clinical behaviour of this disease.

It should be noted that the above six datasets are typical with high-dimensional genes. In each dataset, the number of genes is much larger than the number of samples, which brings challenge for many practical tasks. In Table 1, we give a brief description about these datasets.

5.2. Experimental Settings

In the proposed G3CS, we have three parameters that need to be adjusted, i.e., , , and . In our experiments, we varied their values by a “grid-search” in the range . In addition, the optimal number of selected genes is also unknown, we set different numbers of selected genes for different datasets, and the final best results obtained from the optimal parameter setting were reported. In our experiments, the number of selected genes was tuned from for each dataset. For each gene subset, the three abovementioned different basic classification methods were used to classify the microarray data for testing the discrimination of selected genes. In order to validate the efficacy of the proposed G3CS, we compare it with other six state of-the-art gene selection methods including: (i)-test [72], which is a traditional filter-based gene selection method, it uses the statistical hypothesis testing(ii)RLR [73], which is based on linear discriminant analysis criterion. The class centroid is estimated to define both the between-class separability and the within-class compactness(iii)WLMGS (Weight Local Modularity based Gene Selection) [74], which uses the weight local modularity of a weighted sample graph to evaluate the discriminative power of gene subset(iv)LNNFW [75], which uses the -nearest neighbors rule to minimize the within-class distances and maximize the between-class distances(v)GRSL-GS [20], which is based on subspace learning and manifold regularization(vi)AHEDL [22], which is based on dictionary learning theory with adaptive hypergraph learning and regularization

As to WLMGS and GRSL-GS, we set the number of nearest neighbor for constructing the sample graph to 5. The kernel width used in the Gaussian kernel function and other regularization parameters in GRSL-GS and RLR are tuned with 5-fold cross validation (CV). For other parameters in other methods, we use the recommended settings in the corresponding references. We run all the implementation programs on a desktop computer with Intel Core i5-4200M 2.5 GHz CPU and 8 GB RAM.

5.3. Experimental Comparison of Different Methods

In order to verify the superiority of the proposed G3CS, we compare it with the other six state-of-the-art gene selection methods on different datasets. For each dataset, we can obtain 5 different gene subsets with the numbers of selected genes which vary from 10 to 50. As to each gene subset, three classifiers and 5-fold CV are used for classification performance evaluation, and we report the average accuracy of 5 times of CV in Table 2. We mark the best results in bold font for clear comparison. As can be seen from the results, the proposed G3CS can consistently outperform other methods in terms of averaged classification accuracy, which demonstrates that G3CS can effectively select more discriminative genes for original high-dimensional microarray data for classification task.

5.4. Classification Accuracy with Different Numbers of Selected Genes

Since the optimal number of selected genes for each dataset is hard to determine, we investigate the classification performance of different methods on different datasets with different numbers of selected genes. We plot the classification accuracy curves of different methods on different datasets with varied numbers of selected genes in Figures 27. For each method and each dataset, we plot the average classification accuracy value of the 5 times CV obtained by the SVM classifier. As can be seen from Figures 27, the proposed G3CS performs steadily better than other methods when the number of selected genes changes. With a small number of selected genes, our method can select more discriminative genes than other methods, which validates that the selected gene subset obtained by G3CS can better serve classification of microarray data.

6. Discussion and Conclusions

In this work, we present a novel gene selection method by taking the gene correlation into consideration, named gene correlation guided gene selection (G3CS). In detail, we capture the correlation of different gene dimension pairs by calculating the covariance matrix from the perspective of gene dimension and embed it into the proposed model to regularize the gene selection coefficient learning. In such a manner, redundant genes can be effectively excluded to reduce the redundancy of the selected genes. In addition, a matrix factorization term was utilized to exploit the cluster structure of original microarray data to assist the learning process. We design an iterative updating algorithm to solve the resultant optimization problem. Experimental results on six publicly available microarray datasets are conducted to validate the efficacy of our proposed method. With varied selected gene dimensions, the proposed method can consistently outperform other compared ones in terms of classification accuracy.

Data Availability

The datasets used in this work are publicly available at: http://featureselection.asu.edu/datasets.php, http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We would like to thank Dr. Zheng for providing their Matlab code for generating the comparison results of this paper.