Abstract

Identifying cancer-associated mutations (driver mutations) is critical for understanding the cellular function of cancer genome that leads to activation of oncogenes or inactivation of tumor suppressor genes. Many approaches are proposed which use supervised machine learning techniques for prediction with features obtained by some databases. However, often we do not know which features are important for driver mutations prediction. In this study, we propose a novel feature selection method (called DX) from 126 candidate features’ set. In order to obtain the best performance, rotation forest algorithm was adopted to perform the experiment. On the train dataset which was collected from COSMIC and Swiss-Prot databases, we are able to obtain high prediction performance with 88.03% accuracy, 93.9% precision, and 81.35% recall when the 11 top-ranked features were used. Comparison with other various techniques in the TP53, EGFR, and Cosmic2plus datasets shows the generality of our method.

1. Introduction

Recent developments of large-scale sequencing in the cancer genome have exploited hundreds or thousands of various types of mutations [1], such as DNA sequence alterations including point mutations, nucleotide mutations, and genomic rearrangements [2]. Although many somatic mutations are discovered, a small fraction of mutations promote cancer progress (driver genes that drive tumor evolution, about <1%) and majority of mutations are likely to be “passengers” which have no effects on tumor cell selection [35]. Many methods are used to explore the mechanism on the different mutations. For example, Purohit et al. [6] have conducted studies on the drug resistance through docking and binding analysis and found that mutation (S315T) has high docking score: it can decrease the flexibility of binding residues and make them rigid by altering the conformational changes, and in turn it hampers the INH activity. Lamin A/C proteins are the major components of a thin proteinaceous filamentous meshwork and the structural and functional consequences of mutation R482W cause FPLD [7]. Both structure and relationship of mutation protein are also studied, such as cancer-associated E17K [8], SH2-containing protein (NSP3) and Crk-associated substrate (p130Cas) [9], TMC114 [10, 11], PncA of Mycobacterium tuberculosis [12], and KIT receptor [13]. Among these mutations’ analyses, the missense mutation which is a point mutation that can cause different codon coding through gene is widely noted [14, 15]. So, various methods on the basis of data are used to identify which missense mutations are drivers and which are passengers [16].

So far, several approaches have been exploited to identify driver mutations and can be roughly classified into two categories. The first class is based on biological difference with the hypothesis that a driver gene has a higher frequency compared to passenger genes with passenger mutations [1, 1719]. Parmigiani et al. developed a software package (CancerMutationAnalysis, bioconductor) to identify driver mutations at the gene level. This software can calculate passenger mutation rate. Carter et al. proposed a novel method for estimating the passenger mutation rate from three aspects including the number of nonsilent somatic single based variants, reducing known driver mutations and the frequency of the nonsilent somatic single (24 categories) [20]. Zhang et al. [17] computed the Mahalanobis distance of a gene from known cancer genes with four features including gene size, background nonsynonymous mutation rates, somatically acquired events, and the rate of these events in carriers. MutSig tools are also used to compute the score of each gene in the tumor. On the other hand, researchers adopt some features related to the missense mutations to train classifier using some learning algorithms, and then the model can be applied to the test dataset. Hitherto several groups propose some methods to recognize driver mutations from a lot of passenger mutations [15, 2030]. They use different features and algorithms for prediction, especially feature spaces.

Recently, Tan et al. [30] proposed a novel feature extraction scheme for driver mutations identification. They selected 126 features relating to physicochemical properties of amino acids (AARC), scoring mutation matrix (SSM) from AAIndex database [31], 2-gram feature from sequence (PSS), and annotated features (AF) from other databases, then used DX score to rank 126 features, and finally selected 70 features according to accuracy of support vector machine (SVM). This work is interesting and shows us how to select efficient features for our recognition.

In this study, inspired by Tan et al.’s method, we developed a novel method to predict driver mutations from candidate passenger mutations using DX-RF (rotation forest (RF) algorithm with DX method). In order to utilize more features, we also adopt four kinds of features that were used by Tan et al. A novel scoring system (DX) was employed to evaluate the performance of each feature in identifying driver mutations. Our experiments can acquire 87.97% average accuracy on DX-RF method using the 11 top-ranked features combined. We also tested the classifier on the other dataset and got higher accuracy than before.

2. Materials and Methods

2.1. Data Collection

The driver-passenger mutations dataset is retrieved from Tan et al. [30]. This dataset is composed of cancer-associated variants (driver mutations) which were collected from COSMIC database and neutral polymorphisms (passenger mutations) which were collected from Swiss-Prot Variant Pages (humsavar.txt) with only the record type “Polymorphism.” Based on this dataset, train dataset with 4193 driver mutations and 4193 passenger mutations is constructed. The test dataset contains three disjointed driver mutations sets (EGFR, TP53, and Cosmic2plus) and passenger mutations dataset which was collected from humsavar.txt by removing those that appeared in the train dataset. In this study, driver mutations are labeled as positive class and passenger mutations are labeled as negative class.

2.2. Feature Extraction

The candidate features were collected from Tan et al.’s paper which mainly contain four type features which are composed of AARC features (physicochemical properties), SSM features (scoring mutation matrix, from AAIndex), PSS features which were produced according to Wu et al. [32] and Wang et al. [33] using 2-gram and 6-letter method, and annotated features which were collected from several databases including UniProt KnowledgeBase, Swiss-Prot Variant Page, and COMSIC database. In the annotated features, there are 14 binary categorical features, which perhaps are unavailable for the referring mutations.

2.2.1. Feature Coding

Machine learning-based techniques such as support vector machine (SVM) and rotation forest (RF) need a fixed number of inputs for training. So, before training, the features should be converted to number. The AARC feature value for a missense mutation is defined by where denotes sample, denotes wild-type residue, denotes mutation residue, and denotes the th AARC feature value. The SSM feature value for a missense mutation is assigned as the element of scoring mutation matrix. The 2-gram method extracts two consecutive amino acid residues in a protein sequence and counts the number of occurrences of the residue pairs; it will produce 400-dimension vector for a protein sequence. DX is used to calculate the score of each feature and the 30 top-rank features are selected for prediction. The 6-letter method classifies 20 amino acids to six groups according to physicochemical properties [34]. Table 1 shows the six groups.

The 6-letter method first represents a protein sequence by the 6-letter group and then encodes new protein sequence using 2-gram method. Thus, The PSS feature value for a missense mutation is assigned as the 436-dimension vector. In order to reduce lost information, the linear correlation coefficient (LCC) is computed through 436-dimension vector as follows: where is the th 2-gram feature value and is the mean value of th 2-gram feature. Finally, we got 31 PSS features. The annotated features were collected from different databases including UniProt KnowledgeBase, Swiss-Prot, and COSMIC; here 29 features were used in this study.

2.2.2. The Feature Space

For each missense mutation of dataset, there are 126 features, including 15 features of AARC, 51 SSM features, 31 features of PSS, and 29 features of function annotated. On the whole, features for each missense mutation were got.

2.3. Feature Selection Method

In many pattern recognition applications, feature selection is very important. Here we use two methods to solve this problem: DX score [33] and minimum redundancy maximal relevance (mRMR) [35]. The author of DX method adopted it to pick out the most relevant 2-gram features. Intuitively, this DX score bears the capability of assessing a feature’s discrimination power in general case. According to [36], the DX score can be defined as follows: where average_pos denotes the mean value of the feature in the interaction pairs of train dataset and average_neg denotes the mean value of the feature in the noninteraction pairs of train dataset. var_pos and var_neg denote the variance of the feature in the interaction pairs and noninteraction pairs of train dataset, respectively. The mRMR method selects good features according to the maximal statistical dependency criterion based on mutual information. A smaller index of a feature denotes that it has a better trade-off between maximum relevance to the target and minimum redundancy to the features. The mutual information equation of random variables is defined as follows: Here are vectors and , , is probabilistic density function. Max-Relevance is to find features satisfying (5) and meanwhile Min-Redundancy condition needs to be added to select mutually exclusive features with (6); denote feature, denotes the whole feature set, and denotes the target class. Consider The mRMR feature evaluation uses incremental search methods for optimal features and would loop rounds when given a feature set with features. After the mRMR feature evaluation, a ranking feature set is obtained.

2.4. Model Construction

The classification model of identifying driver mutations was based on rotation forest (RF) [37] and the software Weka [38] was adopted to implement our classification. The final train dataset is comprised of 4193 driver mutations and 4193 passenger mutations. In statistical prediction, subsampling test and jackknife test are used as two cross-validation methods. Jackknife test is considered to be more objective and has been widely adopted by many researchers to validate the power of various classifiers, but it will take much longer time to perform the jackknife test. So considering the numerous samples used in this study, 5-fold cross-validation is used to evaluate the importance of the features for train dataset. This process is repeated five times and average accuracy is used to evaluate features.

A RF model was constructed on the train dataset with default parameters. In order to get good features for identifying driver mutations, 126 train datasets are built according to IFS [39, 40] approach based on the ranked features obtained by the DX method and mRMR method, respectively. Then the 126 train datasets are trained with 5-fold cross-validation and this process was repeated five times. Thus, models were generated. Five parameters, precision, recall, accuracy, -measure, and Matthews’s correlation coefficient (MCC), were employed to measure the performance of features combined on the training dataset and TP denotes true driver mutations, TN denotes true passenger mutations, FP denotes false driver mutations, and FN denotes false passenger mutations

3. Results and Discussion

3.1. Optimization of the Feature Space

In order to obtain the best feature space for driver mutations prediction, two classifiers which use RF with DX and mRMR feature selection methods are constructed, called DX-RF and mRMR-RF, respectively. Supplemental Materials S1 (in Supplementary Material available online at http://dx.doi.org/10.1155/2014/905951) are two results using the mRMR software: one table is a maximum relevance feature result that ranks the 126 features based on their relevance to the class of samples; the other is called the mRMR feature table that lists the 126 ranked features according to mRMR criteria. The front feature means that it is more important for driver mutations prediction in the mRMR feature table. After ranking, IFS was adopted for optimal feature set selection. During IFS procedure, features were added with one feature from higher to lower rank according to the mRMR table. Supplemental Materials S2 are the result using DX method. After features were ranked, 126 individual predictors corresponding to 126 feature subsets were constructed to train the dataset using mRMR-RF and DX-RF. The average results of 126 predictors using 5-fold cross-validation based on two classifiers can be seen in the Supplemental Materials S3. This feature selection process is illustrated in Figure 1; from Figure 1 it can be seen that the DX-RF predictor achieved the highest 87.97% accuracy when adopting the 11 top-ranked features and the mRMR-RF predictor also got a similar highest 88.18% accuracy with the 76 top-ranked features. In order to compare with Tan et al., DX-SVMLight and DX-LibSVM with the 70 top-ranked features of Tan et al. are performed. DX-SVMLight got 83.04% accuracy and it is lower by about 4.93% and 5.14% than DX-RF and mRMR-RF, respectively. DX-LibSVM got 83.97% accuracy and it is lower by about 4% and 4.21% than DX-RF and mRMR-RF, respectively. For DX-RF classifier, we can see that the performance of the DX-RF is almost the same as the mRMR-RF (88.18% with the 76 top-ranked features) with only 11 features. Finally, we select the 11 top-ranked features with rotation forest algorithm to build the model for driver mutations prediction. Supplemental Materials S4 show that one table is the 11 top-ranked features of DX-RF; another table is all 126 features that were used by Tan et al. [30] in their study.

3.2. Feature Analysis

We investigate the distribution of the optimal features based on DX-RF, mRMR-RF, and Tan et al.’s method. From Figure 2, 0, 6, and 1 features were derived from amino acid residue change features (AARC); 0, 12, and 40 were derived from substitution scoring matrix features (SSM); 7, 31, and 21 were derived from protein sequence-specific features (PSS); and 4, 27, and 8 were derived from annotated features (AF) of DX-RF, mRMR-RF, and Tan et al., respectively.

3.3. Comparison of the Prediction Performance on the Train Dataset

After the optimal feature subset can be confirmed, the experiment was performed to evaluate whether DX-RF method is better than other methods. According to DX and mRMR, the experiments using 5-fold cross-validation on the train dataset are performed again and this process can be run 10 times. Table 2 shows the average results of DX-RF and mRMR-RF method. From Table 2, the performance of DX-RF method is almost the same as the mRMR-RF method. However, the DX-RF method only needs 11 features, while the mRMR-RF method needs 76 features.

3.4. Comparison of the Prediction Performance with Different Methods on the Independent Set

To determine whether the 11 top-ranked features’ set contributes to the prediction of driver mutations, we test independent set between DX-RF and Tan et al.’s method and construct four classifiers, called DX-SVMLight, DX-LibSVM, DX-RF, and mRMR-RF, respectively. Table 3 shows that the results on the three datasets including TP53 + neutral, EGFR + neutral, and Cosmic2plus + neutral. Four classifiers can identify all TP53 and EGFR driver mutations (recall: 100%). Particularly, on the Cosmic2plus dataset, DX-SVMLight can identify 940 driver mutations, DX-LibSVM can identify 963 driver mutations, mRMR-RF can predict 902 driver mutations, and DX-RF predicts 892, but DX-RF method gets higher precision than DX-LibSVM, (59.91% versus 51.83%) and almost the same as DX-SVMLight. DX-RF predicts 3942 passenger mutations, which is higher than DX-SVMLight (with 3888 passenger mutations), DX-LibSVM (with 3644 passenger mutations), and mRMR-RF (with 3919 passenger mutations).

We know that false positive should be avoided. In the experiment, DX-SVMLight (651 false driver mutations), DX-LibSVM (895 false driver mutations), and mRMR-RF (620 false driver mutations) all got high FP (false positive). DX-RF method only got 597 false driver mutations. Table 4 gives the detailed information based on the four classifiers on the three datasets. From Tables 3 and 4, we can conclude that DX-RF is more reliable than DX-SVMLight, DX-LibSVM, and mRMR-RF according to the results of three independent sets.

4. Conclusion

In this study, we propose a novel feature extraction for identifying driver mutations. The model was constructed by the optimal features set with rotation forest. The 5-fold CV experiments are performed on the train dataset and obtain high prediction performance with 93.9% precision and 81.35% recall when the 11 top-ranked features are used. On the independent set of missense mutations, the DX-RF got higher 89.28%, 87.18%, and 85.53% accuracy than the other methods on the TP53, EGFR, and Cosmic2plus, respectively.

Although our work got the best performance, further improvements are both needful and possible. In the future, on the one hand, we will exploit more correlation features to describe the difference between driver mutations and passenger mutations. On the other hand, a new fast algorithm will be considered for driver mutations prediction.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the Grants of the National Youth Fond of China (no. 61203290), supported by the Doctoral Start-Up Funds of Anhui University under Grant no. 33190078, and supported by the Outstanding Young Backbone Teachers Training under Grant no. 02303301.

Supplementary Materials

The Supplementary Material S1 gives the detail score information of mRMR. The Supplementary Material S2 lists the score of each feature based DX feature selection. The Supplementary Material S3 shows the performance of predicting driver mutations using mRMR-RF and DX-RF. The Supplementary Material S4 shows the top-rank 11 features and all the 126 features.

  1. Supplementary Material S1
  2. Supplementary Material S2
  3. Supplementary Material S3
  4. Supplementary Material S4