Identifying the Types of Ion Channel-Targeted Conotoxins by Incorporating New Properties of Residues into Pseudo Amino Acid Composition

Wu, Yun; Zheng, Yufei; Tang, Hua

doi:https://doi.org/10.1155/2016/3981478

BioMed Research International

On this page

Abstract Introduction Materials and Methods Results and Discussion Conclusion Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2016 | Article ID 3981478 | https://doi.org/10.1155/2016/3981478

Identifying the Types of Ion Channel-Targeted Conotoxins by Incorporating New Properties of Residues into Pseudo Amino Acid Composition

Yun Wu,¹Yufei Zheng,¹and Hua Tang²

Academic Editor: Ren-Zhi Cao

Received13 Jul 2016

Accepted31 Jul 2016

Published18 Aug 2016

Abstract

Conotoxins are a kind of neurotoxin which can specifically interact with potassium, sodium type, and calcium channels. They have become potential drug candidates to treat diseases such as chronic pain, epilepsy, and cardiovascular diseases. Thus, correctly identifying the types of ion channel-targeted conotoxins will provide important clue to understand their function and find potential drugs. Based on this consideration, we developed a new computational method to rapidly and accurately predict the types of ion-targeted conotoxins. Three kinds of new properties of residues were proposed to use in pseudo amino acid composition to formulate conotoxins samples. The support vector machine was utilized as classifier. A feature selection technique based on F-score was used to optimize features. Jackknife cross-validated results showed that the overall accuracy of 94.6% was achieved, which is higher than other published results, demonstrating that the proposed method is superior to published methods. Hence the current method may play a complementary role to other existing methods for recognizing the types of ion-target conotoxins.

1. Introduction

The marine cone snail can secrete venom for predation and defense. A key component of venom is called conotoxin which is a kind of disulfide-rich neurotoxic peptide with 10–30 residues long. The high diversity of their sequences makes it difficult to systemically study them. It has been reported that there are over 100,000 conotoxins existing in approximately 700 species of cone snails [1]. Conotoxins can target G protein-coupled receptors (GPCRs) [2], nicotinic acetylcholine, and neurotensin receptors. Particularly, they interact with ion channels with extremely high specificity and affinity [3]. Thus, they have been regarded as important drug candidates to treat chronic pain, epilepsy, spasticity, and cardiovascular diseases [4, 5].

With more and more conotoxins being discovered, biochemical experiments-based method to investigate the function of conotoxins becomes more and more difficult because of high cost and long period of wet experiment. Using computational method to predict the function of conotoxins provides us with a convenient way to perform systemic analysis of conotoxins. In 2006, Mondal et al. combined support vector machine (SVM) with pseudo amino acid composition (PseAAC) to predict the superfamily of conotoxins [6]. Subsequently, Lin and Li developed a novel method called increment of diversity (ID) to describe dipeptide sequence and used quadratic discriminant (QD) to predict superfamily and family of conotoxins [7]. Zaki et al. used sequence alignment which was also used by Zou et al. [8] combined with amino acid composition to predict superfamily of conotoxins by use of SVM [9]. They further provide a SVM-Freescore method to improve accuracy [10]. Recently, Yin et al. developed a method called dHKNN to predict superfamily of conotoxins and achieved the overall accuracy of 90.3% by using hidden Markov model to select best features [11, 12]. Lisacek et al. used profile Hidden Markov Models (pHMMs) and position-specific scoring matrix (PSSM) to improve accuracy for conotoxin superfamily prediction [13–15].

Although the methods and results mentioned above can give some guide to study conotoxins, they did not provide more information for the prediction of conotoxins’ function. A case shows that two conotoxins (delta-conotoxin-like Ac6.1 and omega-conotoxin-like Ai6.2) belong to the same superfamily; however, they can target different ion channels [16]. Thus, it is necessary to develop new bioinformatics tools to identify the function of conotoxins. In 2007, Saha and Raghava proposed a method based on SVM and PSI-BLAST to predict the function of neurotoxins [17]. Soli et al. developed a statistical-based model to predict the activity of scorpion toxins by using motifs and secondary structure information [18]. Recently, Yuan et al. developed a feature selection technique based on binomial distribution to predict the types of ion channel-targeted conotoxins by using radial basis function network [19]. Subsequently, they improved the accuracy by using SVM with optimal dipeptide composition [20]. However, the prediction accuracy can be further improved.

Thus, the present study aimed to develop a new prediction method to improve the prediction quality of conotoxins’ types. We incorporated three kinds of new properties of residues into PseAAC for formulating conotoxins samples. Subsequently, we used SVM to perform classification. After feature selection, we found that the accuracy was dramatically improved in jackknife cross-validation. In the following section, we will introduce the process of model construction in detail.

2. Materials and Methods

2.1. Benchmark Dataset

The benchmark dataset extracted from the UniProt [21] was constructed by Lin’s group [19, 20]. The dataset is reliable and objective because (i) the conotoxins with ambiguous annotations have been excluded, (ii) the function of all conotoxins in benchmark dataset has been experimentally confirmed, and (iii) high similar sequences (cutoff = 80%) have been pruned by using CD-HIT program. The benchmark dataset contains 112 mature conotoxins peptide sequences including 24 potassium ion channel-targeted conotoxins (K-conotoxins), 43 sodium ion channel-targeted conotoxins (Na-conotoxins), and 45 calcium ion channel-targeted conotoxins (Ca-conotoxins). All calculations and model construction in the following section are based on the data.

2.2. Feature Extraction

A key point in protein prediction is how to extract important information from peptide sequences. In the past studies, the amino acid composition has been widely used in protein prediction. To consider the correlation of residues, the dipeptide composition was used in prediction model. Chou proposed a very popular and elegant descriptor called PseAAC which describes not only the correlation of physicochemical properties of residues but also the amino acid composition [22]. Furthermore, recently some web servers or stand-alone tools have been proposed to generate different modes of PseAAC, such as PseKNC [23], PseKNC-General [24], Pse-in-One [25], repRNA [26], and repDNA [27]. The authors should introduce these tools. In this study, we proposed three kinds of new properties, that is, rigidity, flexibility, and irreplaceability. The flexibility and rigidity of residues correlate with the protein structure and function. The irreplaceability of residues can reflect the evolution of life. The values of three properties for 20 residues [28] have been listed in Table 1. In the following, we will describe how to formulate conotoxins with PseAAC [22].

Consider a conotoxin , where , , and denote the 1st, 2nd, and th residue of the conotoxin sample ; it can be defined by a -dimensional vector as shown bywherewhere is the normalized frequency of the 400 dipeptides in conotoxin and can be defined aswhere denotes the number of occurrences of th dipeptide in conotoxin .

In (2), is weight factor for sequence order effect. is called the -tier sequence correlation factor computed by the following formula:where denotes rigidity, flexibility, and irreplaceability) is called the correlation function and can be given by where is the th kind of the physicochemical values of the amino acid . The values should be converted to standard type bywhere is the original physicochemical values of the th amino acid.

For the purpose of finding the best feature subset which can produce the maximum accuracy, we performed feature selection by using the algorithm called -score which can be defined aswhere and are the average values of the th feature in whole dataset and the th dataset; is the value of the th feature of the th conotoxin in the th dataset; and is the numbers of conotoxin in the th dataset. We noticed that the larger the value is, the better the predictive capability the th feature has. We used a python script fselect.py downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ to calculate -score.

2.3. Support Vector Machine

SVM is a very popular machine learning method which is very suitable for small sample classification [29–31] and regressions [32, 33]. Its basic idea is to map the original samples into a high-dimensional space and search for the best hyperplane in this space which can separate different samples. In this study, the LibSVM soft package was used to implement SVM. The radial basis function (RBF) usually exhibits excellent performance in nonlinear classification [34]. Thus the RBF kernel function was used in the current work. We utilized grid search method to find out the best values of the regularization parameter and kernel parameter via jackknife cross-validation. The search spaces for and are and with steps being 2⁻¹ and 2, respectively.

2.4. The Evaluation of Model Performance

We used jackknife cross-validation to evaluate the performance of proposed method. Three metrics, namely, sensitivity (Sn), overall accuracy (OA), and average accuracy (AA) as defined in [19, 20], were used to quantitatively estimate the accuracy of the model:where is the total number of the th types of conotoxins and denotes the number of the th types of conotoxins which was correctly recognized.

3. Results and Discussion

As we can see from (2), the results of the proposed method depend on two parameters and , where represents the long-range sequence order effect and is called weight factor which reflects the weight imposed between the local and global effects. Generally speaking, the greater is, the more global sequence order information it contains. However, if is too large, it would cause the high-dimensional disaster as mentioned above. Therefore, our searching for the optimal values of the three parameters was carried out in the following regions:

From (9), a total of individual combinations needed to be considered for finding the optimal parameter combination. This was actually a routine but tedious process to optimize the model via a 2-dimensional grid search. We used the jackknife cross-validation approach to deal with the parameter optimization. The results show that when and , the accuracy reaches to maximum value. We noticed that the current model contains 418 features which is still so large that the high-dimensional and overfitting problems will appear.

Therefore, we must select the key features from the 418 components. These key features can produce the maximum Acc. The best feature subset will be obtained by investigating all the combinations of features. However, it is time-consuming and even beyond computational capability for most computers to examine all possible combinations. Based on this reason, we used -score defined in (7) to perform feature selection. At first, all 418 features were ranked according to their -scores from large to small. Secondly, the SVM was used to classify three samples and calculate the accuracy based on the feature with maximum -score. Thirdly, a new feature subset was produced by adding the feature with the second highest F value to the former feature subset. We repeated the process until all combinations were investigated and the accuracies were calculated.

We plotted the accuracies with feature dimension in Figure 1 and noticed that the maximum accuracy is 94.6% when 180 best features were used. The detailed results were recorded in Table 1. Other published results were also listed in Table 2. We noticed that s of Na- and Ca-conotoxins of our method are 95.3% and 95.6%, respectively, which are higher than those of RBF network-based method [19]. The s of K- and Ca-conotoxins of our method are 91.7% and 95.6%, respectively, which are higher than those of iCTX-Type [20]. Thus, in summary, our proposed method is superior to other published methods.

4. Conclusion

In this paper, we designed a new method based on three kinds of new properties to predict three kinds of ion channel-targeted conotoxins. By using feature selection technique, prediction accuracy was dramatically improved. Comparison with published methods demonstrated the advantage of our method. The properties of residues used in this paper can also be used in other fields of protein classification. In the future, we will construct a free webserver based on the proposed method for the convenience of the vast majority of experimental scientists.

Competing Interests

The authors declare that they have no competing financial interests.

Acknowledgments

This work was supported by the Applied Basic Research Program of Sichuan Province (LZ-LY-45) and the Scientific Research Foundation of the Education Department of Sichuan Province (11ZB122).

References

N. L. Daly and D. J. Craik, “Structural studies of conotoxins,” IUBMB Life, vol. 61, no. 2, pp. 144–150, 2009.
View at: Publisher Site | Google Scholar
Z. Liao, Y. Ju, and Q. Zou, “Prediction of G protein-coupled receptors with SVM-prot features and random forest,” Scientifica, vol. 2016, Article ID 8309253, 10 pages, 2016.
View at: Publisher Site | Google Scholar
H. Terlau and B. M. Olivera, “Conus venoms: a rich source of novel ion channel-targeted peptides,” Physiological Reviews, vol. 84, no. 1, pp. 41–68, 2004.
View at: Publisher Site | Google Scholar
T. S. Han, R. W. Teichert, B. M. Olivera, and G. Bulaj, “Conus venoms—a rich source of peptide-based therapeutics,” Current Pharmaceutical Design, vol. 14, no. 24, pp. 2462–2479, 2008.
View at: Publisher Site | Google Scholar
M. R. Watters, “Tropical marine neurotoxins: venoms to drugs,” Seminars in Neurology, vol. 25, no. 3, pp. 278–289, 2005.
View at: Publisher Site | Google Scholar
S. Mondal, R. Bhavna, R. M. Babu, and S. Ramakumar, “Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification,” Journal of Theoretical Biology, vol. 243, no. 2, pp. 252–260, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
H. Lin and Q.-Z. Li, “Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant,” Biochemical and Biophysical Research Communications, vol. 354, no. 2, pp. 548–551, 2007.
View at: Publisher Site | Google Scholar
Q. Zou, Q. Hu, M. Guo, and G. Wang, “HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy,” Bioinformatics, vol. 31, no. 15, pp. 2475–2481, 2015.
View at: Publisher Site | Google Scholar
N. Zaki, F. Sibai, and P. Campbell, “Conotoxin protein classification using pairwise comparison and amino acid composition,” in Proceedings of the 13th Annual Genetic and Evolutionary Computation Conference (GECCO '11), pp. 323–330, ACM, Dublin, Ireland, July 2011.
View at: Publisher Site | Google Scholar
N. Zaki, S. Wolfsheimer, G. Nuel, and S. Khuri, “Conotoxin protein classification using free scores of words and support vector machines,” BMC Bioinformatics, vol. 12, article 217, 2011.
View at: Publisher Site | Google Scholar
Y.-X. Fan, J. Song, X. Kong, and H.-B. Shen, “PredCSf: an integrated feature-based approach for predicting conotoxin superfamily,” Protein and Peptide Letters, vol. 18, no. 3, pp. 261–267, 2011.
View at: Publisher Site | Google Scholar
J.-B. Yin, Y.-X. Fan, and H.-B. Shen, “Conotoxin superfamily prediction using diffusion maps dimensionality reduction and subspace classifier,” Current Protein and Peptide Science, vol. 12, no. 6, pp. 580–588, 2011.
View at: Publisher Site | Google Scholar
D. Koua, A. Brauer, S. Laht et al., “ConoDictor: a tool for prediction of conopeptide superfamilies,” Nucleic Acids Research, vol. 40, no. 1, pp. W238–W241, 2012.
View at: Publisher Site | Google Scholar
D. Koua, S. Laht, L. Kaplinski et al., “Position-specific scoring matrix and hidden Markov model complement each other for the prediction of conopeptide superfamilies,” Biochimica et Biophysica Acta (BBA)—Proteins and Proteomics, vol. 1834, no. 4, pp. 717–724, 2013.
View at: Publisher Site | Google Scholar
S. Laht, D. Koua, L. Kaplinski, F. Lisacek, R. Stöcklin, and M. Remm, “Identification and classification of conopeptides using profile Hidden Markov Models,” Biochimica et Biophysica Acta (BBA)—Proteins and Proteomics, vol. 1824, no. 3, pp. 488–492, 2012.
View at: Publisher Site | Google Scholar
K. H. Gowd, K. K. Dewan, P. Iengar, K. S. Krishnan, and P. Balaram, “Probing peptide libraries from Conus achatinus using mass spectrometry and cDNA sequencing: identification of δ and ω-conotoxins,” Journal of Mass Spectrometry, vol. 43, no. 6, pp. 791–805, 2008.
View at: Publisher Site | Google Scholar
S. Saha and G. P. S. Raghava, “Prediction of neurotoxins based on their function and source,” In Silico Biology, vol. 7, no. 4-5, pp. 369–387, 2007.
View at: Google Scholar
R. Soli, B. Kaabi, M. Barhoumi, M. El-Ayeb, and N. Srairi-Abid, “Bioinformatic characterizations and prediction of K⁺ and Na⁺ ion channels effector toxins,” BMC Pharmacology, vol. 9, article 4, 2009.
View at: Publisher Site | Google Scholar
L.-F. Yuan, C. Ding, S.-H. Guo, H. Ding, W. Chen, and H. Lin, “Prediction of the types of ion channel-targeted conotoxins based on radial basis function network,” Toxicology in Vitro, vol. 27, no. 2, pp. 852–856, 2013.
View at: Publisher Site | Google Scholar
H. Ding, E.-Z. Deng, L.-F. Yuan et al., “ICTX-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels,” BioMed Research International, vol. 2014, Article ID 286419, 10 pages, 2014.
View at: Publisher Site | Google Scholar
M. Magrane and UniProt Consortium, “UniProt Knowledgebase: a hub of integrated protein data,” Database, vol. 2011, Article ID bar009, 2011.
View at: Publisher Site | Google Scholar
K.-C. Chou, “Prediction of protein cellular attributes using pseudo-amino acid composition,” Proteins: Structure, Function and Genetics, vol. 43, no. 3, pp. 246–255, 2001.
View at: Publisher Site | Google Scholar
W. Chen, T.-Y. Lei, D.-C. Jin, H. Lin, and K.-C. Chou, “PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition,” Analytical Biochemistry, vol. 456, no. 1, pp. 53–60, 2014.
View at: Publisher Site | Google Scholar
W. Chen, X. Zhang, J. Brooker, H. Lin, L. Zhang, and K.-C. Chou, “PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions,” Bioinformatics, vol. 31, no. 1, pp. 119–120, 2015.
View at: Publisher Site | Google Scholar
B. Liu, F. Liu, X. Wang, J. Chen, L. Fang, and K. Chou, “Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences,” Nucleic Acids Research, vol. 43, no. W1, pp. W65–W71, 2015.
View at: Publisher Site | Google Scholar
B. Liu, F. Liu, L. Fang, X. Wang, and K.-C. Chou, “repRNA: a web server for generating various feature vectors of RNA sequences,” Molecular Genetics and Genomics, vol. 291, no. 1, pp. 473–481, 2016.
View at: Publisher Site | Google Scholar
B. Liu, F. Liu, L. Fang, X. Wang, and K.-C. Chou, “RepDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects,” Bioinformatics, vol. 31, no. 8, pp. 1307–1309, 2015.
View at: Publisher Site | Google Scholar
H. Tang, W. Chen, and H. Lin, “Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique,” Molecular BioSystems, vol. 12, no. 4, pp. 1269–1275, 2016.
View at: Publisher Site | Google Scholar
P.-P. Zhu, W.-C. Li, Z.-J. Zhong et al., “Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition,” Molecular BioSystems, vol. 11, no. 2, pp. 558–563, 2015.
View at: Publisher Site | Google Scholar
R. Wang, Y. Xu, and B. Liu, “Recombination spot identification Based on gapped k-mers,” Scientific Reports, vol. 6, Article ID 23934, 2016.
View at: Publisher Site | Google Scholar
D. Li, Y. Ju, and Q. Zou, “Protein Folds Prediction with Hierarchical Structured SVM,” Current Proteomics, vol. 13, no. 2, pp. 79–85, 2016.
View at: Publisher Site | Google Scholar
R. Cao, Z. Wang, and J. Cheng, “Designing and evaluating the MULTICOM protein local and global model quality prediction methods in the CASP10 experiment,” BMC Structural Biology, vol. 14, no. 1, article 13, 2014.
View at: Publisher Site | Google Scholar
R. Cao, Z. Wang, Y. Wang, and J. Cheng, “SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines,” BMC Bioinformatics, vol. 15, no. 1, article 120, 2014.
View at: Publisher Site | Google Scholar
J. Chen, X. Wang, and B. Liu, “IMiRNA-SSF: improving the identification of MicroRNA precursors by combining negative sets with different distributions,” Scientific Reports, vol. 6, article 19062, 2016.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2016 Yun Wu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1184

Downloads

721

Citations