Skip to main content

Advertisement

Log in

Improved feature-based prediction of SNPs in human cytochrome P450 enzymes

  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Single nucleotide polymorphisms (SNPs) make up the most common form of mutations in human cytochrome P450 enzymes family, and have the potential to bring with different drug responses or specific diseases in individual patients. Here, based on machine learning technology, we aim to explore an effective set of sequence-based features for improving prediction of SNPs by using support vector machine algorithms. The features are derived from the target residues and flanking protein sequences, such as amino acid types, sequences composition, physicochemical properties, position-specific scoring matrix, phylogenetic entropy and the number of possible codons of target residues. In order to deal with the imbalance data with a majority of non-SNPs and a minority of SNPs, a preprocessing strategy based on fuzzy set theory was applied to the datasets. Our final model achieves the performance of 93.8% in sensitivity, 88.8% in specificity, 91.3% in accuracy and 0.971 of AUC value, which is significantly higher than the previous DNA sequence-based or protein sequence-based methods. Furthermore, our study also suggested the roles of individual features for prediction of SNPs. The most important features consist of the amino acid type, the number of available codons, position-specific scoring matrix and phylogenetic entropy. The improved model will be a promising tool for SNP predictions, and assist in the research of genome mutation and personalized prescriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., Sunyaev, S.R. 2010. A method and server for predicting damaging missense mutations. Nat Methods, 7(4): 248–249.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  2. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25(17): 3389–3402.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  3. Bhattacharyya, M., Feuerbach, L., Bhadra, T., Lengauer, T., Bandyopadhyay, S. 2012. MicroRNA transcription start site prediction with multi-objective feature selection. Stat Appl Genet Mol Biol, 11(1): Article 6.

    Google Scholar 

  4. Buske, O.J., Manickaraj, A., Mital, S., Ray, P.N., Brudno, M. 2013. Identification of deleterious synonymous variants in human genomes. Bioinformatics, 29(15): 1843–1850.

    Article  PubMed  CAS  Google Scholar 

  5. Castle, J.C. 2011. SNPs occur in regions with less genomic sequence conservation. PLoS One, 6(6): e20660.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  6. Chang, C., Lin, C. 2001. LIBSVM: a library for support vector machines. LIBSVM software website. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Accessed 2011 May 2.

    Google Scholar 

  7. Cheng, C.W., Su, E.C., Hwang, J.K., Sung, T.Y., Hsu, W.L. 2008. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics, 9Suppl 12: S6.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  8. Dai, D.P., Xu, R.A., Hu, L.M., Wang, S.H., Geng, P.W., Yang, J.F., Yang, L.P., Qian, J.C., Wang, Z.S., Zhu, G.H., Zhang, X.H., Ge, R.S., Hu, G.X., Cai, J.P. 2014. CYP2C9 polymorphism analysis in Han Chinese populations: building the largest allele frequency database. The pharmacogenomics journal, 14(1): 85–92.

    Article  PubMed  CAS  Google Scholar 

  9. Dodgen, T.M., Hochfeld, W.E., Fickl, H., Asfaha, S.M., Durandt, C., Rheeder, P., Drogemoller, B.I., Wright, G.E., Warnich, L., Labuschagne, C., van Schalkwyk, A., Gaedigk, A., Pepper, M.S. 2013. Introduction of the AmpliChip CYP450 Test to a South African cohort: a platform comparative prospective cohort study. BMC Med Genet, 14: 20.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  10. Hirschhorn, J.N., Daly, M.J. 2005. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet, 6(2): 95–108.

    Article  PubMed  CAS  Google Scholar 

  11. Johnson, A.D., Handsaker, R.E., Pulit, S.L., Nizzari, M.M., O’Donnell, C.J., de Bakker, P.I. 2008. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics, 24(24): 2938–2939.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  12. Komar, A.A. 2007. Silent SNPs: impact on gene function and phenotype. Pharmacogenomics, 8(8): 1075–1080.

    Article  PubMed  CAS  Google Scholar 

  13. Kumar, P., Henikoff, S., Ng, P.C. 2009. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc, 4(7): 1073–1081.

    Article  PubMed  CAS  Google Scholar 

  14. Li, D.C., Liu, C.W., Hu, S.C. 2010. A learning method for the class imbalance problem with medical data sets. Comput Biol Med, 40(5): 509–518.

    Article  PubMed  Google Scholar 

  15. Li, D.C., Wu, C.S., Tsai, T.I., Lina, Y.S. 2007. Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers and Operations Research, 34: 966–982.

    Article  CAS  Google Scholar 

  16. Li, L., Chen, Q., Wei, D.Q. 2012a. Prediction and functional analysis of single nucleotide polymorphisms. Curr Drug Metab, 13(7): 1012–1023.

    Article  PubMed  CAS  Google Scholar 

  17. Li, L., Wei, D.Q., Wang, J.F., Chou, K.C. 2012b. SCYPPred: a web-based predictor of SNPs for human cytochrome P450. Protein Pept Lett, 19(1): 57–61.

    Article  PubMed  Google Scholar 

  18. Ma, C., Wang, L., Xie, X.Q. 2011. Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (Li-CABEDS) and its application on modeling ligand functionality for 5HT-subtype GPCR families. J Chem Inf Model, 51(3): 521–531.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  19. McCarthy, J.J., Hilfiker, R. 2000. The use of single-nucleotide polymorphism maps in pharmacogenomics. Nat Biotechnol, 18(5): 505–508.

    Article  PubMed  CAS  Google Scholar 

  20. McGraw, J., Waller, D. 2012. Cytochrome P450 variations in different ethnic populations. Expert Opin Drug Metab Toxicol, 8(3): 371–382.

    Article  PubMed  CAS  Google Scholar 

  21. Ng, P.C., Henikoff, S. 2001. Predicting deleterious amino acid substitutions. Genome Res, 11(5): 863–874.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  22. Pairo, E., Maynou, J., Marco, S., Perera, A. 2012. A subspace method for the detection of transcription factor binding sites. Bioinformatics, 28(10): 1328–1335.

    Article  PubMed  CAS  Google Scholar 

  23. Pers, T.H., Timshel, P., Hirschhorn, J.N. 2015. SNPsnap: a Web-based tool for identification and annotation of matched SNPs. Bioinformatics, 31(3): 418–420.

    Article  PubMed  Google Scholar 

  24. Philip K. Chan, S.J.S. 2001. Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining: 164–168.

    Google Scholar 

  25. Ramensky, V., Bork, P., Sunyaev, S. 2002. Human non-synonymous SNPs: server and survey. Nucleic Acids Res, 30(17): 3894–3900.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  26. Rong Yan, Y.L., Rong Jin 2003. On predicting rare classes with SVM ensembles in scene classification. in: 2003 IEEE International Conference, 3: III21–III24.

    Google Scholar 

  27. Schierz, A.C. 2009. Virtual screening of bioassay data. J Cheminform, 1: 21.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  28. Schmeier, S., Jankovic, B., Bajic, V.B. 2011. Simplified method to predict mutual interactions of human transcription factors based on their primary structure. PLoS One, 6(7): e21887.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  29. Shi, S.P., Qiu, J.D., Sun, X.Y., Suo, S.B., Huang, S.Y., Liang, R.P. 2012. PLMLA: prediction of lysine methylation and lysine acetylation by combining multiple features. Mol Biosyst, 8(5): 1520–1527.

    Article  PubMed  CAS  Google Scholar 

  30. Sim, N.L., Kumar, P., Hu, J., Henikoff, S., Schneider, G., Ng, P.C. 2012. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res, 40(Web Server issue): W452–457.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  31. Wang, L., Spira, B., Zhou, Z., Feng, L., Maharjan, R.P., Li, X., Li, F., McKenzie, C., Reeves, P.R., Ferenci, T. 2010. Divergence involving global regulatory gene mutations in an Escherichia coli population evolving under phosphate limitation. Genome Biol Evol, 2: 478–487.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  32. Xiong, Y., Liu, J., Wei, D.Q. 2011a. An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins, 79(2): 509–517.

    Article  PubMed  CAS  Google Scholar 

  33. Xiong, Y., Xia, J., Zhang, W., Liu, J. 2011b. Exploiting a Reduced Set of Weighted Average Features to Improve Prediction of DNA-Binding Residues from 3D Structures. PLoS One, 6(12): e28440.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  34. Yan, R., Boutros, P.C., Jurisica, I., Penn, L.Z. 2007. Comparison of machine learning and pattern discovery algorithms for the prediction of human single nucleotide polymorphisms. Grc: 2007 IEEE International Conference on Granular Computing, Proceedings: 452–457.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yi Xiong or Yong-Hong Zhang.

Additional information

Zhuo-Yu ZHANG is a summer student from No 3 High School of Wuhan and Quan Guo is a summer student from Xiamen University

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Xiong, Y., Zhang, ZY. et al. Improved feature-based prediction of SNPs in human cytochrome P450 enzymes. Interdiscip Sci Comput Life Sci 7, 65–77 (2015). https://doi.org/10.1007/s12539-014-0257-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-014-0257-2

Key words

Navigation