Improved feature-based prediction of SNPs in human cytochrome P450 enzymes

Li, Li; Xiong, Yi; Zhang, Zhuo-Yu; Guo, Quan; Xu, Qin; Liow, Hien-Haw; Zhang, Yong-Hong; Wei, Dong-Qing

doi:10.1007/s12539-014-0257-2

Improved feature-based prediction of SNPs in human cytochrome P450 enzymes

Published: 21 March 2015

Volume 7, pages 65–77, (2015)
Cite this article

Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Li Li¹,
Yi Xiong¹,
Zhuo-Yu Zhang¹,
Quan Guo¹,
Qin Xu¹,
Hien-Haw Liow^2,3,
Yong-Hong Zhang^1,4 &
…
Dong-Qing Wei¹

298 Accesses
10 Citations
Explore all metrics

Abstract

Single nucleotide polymorphisms (SNPs) make up the most common form of mutations in human cytochrome P450 enzymes family, and have the potential to bring with different drug responses or specific diseases in individual patients. Here, based on machine learning technology, we aim to explore an effective set of sequence-based features for improving prediction of SNPs by using support vector machine algorithms. The features are derived from the target residues and flanking protein sequences, such as amino acid types, sequences composition, physicochemical properties, position-specific scoring matrix, phylogenetic entropy and the number of possible codons of target residues. In order to deal with the imbalance data with a majority of non-SNPs and a minority of SNPs, a preprocessing strategy based on fuzzy set theory was applied to the datasets. Our final model achieves the performance of 93.8% in sensitivity, 88.8% in specificity, 91.3% in accuracy and 0.971 of AUC value, which is significantly higher than the previous DNA sequence-based or protein sequence-based methods. Furthermore, our study also suggested the roles of individual features for prediction of SNPs. The most important features consist of the amino acid type, the number of available codons, position-specific scoring matrix and phylogenetic entropy. The improved model will be a promising tool for SNP predictions, and assist in the research of genome mutation and personalized prescriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distinguishing the disease-associated SNPs based on composition frequency analysis

Article 16 November 2017

Wenling Li, Menglong Li, … Yanzhi Guo

Machine learning techniques for pathogenicity prediction of non-synonymous single nucleotide polymorphisms in human body

Article 07 January 2022

Enas M. F. El Houby

Comparative Study of Machine Learning Models to Classify Gene Variants of ClinVar

References

Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., Sunyaev, S.R. 2010. A method and server for predicting damaging missense mutations. Nat Methods, 7(4): 248–249.
Article PubMed Central PubMed CAS Google Scholar
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25(17): 3389–3402.
Article PubMed Central PubMed CAS Google Scholar
Bhattacharyya, M., Feuerbach, L., Bhadra, T., Lengauer, T., Bandyopadhyay, S. 2012. MicroRNA transcription start site prediction with multi-objective feature selection. Stat Appl Genet Mol Biol, 11(1): Article 6.
Google Scholar
Buske, O.J., Manickaraj, A., Mital, S., Ray, P.N., Brudno, M. 2013. Identification of deleterious synonymous variants in human genomes. Bioinformatics, 29(15): 1843–1850.
Article PubMed CAS Google Scholar
Castle, J.C. 2011. SNPs occur in regions with less genomic sequence conservation. PLoS One, 6(6): e20660.
Article PubMed Central PubMed CAS Google Scholar
Chang, C., Lin, C. 2001. LIBSVM: a library for support vector machines. LIBSVM software website. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Accessed 2011 May 2.
Google Scholar
Cheng, C.W., Su, E.C., Hwang, J.K., Sung, T.Y., Hsu, W.L. 2008. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics, 9Suppl 12: S6.
Article PubMed Central PubMed CAS Google Scholar
Dai, D.P., Xu, R.A., Hu, L.M., Wang, S.H., Geng, P.W., Yang, J.F., Yang, L.P., Qian, J.C., Wang, Z.S., Zhu, G.H., Zhang, X.H., Ge, R.S., Hu, G.X., Cai, J.P. 2014. CYP2C9 polymorphism analysis in Han Chinese populations: building the largest allele frequency database. The pharmacogenomics journal, 14(1): 85–92.
Article PubMed CAS Google Scholar
Dodgen, T.M., Hochfeld, W.E., Fickl, H., Asfaha, S.M., Durandt, C., Rheeder, P., Drogemoller, B.I., Wright, G.E., Warnich, L., Labuschagne, C., van Schalkwyk, A., Gaedigk, A., Pepper, M.S. 2013. Introduction of the AmpliChip CYP450 Test to a South African cohort: a platform comparative prospective cohort study. BMC Med Genet, 14: 20.
Article PubMed Central PubMed CAS Google Scholar
Hirschhorn, J.N., Daly, M.J. 2005. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet, 6(2): 95–108.
Article PubMed CAS Google Scholar
Johnson, A.D., Handsaker, R.E., Pulit, S.L., Nizzari, M.M., O’Donnell, C.J., de Bakker, P.I. 2008. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics, 24(24): 2938–2939.
Article PubMed Central PubMed CAS Google Scholar
Komar, A.A. 2007. Silent SNPs: impact on gene function and phenotype. Pharmacogenomics, 8(8): 1075–1080.
Article PubMed CAS Google Scholar
Kumar, P., Henikoff, S., Ng, P.C. 2009. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc, 4(7): 1073–1081.
Article PubMed CAS Google Scholar
Li, D.C., Liu, C.W., Hu, S.C. 2010. A learning method for the class imbalance problem with medical data sets. Comput Biol Med, 40(5): 509–518.
Article PubMed Google Scholar
Li, D.C., Wu, C.S., Tsai, T.I., Lina, Y.S. 2007. Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers and Operations Research, 34: 966–982.
Article CAS Google Scholar
Li, L., Chen, Q., Wei, D.Q. 2012a. Prediction and functional analysis of single nucleotide polymorphisms. Curr Drug Metab, 13(7): 1012–1023.
Article PubMed CAS Google Scholar
Li, L., Wei, D.Q., Wang, J.F., Chou, K.C. 2012b. SCYPPred: a web-based predictor of SNPs for human cytochrome P450. Protein Pept Lett, 19(1): 57–61.
Article PubMed Google Scholar
Ma, C., Wang, L., Xie, X.Q. 2011. Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (Li-CABEDS) and its application on modeling ligand functionality for 5HT-subtype GPCR families. J Chem Inf Model, 51(3): 521–531.
Article PubMed Central PubMed CAS Google Scholar
McCarthy, J.J., Hilfiker, R. 2000. The use of single-nucleotide polymorphism maps in pharmacogenomics. Nat Biotechnol, 18(5): 505–508.
Article PubMed CAS Google Scholar
McGraw, J., Waller, D. 2012. Cytochrome P450 variations in different ethnic populations. Expert Opin Drug Metab Toxicol, 8(3): 371–382.
Article PubMed CAS Google Scholar
Ng, P.C., Henikoff, S. 2001. Predicting deleterious amino acid substitutions. Genome Res, 11(5): 863–874.
Article PubMed Central PubMed CAS Google Scholar
Pairo, E., Maynou, J., Marco, S., Perera, A. 2012. A subspace method for the detection of transcription factor binding sites. Bioinformatics, 28(10): 1328–1335.
Article PubMed CAS Google Scholar
Pers, T.H., Timshel, P., Hirschhorn, J.N. 2015. SNPsnap: a Web-based tool for identification and annotation of matched SNPs. Bioinformatics, 31(3): 418–420.
Article PubMed Google Scholar
Philip K. Chan, S.J.S. 2001. Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining: 164–168.
Google Scholar
Ramensky, V., Bork, P., Sunyaev, S. 2002. Human non-synonymous SNPs: server and survey. Nucleic Acids Res, 30(17): 3894–3900.
Article PubMed Central PubMed CAS Google Scholar
Rong Yan, Y.L., Rong Jin 2003. On predicting rare classes with SVM ensembles in scene classification. in: 2003 IEEE International Conference, 3: III21–III24.
Google Scholar
Schierz, A.C. 2009. Virtual screening of bioassay data. J Cheminform, 1: 21.
Article PubMed Central PubMed CAS Google Scholar
Schmeier, S., Jankovic, B., Bajic, V.B. 2011. Simplified method to predict mutual interactions of human transcription factors based on their primary structure. PLoS One, 6(7): e21887.
Article PubMed Central PubMed CAS Google Scholar
Shi, S.P., Qiu, J.D., Sun, X.Y., Suo, S.B., Huang, S.Y., Liang, R.P. 2012. PLMLA: prediction of lysine methylation and lysine acetylation by combining multiple features. Mol Biosyst, 8(5): 1520–1527.
Article PubMed CAS Google Scholar
Sim, N.L., Kumar, P., Hu, J., Henikoff, S., Schneider, G., Ng, P.C. 2012. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res, 40(Web Server issue): W452–457.
Article PubMed Central PubMed CAS Google Scholar
Wang, L., Spira, B., Zhou, Z., Feng, L., Maharjan, R.P., Li, X., Li, F., McKenzie, C., Reeves, P.R., Ferenci, T. 2010. Divergence involving global regulatory gene mutations in an Escherichia coli population evolving under phosphate limitation. Genome Biol Evol, 2: 478–487.
Article PubMed Central PubMed CAS Google Scholar
Xiong, Y., Liu, J., Wei, D.Q. 2011a. An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins, 79(2): 509–517.
Article PubMed CAS Google Scholar
Xiong, Y., Xia, J., Zhang, W., Liu, J. 2011b. Exploiting a Reduced Set of Weighted Average Features to Improve Prediction of DNA-Binding Residues from 3D Structures. PLoS One, 6(12): e28440.
Article PubMed Central PubMed CAS Google Scholar
Yan, R., Boutros, P.C., Jurisica, I., Penn, L.Z. 2007. Comparison of machine learning and pattern discovery algorithms for the prediction of human single nucleotide polymorphisms. Grc: 2007 IEEE International Conference on Granular Computing, Proceedings: 452–457.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Microbial Metabolism and College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
Li Li, Yi Xiong, Zhuo-Yu Zhang, Quan Guo, Qin Xu, Yong-Hong Zhang & Dong-Qing Wei
Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, MO, 63130, USA
Hien-Haw Liow
Department of Mathematics, Washington University in St. Louis, St. Louis, MO, 63130, USA
Hien-Haw Liow
Medicine Engineering Research Center and School of Pharmacy, Chongqing Medical University, Chongqing, 400016, China
Yong-Hong Zhang

Authors

Li Li
View author publications
You can also search for this author in PubMed Google Scholar
Yi Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Zhuo-Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Quan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Qin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hien-Haw Liow
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Hong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Qing Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yi Xiong or Yong-Hong Zhang.

Additional information

Zhuo-Yu ZHANG is a summer student from No 3 High School of Wuhan and Quan Guo is a summer student from Xiamen University

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Xiong, Y., Zhang, ZY. et al. Improved feature-based prediction of SNPs in human cytochrome P450 enzymes. Interdiscip Sci Comput Life Sci 7, 65–77 (2015). https://doi.org/10.1007/s12539-014-0257-2

Download citation

Received: 25 December 2014
Revised: 02 March 2015
Accepted: 06 March 2015
Published: 21 March 2015
Issue Date: March 2015
DOI: https://doi.org/10.1007/s12539-014-0257-2

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved feature-based prediction of SNPs in human cytochrome P450 enzymes

Abstract

Access this article

Similar content being viewed by others

Distinguishing the disease-associated SNPs based on composition frequency analysis

Machine learning techniques for pathogenicity prediction of non-synonymous single nucleotide polymorphisms in human body

Comparative Study of Machine Learning Models to Classify Gene Variants of ClinVar

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Improved feature-based prediction of SNPs in human cytochrome P450 enzymes

Abstract

Access this article

Similar content being viewed by others

Distinguishing the disease-associated SNPs based on composition frequency analysis

Machine learning techniques for pathogenicity prediction of non-synonymous single nucleotide polymorphisms in human body

Comparative Study of Machine Learning Models to Classify Gene Variants of ClinVar

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation