Pattern recognition approach to classifying CYP 2C19 isoform

Bartosz Krawczyk

doi:10.2478/s11536-011-0120-3

Open Access Published by De Gruyter Open Access November 24, 2011

Pattern recognition approach to classifying CYP 2C19 isoform

Bartosz Krawczyk

From the journal Open Medicine

https://doi.org/10.2478/s11536-011-0120-3

Abstract

In this paper a pattern recognition approach to classifying quantitative structure-property relationships (QSPR) of the CYP2C19 isoform is presented. QSPR is a correlative computer modelling of the properties of chemical molecules and is widely used in cheminformatics and the pharmaceutical industry. Predicting whether or not a particular chemical will be metabolized by 2C19 is of primary importance to the pharmaceutical industry. This task poses certain challenges. First of all analyzed data are characterized by a significant biological noise. Additionally the training set is unbalanced, with objects from negative class outnumbering the positives four times. Presented solution deals with those problems, additionally incorporating a throughout feature selection for improving the stability of received results. A strong emphasis is put on the outlier detection and proper model validation to achieve the best predictive power.

Keywords: Pattern recognition; Machine learning; Medical informatics; Chemoinformatics; Unbalanced training set; Feature selection

[1] http://www.simulations-plus.com/ Search in Google Scholar

[2] Gasteiger J., Funatsu K., Chemoinformatics-An Important Scientific Discipline, Journal of Computational Chemistry Jpn., 2006, Vol. 5, No. 2:53–58 http://dx.doi.org/10.2477/jccj.5.5310.2477/jccj.5.53Search in Google Scholar

[3] Chawla N.V., Bowyer K.W., Hall L.O. and Kegelmeyer W.P., SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, 2002, Volume 16:321–357 10.1613/jair.953Search in Google Scholar

[4] Chawla N.V., Lazarevic A., Hal L.O. and Bowyer K.W., Smoteboost: improving prediction of the minority class in boosting, Proceedings of the Principles of Knowledge Discovery in Databases, 2003, PKDD-2003:107–119 10.1007/978-3-540-39804-2_12Search in Google Scholar

[5] Han H., Wang W., and Mao B., Borderline-smote: A new over-sampling method in imbalanced data sets learning, Lecture Notes in Computer Science, 2005, vol. 3644:878–887 http://dx.doi.org/10.1007/11538059_9110.1007/11538059_91Search in Google Scholar

[6] Köknar-Tezel S., Latecki L.J., Improving SVM classification on imbalanced time series data sets with ghost points, Knowledge and Information Systems, 2010, DOI: 10.1007/s10115-010-0310-3 10.1007/s10115-010-0310-3Search in Google Scholar

[7] Wang B.X., Japkowicz N., Boosting Support Vector Machines for Imbalanced Data Sets, Lecture Notes in Computer Science, 2008, Volume 4994/2008:38–47 http://dx.doi.org/10.1007/978-3-540-68123-6_410.1007/978-3-540-68123-6_4Search in Google Scholar

[8] Li B.Y., Peng J., Chen Y.Q. and Jin Y.Q., Classifying Unbalanced Pattern Groups by Training Neural Network, Lecture Notes in Computer Science, 2006, Volume 3972/2006:8–13 http://dx.doi.org/10.1007/11760023_210.1007/11760023_2Search in Google Scholar

[9] Zhao Z., Huang D., An evolutionary modular neural network for unbalanced pattern classifications, Evolutionary Computation, 2007, CEC 2007:1662–1669 Search in Google Scholar

[10] Gasteiger J.(Editor), Handbook of Chemoinformatics — From Data to Knowledge, Wiley-VCH, 2003 10.1002/9783527618279Search in Google Scholar

[11] Lindsay K.R., Buchanan B.G., Feigenbaum E.A., Lederberg J., Applications of Artificial Intelligence for Organic Chemistry; the DendralProject, McGraw-Hill, New York, 1980 Search in Google Scholar

[12] Brown F., Editorial Opinion: Chemoinformatics-a ten year update, Current Opinion in Drug Discovery & Development, 2005, 8(3):296–302 Search in Google Scholar

[13] Anoyama, T., Suzuki, Y., Ichikawa, H., Neural networks applied to structure-active relationships. Journal of Medicinal Chemistry. 1990, 33, 905–908 http://dx.doi.org/10.1021/jm00165a00410.1021/jm00165a004Search in Google Scholar PubMed

[14] King, R. D., Hirst, J. D., Sternberg, M. J. E., Comparison of artificial intellogence methods for modeling pharmaceutical QSARs. Applied Artificial Intelligence, 1995, 9, 213–233 http://dx.doi.org/10.1080/0883951950894547410.1080/08839519508945474Search in Google Scholar

[15] Liu, Y., A comparative study on feature selection methods for drug discovery. Journal of Chem. Inf. Comput. Sci., 2004, 44, 1823–1828 http://dx.doi.org/10.1021/ci049875d10.1021/ci049875dSearch in Google Scholar

[16] Burbidge, R., Trotter, M., Buxton, B., Drug design by machine learning: support vector machines for pharmaceutical data analysis. Computers and Chemistry, 2001, 26, 5–14 http://dx.doi.org/10.1016/S0097-8485(01)00094-810.1016/S0097-8485(01)00094-8Search in Google Scholar

[17] Duda R.O., Hart P.E., Stork D.G., Pattern Classification, Wiley-Interscience, 2001 Search in Google Scholar

[18] Vapnik V., Statistical Learning Theory, Willey 1998 Search in Google Scholar

[19] Williams, C. K. I., Barber, D., Bayesian classification with Gaussian Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20, 1342–1351 http://dx.doi.org/10.1109/34.73580710.1109/34.735807Search in Google Scholar

[20] Crammer, K., Singer, Y., On the algorithmic implementation of multiclass kernel-based vector machines, Journal of Machine Learning Research, 2001, 2, 265–292 Search in Google Scholar

[21] Redman T. C., Data Quality. The Field Guide, Boston Digital Press, 2001 Search in Google Scholar

[22] Ben-Gal I., Outlier detection, Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic Publishers, 2005 Search in Google Scholar

[23] Guyon I., Gunn S., Nikravesh M. and Zadeh L., Feature extraction, foundations and applications, Springer, 2006 10.1007/978-3-540-35488-8Search in Google Scholar

[24] Yu L., Liu H., Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004, 1205–1224 Search in Google Scholar

[25] http://www.r-project.org/ Search in Google Scholar

[26] Karatzoglou A., Smola A., Hornik K., Zeileis A., Kernlab — An S4 Package for Kernel Methods in R, Journal of Statistical Software, 2004, 11(9) 10.18637/jss.v011.i09Search in Google Scholar

[27] Karatzoglou A., Meyer D., Hornik K., Support Vector Machines in R, Journal of Statistical Software, 2006, 15(9) 10.18637/jss.v015.i09Search in Google Scholar

[28] Alpaydin, E., Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms, Neural Computation, 1998, 11:1885–1892 http://dx.doi.org/10.1162/08997669930001600710.1162/089976699300016007Search in Google Scholar PubMed

Published Online: 2011-11-24

Published in Print: 2012-2-1

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Pattern recognition approach to classifying CYP 2C19 isoform

Abstract

Journal and Issue

Articles in the same Issue