ABSTRACT
The classification of cancer patients into risk classes is a very active field of research, with direct clinical applications. We have recently compared several machine learning methods on the well known 70-genes signature dataset. In that study, genetic programming showed promising results, given that it outperformed all the other techniques. Nevertheless, the study was preliminary, mainly because the validation dataset was preprocessed and all its features binarized in order to use logical operators for the genetic programming functional nodes. If this choice allowed simple interpretation of the solutions from the biological viewpoint, on the other hand the binarization of data was limiting, since it amounts to a sizable loss of information. The goal of this paper is to overcome this limitation, using the 70-genes signature dataset with real-valued expression data. The results we present show that genetic programming using the number of incorrectly classified instances as fitness function is not able to outperform the other machine learning methods. However, when a weighted average between false positives and false negatives is used to calculate fitness values, genetic programming obtains performances that are comparable with the other methods in the minimization of incorrectly classified instances and outperforms all the other methods in the minimization of false negatives, which is one of the main goals in breast cancer clinical applications. Also in this case, the solutions returned by genetic programming are simple, easy to understand, and they use a rather limited subset of the available features.
- U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumour and normal colon tissues probed by oligonucleotide arrays. In Proc. Nat. Acad. Sci., pages 6745--6750. USA 96, 1999.Google Scholar
- F. Archetti, S. Lanzeni, E. Messina, and L. Vanneschi. Genetic programming for human oral bioavailability of drugs. In M. Cattolico et al., editor, Proceedings of the 8th annual conference on Genetic and Evolutionary Computation, pages 255--262, Seattle, Washington, USA, 2006. Google ScholarDigital Library
- F. Archetti, E. Messina, S. Lanzeni, and L. Vanneschi. Genetic programming and other machine learning approaches to predict median oral lethal dose (LD50) and plasma protein binding levels (%PPB) of drugs. In E. Marchiori et al., editor, Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Proceedings of the Fifth European Conference, EvoBIO 2007, Lecture Notes in Computer Science, LNCS 4447, pages 11--23. Springer, Berlin, Heidelberg, New York, 2007. Google ScholarDigital Library
- F. Archetti, E. Messina, S. Lanzeni, and L. Vanneschi. Genetic programming for computational pharmacokinetics in drug discovery and development. Genetic Programming and Evolvable Machines, 8(4):17--26, 2007. Google ScholarDigital Library
- C. Bojarczuk, H. Lopes, and A. Freitas. Data mining with constrained-syntax genetic programming: applications to medical data sets. Proceedings Intelligent Data Analysis in Medicine and Pharmacology, 1, 2001. Google ScholarDigital Library
- Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. In Machine Learning, pages 277--296, 1998. Google ScholarDigital Library
- N. Friedman, M. Linial, I. Nachmann, and D. Peer. Using bayesian networks to analyze expression data. J. Computational Biology, 7:601--620, 2000.Google ScholarCross Ref
- D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, 1989. Google ScholarDigital Library
- I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389--422, 2002. Google ScholarDigital Library
- S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice-Hall, London, UK, 1999. Google ScholarDigital Library
- D. P. Helmbold and M. K. Warmuth. On weak learning. J. Comput. Syst. Sci., 50(3):551--573, 1995. Google ScholarDigital Library
- J. C. H. Hernandez, B. Duval, and J. Hao. A genetic embedded approach for gene selection and classification of microarray data. Lecture Notes in Computer Science, 4447:90--101, 2007. Google ScholarDigital Library
- J. H. Holland. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, Michigan, 1975. Google ScholarDigital Library
- J. Hong and S. Cho. The classification of cancer based on dna microarray data that uses diverse ensemble genetic programming. Artif. Intell. Med, 36:43--58, 2006. Google ScholarDigital Library
- A. Hsu, S. Tang, and S. Halgamuge. An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data. Bioinformatics, 19(16):2131--40, 2003.Google ScholarCross Ref
- J. R. Koza. Genetic Programming. The MIT Press, Cambridge, Massachusetts, 1992. Google ScholarDigital Library
- J. Liu, G. Cutler, W. Li, Z. Pan, S. Peng, T. Hoey, L. Chen, and X.-B. Ling. Multiclass cancer classification and biomarker discovery using ga-based algorithms. Bioinformatics, 21:2691--2697, 2005. Google ScholarDigital Library
- Y. Lu and J. Han. Cancer classification using gene expression data. Inf. Syst., 28(4):243--268, 2003. Google ScholarDigital Library
- D. Michie, D. Spiegelhalter, and C. Taylor. Machine learning, neural and statistical classification. Prentice Hall, 1994. Google ScholarDigital Library
- J. Moore, J. Parker, and L. Hahn. Symbolic discriminant analysis for mining gene expression patterns. Lecture Notes in Artificial Intelligence, 2167:372--381, 2001. Google ScholarDigital Library
- J. R. Nevins and A. Potti. Mining gene expression profiles: expression signatures as cancer phenotypes. Nat Rev Genet, 8(8):601--609, Aug 2007.Google ScholarCross Ref
- J. Park and J. W. Sandberg. Universal approximation using radial basis functions network. Neural Computation, 3:246--257, 1991. Google ScholarCross Ref
- J. Platt. Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods -- Support Vector Learning, 1998. Google ScholarDigital Library
- T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481--1497, 1990.Google ScholarCross Ref
- R. Poli, W. B. Langdon, and N. F. McPhee. A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk, 2008. (With contributions by J. R. Koza). Google ScholarDigital Library
- M. Rosskopf, H. Schmidt, U. Feldkamp, and W. Banzhaf. Genetic programming based dna microarray analysis for classification of tumour tissues. Technical Report Technical Report 2007-03, Memorial University of Newfoundland, 2007.Google Scholar
- S. Haykin. Neural Networks: a comprehensive foundation. Prentice Hall, London, 1999. Google ScholarDigital Library
- S. Silva. GPLAB - a genetic programming toolbox for MATLAB, version 3.0, 2007. http://gplab.sourceforge.net.Google Scholar
- M. J. van de Vijver, Y. D. He, L. J. van't Veer, H. Dai, A. A. M. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med, 347(25):1999--2009, Dec 2002.Google ScholarCross Ref
- L. J. van 't Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871):530--536, Jan 2002.Google ScholarCross Ref
- L. Vanneschi, A. Farinaccio, M. Giacobini, M. Antoniotti, G. Mauri, and P. Provero. Identification of individualized feature combinations for survival prediction in breast cancer: A comparison of machine learning techniques. In M. Giacobini, et al., editors, Proceedings of the EvoBIO 2010 Conference, Springer, LNCS, 2010. To appear. Google ScholarDigital Library
- V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998. Google ScholarDigital Library
- Weka. A multi-task machine learning software developed by Waikato University, 2006. See www.cs.waikato.ac.nz/ml/weka.Google Scholar
- J. Yu, J. Yu, A. A. Almal, S. M. Dhanasekaran, D. Ghosh, W. P. Worzel, and A. M. Chinnaiyan. Feature selection and molecular classification of cancer using genetic programming. Neoplasia, 9(4):292--303, 2007.Google ScholarCross Ref
Recommendations
Prognosis of breast cancer using genetic programming
KES'10: Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IVWorldwide, breast cancer is the second most common type of cancer after lung cancer and the fifth most common cause of cancer death. In 2004, breast cancer caused 519,000 deaths worldwide. In order to reduce the cancer deaths and thereby increasing the ...
Cancer classification using microarray and layered architecture genetic programming
GECCO '09: Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking PapersAn important problem of cancer diagnosis and treatment is to distinguish tumors from malignant or benign. Classifying tumors correctly leads us to target specific therapies properly to maximizing efficiency and reducing toxicity. Through the microarray ...
Genetic or non-genetic prognostic factors for colon cancer classification
Many researches have addressed patient classification using prognostic factors or gene expression profiles (GEPs). This study tried to identify whether a prognostic factor was genetic by using GEPs. If significant GEP difference was observed between two ...
Comments