Abstract
Hierarchical feature selection is a new research area in machine learning/data mining, which consists of performing feature selection by exploiting dependency relationships among hierarchically structured features. This paper evaluates four hierarchical feature selection methods, i.e., HIP, MR, SHSEL and GTD, used together with four types of lazy learning-based classifiers, i.e., Naïve Bayes, Tree Augmented Naïve Bayes, Bayesian Network Augmented Naïve Bayes and k-Nearest Neighbors classifiers. These four hierarchical feature selection methods are compared with each other and with a well-known “flat” feature selection method, i.e., Correlation-based Feature Selection. The adopted bioinformatics datasets consist of aging-related genes used as instances and Gene Ontology terms used as hierarchical features. The experimental results reveal that the HIP (Select Hierarchical Information Preserving Features) method performs best overall, in terms of predictive accuracy and robustness when coping with data where the instances’ classes have a substantially imbalanced distribution. This paper also reports a list of the Gene Ontology terms that were most often selected by the HIP method.
Similar content being viewed by others
References
Aha DW (1997) Lazy learning. Kluwer Academic Publishers, Norwell
Alexa A, Rahnenführer J, Lengauer T (2006) Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22(13):1600–1607
Barber D (2012) Bayesian reasoning and machine learning. Cambridge University Press, Cambridge
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
de Magalhães JP (2013) How ageing processes influence cancer. Nat Rev Cancer 13(5):357–365
de Magalhães JP, Budovsky A, Lehmann G, Costa J, Li Y, Fraifeld V, Church GM (2009) The human ageing genomic resources: online databases and tools for biogerontologists. Aging Cell 8(1):65–72
Demsǎr J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Derrac J, Garcia S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18
Fang Y, Wang X, Michaelis EK, Fang J (2013) Classifying aging genes into DNA repair or non-DNA repair-related categories. Lecture notes in intelligent computing theories and technology, pp 20–29
Fernandes M, Wan C, Tacutu R, Barardo D, Rajput A, Wang J, Thoppil H, Thornton D, Yang C, Freitas AA, de Magalhães JP (2016) Systematic analysis of the gerontome reveals links between aging and age-related diseases. Hum Mol Genet (in press). doi:10.1093/hmg/ddw307
Freitas AA, Vasieva O, de Magalhães JP (2011) A data mining approach for classifying DNA repair genes into ageing-related or non-ageing-related. BMC Genomics 12(27):1–11
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–3):131–163
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hall MA (1998) Correlation-based feature subset selection for machine learning. PhD thesis, University of Waikato, Hamilton, New Zealand
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, Berlin
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
Jain AK, Zongker D (1997) Representation and recognition of handwritten digits using deformable templates. IEEE Trans Pattern Anal Mach Intell 19(12):1386–1391
Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York
Jenatton R, Audibert JY, Bach F (2011) Structured variable selection with sparity-inducing norms. J Mach Learn Res 12:2777–2824
Jeong Y, Myaeng S (2013) Feature selection using a semantic hierarchy for event recognition and type classification. In: Proceedings of sixth international joint conference on natural language. Nagoya, Japan, pp 136–144
Jiang L, Zhang H, Cai Z, Su J (2005) Learning tree augmented naive bayes for ranking. Database Syst Adv Appl 3453:688–698
Kenyon CJ (2010) The genetics of ageing. Nature 464(7288):504–512
Keogh EJ, Pazzani MJ (1999) Learning augmented bayesian classifiers: a comparison of distribution-based and classification-based approaches. In: Proceedings of the seventh international workshop on artificial intelligence and statistics, Florida, USA, pp 225–230
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Norwell
Lu S, Ye Y, Tsui R, Su H, Rexit R, Wesaratchakit S, Liu X, Hwa R (2013) Domain ontology-based feature reduction for high dimensional drug data and its application to 30-day heart failure readmission prediction. In: Proceedings of the ninth international conference conference on collaborative computing: networking, applications and worksharing (Collaboratecom). Austin, USA, pp 478–484
Martins AFT, Smith NA, Aguiar PMQ, Figueiredo MAT (2011) Structured sparsity in structured prediction. In: Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP 2011). Edinburgh, UK, pp 1500–1511
Pereira RB, Plastino A, Zadrozny B, de C Merschmann LH LH, Freitas AA (2011) Lazy attribute selection: choosing attributes at classification time. Intell Data Anal 15(5):715–732
Ristoski P, Paulheim H (2014) Feature selection in hierarchical feature spaces. In: Proceedings of seventeenth international conference on discovery science. Bled, Slovenia, pp 288–300
Sohal RS, Weindruch R (1996) Oxidative stress, caloric restriction, and aging. Science 273(5271):59–63
Sohal RS, Ku HH, Agarwal S, Forster MJ, Lal H (1994) Oxidative damage, mitochondrial oxidant generation and antioxidant defenses during aging and in response to food restriction in the mouse. Mech Ageing Dev 74(1–2):121–133
Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
Tacutu R, Craig T, Budovsky A, Wuttke D, Lehmann G, Taranukha D, Costa J, Fraifeld VE, de Magalhães JP (2013) Human ageing genomic resources: integrated databases and tools for the biology and genetics of ageing. Nucl Acids Res 41(D1):D1027–D1033
The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Tyner SD, Venkatachalam S, Choi J, Jones S, Ghebranious N, Igelmann H, Lu X, Soron G, Cooper B, Brayton C, Park SH, Thompson T, Karsenty G, Bradley A, Donehower LA (2002) p53 mutant mice that display early ageing-associated phenotypes. Nature 415(6867):45–53
Vijg J, Campisi J (2008) Puzzles, promises and a cure for ageing. Nature 454(7208):1065–1071
Walker G, Houthoofd K, Vanfleteren JR, Gems D (2005) Dietary restriction in \(C. elegans\): from rate-of-living effects to nutrient sensing pathways. Mech Ageing Dev 126(9):929–937
Wan C (2015) Novel hierarchical feature selection methods for classification and their application to datasets of ageing-related genes. PhD thesis, University of Kent, Canterbury, United Kingdom
Wan C, Freitas AA (2013) Prediction of the pro-longevity or anti-longevity effect of Caenorhabditis Elegans genes based on Bayesian classification methods. In: Proceedings of IEEE international conference on bioinformatics and biomedicine (BIBM 2013), Shanghai, China, pp 373–380
Wan C, Freitas AA (2015) Two methods for constructing a gene ontology-based feature selection network for a Bayesian network classifier and applications to datasets of aging-related genes. In: Proceedings of the sixth ACM conference on bioinformatics, computational biology and health informatics (ACM-BCB 2015). Atlanta, USA, pp 27–36
Wan C, Freitas AA, de Magalhães JP (2015) Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods. IEEE/ACM Trans Comput Biol Bioinf 12(2):262–275
Wang B, Mckay R, Abbass H, Barlow M (2003) A comparative study for domain ontology guided feature extraction. In: Proceedings of the twenty-sixth Australasian computer science conference. Adelaide, Australia, pp 69–78
Wood JG, Rogina B, Lavu S, Howitz K, Helfand SL, Tatar M, Sinclair D (2004) Sirtuin activators mimic caloric restriction and delay ageing in metazoans. Nature 430:686–689
Ye J, Liu J (2012) Sparse methods for biomedical data. ACM SIGKDD Explor Newsl 14(1):4–15
Zhang H, Ling CX (2001) An improved learning algorithm for augmented naive bayes. Adv Knowl Discov Data Min 2035:581–586
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37(6):3468–3497
Acknowledgements
We thank Dr. João Pedro de Magalhães for his valuable general advice on the biology of aging for this Project. We also thank Pablo Silva for providing an implementation code of the SHSEL method. We also acknowledge the support of concurrency researchers at the University of Kent for access to the ‘CoSMoS’ cluster, funded by EPSRC Grants EP/E049419/1 and EP/E0535/1.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wan, C., Freitas, A.A. An empirical evaluation of hierarchical feature selection methods for classification in bioinformatics datasets with gene ontology-based features. Artif Intell Rev 50, 201–240 (2018). https://doi.org/10.1007/s10462-017-9541-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-017-9541-y