Abstract
Identifying stable and precise biomarkers is a key challenge in precision medicine. A promising approach in this direction is exploring omics data, such as transcriptome generated by microarrays, to discover candidate biomarkers. This, however, involves the fundamental issue of finding the most discriminative features in high-dimensional datasets. We proposed a homogeneous ensemble feature selection (EFS) method to extract candidate biomarkers of breast cancer from microarray datasets. Ensemble diversity is introduced by bootstraps and by the integration of seven microarray studies. As a baseline method, we used the random effect model meta-analysis, a state-of-the-art approach in the integrative analysis of microarrays for biomarkers discovery. We compared five feature selection (FS) methods as base selectors and four algorithms as base classifiers. Our results showed that the variance FS method is the most stable among the tested methods regardless of the classifier and that stability is higher within datasets than across datasets, indicating high sample heterogeneity among studies. The predictive performance of the top 20 genes selected with both approaches was evaluated with six independent microarray studies, and in four of these, we observed a superior performance of our EFS approach as compared to meta-analysis. EFS recall was as high as 85%, and the median F1-scores surpassed 80% for most of our experiments. We conclude that homogeneous EFS is a promising methodology for candidate biomarkers identification, demonstrating stability and predictive performance as satisfactory as the statistical reference method.
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., Saeys, Y.: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3), 392–398 (2009)
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)
Alejandro, L.R., Marlet, M.A., Gustavo Ulises, M.R., Alberto, T.: Ensemble feature selection and meta-analysis of cancer miRNA biomarkers. bioRxiv, p. 353201 (2018)
Ang, J.C., Mirzal, A., Haron, H., Hamed, H.N.A.: Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(5), 971–989 (2016)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)
Bolón-Canedo, V., Alonso-Betanzos, A.: Ensembles for feature selection: a review and future trends. Inf. Fusion 52, 1–12 (2019)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth and Brooks, Belmont (1984)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Cramer, H.: Mathematical Methods of Statistics (PMS-9), vol. 9. Princeton University Press, Princeton (1999)
Durinck, S., et al.: Biomart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005)
Emberley, E.D., Murphy, L.C., Watson, P.H.: S100A7 and the progression of breast cancer. Breast Cancer Res. 6(4), 1–7 (2004)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall PTR, Upper Saddle River (1998)
He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34(4), 215–225 (2010)
Krieken, J.H.: Precision medicine. J. Hematopathol. 6(1), 1–1 (2013). https://doi.org/10.1007/s12308-013-0176-x
Karley, D., Gupta, D., Tiwari, A.: Biomarker for cancer: a great promise for future. World J. Oncol. 2(4), 151 (2011)
Kent, J.T.: Information gain and a general measure of correlation. Biometrika 70(1), 163–173 (1983)
Khaire, U.M., Dhanalakshmi, R.: Stability of feature selection algorithm: a review. J. King Saud Univ. Comput. Inf. Sci. (2019)
Kuncheva, L.: A stability index for feature selection. In: Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, AIA, vol. 2007, pp. 421–427 (2007)
Liu, H., Liu, L., Zhang, H.: Ensemble gene selection by grouping for microarray data classification. J. Biomed. Inform. 43(1), 81–87 (2010)
Pes, B.: Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains. Neural Comput. Appl. 32(10), 5951–5973 (2019). https://doi.org/10.1007/s00521-019-04082-3
Pes, B., Dessì, N., Angioni, M.: Exploiting the ensemble paradigm for stable feature selection: a case study on high-dimensional genomic data. Inf. Fusion 35, 132–147 (2017)
Ritchie, M.E., et al.: LIMMA powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43(7), e47 (2015)
Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A.: Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl.-Based Syst. 118, 124–139 (2017)
Sharifi, S., Pakdel, A., Ebrahimi, M., Reecy, J.M., Fazeli Farsani, S., Ebrahimie, E.: Integration of machine learning and meta-analysis identifies the transcriptomic bio-signature of mastitis disease in cattle. PLoS ONE 13(2), 1–18 (2018)
Surowiecki, J.: The Wisdom of Crowds. Knopf Doubleday Publishing Group, New York (2005)
Theil, H.: A note on certainty equivalence in dynamic planning. Econometrica 25(2), 346–349 (1957)
Toro-Domínguez, D., Villatoro-García, J.A., Martorell-Marugán, J., Román-Montoya, Y., Alarcón-Riquelme, M.E., Carmona-Sáez, P.: A survey of gene expression meta-analysis: methods and applications. Brief. Bioinform. 22(2), 1694–1705 (2021)
Walsh, C.J., Hu, P., Batt, J., Santos, C.C.D.: Microarray meta-analysis and cross-platform normalization: integrative genomics for robust biomarker discovery. Microarrays 4(3), 389–406 (2015)
Acknowledgments
The authors thank Rodrigo Haas Bueno for his help with microarray data preprocessing.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Trevizan, B., Recamonde-Mendoza, M. (2021). Ensemble Feature Selection Compares to Meta-analysis for Breast Cancer Biomarker Identification from Microarray Data. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2021. ICCSA 2021. Lecture Notes in Computer Science(), vol 12949. Springer, Cham. https://doi.org/10.1007/978-3-030-86653-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-86653-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86652-5
Online ISBN: 978-3-030-86653-2
eBook Packages: Computer ScienceComputer Science (R0)