Skip to main content

Classifying Microarray Gene Expression Cancer Data Using Statistical Feature Selection and Machine Learning Methods

  • Conference paper
  • First Online:
Congress on Intelligent Systems

Abstract

Objective: A breast microarray data is a repository of thousands of gene expressions with different strengths of each cancer cell. It is necessary to detect the genes which are responsible for cancer growth. The proposed work aims to identify a statistical test for extracting the differentially expressed genes from a microarray gene expression and a suitable classifier for classifying the gene as diseased and control genes. Method: Cancerous genes are identified by six statistical tests, namely Welch test, analysis of variance (ANOVA) test, Wilcoxon signed rank sum test, Kruskal–Wallis, linear model for microarray (LIMMA), and F-test using their p-values. The identified cancer genes are used to classify cancer patients using seven classifiers, namely linear discriminant analysis (LDA), K-nearest neighbor, Naïve Bayesian, linear support vector machine, support vector machine with radial basis function, C5.0, and C5.0 with boosting technique. Performance is evaluated using accuracy, sensitivity, and specificity. Result: The microarray breast cancer dataset of 32 cancer patients and 28 non-cancer patients is considered in the experiment. Microarray contains 25,575 numbers of genes for each patient. When LIMMA test is used to extract differentially expressed cancer genes and KNN is used for classification, the maximum classification accuracy 100% is obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jiang H, Ching WK, Cheung WS, Hou W, Yin H (2017) Hadamard Kernel SVM with applications for breast cancer outcome predictions. BMC Syst Biol 11(7):163–174

    Google Scholar 

  2. Maniruzzaman M, Rahman MJ, Ahammed B, Abedin MM, Suri HS, Biswas M, El-Baz A, Bangeas P, Tsoulfas G, Suri JS (2019) Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms. Comput Methods Programs Biomed 176:173–193

    Article  Google Scholar 

  3. Liang Y, Han H, Liu L, Duan Y, Yang X, Ma C, Zhu Y, Han J, Li X, Chen Y (2018) CD36 plays a critical role in proliferation, migration and tamoxifen-inhibited growth of ER-positive breast cancer cells. Oncogenesis 7(12):1–14

    Article  Google Scholar 

  4. Tsai HP, Huang SF, Li CF, Chien HT, Chen SC (2018) Differential microRNA expression in breast cancer with different onset age. PLoS One 13(1)

    Google Scholar 

  5. Cuzick J, Sestak I, Cawthorn S, Hamed H, Holli K, Howell A, Forbes JF (2015) IBIS-I investigators: tamoxifen for prevention of breast cancer: extended long-term follow-up of the IBIS-I breast cancer prevention trial. Lancet Oncol 16(1):67–75

    Google Scholar 

  6. Bolón-Canedo V, Sánchez-Marono N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282:111–135

    Google Scholar 

  7. Lamba M, Munjal G, Gigras Y (2020) Computational studies on breast cancer analysis. J Stat Manag Syst 23(6):999–1009

    Google Scholar 

  8. Hossain MA, Islam SMS, Quinn JM, Huq F, Moni MA (2019) Machine learning and bioinformatics models to identify gene expression patterns of ovarian cancer associated with disease progression and mortality. J Biomed Inform:100

    Google Scholar 

  9. Alagukumar S, Lawrance R (2015) A selective analysis of microarray data using association rule mining. Proc Comput Sci 47:3–12

    Article  Google Scholar 

  10. De Smith MJ (2018) Statistical analysis handbook a comprehensive handbook of statistical concepts, techniques and software tools. The Winchelsea Press

    Google Scholar 

  11. Ayyad SM, Saleh AI, Labib LM (2019) Gene expression cancer classification using modified K-nearest neighbors technique. Biosystems 176:41–51

    Article  Google Scholar 

  12. Liu S, Xu C, Zhang Y, Liu J, Yu B, Liu X, Dehmer M (2018) Feature selection of gene expression data for cancer classification using double RBF-kernels. BMC Bioinform 19(1):1–14

    Article  Google Scholar 

  13. Zhang J, Lee R, Wang YJ (2003) Support vector machine classifications for microarray expression data set. In: Proceedings fifth ınternational conference on computational ıntelligence and multimedia applications, pp 67–71

    Google Scholar 

  14. Shafi ASM, Molla MI, Jui JJ, Rahman MM (2020) Detection of colon cancer based on microarray dataset using machine learning as a feature selection and classification techniques. SN Appl Sci 2(7):1–8

    Article  Google Scholar 

  15. Li L, Weinberg CR, Darden TA, Pedersen LG (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17(12):1131–1142

    Article  Google Scholar 

  16. Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19(9):1061–1069

    Article  Google Scholar 

  17. Zeebaree DQ, Haron H, Abdulazeez AM (2018) Gene selection and classification of microarray data using convolutional neural network. In: 2018 ınternational conference on advanced science and engineering (ICOASE), pp 145–150

    Google Scholar 

  18. Czajkowski M, Kretowski M (2019) Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach. Expert Syst Appl 137:392–404

    Google Scholar 

  19. Gakii C, Rimiru R (2021) Identification of cancer related genes using feature selection and association rule mining. Inform Med Unlocked 24:100595

    Google Scholar 

  20. Ma XJ, Wang Z, Ryan PD, Isakoff SJ, Barmettler A, Fuller A, Muir B, Mohapatra G, Salunga R, Tuggle JT, Tran Y (2004) A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell 5(6):607–616

    Article  Google Scholar 

  21. Shekhawat SS, Sharma H, Kumar S, Nayyar A, Qureshi B (2021) bSSA: binary Salp swarm algorithm with hybrid data transformation for feature selection. IEEE Access 9:14867–14882

    Article  Google Scholar 

  22. Li Z, Xie W, Liu T (2018) Efficient feature selection and classification for microarray data. PloS One 13(8)

    Google Scholar 

  23. Jan SL, Shieh G (2020) On the extended welch test for assessing equivalence of standardized means. Stat Biopharmaceutical Res 12(3):344–351

    Article  Google Scholar 

  24. Ruxton GD (2006) The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behav Ecol 17(4):688–690

    Article  Google Scholar 

  25. Cuevas A, Febrero M, Fraiman R (2004) An anova test for functional data. Comput Stat Data Anal 47(1):111–122

    Article  MathSciNet  Google Scholar 

  26. Hecke TV (2012) Power study of anova versus Kruskal-Wallis test. J Stat Manage Syst 15(2–3):241–247

    Google Scholar 

  27. Fagerland MW, Sandvik L (2009) The wilcoxon–mann–whitney test under scrutiny. Stat Med 28(10):1487–1497

    Article  MathSciNet  Google Scholar 

  28. https://www.r-project.org/. Last accessed on Oct 05, 2021

  29. Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology 3(1)

    Google Scholar 

  30. Smyth GK (2005) Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor, pp 397–420

    Google Scholar 

  31. Tiemann TK (2010) Introductory business statistics with interactive spreadsheets: 1st Canadian Edition

    Google Scholar 

  32. Han J, Kamber M, Pei J (2011) Data mining concepts and techniques third edition. Morgan Kaufmann Ser Data Manage Syst 5(4):83–124

    Google Scholar 

  33. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188

    Article  Google Scholar 

  34. Dudoit S, Fridlyand J (2003) Classification in microarray experiments. Stat Anal Gene Expr Microarray Data 1:93–158

    Google Scholar 

  35. Peterson LE (2009) K-nearest neighbor. Scholarpedia 4(2):1883

    Article  Google Scholar 

  36. Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media

    Google Scholar 

  37. Rasmussen CE (2003) Gaussian processes in machine learning. In: Summer school on machine learning. Springer, Berlin, Heidelberg, pp 63–71

    Google Scholar 

  38. Jansson J (2016) Decision tree classification of products using C5. 0 and prediction of workload using time series analysis

    Google Scholar 

  39. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  Google Scholar 

  40. Altman DG, Bland JM (1994) Diagnostic tests. 1: sensitivity and specificity. BMJ Br Med J 308(6943):1552

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alagukumar, S., Kathirvalavakumar, T. (2022). Classifying Microarray Gene Expression Cancer Data Using Statistical Feature Selection and Machine Learning Methods. In: Saraswat, M., Sharma, H., Balachandran, K., Kim, J.H., Bansal, J.C. (eds) Congress on Intelligent Systems. Lecture Notes on Data Engineering and Communications Technologies, vol 114. Springer, Singapore. https://doi.org/10.1007/978-981-16-9416-5_5

Download citation

Publish with us

Policies and ethics