Colon cancer diagnosis and staging classification based on machine learning and bioinformatics analysis

https://doi.org/10.1016/j.compbiomed.2022.105409Get rights and content

Highlights

  • Machine learning analysis combined with bioinformatics analysis to screen for markers.

  • Random forests perform best in both colon cancer and colon cancer staging diagnosis.

  • The accuracy of colon cancer diagnosis is greater than 98%.

  • Lasso machine learning method screens for 2 genes associated with colon cancer prognosis.

  • Bioinformatic analysis identified six genes associated with colon cancer prognosis distributed in five protein groups.

Abstract

Advanced metastasis of colon cancer makes it more difficult to treat colon cancer. Finding the markers of colon cancer (Colon Cancer) can diagnose the stage of cancer in time and improve the prognosis with timely treatment. This paper uses gene expression profiling data from The Cancer Genome Atlas (TCGA) for the diagnosis of colon cancer and its staging. In this study, we first selected the gene modules with the greatest correlation with cancer by Weighted Gene Co-expression Network Analysis (WGCNA), extracted the characteristic genes for differential expression results using the least absolute shrinkage and selection operator algorithm (Lasso) and performed survival analysis, and then combined the genes in the modules with the Lasso-extracted feature genes were combined to diagnose colon cancer versus healthy controls using RF, SVM and decision trees, and colon cancer staging was diagnosed using differentially expressed genes for each stage. Finally, Protein-Protein Interaction Networks (PPI) networks were done for 289 genes to identify clusters of aggregated proteins for survival analysis. Finally, the RF model had the best results in the diagnosis of colon cancer versus control group fold cross-validation with an average accuracy of 99.81%, F1 value reaching 0.9968, accuracy of 99.88%, and recall of 99.5%, and an average accuracy of 91.5%, F1 value reaching 0.7679, accuracy of 86.94%, and recall in the diagnosis of colon cancer stages I, II, III and IV. The recall rate reached 73.04%, and eight genes associated with colon cancer prognosis were identified for GCNT2, GLDN, SULT1B1, UGT2B15, PTGDR2, GPR15, BMP5 and CPT2.

Introduction

Colorectal cancer is the third most common cause of cancer death for both men and women in the United States, and the second most common when men and women are combined [1]. According to the American Society of Clinical Oncology, an estimated 149,500 adults in the United States will be diagnosed with colorectal cancer in 2021. This includes 104,270 new cases of colon cancer and 45,230 new cases of rectal cancer, which accounts for about 70% of colorectal cancers [2]. An important test for CRC screening is the fecal occult blood test (FOBT), which includes a guaiac FOBT and a fecal immunology test [3]. Colonoscopy is considered the gold standard method for CRC screening, with the advantages of high sensitivity, specificity and direct visualization, and it plays an important role in cancer and precancerous lesions (biopsy and removal of polyps) [4]. This specifically includes endorectal ultrasonography (USG), abdominal ultrasonography (USG), computed tomography (CT) and nuclear magnetic resonance (NMR). However, these methods are only effective in the case of severe focal lesions [5].

In recent years, tumor markers have been widely used in the field of cancer diagnosis and treatment. The ideal tumor marker should have strong specificity for tumor screening, diagnosis, efficacy and prognosis assessment, recurrence detection, etc., and can detect microscopic lesions and quantitatively reflect tumor load [6]. Cancer staging is used to describe how far the cancer has spread in the body. It helps determine the severity of the cancer and how best to treat it, and is also used by doctors in survival statistics [7]. According to the American Joint Committee on Cancer Tumor-Lymph Node-Metastasis (TNM), most cancers have five distinct stages - Stage 0, I, II, III and IV [8]. The stage of the cancer will tell us the location and size of the cancer, how much it has grown in nearby tissues, and whether it has spread to nearby lymph nodes or other parts of the body, and will influence the presence of markers of cancer spread [9]. Among patients with the worst grade of colon cancer, if diagnosed at stage I, the 5-year survival rate for those aged 18 and 65 years is 91% and survivable with appropriate treatment, while the 5-year survival rates at stages II, III and IV are about 82%, 66% and 10%, respectively [10]. Colon cancer mutations are the most common and deadly of many cancers, and detecting the disease in its early stages greatly increases a patient's chances of survival [11].

Machine learning can detect hard-to-identify patterns from large, noisy or complex datasets. This capability is particularly well suited for applications in data analysis in the medical field, especially those involving complex proteomic and genomic expression data related applications, often used in recent years for cancer diagnosis and detection [12]. Machine learning algorithms have been widely used in the medical field including SVM [[13], [14], [15]], random forests [16] and decision trees [17,18]. Specific applications such as Xie et al. used spectral data to build SVM models for rapid and noninvasive screening of keratitis with an average accuracy of 100% [19] and Chen Fangfang et al. used SVM and DT models for rapid detection of glioma with model accuracy around 90% [20], which further indicates that machine learning has better applicability in medical disease diagnosis. SVM is used as a supervised learning model for classification and regression problems, and can solve both linear and nonlinear problems. A random forest classifier is a set of decision trees from a randomly selected subset of the training set that aggregates votes from different decision trees to determine the final class of the test object. Decision trees are popular machine learning models for classification and regression tasks.

Gene network-based cancer predictive biomarker screening has yielded some good results in the biomedical field, such as the subtype-specific network biomarker approach to identify breast cancer survivorship constructed by Sheikh Jubair et al. which has high predictive performance in identifying breast cancer patients' survivorship [21]. Shiyan Li et al. developed a model to assess the prognosis of cervical cancer patients using the weighted gene co-expression network (WGCNA) combined with the LASSO approach and demonstrated that the model is valid and stable [22]. In this study, we first did differential expression analysis between colon cancer and healthy controls, and used WGCNA to correlate healthy and cancer samples to obtain gene modules associated with cancer, and combined the features extracted by LASSO machine algorithm to diagnose colon cancer. Weighted Gene Co-expression Network Analysis (WGCNA) has been applied to the analysis of various cancers [23,24], such as: bladder cancer [25], breast cancer [26], and lung cancer [27], and can help identify the underlying mechanisms involved in specific biological processes as well as explore candidate biomarkers. The LASSO feature selection technique has been used in many applications in the biological field. Lasso is a well-known feature selection method that considers an L1 type penalty, which adds a constraint on the sum of all absolute values of the feature coefficients to ensure both global optimality and computational efficiency [28]. The study of feature selection methods for microbial and microbiome data at the 2021 IEEE International Conference on Bioinformatics and Biomedicine found that LASSO consistently outperformed other methods in several key classification metrics, most notably AUC, and that the LASSO framework can generate more meaningful feature selection algorithms relative to similar feature selection methods for features [29]. In addition, the Lasso algorithm was well applied to the field of cancer by Neha Shree Maurya et al. who used LASSO and other methods to extract signature genes to discover TMEM236, a novel biomarker for the diagnosis of colorectal cancer [30]. Secondly, differential expression analysis was done by dividing the first three stages of colon cancer into two groups according to early stage cancer versus advanced or metastatic cancer [31] and stage I versus the last three stages, respectively. Colon cancer was classified into two groups, I, II and III, IV, according to whether the cancer had spread to nearby lymph nodes or elsewhere, and differential expression analysis was performed for the staging groups. Finally, colon cancer and colon cancer staging were classified using machine learning SVM, random forest and decision tree, and the final feature genes used for classification were screened for prognostic genes after Protein-Protein Interaction Networks (PPI) analysis. Here, the results of the model constructed by our extracted features achieved better results for colon cancer and its early and late diagnosis, and the screened prognostic genes provided more ideas for the treatment of colon cancer.

Section snippets

Experimental method

The methods used in this experiment are all illustrated by the workflow diagram 1 below (Fig. 1).

Materials and data

Colon cancer (COAD) gene expression data were obtained from TCGA (https://portal.gdc.cancer.gov/repository). Transcriptomic gene expression data from the TCGA-COAD project on the TCGA official website were selected and 521 samples were downloaded, including 480 tumor tissue samples and 41 corresponding control tissue samples. In this trial, clinical data of 459 colon cancer patients were downloaded

Identification of differentially expressed genes

The differential expression analysis of colon carcinoma tissues versus control tissues finally yielded 3500 differentially expressed genes, whose heat map (Fig. 2A) and volcano map (Fig. 2B) are shown below. Differential expression analysis between colon cancer I, II and III, IV cancer tissues and healthy samples yielded 134 differentially expressed genes, 151 differentially expressed genes between first three and IV stage samples, and 189 differentially expressed genes between stage I and last

Discussion

Colon cancer is considered the third leading cause of death in women and the second leading cause of death in men worldwide. Colon cancer is poorly treated after carcinogenesis and metastasis, and early diagnosis of colon cancer is beneficial in improving the survival rate of patients compared to those diagnosed at a later stage, and with certain treatments lower morbidity and better survival can be achieved [34]. Many European countries mainly perform interval colonoscopy or fecal occult blood

Conclusion

In the diagnosis of colon cancer, the average accuracy of RF model in this experiment reached 99.81%, PR reached 99.88%, recall rate 99.50%, F1 value 99.68%, and specificity and ROC_auc reached 100%. The RF model achieved an average accuracy of 91.52%, PR of 86.94%, recall of 73.04%, F1 value of 76.79%, specificity of 98.16%, and ROC_auc of 82.37% in the diagnosis of first three stages of colon cancer with stage IV staging. By survival analysis of LASSO-extracted genes with PPI-screened genes,

Data availability statement

Publicly available datasets were analyzed in this study. This data can be foundhere: TCGA (https://portal.gdc.cancer.gov/repository).

Authors’ contributions

Study design: YS, ChenCheng, WG, XT, YL. Data acquisition: YS, WG, XT, RG. Statistical analysis: YS, ChenCheng, WG, XT, ChenChen, DJ, RG. Manuscript preparation: YS, ChenCheng, WG, XT, ChenChen, TL, YL. Manuscript editing: YS, ChenCheng, XT, YL, RG. Manuscript review: YS, ChenCheng, WG, XT, ChenChen, DJ, TL, YL, RG.

Declaration of competing interest

The authors declare that they have no conflicting financial interests.

Acknowledgment

This work supported by the Clinical Research Center of Breast Tumor and Thyroid Tumor in Xinjiang Autonomous Region, the Special Project of Tianshan Innovation Team in Xinjiang Uygur Autonomous Region2020D14031)and Tianshan Youth Project in Xinjiang Uygur Autonomous Region2019Q043).

References (37)

  • R.L. Siegel et al.

    Colorectal cancer statistics, 2020

    CA A Cancer J. Clin.

    (2020)
  • Statistics | Cancer.Net. [(accessed on 27 November 2021)]; Available online:...
  • J.N. Li et al.

    Fecal occult blood test in colorectal caPhotodiagnosis Photodyn. Ther.ncer screening

    Journal of digestive diseases

    (2019)
  • H. Goyal et al.

    Scope of artificial intelligence in screening and diagnosis of colorectal cancer[J]

    J. Clin. Med.

    (2020)
  • L. Krakowczyk et al.

    Epigenetic modification of gene expression in colorectal carcinogenesis

    Wspólczesna Onkol.

    (2007)
  • A. Horwich et al.

    Circulating tumor markers

  • Staged | Cancer.Net. [(accessed on 27 November 2021)]; Available...
  • Stage, T., Stage, N., & Stage, M. Carcinoma in Situ Corresponds to the TNM Classification. Laryngeal Cancer: Stages....
  • Cited by (45)

    View all citing articles on Scopus
    View full text