A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data

https://doi.org/10.1016/j.cmpb.2018.10.004Get rights and content

Highlights

  • A deep learning model, the stacked sparse auto-encoder based model, is proposed for cancer prediction.

  • The deep learning model, with pre-training and sparsity, outperforms classical models.

  • Important and abstract features are extracted.

  • Prediction results are presented on lung, stomach and breast cancer data.

Abstract

Background and objective: Cancer has become a complex health problem due to its high mortality. Over the past few decades, with the rapid development of the high-throughput sequencing technology and the application of various machine learning methods, remarkable progress in cancer research has been made based on gene expression data. At the same time, a growing amount of high-dimensional data has been generated, such as RNA-seq data, which calls for superior machine learning methods able to deal with mass data effectively in order to make accurate treatment decision.

Methods: In this paper, we present a semi-supervised deep learning strategy, the stacked sparse auto-encoder (SSAE) based classification, for cancer prediction using RNA-seq data. The proposed SSAE based method employs the greedy layer-wise pre-training and a sparsity penalty term to help capture and extract important information from the high-dimensional data and then classify the samples.

Results: We tested the proposed SSAE model on three public RNA-seq data sets of three types of cancers and compared the prediction performance with several commonly-used classification methods. The results indicate that our approach outperforms the other methods for all the three cancer data sets in various metrics.

Conclusions: The proposed SSAE based semi-supervised deep learning model shows its promising ability to process high-dimensional gene expression data and is proved to be effective and accurate for cancer prediction.

Introduction

Cancer prediction and diagnosis is a challenging subject that has drawn worldwide concern due to the high morbidity and mortality of cancer [1], [2], [3]. Accurate prediction in the early stage is crucial to effective treatment and has the potential for improving cancer outcomes [4]. With the pervasive applications, machine learning plays an increasingly important role in cancer diagnosis and the accurate prediction of machine learning methods for cancer has become one of the most urgent and challenging tasks for researchers [5]. Ubaidillah et al. [6] applied support vector machine (SVM) and neural network (NN) models on the BUPA Liver Disorders data set and showed that the SVM classifier yielded better performance than NN for liver cancer classification. Statnikov et al. [7] compared the random forest (RF) and SVM methods on twenty-two gene expression data sets. The results illustrated that by using the full set of genes, SVMs outperformed RFs on fifteen data sets, RFs outperformed SVMs on four data sets and the two methods performed the same on three data sets. Similar results were obtained using the selected genes. However, in most research on cancer prediction, only labeled data can be considered, while much unlabeled data is ignored. In fact, label information is often difficult to come by because labeling is expensive, cumbersome and error-prone [8]. In fact, there are three common types of machine learning methods which are supervised learning, unsupervised learning, and semi-supervised learning. In supervised learning, the labeled training data is mapped to the desired output. In unsupervised learning, training data without labels is used to discover patterns or groups of similar samples. In semi-supervised learning, both labeled and unlabeled data is utilized to build an accurate model [1]. Given the fact that most commonly-used machine learning methods cannot utilize unlabeled information, the semi-supervised learning algorithm has become a good choice. Shi and Zhang [4] evaluate a semi-supervised method, the low density separation (LDS) method, and compared with SVM on five colorectal cancer data sets. The results showed the great potential of semi-learning algorithms in cancer prediction problem.

With the development of the high-throughput sequencing technology, a large amount of gene expression data has been produced. The analysis of the gene expression data has shown great potential in cancer prediction. It is noteworthy that the gene expression quantitated by RNA-seq technology from cell free circulating RNA (cfRNA) samples has been used as a sensitive and useful indicator for cancer diagnosis and prognosis, and has been incrementally applied in clinic in recent years [9], [10]. However, the analysis of RNA-seq data has posed a challenge because of the high dimension and redundancy of the data [11]. It is difficult and time-consuming for traditional simplex classification algorithms to deal with the original large data sets because of the mismatch between the large number of genes and the small number of samples. To be specific, high dimensionality can incur significant computational cost, redundancy of massive genes can lead to decrease in predictive accuracy, and the mismatch between the numbers of features and sample sizes may lead to reduced computational accuracy and generalization ability and easily trigger over-fitting. Thus, dimensionality reduction and feature extraction of gene expression data are important for the follow-up analysis for cancer prediction [12], [13], [14]. Osareh and Shadgar [15] used a wrapper method based on sequential forward selection to reduce high-dimensional features and combined with three supervised learning algorithms, SVMs, k-nearest neighbors (kNNs) and probabilistic neural networks (PNNs), to classify between benign and malignant tumors. The results showed that feature selection was necessary for the best performance. Zheng et al. [16] applied an unsupervised learning algorithm, K-means, to implement feature extraction on a breast cancer data set, through which they not only guaranteed the accuracy of the prediction but also improved the speed of operation greatly. Liu et al. [17] employed a popular feature extraction method, the principal component analysis (PCA) method, and improved the method by imposing the integrative sparse penalty. The analysis of the improved PCA algorithm on breast and pancreatic cancer gene expression data sets both showed satisfactory performance. However, among the most commonly-used dimension reduction algorithms mentioned above, the shallow architecture limits the ability to learn high-level important information automatically from the input data. Compared to traditional machine learning methods, deep learning is considered a major breakthrough in machine learning, owing to its deep architecture that can explore the intricate nonlinear structure in the data and transform the original data into high-level representations [18]. In recent years, deep learning has been successfully applied in various areas, such as image processing [19], [20], natural language processing [21], and biomedicine [22]. As one of the techniques of deep learning, auto-encoders (AEs), with the appealing attributes of exploiting unlabeled data and learning high-level features [23], [24], has shown a very encouraging performance in feature extraction tasks since the first deep AE was proposed by Hinton et al. in [25]. Fakoor et al. [26] employed PCA and AEs with an SVM model on thirteen gene expression data sets for cancer detection. The results revealed that AEs outperformed PCA on the majority of the data sets due to its ability to automatically discover intricate relationships behind the data. Xu et al. [27] used a stacked sparse auto-encoder based framework on square patches from breast cancer histopathology images, which outperformed PCA and sparse auto-encoders.

In this paper, we attempt to develop a stacked sparse auto-encoder (SSAE) based semi-supervised deep learning model for cancer prediction. In this approach, an unsupervised feature extraction procedure and a supervised classification algorithm are combined so as to utilize both unlabeled and labeled information. We compared the proposed model with three commonly-used supervised classification methods, SVMs, RFs and NNs, and a semi-supervised classification method, an AE based classifier. The data sets used to evaluate the performance of the models are three public RNA-seq data sets derived from lung tissues, stomach tissues, and breast tissues, respectively. The results demonstrate that the proposed SSAE based semi-supervised model takes full advantage of the information in the data and obtains the most accurate prediction over the other methods on all three data sets, thus showing great superiority in the critical and difficult cancer prediction problem.

Section snippets

Stacked sparse auto-encoder (SSAE) based deep learning model

A flowchart of the training and testing processes of the proposed SSAE based semi-supervised classification method for cancer prediction is shown in Fig. 1. The semi-supervised classification model consists of the unsupervised feature extraction stage and the supervised classification stage, so as to address both unlabeled and labeled data to extract more valuable information and make better predictions.

Data collection and differential expression analysis

We conducted experiments to evaluate the proposed method on three RNA-seq data sets of three types of cancers, Lung Adenocarcinoma (LUAD), Stomach Adenocarcinoma (STAD) and Breast Invasive Carcinoma (BRCA). We obtained the gene expression data from the TCGA project web page [30]. These data sets were collected from various subjects of different genders, ages, clinical conditions and phases of cancers. The tumor tissues from patients were not treated with prior chemotherapy or radiotherapy, as

Discussion

According to the experimental results, we have obtained that the proposed SSAE based semi-supervised deep learning model shows superior performance over several commonly-used methods on cancer prediction. In practice, there is a lot of unlabeled data but labeled data is very limited. The information provided by the limited labeled data may not be sufficient for model prediction. If we can utilize the much larger amount of unlabeled data in our model, it may greatly improve the prediction

Conclusions

In this paper, we proposed a stacked sparse auto-encoder (SSAE) based semi-supervised classification model for cancer prediction. Specifically, we analyzed gene expression data obtained from three kinds of tissues, lung, stomach, and breast, and preprocessed the data through a differential expression analysis algorithm. The proposed SSAE based deep learning model is then applied. An unsupervised feature extraction algorithm based on the SSAE is first trained through the greedy layer-wise

Conflict of interests

The authors do not have financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References (34)

  • S.H.S.A. Ubaidillah et al.

    Classification of liver cancer using artificial neural network and support vector machine

    Proceedings of International Conference on Advance in Communication Network, and Computing

    (2014)
  • A. Statnikov et al.

    A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

    BMC Bioinformatics

    (2008)
  • J.C. Ang et al.

    Semi-supervised SVM-based feature selection for cancer classification using microarray gene expression data

    International Conference on Current Approaches in Applied Artificial Intelligence

    (2015)
  • Y. Li et al.

    Circular RNA is enriched and stable in exosomes: a promising biomarker for cancer diagnosis

    Cell Res.

    (2015)
  • P. Danaee et al.

    A deep learning approach for cancer detection and relevant gene identification

    Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing

    (2016)
  • Z.M. Hira et al.

    A review of feature selection and feature extraction methods applied on microarray data

    Adv. Bioinform.

    (2015)
  • A. Osareh et al.

    Machine learning techniques to diagnose breast cancer

    International Symposium on Health Informatics and Bioinformatics IEEE

    (2010)
  • Cited by (74)

    • SARS-CoV-2 virus classification based on stacked sparse autoencoder

      2023, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      Considering the importance of providing viral classification and the advantages of the use of DL techniques in several applications, especially for many viral classification problems, as presented previously, the main objective of this work is to generate an efficient viral genome classifier for the SARS-CoV-2 virus using the DNN based on the stacked sparse autoencoder (SSAE) technique. The SSAE has been successfully applied in many biomedical works from the state of the art [38–40,6]. Unlike most of the related works presented previously, this work intends to provide viral classification using the whole genome sequences, as presented in [12,13].

    • Towards computational solutions for precision medicine based big data healthcare system using deep learning models: A review

      2022, Computers in Biology and Medicine
      Citation Excerpt :

      Integrating gene information from new pathways into the proposed model will improve annotation performance and interpretation. Xiao et al. [74] presented a cancer prediction model that used a semi-supervised stacked sparse autoencoder for classification using RNA-seq data (lung tissues, stomach tissues, and breast tissues). The semi-supervised classification model consisted of unsupervised feature extraction (trained through greedy layer-wise pre-training strategy) and supervised classification stages that facilitated unlabelled and labeled information utilization.

    View all citing articles on Scopus
    View full text