A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data

doi:10.1016/j.cmpb.2018.10.004

Computer Methods and Programs in Biomedicine

Volume 166, November 2018, Pages 99-105

https://doi.org/10.1016/j.cmpb.2018.10.004 Get rights and content

Highlights

•
A deep learning model, the stacked sparse auto-encoder based model, is proposed for cancer prediction.
•
The deep learning model, with pre-training and sparsity, outperforms classical models.
•
Important and abstract features are extracted.
•
Prediction results are presented on lung, stomach and breast cancer data.

Abstract

Background and objective: Cancer has become a complex health problem due to its high mortality. Over the past few decades, with the rapid development of the high-throughput sequencing technology and the application of various machine learning methods, remarkable progress in cancer research has been made based on gene expression data. At the same time, a growing amount of high-dimensional data has been generated, such as RNA-seq data, which calls for superior machine learning methods able to deal with mass data effectively in order to make accurate treatment decision.

Methods: In this paper, we present a semi-supervised deep learning strategy, the stacked sparse auto-encoder (SSAE) based classification, for cancer prediction using RNA-seq data. The proposed SSAE based method employs the greedy layer-wise pre-training and a sparsity penalty term to help capture and extract important information from the high-dimensional data and then classify the samples.

Results: We tested the proposed SSAE model on three public RNA-seq data sets of three types of cancers and compared the prediction performance with several commonly-used classification methods. The results indicate that our approach outperforms the other methods for all the three cancer data sets in various metrics.

Conclusions: The proposed SSAE based semi-supervised deep learning model shows its promising ability to process high-dimensional gene expression data and is proved to be effective and accurate for cancer prediction.

Introduction

Cancer prediction and diagnosis is a challenging subject that has drawn worldwide concern due to the high morbidity and mortality of cancer [1], [2], [3]. Accurate prediction in the early stage is crucial to effective treatment and has the potential for improving cancer outcomes [4]. With the pervasive applications, machine learning plays an increasingly important role in cancer diagnosis and the accurate prediction of machine learning methods for cancer has become one of the most urgent and challenging tasks for researchers [5]. Ubaidillah et al. [6] applied support vector machine (SVM) and neural network (NN) models on the BUPA Liver Disorders data set and showed that the SVM classifier yielded better performance than NN for liver cancer classification. Statnikov et al. [7] compared the random forest (RF) and SVM methods on twenty-two gene expression data sets. The results illustrated that by using the full set of genes, SVMs outperformed RFs on fifteen data sets, RFs outperformed SVMs on four data sets and the two methods performed the same on three data sets. Similar results were obtained using the selected genes. However, in most research on cancer prediction, only labeled data can be considered, while much unlabeled data is ignored. In fact, label information is often difficult to come by because labeling is expensive, cumbersome and error-prone [8]. In fact, there are three common types of machine learning methods which are supervised learning, unsupervised learning, and semi-supervised learning. In supervised learning, the labeled training data is mapped to the desired output. In unsupervised learning, training data without labels is used to discover patterns or groups of similar samples. In semi-supervised learning, both labeled and unlabeled data is utilized to build an accurate model [1]. Given the fact that most commonly-used machine learning methods cannot utilize unlabeled information, the semi-supervised learning algorithm has become a good choice. Shi and Zhang [4] evaluate a semi-supervised method, the low density separation (LDS) method, and compared with SVM on five colorectal cancer data sets. The results showed the great potential of semi-learning algorithms in cancer prediction problem.

With the development of the high-throughput sequencing technology, a large amount of gene expression data has been produced. The analysis of the gene expression data has shown great potential in cancer prediction. It is noteworthy that the gene expression quantitated by RNA-seq technology from cell free circulating RNA (cfRNA) samples has been used as a sensitive and useful indicator for cancer diagnosis and prognosis, and has been incrementally applied in clinic in recent years [9], [10]. However, the analysis of RNA-seq data has posed a challenge because of the high dimension and redundancy of the data [11]. It is difficult and time-consuming for traditional simplex classification algorithms to deal with the original large data sets because of the mismatch between the large number of genes and the small number of samples. To be specific, high dimensionality can incur significant computational cost, redundancy of massive genes can lead to decrease in predictive accuracy, and the mismatch between the numbers of features and sample sizes may lead to reduced computational accuracy and generalization ability and easily trigger over-fitting. Thus, dimensionality reduction and feature extraction of gene expression data are important for the follow-up analysis for cancer prediction [12], [13], [14]. Osareh and Shadgar [15] used a wrapper method based on sequential forward selection to reduce high-dimensional features and combined with three supervised learning algorithms, SVMs, k-nearest neighbors (kNNs) and probabilistic neural networks (PNNs), to classify between benign and malignant tumors. The results showed that feature selection was necessary for the best performance. Zheng et al. [16] applied an unsupervised learning algorithm, K-means, to implement feature extraction on a breast cancer data set, through which they not only guaranteed the accuracy of the prediction but also improved the speed of operation greatly. Liu et al. [17] employed a popular feature extraction method, the principal component analysis (PCA) method, and improved the method by imposing the integrative sparse penalty. The analysis of the improved PCA algorithm on breast and pancreatic cancer gene expression data sets both showed satisfactory performance. However, among the most commonly-used dimension reduction algorithms mentioned above, the shallow architecture limits the ability to learn high-level important information automatically from the input data. Compared to traditional machine learning methods, deep learning is considered a major breakthrough in machine learning, owing to its deep architecture that can explore the intricate nonlinear structure in the data and transform the original data into high-level representations [18]. In recent years, deep learning has been successfully applied in various areas, such as image processing [19], [20], natural language processing [21], and biomedicine [22]. As one of the techniques of deep learning, auto-encoders (AEs), with the appealing attributes of exploiting unlabeled data and learning high-level features [23], [24], has shown a very encouraging performance in feature extraction tasks since the first deep AE was proposed by Hinton et al. in [25]. Fakoor et al. [26] employed PCA and AEs with an SVM model on thirteen gene expression data sets for cancer detection. The results revealed that AEs outperformed PCA on the majority of the data sets due to its ability to automatically discover intricate relationships behind the data. Xu et al. [27] used a stacked sparse auto-encoder based framework on square patches from breast cancer histopathology images, which outperformed PCA and sparse auto-encoders.

In this paper, we attempt to develop a stacked sparse auto-encoder (SSAE) based semi-supervised deep learning model for cancer prediction. In this approach, an unsupervised feature extraction procedure and a supervised classification algorithm are combined so as to utilize both unlabeled and labeled information. We compared the proposed model with three commonly-used supervised classification methods, SVMs, RFs and NNs, and a semi-supervised classification method, an AE based classifier. The data sets used to evaluate the performance of the models are three public RNA-seq data sets derived from lung tissues, stomach tissues, and breast tissues, respectively. The results demonstrate that the proposed SSAE based semi-supervised model takes full advantage of the information in the data and obtains the most accurate prediction over the other methods on all three data sets, thus showing great superiority in the critical and difficult cancer prediction problem.

Section snippets

Stacked sparse auto-encoder (SSAE) based deep learning model

A flowchart of the training and testing processes of the proposed SSAE based semi-supervised classification method for cancer prediction is shown in Fig. 1. The semi-supervised classification model consists of the unsupervised feature extraction stage and the supervised classification stage, so as to address both unlabeled and labeled data to extract more valuable information and make better predictions.

Data collection and differential expression analysis

We conducted experiments to evaluate the proposed method on three RNA-seq data sets of three types of cancers, Lung Adenocarcinoma (LUAD), Stomach Adenocarcinoma (STAD) and Breast Invasive Carcinoma (BRCA). We obtained the gene expression data from the TCGA project web page [30]. These data sets were collected from various subjects of different genders, ages, clinical conditions and phases of cancers. The tumor tissues from patients were not treated with prior chemotherapy or radiotherapy, as

Discussion

According to the experimental results, we have obtained that the proposed SSAE based semi-supervised deep learning model shows superior performance over several commonly-used methods on cancer prediction. In practice, there is a lot of unlabeled data but labeled data is very limited. The information provided by the limited labeled data may not be sufficient for model prediction. If we can utilize the much larger amount of unlabeled data in our model, it may greatly improve the prediction

Conclusions

In this paper, we proposed a stacked sparse auto-encoder (SSAE) based semi-supervised classification model for cancer prediction. Specifically, we analyzed gene expression data obtained from three kinds of tissues, lung, stomach, and breast, and preprocessed the data through a differential expression analysis algorithm. The proposed SSAE based deep learning model is then applied. An unsupervised feature extraction algorithm based on the SSAE is first trained through the greedy layer-wise

Conflict of interests

The authors do not have financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References (34)

K. Kourou et al.
Machine learning applications in cancer prognosis and prediction
Comput. Struct. Biotechnol. J.
(2015)
G. Cheng
Circulating miRNAs: roles in cancer diagnosis, prognosis and therapy
Adv. Drug Deliv. Rev.
(2015)
M.F. Akay
Support vector machines combined with feature selection for breast cancer diagnosis
Expert Syst. Appl.
(2009)
Y. Xiao et al.
A deep learning-based multi-model ensemble method for cancer prediction
Comput. Methods Programs Biomed.
(2018)
B. Zheng et al.
Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms
Expert Syst. Appl.
(2014)
Z. Zhang et al.
Automated feature learning for nonlinear process monitoring–an approach using stacked denoising autoencoder and k-nearest neighbor rule
J. Process. Control
(2018)
A. Bashiri et al.
Improving the prediction of survival in cancer patients by using machine learning techniques: experience of gene expression data: a narrative review
Iran. J. Public Health
(2017)
M.S. Bal et al.
Patterns of cancer: a study of 500 punjabi patients
Asian Pacific J. Cancer Prev. APJCP
(2015)
M. Shi et al.
Semi-supervised learning improves gene expression-based prediction of cancer recurrence
Bioinformatics
(2011)
J.A. Cruz et al.
Applications of machine learning in cancer prediction and prognosis
Cancer Inform.
(2006)

S.H.S.A. Ubaidillah et al.

Classification of liver cancer using artificial neural network and support vector machine

Proceedings of International Conference on Advance in Communication Network, and Computing

(2014)

A. Statnikov et al.

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

BMC Bioinformatics

(2008)

J.C. Ang et al.

Semi-supervised SVM-based feature selection for cancer classification using microarray gene expression data

International Conference on Current Approaches in Applied Artificial Intelligence

(2015)

Y. Li et al.

Circular RNA is enriched and stable in exosomes: a promising biomarker for cancer diagnosis

Cell Res.

(2015)

P. Danaee et al.

A deep learning approach for cancer detection and relevant gene identification

Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing

(2016)

Z.M. Hira et al.

A review of feature selection and feature extraction methods applied on microarray data

Adv. Bioinform.

(2015)

A. Osareh et al.

Machine learning techniques to diagnose breast cancer

International Symposium on Health Informatics and Bioinformatics IEEE

(2010)

Cited by (74)

Tree enhanced deep adaptive network for cancer prediction with high dimension low sample size microarray data
2023, Applied Soft Computing
Cancer prediction based on microarray data can facilitate the molecular exploration of cancers, thus building more accurate cancer prediction models is essential. This study focuses on a deep learning-based cancer prediction model. However, using a deep neural network to predict cancer is a difficult task due to the complexity of the underlying biological patterns and high dimension low sample size (HDLSS) of microarray data, which could bring about over-fitting and large training gradient variance. Therefore, a tree-enhanced deep adaptive network (TEDAN) is proposed to address these issues. Firstly, we employ the idea of the ensemble tree as a feature transformation method to alleviate the over-fitting problem, which generates a feature with a lower dimension and a more discriminative pattern. Secondly, a deep adaptive network (DAN) based on a self-attention mechanism is proposed to model the underlying biological interaction between different genes. Thirdly, a low sample size training (LSST) method is proposed to further reduce the large training gradient variance. Experiment results on six public cancer prediction datasets demonstrate that the TEDAN outperforms other strong baseline models.
SARS-CoV-2 virus classification based on stacked sparse autoencoder
2023, Computational and Structural Biotechnology Journal
Citation Excerpt :
Considering the importance of providing viral classification and the advantages of the use of DL techniques in several applications, especially for many viral classification problems, as presented previously, the main objective of this work is to generate an efficient viral genome classifier for the SARS-CoV-2 virus using the DNN based on the stacked sparse autoencoder (SSAE) technique. The SSAE has been successfully applied in many biomedical works from the state of the art [38–40,6]. Unlike most of the related works presented previously, this work intends to provide viral classification using the whole genome sequences, as presented in [12,13].
Since December 2019, the world has been intensely affected by the COVID-19 pandemic, caused by the SARS-CoV-2. In the case of a novel virus identification, the early elucidation of taxonomic classification and origin of the virus genomic sequence is essential for strategic planning, containment, and treatments. Deep learning techniques have been successfully used in many viral classification problems associated with viral infection diagnosis, metagenomics, phylogenetics, and analysis. Considering that motivation, the authors proposed an efficient viral genome classifier for the SARS-CoV-2 using the deep neural network based on the stacked sparse autoencoder (SSAE). For the best performance of the model, we explored the utilization of image representations of the complete genome sequences as the SSAE input to provide a classification of the SARS-CoV-2. For that, a dataset based on k-mers image representation was applied. We performed four experiments to provide different levels of taxonomic classification of the SARS-CoV-2. The SSAE technique provided great performance results in all experiments, achieving classification accuracy between 92% and 100% for the validation set and between 98.9% and 100% when the SARS-CoV-2 samples were applied for the test set. In this work, samples of the SARS-CoV-2 were not used during the training process, only during subsequent tests, in which the model was able to infer the correct classification of the samples in the vast majority of cases. This indicates that our model can be adapted to classify other emerging viruses. Finally, the results indicated the applicability of this deep learning technique in genome classification problems.
Towards computational solutions for precision medicine based big data healthcare system using deep learning models: A review
2022, Computers in Biology and Medicine
Citation Excerpt :
Integrating gene information from new pathways into the proposed model will improve annotation performance and interpretation. Xiao et al. [74] presented a cancer prediction model that used a semi-supervised stacked sparse autoencoder for classification using RNA-seq data (lung tissues, stomach tissues, and breast tissues). The semi-supervised classification model consisted of unsupervised feature extraction (trained through greedy layer-wise pre-training strategy) and supervised classification stages that facilitated unlabelled and labeled information utilization.
The emergence of large-scale human genome projects, advances in DNA sequencing technologies, and the massive volume of electronic medical records [EMR] shift the transformation of healthcare research into the next paradigm, namely ‘Precision Medicine.’ This new clinical system model uses patients' genomic profiles and disparate healthcare data sources to a greater extent and provides personalized deliverables. As an advanced analytical technique, deep learning models significantly impact precision medicine because they can process voluminous amounts of diversified data with improved accuracy. Two salient features of deep learning models, namely processing a massive volume of multi-model data at multiple levels of abstraction and the ability to identify inherent features from the input data on their own, attract the implication of deep learning techniques in precision medicine research. The proposed review highlights the importance of deep learning-based analytical models in handling diversified and disparate big data sources of precision medicine. To augment further, state-of-the-art precision medicine research based on the taxonomy of deep learning models has been reviewed along with their research outcomes. The diversified data inputs used in research attempts, their applications, benchmarking data repositories, and usage of various evaluation measures for accuracy estimations are highlighted in this review. This review also brings out some promising analytical avenues of precision medicine research that give directions for future exploration.
Handling partially labeled network data: A semi-supervised approach using stacked sparse autoencoder
2022, Computer Networks
Network traffic analytics has become a crucial task in order to better understand and manage network resources, especially in the network softwarization era where the implementation of this concept can be done easily with network function virtualization. Currently, many approaches have been proposed to improve the performance of traffic classification. However, as new types of traffic emerge every day (and they are generally not labeled), this opens a new challenge to be handled. Moreover, the question of how to accurately classify the traffic using a limited amount of labeled data or partially labeled data is also another important concern. In fact, labeling data is often difficult and time-consuming. In order to solve the previously described issues, we reformulate traffic classification into a semi-supervised learning where both supervised learning (using labeled data) and unsupervised learning (no label data) are combined. To do so, this paper presents a stacked sparse autoencoder (SSAE) based semi-supervised deep-learning model for traffic classification. The main motivations of this approach are: (i) unlabeled data is often abundant and easily available; (ii) classification performance of the whole model can be greatly improved when a large amount of unlabeled traffic is included in the training process; (iii) there is a limit to how much human effort can be thrown at the labeling problem. To investigate the performance of our approach, an empirical study has been conducted on a real dataset and results indicate that using a large amount of unlabeled data in the SSAE pre-trained phase can improve significantly the classification performance of the whole model. Furthermore, the proposed approach is compared against other representative machine-learning and deep-learning models, which are Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP), eXtreme Gradient Boosting (XGBoost), and Autoencoder.
A soft sensor model based on an improved semi-supervised stacked autoencoder for just-in-time updating of cement clinker production process data f-CaO
2024, Measurement Science and Technology
An Image Classification Method Based on Semi-Supervised Classification Learning and Convolutional Neural Networks
2024, Journal of Circuits, Systems and Computers

View all citing articles on Scopus

View full text

A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data

Highlights

Abstract

Introduction

Section snippets

Stacked sparse auto-encoder (SSAE) based deep learning model

Data collection and differential expression analysis

Discussion

Conclusions

Conflict of interests

Acknowledgments

Comput. Struct. Biotechnol. J.

Adv. Drug Deliv. Rev.

Expert Syst. Appl.

Comput. Methods Programs Biomed.

Expert Syst. Appl.

J. Process. Control

Improving the prediction of survival in cancer patients by using machine learning techniques: experience of gene expression data: a narrative review

Iran. J. Public Health

Patterns of cancer: a study of 500 punjabi patients

Asian Pacific J. Cancer Prev. APJCP

Semi-supervised learning improves gene expression-based prediction of cancer recurrence

Bioinformatics

Applications of machine learning in cancer prediction and prognosis

Cancer Inform.

Classification of liver cancer using artificial neural network and support vector machine

Proceedings of International Conference on Advance in Communication Network, and Computing

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

BMC Bioinformatics

Semi-supervised SVM-based feature selection for cancer classification using microarray gene expression data

International Conference on Current Approaches in Applied Artificial Intelligence

Circular RNA is enriched and stable in exosomes: a promising biomarker for cancer diagnosis

Cell Res.

A deep learning approach for cancer detection and relevant gene identification

Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing

A review of feature selection and feature extraction methods applied on microarray data

Adv. Bioinform.

Machine learning techniques to diagnose breast cancer

International Symposium on Health Informatics and Bioinformatics IEEE