A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data
Introduction
Cancer prediction and diagnosis is a challenging subject that has drawn worldwide concern due to the high morbidity and mortality of cancer [1], [2], [3]. Accurate prediction in the early stage is crucial to effective treatment and has the potential for improving cancer outcomes [4]. With the pervasive applications, machine learning plays an increasingly important role in cancer diagnosis and the accurate prediction of machine learning methods for cancer has become one of the most urgent and challenging tasks for researchers [5]. Ubaidillah et al. [6] applied support vector machine (SVM) and neural network (NN) models on the BUPA Liver Disorders data set and showed that the SVM classifier yielded better performance than NN for liver cancer classification. Statnikov et al. [7] compared the random forest (RF) and SVM methods on twenty-two gene expression data sets. The results illustrated that by using the full set of genes, SVMs outperformed RFs on fifteen data sets, RFs outperformed SVMs on four data sets and the two methods performed the same on three data sets. Similar results were obtained using the selected genes. However, in most research on cancer prediction, only labeled data can be considered, while much unlabeled data is ignored. In fact, label information is often difficult to come by because labeling is expensive, cumbersome and error-prone [8]. In fact, there are three common types of machine learning methods which are supervised learning, unsupervised learning, and semi-supervised learning. In supervised learning, the labeled training data is mapped to the desired output. In unsupervised learning, training data without labels is used to discover patterns or groups of similar samples. In semi-supervised learning, both labeled and unlabeled data is utilized to build an accurate model [1]. Given the fact that most commonly-used machine learning methods cannot utilize unlabeled information, the semi-supervised learning algorithm has become a good choice. Shi and Zhang [4] evaluate a semi-supervised method, the low density separation (LDS) method, and compared with SVM on five colorectal cancer data sets. The results showed the great potential of semi-learning algorithms in cancer prediction problem.
With the development of the high-throughput sequencing technology, a large amount of gene expression data has been produced. The analysis of the gene expression data has shown great potential in cancer prediction. It is noteworthy that the gene expression quantitated by RNA-seq technology from cell free circulating RNA (cfRNA) samples has been used as a sensitive and useful indicator for cancer diagnosis and prognosis, and has been incrementally applied in clinic in recent years [9], [10]. However, the analysis of RNA-seq data has posed a challenge because of the high dimension and redundancy of the data [11]. It is difficult and time-consuming for traditional simplex classification algorithms to deal with the original large data sets because of the mismatch between the large number of genes and the small number of samples. To be specific, high dimensionality can incur significant computational cost, redundancy of massive genes can lead to decrease in predictive accuracy, and the mismatch between the numbers of features and sample sizes may lead to reduced computational accuracy and generalization ability and easily trigger over-fitting. Thus, dimensionality reduction and feature extraction of gene expression data are important for the follow-up analysis for cancer prediction [12], [13], [14]. Osareh and Shadgar [15] used a wrapper method based on sequential forward selection to reduce high-dimensional features and combined with three supervised learning algorithms, SVMs, k-nearest neighbors (kNNs) and probabilistic neural networks (PNNs), to classify between benign and malignant tumors. The results showed that feature selection was necessary for the best performance. Zheng et al. [16] applied an unsupervised learning algorithm, K-means, to implement feature extraction on a breast cancer data set, through which they not only guaranteed the accuracy of the prediction but also improved the speed of operation greatly. Liu et al. [17] employed a popular feature extraction method, the principal component analysis (PCA) method, and improved the method by imposing the integrative sparse penalty. The analysis of the improved PCA algorithm on breast and pancreatic cancer gene expression data sets both showed satisfactory performance. However, among the most commonly-used dimension reduction algorithms mentioned above, the shallow architecture limits the ability to learn high-level important information automatically from the input data. Compared to traditional machine learning methods, deep learning is considered a major breakthrough in machine learning, owing to its deep architecture that can explore the intricate nonlinear structure in the data and transform the original data into high-level representations [18]. In recent years, deep learning has been successfully applied in various areas, such as image processing [19], [20], natural language processing [21], and biomedicine [22]. As one of the techniques of deep learning, auto-encoders (AEs), with the appealing attributes of exploiting unlabeled data and learning high-level features [23], [24], has shown a very encouraging performance in feature extraction tasks since the first deep AE was proposed by Hinton et al. in [25]. Fakoor et al. [26] employed PCA and AEs with an SVM model on thirteen gene expression data sets for cancer detection. The results revealed that AEs outperformed PCA on the majority of the data sets due to its ability to automatically discover intricate relationships behind the data. Xu et al. [27] used a stacked sparse auto-encoder based framework on square patches from breast cancer histopathology images, which outperformed PCA and sparse auto-encoders.
In this paper, we attempt to develop a stacked sparse auto-encoder (SSAE) based semi-supervised deep learning model for cancer prediction. In this approach, an unsupervised feature extraction procedure and a supervised classification algorithm are combined so as to utilize both unlabeled and labeled information. We compared the proposed model with three commonly-used supervised classification methods, SVMs, RFs and NNs, and a semi-supervised classification method, an AE based classifier. The data sets used to evaluate the performance of the models are three public RNA-seq data sets derived from lung tissues, stomach tissues, and breast tissues, respectively. The results demonstrate that the proposed SSAE based semi-supervised model takes full advantage of the information in the data and obtains the most accurate prediction over the other methods on all three data sets, thus showing great superiority in the critical and difficult cancer prediction problem.
Section snippets
Stacked sparse auto-encoder (SSAE) based deep learning model
A flowchart of the training and testing processes of the proposed SSAE based semi-supervised classification method for cancer prediction is shown in Fig. 1. The semi-supervised classification model consists of the unsupervised feature extraction stage and the supervised classification stage, so as to address both unlabeled and labeled data to extract more valuable information and make better predictions.
Data collection and differential expression analysis
We conducted experiments to evaluate the proposed method on three RNA-seq data sets of three types of cancers, Lung Adenocarcinoma (LUAD), Stomach Adenocarcinoma (STAD) and Breast Invasive Carcinoma (BRCA). We obtained the gene expression data from the TCGA project web page [30]. These data sets were collected from various subjects of different genders, ages, clinical conditions and phases of cancers. The tumor tissues from patients were not treated with prior chemotherapy or radiotherapy, as
Discussion
According to the experimental results, we have obtained that the proposed SSAE based semi-supervised deep learning model shows superior performance over several commonly-used methods on cancer prediction. In practice, there is a lot of unlabeled data but labeled data is very limited. The information provided by the limited labeled data may not be sufficient for model prediction. If we can utilize the much larger amount of unlabeled data in our model, it may greatly improve the prediction
Conclusions
In this paper, we proposed a stacked sparse auto-encoder (SSAE) based semi-supervised classification model for cancer prediction. Specifically, we analyzed gene expression data obtained from three kinds of tissues, lung, stomach, and breast, and preprocessed the data through a differential expression analysis algorithm. The proposed SSAE based deep learning model is then applied. An unsupervised feature extraction algorithm based on the SSAE is first trained through the greedy layer-wise
Conflict of interests
The authors do not have financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work.
Acknowledgments
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
References (34)
- et al.
Machine learning applications in cancer prognosis and prediction
Comput. Struct. Biotechnol. J.
(2015) Circulating miRNAs: roles in cancer diagnosis, prognosis and therapy
Adv. Drug Deliv. Rev.
(2015)Support vector machines combined with feature selection for breast cancer diagnosis
Expert Syst. Appl.
(2009)- et al.
A deep learning-based multi-model ensemble method for cancer prediction
Comput. Methods Programs Biomed.
(2018) - et al.
Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms
Expert Syst. Appl.
(2014) - et al.
Automated feature learning for nonlinear process monitoring–an approach using stacked denoising autoencoder and k-nearest neighbor rule
J. Process. Control
(2018) - et al.
Improving the prediction of survival in cancer patients by using machine learning techniques: experience of gene expression data: a narrative review
Iran. J. Public Health
(2017) - et al.
Patterns of cancer: a study of 500 punjabi patients
Asian Pacific J. Cancer Prev. APJCP
(2015) - et al.
Semi-supervised learning improves gene expression-based prediction of cancer recurrence
Bioinformatics
(2011) - et al.
Applications of machine learning in cancer prediction and prognosis
Cancer Inform.
(2006)
Classification of liver cancer using artificial neural network and support vector machine
Proceedings of International Conference on Advance in Communication Network, and Computing
A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification
BMC Bioinformatics
Semi-supervised SVM-based feature selection for cancer classification using microarray gene expression data
International Conference on Current Approaches in Applied Artificial Intelligence
Circular RNA is enriched and stable in exosomes: a promising biomarker for cancer diagnosis
Cell Res.
A deep learning approach for cancer detection and relevant gene identification
Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing
A review of feature selection and feature extraction methods applied on microarray data
Adv. Bioinform.
Machine learning techniques to diagnose breast cancer
International Symposium on Health Informatics and Bioinformatics IEEE
Cited by (74)
Tree enhanced deep adaptive network for cancer prediction with high dimension low sample size microarray data
2023, Applied Soft ComputingSARS-CoV-2 virus classification based on stacked sparse autoencoder
2023, Computational and Structural Biotechnology JournalCitation Excerpt :Considering the importance of providing viral classification and the advantages of the use of DL techniques in several applications, especially for many viral classification problems, as presented previously, the main objective of this work is to generate an efficient viral genome classifier for the SARS-CoV-2 virus using the DNN based on the stacked sparse autoencoder (SSAE) technique. The SSAE has been successfully applied in many biomedical works from the state of the art [38–40,6]. Unlike most of the related works presented previously, this work intends to provide viral classification using the whole genome sequences, as presented in [12,13].
Towards computational solutions for precision medicine based big data healthcare system using deep learning models: A review
2022, Computers in Biology and MedicineCitation Excerpt :Integrating gene information from new pathways into the proposed model will improve annotation performance and interpretation. Xiao et al. [74] presented a cancer prediction model that used a semi-supervised stacked sparse autoencoder for classification using RNA-seq data (lung tissues, stomach tissues, and breast tissues). The semi-supervised classification model consisted of unsupervised feature extraction (trained through greedy layer-wise pre-training strategy) and supervised classification stages that facilitated unlabelled and labeled information utilization.
A soft sensor model based on an improved semi-supervised stacked autoencoder for just-in-time updating of cement clinker production process data f-CaO
2024, Measurement Science and TechnologyAn Image Classification Method Based on Semi-Supervised Classification Learning and Convolutional Neural Networks
2024, Journal of Circuits, Systems and Computers