Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

On the use of QDE-SVM for gene feature selection and cell type classification from scRNA-seq data

Abstract

Cell type identification is one of the fundamental tasks in single-cell RNA sequencing (scRNA-seq) studies. It is a key step to facilitate downstream interpretations such as differential expression, trajectory inference, etc. scRNA-seq data contains technical variations that could affect the interpretation of the cell types. Therefore, gene selection, also known as feature selection in data science, plays an important role in selecting informative genes for scRNA-seq cell type identification. Generally speaking, feature selection methods are categorized into filter-, wrapper-, and embedded-based approaches. From the existing literature, methods from filter- and embedded-based approaches are widely applied in scRNA-seq gene selection tasks. The wrapper-based method that gives promising results in other fields has yet been extensively utilized for selecting gene features from scRNA-seq data; in addition, most of the existing wrapper methods used in this field are clustering instead of classification-based. With a large number of annotated data available today, this study applied a classification-based approach as an alternative to the clustering-based wrapper method. In our work, a quantum-inspired differential evolution (QDE) wrapped with a classification method was introduced to select a subset of genes from twelve well-known scRNA-seq transcriptomic datasets to identify cell types. In particular, the QDE was combined with different machine-learning (ML) classifiers namely logistic regression, decision tree, support vector machine (SVM) with linear and radial basis function kernels, as well as extreme learning machine. The linear SVM wrapped with QDE, namely QDE-SVM, was chosen by referring to the feature selection results from the experiment. QDE-SVM showed a superior cell type classification performance among QDE wrapping with other ML classifiers as well as the recent wrapper methods (i.e., FSCAM, SSD-LAHC, MA-HS, and BSF). QDE-SVM achieved an average accuracy of 0.9559, while the other wrapper methods achieved average accuracies in the range of 0.8292 to 0.8872.

Introduction

Single-cell RNA sequencing (scRNA-seq) generates the expression profile of transcripts for every single cell in a given population [1], and provides high-resolution insights for current biomedical studies. Unlike bulk RNA sequencing (RNA-seq) that provides the average expression of all cells, scRNA-seq treats cells individually to study the differences in each cell [2]. Questions as to which cells can be effectively targeted in studies such as cancer treatments or drug designs could be answered by analyzing the transcriptomic data from scRNA-seq [3, 4]. These analyses include cell type or cell state identification [5, 6], cell clustering [7], differential expression [8], spatial transcriptomics [9], and others [10, 11]. Identification of cell type, including cell state and cell cycle stage, is one of the fundamental tasks in scRNA-seq analyses. It is a key step in making sense of the data to facilitate downstream interpretations such as differential expression and trajectory inference [12, 13]. There are mainly two ways to identify cell types: classification and clustering [6]. Clustering is useful to identify novel and rare cell types. As scRNA-seq studies progress, many data are accumulated and tagged with cell types by referring to expert knowledge, and this leads to the introduction of cell type classification studies [14].

The advancements in single-cell sequencing technologies and protocols now enable millions of cells to be sequenced [15]. However, downstream analysis interpretations would not be accurate if technical variabilities, such as batch effects and biological factors are left unaccounted for [16]. Therefore, gene selection plays a crucial role in selecting a smaller but relevant set of genes for carrying out informative scRNA-seq analyses. Gene selection, also known as feature selection in data science, is a typical task for identifying salient features from a high-dimensional scRNA-seq dataset that comprises noises. Generally speaking, feature selection methods are categorized into filter-, wrapper-, and embedded-based approaches [17]. The filter-based feature selection methods are generally fast as they rank and filter the features directly based on a metric quantifying data characteristics such as information, distance, or correlation [1820]. The wrapper-based feature selection methods involve wrapping a feature subset selection algorithm (that searches for a good subset of features) around a learning algorithm (that evaluates the goodness of feature subsets) to find optimal and relevant features [21]. Embedded-based methods, on the other hand, integrate feature selection in the process of learning while avoiding long computational time [17, 18].

A number of gene selection algorithms and tools have been introduced in the literature for scRNA-seq cell type identification. Among the aforementioned three feature selection approaches, the filter-based approach is the most widely used [22] in finding a subset of useful gene features from scRNA-seq datasets. An example of the tool from this approach is scmap [23], which relies on dropouts, highly variable genes, and random selection to select genes for projecting cell types across datasets. Seurat [9] is another impactful filter-based method that selects highly variable genes to decode the spatial heterogeneity of the scRNA-seq gene expression data. CaSTLe [24] is a filter-based method that selects genes for cell type, cell state, or cell cycle labeling according to the mean expression of features, mutual information between features and class, and inter-feature correlation. scClassify [25] selects genes that discriminate among cell types based on differentially expressed genes, differential variable genes, differentially distributed genes, differentially proportioned genes, and bimodally distributed genes. COMET [26] is another tool designed for selecting marker genes that differentiate cells. It uses a hypergeometric test to evaluate the gene enrichment in a particular cell cluster. Other filter-based methods include concepts such as entropy [27], analysis of variance [28], or co-expression [29] for cell-type-specific gene selection.

Besides filter-based methods, a number of embedded-based gene selection methods are also available for scRNA-seq cell type identification. For example, RFCell [30] uses data permutation to generate negative samples, followed by a random forest to evaluate the importance of genes in cell type identification and biological interpretation. NS-forest [31] uses a random forest to select gene features. The features are further filtered by referring to a binary expression score. Sen Puliparambil et al. [32] introduced a method to select a set of discriminative genes using multiple penalization models. Other embedded-based gene selection methods incorporate models such as logistics regression (LR) [33], autoencoder [34], or deep learning [35].

In the literature, the number of tools that are developed in the wrapper approach to perform scRNA-seq gene selection is relatively few. Information gain ratio and genetic algorithm with dynamic crossover (IGRDCGA) [36] is one of the wrapper-based methods used for scRNA-seq gene selection. Information gain ratio is used as a measurement to eliminate irrelevant genes. Genetic algorithm of dynamic crossover wrapped with k-means clustering is then utilized to select genes and improve cell classification. Another wrapper-based tool is single-cell feature selection method based on convex analysis of mixtures (FSCAM) [37], which uses convex analysis of mixture (CAM) and fruit fly optimization algorithm (FFOA) as a wrapper to search for genes enriched in specific cell types. FSCAM applies prefiltering of genes based on zero read counts, mean expression, and dropouts. The differentially expressed genes selected by FSCAM for cell type clustering have superior clustering performance among methods from the other categories. On the other hand, Feature Selection via Genetic Algorithm (FSGA) [38] is one such method that uses classification learning algorithm for evaluating the goodness of the selected features. k-nearest neighbours (KNN) classifier wrapped by a genetic algorithm is used in this work. FSGA also emphasizes the biological relevancy of the selected features.

Despite superior performances of these methods, to the best of our knowledge, there is a lack of benchmarking studies utilizing wrapper-based gene selection methods in the literature. The wrapper-based methods have shown promising results in other fields [3941]; however, the effectiveness of this feature selection category in selecting gene features from scRNA-seq datasets has yet been extensively investigated. In fact, both learning algorithm and the feature subset searching algorithm are necessary in selecting optimal gene sets that are useful to describe cell types. Another note is that, many existing wrappers resort to clustering as the learning component in the wrapper-based feature selection process [36, 37]. With efforts from the pioneers, Cell Ontology [42] is established to provide an updated list of cell types. This comprehensive ontology is referred to create a large number of labeled scRNA-seq datasets. The main interest of our work is to investigate the effectiveness of a classification-based wrapper method in selecting gene subsets from the labeled scRNA-seq datasets to classify cell types. In this regard, a quantum-inspired differential evolution (QDE) wrapped with a classification algorithm is utilized to select genes from the annotated scRNA-seq datasets.

The rest of the content in this paper is divided into several sections. The “Materials and Methods” section explains the datasets, gene selection methods, and experimental workflow applied in this study. The “Results” section compares the performance of different feature selection methods in the experimental study, while the “Discussion” section presents the analyses and findings from the study. Finally, the “Conclusion” section summarizes the work.

Materials and methods

Datasets

Twelve popular scRNA-seq datasets from the public portals are used in this study. These datasets comprise the records of single cells from various cell types or cell cycle stages. The twelve datasets selected for this study contain single cells from human and mouse samples. They can be divided into three broad categories, namely developmental, metabolic diseases, and connective tissues. Four datasets from human [43] and mouse [4446] embryos at different development stages belong to the developmental category. Five datasets from healthy or type 2 diabetic pancreas tissues belong to the metabolic diseases category wherein four of them are from humans [4750], and one from mice [50]. Another three datasets from the brain cells (one from humans [51] and two from mice [52, 53]) belong to the connective tissues category. Each dataset is named based on the isolated tissue, organism, author’s first name, and year of dataset publication, separated by underscores. A summary of the dataset information is provided in Table 1.

thumbnail
Table 1. Summary of scRNA-seq datasets used in this study.

https://doi.org/10.1371/journal.pone.0292961.t001

In this study, a minimal preprocessing is applied to the data by excluding cells with ambiguous labels such as “None/Other” in Pancreas_Human_Lawlor_2017; “not applicable”, “unclassified endocrine cell”, “unclassified cell”, and “co-expression cell” in Pancreas_Human_Segerstolpe_2016. The absolute or normalized gene counts are retained as in the original dataset sources. No other normalization techniques have been applied to the data with a purpose to evaluate the performance of the gene selection method in selecting gene features from the datasets.

Classification-based QDE

In this study, a wrapper-based feature selection by a metaheuristic approach is employed, where QDE is utilized as a feature subset searching algorithm to select the genes that characterize cell types. A classification algorithm is used to quantize the goodness of these gene subsets in classifying cell types. QDE is a metaheuristic algorithm based on differential evolution (DE) that uses quantum computing in the feature subset initialization process [54]. It has been applied in various domains from numerical optimization, to discrete optimization such as feature selection for biomedical and radar signaling classifications [55], as well as biomarker selection [56]. In these discrete optimizations, QDE selects features from low-dimensional datasets containing few features. A DE method is considered in this work because it is an accurate metaheuristic method [57]. On the other hand, quantum-based metaheuristic variants, including QDE, could search for optimal solutions at a high convergence rate [54, 58, 59]. In view of these advantages, QDE is applied in this study to search for optimal feature subsets from the high-dimensional scRNA-seq datasets.

The QDE feature selection process begins by initializing 30 feature subsets (candidate solutions), which is the same setting as mentioned in Srikrishna et al. [55]. Assume that a dataset consists of D gene features; each candidate solution is made up of D binary bits, where a state of 1 indicates that the corresponding feature is selected as a candidate feature, and a state of 0 indicates that it is not included in the solution. The state of each feature is determined by observing a quantum bit. This quantum-based initialization process is adopted from the work by Srikrishna et al. [55]. The first 30 candidate solutions are generated through a serial process of quantum-based initialization and observation to form an initial population.

The fitness (goodness) of these 30 candidate solutions in representing cell types will be evaluated using a classifier. In this research, five machine learning (ML) classifiers are employed to determine the most suitable classification model for handling scRNA-seq data. Four of them are coded using the scikit-learn library [60], namely LR, decision tree (DT), support vector machine (SVM) with linear kernel, and SVM with radial basis function (RBF) kernel. The fifth classifier is the extreme learning machine (ELM), a fast neural network with a single hidden layer [61] that is available on GitHub [62]. The default hyperparameter settings of five classifiers are listed in Table 2. Notably, the hyperparameter settings of five classifiers are not fine-tuned; they are relaxed at the default settings. The purpose is to make a fair performance comparison of QDE with these ML classifiers using twelve datasets. As most of the datasets are imbalanced, F1-score, which is a harmonic mean of precision and recall with a range between 0 to 1 [63], is used as one of the computing elements of the fitness score. This is to evaluate the goodness of a gene features subset in classifying cell types. The fitness score for the classification-based QDE is based on two elements, (1) classification performance, i.e., the F1-score, and (2) the number of selected genes. This is to ensure the feature subsets discovered are able to classify cells accurately using a smaller number of gene features. The fitness score of a candidate feature set is defined as follows: (1) where F1 is the F1-score of the feature set, N is the number of selected genes, and D is the total number of genes before selection.

thumbnail
Table 2. Hyperparameter settings applied to each ML classifiers.

https://doi.org/10.1371/journal.pone.0292961.t002

Mutation and crossover of the population are performed before forming a new population. This process creates child solutions in which their performance is then compared with their respective parent solution. The mutation and crossover are performed at a rate of 0.8 out of 1 to explore for better candidate solutions in the large feature subspace while still exploiting the current solutions [64]. After generating the child population, both parent and child solutions will be considered in a selection process to form a new population for the next generation. The elitist selection strategy [55] is applied in this study. If the fitness score of the parent solution is lower than the elitism threshold, the corresponding child solution is selected for the next generation. However, if the fitness score of the child solution is lower than the parent solution, the parent solution remains in the population for the next generation. The elitism threshold is defined as follows: (2) where F1i is obtained from the mean F1-score of the initial population with an addition of 0.1 to allow improvement in the classification performance. Ni is an adjustable variable for the ideal portion of features to be selected. It is set to 0.01 throughout this study, i.e., 1% of the total number of features, to determine the best-performing feature subset.

The process of feature subset evaluation, mutation, crossover, and selection is continued until reaching a specified number of generations. Since all datasets have a large number of features, the QDE is executed for 100 iterations so that the evolution process is carried out not too long to avoid overfitting results. The flowchart of the classification-based QDE is depicted in Fig 1.

thumbnail
Fig 1. Flowchart of classification-based QDE and 5-fold cross-validation process.

The fitness evaluation step is done based on the F1-score from a classification process. Five ML classifiers (LR, DT, SVM with linear and RBF kernel, as well as ELM) are tested for fitness evaluation in this QDE model. The performance of the feature subsets selected using classification-based QDE is evaluated in a 5-fold cross-validation process.

https://doi.org/10.1371/journal.pone.0292961.g001

Experimental setup

The experiments are designed in two stages using a 5-fold cross-validation strategy: (1) in the first stage, the best classifier for wrapper QDE is determined; (2) in the second stage, the selected method from stage 1 is used to compare with the classification performance of the recent methods comprising different feature selection algorithms.

In the first stage, the experiment is conducted using QDE wrapped with five different classifiers using the hyperparameters settings listed in Table 2. Thus, five methods have been developed, namely QDE with LR (QDE-LR), QDE with DT (QDE-DT), QDE with SVM of linear kernel (QDE-SVM), QDE with SVM of RBF kernel (QDE-SVMrbf), and QDE with ELM (QDE-ELM). A 5-fold cross-validation is used in the experiment (Fig 1). Each dataset is divided into five portions. The function StratifiedKFold() from scikit-learn library [60] is applied in this work. A portion of the dataset is set as the test set. The remaining four portions are used in the feature selection process, which are further divided into training and validation sets using the function train_test_split() from scikit-learn library [60] with a ratio of 80:20. In each fold, the training set is used to train a group (or a population) of classifiers using different subsets of gene features (candidate solutions) provided by QDE, and the validation set is used to compute the fitness score of the candidate solution. The average accuracy, F1-score, number of selected features, and the total time taken for the five test sets on each method are recorded.

The second stage of the experiment is a comparison of classification performance between the best-performing QDE method chosen from the first stage and four recent methods with different feature selection algorithms [37, 6567] that have also been developed by a wrapper approach. Unless specified, otherwise all hyperparameter settings of these recent methods are respectively referred from [37, 6567].

One of the recent methods is FSCAM [37]. FSCAM applies a metaheuristic search strategy (i.e., FFOA) to find the optimal feature set. However, instead of a classification approach, FSCAM utilizes a clustering method (i.e., CAM) to evaluate the goodness of a feature subset. Another difference between the proposed classification-based QDE and FSCAM is that the latter imposes data preprocessing. FSCAM filters genes of zero read counts, dropouts, and genes of extremely high or low mean expression levels before conducting feature selection. In FSCAM, genes are modeled into a convex set. Identification of the vertices for the convex set corresponds to the identification of differentially expressed genes from the scRNA-seq data. This process is aided by optimization using FFOA. Genes that are exclusively expressed in a cell cluster (cell types) are preferred. FSCAM was originally an unsupervised method used to identify cell types. In this study, the genes selected from FSCAM are evaluated for their accuracy in cell type classification using a linear SVM. In our work, the hyperparameters of the SVM are in the same settings as the linear SVM from the best QDE method (i.e., QDE-SVM).

Apart from FSCAM, another three classification-based wrapper methods, namely hill-climbing-based social ski driver (SSD-LAHC) algorithm [65], mayfly-harmony search (MA-HS) [67], and binary sailfish (BSF) optimizer [66], are also included in the benchmark study. Notably, these three methods are general metaheuristic wrapper-based methods that have yet to be applied in the domain of scRNA-seq gene selection. In [6567], the classifiers wrapping these algorithms are the KNN classifiers. The number of iterations and potential solutions per population of SSD-LAHC, MA-HS, and BSF are set to the same settings as those of the classification-based QDE (i.e., 100 iterations and 30 solutions per population) for aligning experimental setup in the benchmark study. All of the experiments have been conducted in a workstation with an Intel(R) Xeon(R) W-2195 CPU @ 2.30GHz and a RAM of 64GB.

Results

Stage 1: Comparison of classification-based QDE methods

Table 3 shows the classification performance of gene feature subsets selected by QDE with five different classifiers in terms of accuracy and F1-score. It can be seen that the QDE models with DT, SVM of RBF kernel, and ELM as the classifier perform with lower scores as compared to QDE with LR and linear SVM.

thumbnail
Table 3. Accuracy and F1-score from different QDE methods.

https://doi.org/10.1371/journal.pone.0292961.t003

When observing the time taken to complete the feature selection process (Table 4), QDE with DT and ELM are generally faster than the others (an average time of 10.41 hours and 7.55 hours respectively), QDE-SVM is slightly slower at an average time of 29.85 hours, followed by QDE-LR with an average time of 38.00 hours. QDE-SVMrbf is the slowest with an average time of 86.70 hours. As a fast and single-layered neural network, ELM has the shortest time of execution as compared to the other classifiers. However, the gene features subsets selected by QDE-ELM lead to the lowest accuracy (an average accuracy of 0.7280) and F1-score (an average F1-score of 0.6967) among all the other methods, especially for the non-pancreatic datasets. SVM with RBF kernel could help QDE to select genes with similar accuracy and F1-scores as DT, but the time required to complete the same number of iterations is the longest with QDE-SVMrbf. QDE-LR and QDE-SVM give a similar performance in terms of classification scores (around 0.94 accuracy and F1-score from both methods). Nevertheless, considering the time of execution, QDE-SVM is a better method as it gives high classification accuracy with a shorter training time duration (an average time of 29.85 hours) than QDE-LR (an average time of 38.00 hours). By referring to Table 5, the number of features can be reduced to around half of the original number for all methods as they are using the same QDE searching scheme.

thumbnail
Table 4. Time taken (in hours) for feature selection in different QDE methods.

https://doi.org/10.1371/journal.pone.0292961.t004

thumbnail
Table 5. Number of selected features from different QDE methods.

https://doi.org/10.1371/journal.pone.0292961.t005

Before selecting the best QDE method, statistical test was performed. Since the results are not normally distributed, where they are largely skewed from the mean value, a non-parametric statistical test was applied. Friedman test [68] was chosen and performed for the results (accuracy, F1-score, number of features, and time taken) on the twelve datasets from five different QDE methods. The null hypothesis is that all of the methods have statistically similar results. A significance level of α = 0.05 was used. The p-values of all tests listed in Table 6 are lower than α, indicating that all null hypotheses are rejected, where at least one of the five methods is statistically different from the others in the aspects of classification performance, number of features, and time of execution.

thumbnail
Table 6. Friedman tests for 5 different QDE methods (QDE-LR, QDE-DT, QDE-SVM, QDE-SVMrbf, and QDE-ELM) in terms of accuracy, F1-score, number of features, and time taken.

https://doi.org/10.1371/journal.pone.0292961.t006

To identify how the methods are different from each other, post hoc tests were conducted using Holm’s procedure [69] for pairwise comparison. QDE-SVM is selected as the control algorithm as it showed the most advantages as discussed earlier. The post hoc tests also aimed to validate if QDE-SVM is a better method by showing statistically significant results from the other methods. The null hypothesis is that QDE-SVM is statistically similar to a compared method.

The results in Table 7 show that QDE-SVM achieves similar accuracy and F1-score with QDE-LR, where both of them are significantly more accurate as compared to the rest. The number of features selected by QDE-SVM and QDE-LR are also statistically equal. It can be inferred that QDE-LR and QDE-SVM are slightly better at selecting important features, as both of them have a relatively lower average feature number (Table 5). QDE-SVM requires a similar time duration as taken by QDE-DT, QDE-SVMrbf, and QDE-ELM. These test results show that QDE-SVM can achieve as good classification results as QDE-LR in a shorter time. Thus, QDE-SVM is selected as the best method in stage 1 for further comparison with other wrapper-based feature selection methods.

thumbnail
Table 7. Post hoc tests among different QDE methods (QDE-SVM as the control algorithm) using Holm’s procedure.

https://doi.org/10.1371/journal.pone.0292961.t007

Stage 2: Comparison of wrapper-based gene selection methods

The method introduced in this study is further compared with recent wrapper-based methods, including a clustering-based wrapper method (i.e., FSCAM), and three classification-based wrapper methods (i.e., SSD-LAHC [65], MA-HS [67], and BSF [66]). As mentioned in the “Materials and Methods: Experimental setup” section, the performances of features selected by FSCAM on test sets were determined using a linear SVM. This is to ensure the performance of FSCAM, originally an unsupervised cell type identification method, is comparable with other classification-based methods by using comparable metrics (i.e., accuracy and F1-score). In our work, the three classification-based wrappers were used with the original classifier (i.e., KNN) as in [6567].

Table 8 shows the comparison of accuracy and F1-score between QDE-SVM and the recent methods. For the classification performance of the selected features on cell type identification, QDE-SVM has higher average scores as compared to the other methods (the average of 0.9456 and 0.9429 for accuracy and F1-score respectively). This phenomenon is also observed in a boxplot in Fig 2. The gene features selected by FSCAM achieve the lowest cell type classification performance as compared to the other wrapper methods (the average of 0.8292 and 0.8258 for accuracy and F1-score respectively). On the other hand, SSD-LAHC, MA-HS, and BSF perform with a moderate classification performance within a range of average accuracy between 0.8793 and 0.8872, and a range of average F1-score between 0.8679 and 0.8752.

thumbnail
Fig 2. Boxplot of cell type classification accuracy and F1-score of all datasets using gene features selected by different wrapper methods.

https://doi.org/10.1371/journal.pone.0292961.g002

thumbnail
Table 8. Accuracy and F1-score from different wrapper methods.

https://doi.org/10.1371/journal.pone.0292961.t008

In terms of the number of selected gene features, QDE-SVM, SSD-LAHC, and BSF have obtained nearly thirteen thousand gene features on average (Table 9). They might not be favorable for use in applications such as probes design in spatial transcriptomics wherein a much smaller set of gene features is required [35]. On the other hand, the average number of features obtained by MA-HS (an average unit of 4478.12 genes) is between those of FSCAM, QDE-SVM, and SSD-LAHC. The number of features selected by FSCAM is the smallest among all, with only an average unit of 531.40 genes. In short, FSCAM is a better candidate than the rest in obtaining a small number of genes for classifying cell types.

thumbnail
Table 9. Number of selected features from different wrapper methods.

https://doi.org/10.1371/journal.pone.0292961.t009

On the other hand, as shown in Table 10, the time taken by FSCAM is much shorter for small datasets. However, when compared with QDE-SVM, such advantage of FSCAM is not shown when processing the datasets with large sample sizes such as Pancreas_Xin_2016, Pancreas_Segerstolpe_2016, Pancreas_Human_Baron_2016, Pancreas_Mouse_Baron_2016, and Brain_Mouse_Zeisel_2015. The time needed for SSD-LAHC, MA-HS, and BSF is much longer (on average from 169.65 to 295.15 hours) when compared to QDE-SVM (an average time of 29.85 hours) and FSCAM (an average time of 43.18 hours). Overall, QDE-SVM has the shortest time of execution on average.

thumbnail
Table 10. Time taken (in hours) for feature selection using different wrapper methods.

https://doi.org/10.1371/journal.pone.0292961.t010

Statistical tests on results from different wrapper methods were conducted using Friedman test and followed by the post hoc tests using Holm’s procedure to examine their significance. As usual, an α value of 0.05 was used as the significant level to test for the null hypothesis that all methods give statistically similar results. By referring to the results in Table 11, the null hypotheses are rejected with lower p-values for the tests on accuracy, F1-score, number of features, and time taken.

thumbnail
Table 11. Friedman tests for 5 different wrapper-based methods (QDE-SVM, FSCAM, SSD-LAHC, MA-HS, and BSF) in terms of accuracy, F1-score, number of features, and time taken.

https://doi.org/10.1371/journal.pone.0292961.t011

In the post hoc tests, QDE-SVM shows significant differences in cell type classification accuracy and F1-score among all other methods (Table 12). This implies that it can achieve better classification performance than the other methods. On the other hand, the number of gene features obtained by QDE-SVM is statistically similar to those of SSD-LAHC and BSF, and is significantly greater than those of FSCAM and MA-HS. The post hoc test results on execution times indicate that only QDE-SVM and FSCAM have a similar time of execution when processing all datasets. The other three wrapper methods utilize much longer execution times.

thumbnail
Table 12. Post hoc tests among different wrapper-based methods (QDE-SVM as the control algorithm) using Holm procedure.

https://doi.org/10.1371/journal.pone.0292961.t012

In summary, QDE-SVM is statistically more accurate than FSCAM, MA-HS, SSD-LAHC, and BSF in classifying cell types at the expense of utilizing a greater number of gene features than FSCAM and MA-HS. The number of gene features obtained by QDE-SVM is statistically the same as those of SSD-LAHC and BSF. Its execution time is as fast as FSCAM and is much faster than MA-HS, SSD-LAHC, and BSF. FSCAM applies pre-filtering of genes before running feature selection with FFOA. This explains the reason why the number of features selected by FSCAM is much lesser than all the other methods which do not filter any of the genes in advance.

Discussion

To further validate and analyze the effectiveness of gene features selected by the proposed method, the best solution (gene feature subset) was extracted from the five-fold candidate solutions of each method. For each method, the gene subset with the highest fitness score (as defined in Eq (1)) was identified as the best solution. The gene overlapping rates of the best solution obtained by the proposed method with the other four methods were examined to find further insights. The percentage of overlapping was calculated using the Jaccard score [70]. Table 13 shows the number and percentage of overlapping genes between the gene subsets from QDE-SVM and the other wrapper-based methods. Note that little genes from QDE-SVM are overlapped with other wrapper methods (in a range from 0.32% to 45.11%). This shows that the genes selected using the five methods are different, and this phenomenon can also be observed in other gene selection studies [71, 72]. The number of overlapping genes between QDE-SVM and FSCAM is relatively lesser as compared to QDE-SVM and the other three methods. The reason is that, FSCAM selects the smallest number of genes, and not many of these genes are also selected by QDE-SVM. On the other hand, the genes that are not selected by other methods but by QDE-SVM might be the key genes contributing to the higher classification accuracy in QDE-SVM.

thumbnail
Table 13. Number and percentage of overlapping genes between QDE-SVM and other wrapper-based methods.

https://doi.org/10.1371/journal.pone.0292961.t013

As the genes selected by the five methods are different, the biological significance of the selected gene subsets is further discussed after conducting a gene enrichment analysis. Gene Ontology (GO) enrichment analysis was performed at http://geneontology.org/ to validate the biological significance of the selected gene subset. The test and correction methods used for the enrichment analyses were Fisher’s exact test and false discovery rate (FDR). For each gene subset, the top 15 enriched GO terms were determined from terms of third-level and above in the ontology, as well as terms with high gene ratios and low FDR. For each category of the datasets (developmental, metabolic diseases, and connective tissues), a dataset with a moderate number of genes was chosen as the representative dataset for analysis (Embryo_Mouse_Biase_2014 to represent embryo development datasets, Pancreas_Human_Segerstolpe_2016 to represent pancreas tissue and metabolic disease datasets, and Brain_Human_Darmanis_2015 to represent connective datasets).

Fig 3A–3C show GO enrichment results of the gene subsets from QDE-SVM for the datasets Embryo_Mouse_Biase_2014, Pancreas_Human_Segerstolpe_2016, and Brain_Mouse_Darmanis_2015 respectively. Gene subsets from FSCAM were also included in the enrichment analysis for reference, as it is also a wrapper-based method introduced for scRNA-seq gene selection. The horizontal axis shows the feature selection methods, while the vertical axis shows the enriched GO terms. The ratio of genes in the gene subset that matched with the genes involved in a GO term is represented by the data point size. The larger the point, the more genes matched the terms. The color intensity of the data points represents -log10FDR, where the lighter color shows lower FDR.

thumbnail
Fig 3.

GO enrichment results of gene subsets from QDE-SVM and FSCAM for the datasets (A) Embryo_Mouse_Biase_2014 (B) Pancreas_Human_Segerstolpe_2016, and (C) Brain_Mouse_Darmanis_2015.

https://doi.org/10.1371/journal.pone.0292961.g003

For the embryo development dataset (Embryo_Mouse_Biase_2014), the gene subset selected by QDE-SVM is enriched in developmental-related terms such as Golgi vesicle transport, which are essential for embryo development [73]. Signaling pathways are also found to be important in embryogenesis for the secretion of essential proteins such as growth factors [74, 75]. Other terms such as mRNA processing, translation, embryonic morphogenesis, and cellular component disassembly are related to the experiment setup of Embryo_Mouse_Biase_2014 as well (Fig 3A). For the gene subset from FSCAM, very few genes are mapped to the four enriched GO terms most probably due to a small number of genes being selected. It is unlikely that one could obtain any valuable biological insights from this result. Related GO terms can also be seen in the Pancreas_Human_Segerstolpe_2016 gene subsets (Fig 3B). For the gene subset from QDE-SVM, the enriched terms include intracellular organelle, such as mitochondria, which is important for insulin regulation. Intracellular organelle stress is found to be one of the potential research directions for T2D treatments [76]. Other than that, the analysis shows immunology-related terms are enriched in the gene subset. This can be related to the presence of immune cells in the pancreas as reported by Wu et al. [77]. For the gene subset from FSCAM, the genes are highly enriched in collagen-related terms, mostly with all genes mapped. This might be due to the presence of a large amount of collagen in the pancreas extracellular matrix [77]. In addition, there are also netrin-related terms found in the gene subset from FSCAM, which could be related to pancreatic development [78]. Other terms such as glomerular-related terms may not be useful for the pancreatic dataset to our knowledge. For Brain_Mouse_Darmanis_2015 dataset, the enriched terms identified for the gene subset from QDE-SVM are mostly related to organelles (Fig 3C), which might be related to the essential cell functions. There is also a small portion of immunology-related terms. T cells are found to be important for CNS neuroprotection [79], thus, this might explain why there are T cells receptor complex and adaptive immune response terms in the cortex cells. The rest of the terms include cell differentiation, which is important for generating various cells in the brain. For the gene subset from FSCAM, the gene ratio is relatively low due to fewer genes in the gene subset. However, they are also somehow related to the dataset. For example, mitochondria and respiratory chain complex-related terms might be involved in the brain aging process and neurodegenerative diseases [80, 81]. Other terms are also somehow related to the brain tissue such as fibers and lysozymes [82].

GO enrichment analysis shows that the genes selected by QDE-SVM are biologically relevant. The top 15 terms are not similar to the one in FSCAM, which is expected as the overlapping genes are few. Both methods could be useful in discovering biologically relevant genes. Overall, this study provides a brief functional analysis of the gene features selected. More practical efforts are needed to validate the usefulness of the GO terms in downstream applications such as biomarker design.

QDE-SVM is potentially useful in various applications. While serving for reducing the scRNA-seq data dimension, the selected gene features can also be applied for future cell-type classification tasks in similar experimental settings [23]. As not all of the genes are useful for classification tasks, selecting genes that contribute to accurate classification helps to assign cell types correctly for the newly-sequenced single cells. Besides, the gene selected could also facilitate the downstream identification of marker genes [7, 35]. The marker genes could be used to distinguish cell types, cell stages, certain diseases, or conditions. Reducing the number of potential marker genes using QDE-SVM eases laboratory experiments or tests.

The classification component in QDE-SVM gives a slightly different observation of results from the clustering-based method. It is noticed that classification-based methods (i.e., QDE-SVM, SSD-LAHC, MA-HS, and BSF) generally select more accurate gene subsets than the clustering-based method (i.e., FSCAM). This is reasonable as they are supervised algorithms provided with labeled cells. However, the number of genes selected by QDE-SVM is only around half of the original number of genes, which might still be further reduced for feasible downstream applications. Also, when comparing different feature selection categories, an obvious limitation of the wrapper-based method is the time needed to conduct feature selection. The well-known non-wrapper-based methods such as Seurat [9] or scmap [23] fall under the filter-based feature selection category. Filter-based feature selection methods have an advantage of fast execution. This also explains why there are fewer works published on wrapper-based methods as the computational time required will increase with the number of iterations. Nevertheless, wrapper-based methods impose learning algorithms to assess the quality of features during iterative search process, and this contributes to finding better (more accurate) feature subsets. Filter-based methods that select gene features prior to assessment with learning algorithm might not be accurate enough for downstream applications such as cell type classification [22]. Thus, the proposed wrapper-based method could still be useful with several improvements.

In the future, the first effort should be taken to reduce the number of genes while still preserving the superior classification performance of QDE-SVM. It can be done using threshold-based filtering steps as in FSCAM or using other filtering methods such as information theory, distances, correlation, etc. This would require additional studies and experiments to determine the suitable filters. Another possible future work is to improve QDE using different schemes of mutation and crossover [83] or different selection strategies [84, 85], so that it could be more explorative when searching subsets in the large feature space. Additionally, the performance of the selected gene features could also be tested across datasets with similar experimental settings, such as from the same tissue, disease, or sequencing platform and protocol. This is a key step to move forward to the application stage, i.e., cell-type classification.

Conclusion

In conclusion, a classification-based wrapper method for scRNA-seq gene selection has been presented. A linear SVM wrapped with QDE was suggested in this work based on the feature selection results on twelve well-known scRNA-seq transcriptomic data. QDE-SVM has been tested and validated to select biologically relevant gene subsets with superior cell type classification performance when compared to QDE wrapping with other classifiers and the recent wrapper methods. However, QDE-SVM has a limitation when compared to a recent wrapper-based scRNA-seq gene selection method, FSCAM. The number of features being selected by QDE-SVM could still be reduced to obtain a set of informative marker genes for effective downstream analyses. Nevertheless, given the higher accuracy achieved by QDE-SVM in a similar time required for both of the wrapper methods, QDE-SVM is suggested as a promising gene selection method for further exploration.

References

  1. 1. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009 May;6(5):377–82. pmid:19349980
  2. 2. Li X, Wang CY. From bulk, single-cell to spatial RNA sequencing. Int J Oral Sci. 2021 Nov 15;13(1):1–6.
  3. 3. Zhang Y, Wang D, Peng M, Tang L, Ouyang J, Xiong F, et al. Single‐cell RNA sequencing in cancer research. J Exp Clin Cancer Res. 2021 Mar 1;40(1):81. pmid:33648534
  4. 4. Heath JR, Ribas A, Mischel PS. Single-cell analysis tools for drug discovery and development. Nat Rev Drug Discov. 2016 Mar;15(3):204–16. pmid:26669673
  5. 5. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. Nat Biotechnol. 2010 May;28(5):511–5.
  6. 6. Wang Z, Ding H, Zou Q. Identifying cell types to interpret scRNA-seq data: how, why and more possibilities. Brief Funct Genomics. 2020 Jul 29;19(4):286–91. pmid:32232401
  7. 7. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017 May;14(5):483–6. pmid:28346451
  8. 8. Soneson C, Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods. 2018 Apr;15(4):255–61. pmid:29481549
  9. 9. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015 May;33(5):495–502. pmid:25867923
  10. 10. Perkel JM. Single-cell sequencing made simple. Nature. 2017 Jul;547(7661):125–6. pmid:28682345
  11. 11. Hwang B, Lee JH, Bang D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med. 2018 Aug;50(8):1–14. pmid:30089861
  12. 12. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020 Feb 7;21(1):31. pmid:32033589
  13. 13. Rozenblatt-Rosen O, Stubbington MJT, Regev A, Teichmann SA. The Human Cell Atlas: from vision to reality. Nature. 2017 Oct;550(7677):451–3. pmid:29072289
  14. 14. Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJT, et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019 Sep 9;20(1):194. pmid:31500660
  15. 15. Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017 Jan 16;8(1):14049. pmid:28091601
  16. 16. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015 Mar;16(3):133–45. pmid:25628217
  17. 17. Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014 Jan 1;40(1):16–28.
  18. 18. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007 Oct 1;23(19):2507–17. pmid:17720704
  19. 19. Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul;9(4):1106–19. pmid:22350210
  20. 20. Tang J, Alelyani S, Liu H. Feature selection for classification: A review. In: Data Classification. CRC Press; 2014. p. 37–64.
  21. 21. Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997 Dec 1;97(1):273–324.
  22. 22. Yang P, Huang H, Liu C. Feature selection revisited in the single-cell era. Genome Biol. 2021 Dec 1;22(1):321. pmid:34847932
  23. 23. Kiselev VY, Yiu A, Hemberg M. scmap: projection of single-cell RNA-seq data across data sets. Nat Commun. 2018 May;15(5):359–62. pmid:29608555
  24. 24. Lieberman Y, Rokach L, Shay T. CaSTLe–Classification of single cells by transfer learning: Harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments. PLoS ONE. 2018 Oct 10;13(10):e0205499. pmid:30304022
  25. 25. Lin Y, Cao Y, Kim HJ, Salim A, Speed TP, Lin DM, et al. scClassify: sample size estimation and multiscale classification of cells using single and multiple reference. Mol Syst Biol. 2020 Jun 22;16(6):e9389. pmid:32567229
  26. 26. Delaney C, Schnell A, Cammarata LV, Yao-Smith A, Regev A, Kuchroo VK, et al. Combinatorial prediction of marker panels from single-cell transcriptomic data. Mol Syst Biol. 2019 Oct;15(10):e9005. pmid:31657111
  27. 27. Lall S, Ghosh A, Ray S, Bandyopadhyay S. sc-REnF: An entropy guided robust feature selection for single-cell RNA-seq data. Brief Bioinform. 2022 Mar 1;23(2):bbab517. pmid:35037023
  28. 28. Vans E, Patil A, Sharma A. FEATS: feature selection-based clustering of single-cell RNA-seq data. Brief Bioinform. 2021 Jul 1;22(4):bbaa306. pmid:33285568
  29. 29. Wang F, Liang S, Kumar T, Navin N, Chen K. SCMarker: ab initio marker selection for single cell transcriptome profiling. PLOS Comput Biol. 2019 Oct 28;15(10):e1007445. pmid:31658262
  30. 30. Zhao Y, Zhao FY, Lin CX, Deng C, Xu YP, Li HD. RFCell: a gene selection approach for scRNA-seq clustering based on permutation and random forest. Front Genet. 2021 Jul 27;12:665843. pmid:34386033
  31. 31. Aevermann B, Zhang Y, Novotny M, Keshk M, Bakken T, Miller J, et al. A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing. Genome Res. 2021 Oct 1;31(10):1767–80. pmid:34088715
  32. 32. Sen Puliparambil B, Tomal JH, Yan Y. A novel algorithm for feature selection using penalized regression with applications to single-cell RNA sequencing data. Biology. 2022 Oct;11(10):1495. pmid:36290397
  33. 33. Ntranos V, Yi L, Melsted P, Pachter L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat Methods. 2019 Feb;16(2):163–6. pmid:30664774
  34. 34. Bian C, Wang X, Su Y, Wang Y, Wong KC, Li X. scEFSC: Accurate single-cell RNA-seq data analysis via ensemble consensus clustering based on multiple feature selections. Comput Struct Biotechnol J. 2022 Jan 1;20:2181–97. pmid:35615016
  35. 35. Nelson ME, Riva SG, Cvejic A. SMaSH: a scalable, general marker gene identification framework for single-cell RNA-sequencing. BMC Bioinformatics. 2022 Aug 8;23(1):328. pmid:35941549
  36. 36. Feng J, Niu X, Zhang J, Wang JH. Gene selection and classification of scRNA-seq data combining information gain ratio and genetic algorithm with dynamic crossover. Wirel Commun Mob Comput. 2022 Jan 31;2022:e9639304.
  37. 37. Wang Y, Gao J, Xuan C, Guan T, Wang Y, Zhou G, et al. FSCAM: CAM-based feature selection for clustering scRNA-seq. Interdiscip Sci Comput Life Sci. 2022 Jun 1;14(2):394–408. pmid:35028910
  38. 38. Chatzilygeroudis KI, Vrahatis AG, Tasoulis SK, Vrahatis MN. Feature selection in single-cell RNA-seq data via a genetic algorithm. In: Simos DE, Pardalos PM, Kotsireas IS, editors. Learning and Intelligent Optimization. Cham: Springer International Publishing; 2021. p. 66–79. (Lecture Notes in Computer Science).
  39. 39. Inza I, Larrañaga P, Blanco R, Cerrolaza AJ. Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med. 2004 Jun;31(2):91–103. pmid:15219288
  40. 40. Xue B, Zhang M, Browne WN. A comprehensive comparison on evolutionary feature selection approaches to classification. Int J Comput Intell Appl. 2015 Jun;14(02):1550008.
  41. 41. Gan Y, Guan J, Zhou S. A comparison study on feature selection of DNA structural properties for promoter prediction. BMC Bioinformatics. 2012 Jan 7;13(1):4.
  42. 42. Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biol. 2005 Jan 14;6(2):R21. pmid:15693950
  43. 43. Yan L, Yang M, Guo H, Yang L, Wu J, Li R, et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol. 2013 Sep;20(9):1131–9. pmid:23934149
  44. 44. Biase FH, Cao X, Zhong S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res. 2014 Nov;24(11):1787–96. pmid:25096407
  45. 45. Deng Q, Ramsköld D, Reinius B, Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014 Jan 10;343(6167):193–6. pmid:24408435
  46. 46. Goolam M, Scialdone A, Graham SJL, Macaulay IC, Jedrusik A, Hupalowska A, et al. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell. 2016 Mar 24;165(1):61–74. pmid:27015307
  47. 47. Lawlor N, George J, Bolisetty M, Kursawe R, Sun L, Sivakamasundari V, et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 2017 Feb;27(2):208–22. pmid:27864352
  48. 48. Segerstolpe Å, Palasantza A, Eliasson P, Andersson EM, Andréasson AC, Sun X, et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016 Oct 11;24(4):593–607. pmid:27667667
  49. 49. Xin Y, Kim J, Okamoto H, Ni M, Wei Y, Adler C, et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab. 2016 Oct 11;24(4):608–15. pmid:27667665
  50. 50. Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 2016 Oct 26;3(4):346–360.e4. pmid:27667365
  51. 51. Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, et al. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci U S A. 2015 Jun 9;112(23):7285–90. pmid:26060301
  52. 52. Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015 Mar 6;347(6226):1138–42.
  53. 53. Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat Neurosci. 2016 Feb;19(2):335–46. pmid:26727548
  54. 54. Su H, Yang Y. Quantum-inspired differential evolution for binary optimization. In: 2008 Fourth International Conference on Natural Computation. 2008. p. 341–6.
  55. 55. Srikrishna V, Ghosh R, Ravi V, Deb K. Elitist quantum-inspired differential evolution based wrapper for feature subset selection. In 2015. p. 113–24.
  56. 56. Kamarudin MB, Ong CS, Tan SC. Quantum-inspired differential evolution algorithm in probiotics marker genes selection. In: 2022 10th International Conference on Information and Communication Technology (ICoICT). 2022. p. 413–7.
  57. 57. Storn R, Price K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim. 1997 Dec 1;11(4):341–59.
  58. 58. Narayanan A, Moore M. Quantum-inspired genetic algorithms. In: Proceedings of IEEE International Conference on Evolutionary Computation. 1996. p. 61–6.
  59. 59. Han KH, Kim JH. Quantum-inspired evolutionary algorithm for a class of combinatorial optimization. IEEE Trans Evol Comput. 2002 Dec;6(6):580–93.
  60. 60. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in python. J Mach Learn Res. 2011 Nov 1;12(null):2825–30.
  61. 61. Huang GB, Zhu QY, Siew CK. Extreme learning machine: Theory and applications. Neurocomputing. 2006 Dec 1;70(1):489–501.
  62. 62. Lambert DC. Python-ELM [Internet]. 2021 [cited 2023 Mar 11]. Available from: https://github.com/dclambert/Python-ELM
  63. 63. Sasaki Y. The truth of the F-measure. 2007; Available from: https://www.cs.odu.edu/~mukka/cs795sum11dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf
  64. 64. Eiben A, Schippers C. On evolutionary exploration and exploitation. Fundam Inf. 1998 Aug 1;35:35–50.
  65. 65. Chatterjee B, Bhattacharyya T, Ghosh KK, Singh PK, Geem ZW, Sarkar R. Late acceptance hill climbing based social ski driver algorithm for feature selection. IEEE Access. 2020;8:75393–408.
  66. 66. Ghosh KK, Ahmed S, Singh PK, Geem ZW, Sarkar R. Improved binary sailfish optimizer based on adaptive β-hill climbing for feature selection. IEEE Access. 2020;8:83548–60.
  67. 67. Bhattacharyya T, Chatterjee B, Singh PK, Yoon JH, Geem ZW, Sarkar R. Mayfly in harmony: A new hybrid meta-heuristic feature selection algorithm. IEEE Access. 2020;8:195929–45.
  68. 68. Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937 Dec 1;32(200):675–701.
  69. 69. Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6(2):65–70.
  70. 70. Jaccard P. The distribution of the flora in the alpine zone. New Phytol. 1912;11(2):37–50.
  71. 71. M Ascensión A, Ibáñez-Solé O, Inza I, Izeta A, Araúzo-Bravo MJ. Triku: a feature selection method based on nearest neighbors for single-cell data. GigaScience. 2022 Jan 1;11:giac017. pmid:35277963
  72. 72. Yip SH, Sham PC, Wang J. Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Brief Bioinform. 2019 Jul 19;20(4):1583–9. pmid:29481632
  73. 73. Zhong W. Golgi during development. Cold Spring Harb Perspect Biol. 2011 Sep;3(9):a005363. pmid:21768608
  74. 74. Basson MA. Signaling in cell differentiation and morphogenesis. Cold Spring Harb Perspect Biol. 2012 Jun;4(6):a008151. pmid:22570373
  75. 75. Komiya Y, Habas R. Wnt signal transduction pathways. Organogenesis. 2008;4(2):68–75. pmid:19279717
  76. 76. Chang YC, Hee SW, Hsieh ML, Jeng YM, Chuang LM. The role of organelle stresses in diabetes mellitus and obesity: Implication for treatment. Anal Cell Pathol Amst. 2015;2015:972891. pmid:26613076
  77. 77. Wu M, Lee MYY, Bahl V, Traum D, Schug J, Kusmartseva I, et al. Single-cell analysis of the human pancreas in type 2 diabetes using multi-spectral imaging mass cytometry. Cell Rep. 2021 Nov 2;37(5):109919. pmid:34731614
  78. 78. Hebrok M, Reichardt LF. Brain meets pancreas: netrin, an axon guidance molecule, controls epithelial cell migration. Trends Cell Biol. 2004 Apr;14(4):153–5. pmid:15134068
  79. 79. Evans FL, Dittmer M, de la Fuente AG, Fitzgerald DC. Protective and regenerative roles of T cells in central nervous system disorders. Front Immunol. 2019 Sep 12;10:2171. pmid:31572381
  80. 80. Flønes IH, Ricken G, Klotz S, Lang A, Ströbel T, Dölle C, et al. Mitochondrial respiratory chain deficiency correlates with the severity of neuropathology in sporadic Creutzfeldt-Jakob disease. Acta Neuropathol Commun. 2020 Apr 16;8:50. pmid:32299489
  81. 81. Ojaimi J, Masters CL, Opeskin K, McKelvie P, Byrne E. Mitochondrial respiratory chain activity in the human brain as a function of age. Mech Ageing Dev. 1999 Nov 2;111(1):39–47. pmid:10576606
  82. 82. Sandin L, Bergkvist L, Nath S, Kielkopf C, Janefjord C, Helmfors L, et al. Beneficial effects of increased lysozyme levels in Alzheimer’s disease modelled in Drosophila melanogaster. Febs J. 2016 Oct;283(19):3508–22. pmid:27562772
  83. 83. Georgioudakis M, Plevris V. A comparative study of differential evolution variants in constrained structural optimization. Front Built Environ. 2020;6:102.
  84. 84. Bilal , Pant M, Zaheer H, Garcia-Hernandez L, Abraham A. Differential Evolution: A review of more than two decades of research. Eng Appl Artif Intell. 2020 Apr 1;90:103479.
  85. 85. Blickle T, Thiele L. A comparison of selection schemes used in evolutionary algorithms. Evol Comput. 1996 Dec;4(4):361–94.