Abstract

Identification of differentially expressed (DE) genes with two or more conditions is an important task for discovery of few biomarker genes. Significance Analysis of Microarrays (SAM) is a popular statistical approach for identification of DE genes for both small- and large-sample cases. However, it is sensitive to outlying gene expressions and produces low power in presence of outliers. Therefore, in this paper, an attempt is made to robustify the SAM approach using the minimum β-divergence estimators instead of the maximum likelihood estimators of the parameters. We demonstrated the performance of the proposed method in a comparison of some other popular statistical methods such as ANOVA, SAM, LIMMA, KW, EBarrays, GaGa, and BRIDGE using both simulated and real gene expression datasets. We observe that all methods show good and almost equal performance in absence of outliers for the large-sample cases, while in the small-sample cases only three methods (SAM, LIMMA, and proposed) show almost equal and better performance than others with two or more conditions. However, in the presence of outliers, on an average, only the proposed method performs better than others for both small- and large-sample cases with each condition.

1. Introduction

Microarray experiments are usually conducted with expressions of huge number of genes () and a small number of experimental samples (). This unique data structure has been discovered as a completely new promising area for the researchers. At the same time, it provides a challenge to the researchers because of high dimensionality and its complexity with small sample size. Among this huge number of genes, discovery of few biomarker genes those are differentially expressed (DE) between two or more experimental conditions with multiple patterns is one of the main objectives of this experiments. These biomarker genes are important in the diagnosis of different types and subtypes of diseases for patient prognosis and treatment [13]. Nowadays, researchers are also interested in exploring the gene coexpression network or interaction of DE genes to predict the hub genes that are associated with different types and subtypes of cancer [4]. The most commonly used statistical tests for the discovery of DE genes between two or more conditions are t-test or ANOVA (F-test). However, both testing procedures sometimes produce misleading results to discover few biomarker genes, because both of them suffer from small-sample sizes and normality assumptions, and they do not share the information of all genes [5]. Therefore, a gene-specific t-statistic or F-statistic becomes large even for low differential expressions of genes between two or more conditions. Thus, the false discovery rate (FDR) may increase. Tusher et al. [6] introduced a popular statistical technique to detect the DE genes by assimilating a set of gene-specific t-tests. This approach is known as Significance Analysis of Microarrays (SAM). It controls the FDR by sharing information of all genes. It does not suffer from the small-sample sizes and normality assumptions. There is a close relation between SAM and FDR-based multiple testing correction approach of Benjamini and Hochberg [7]. Because of low expense and advancement of microarray technology, SAM has been extensively used in gene expression data analysis. For example, Li et al. [8] performed an integrative data analysis to identify the breast cancer subtype-specific biomarkers by combining copy number aberrations and miRNA-mRNA dual expression profiling data. Tarhini et al. [9] identified some immune-related genes that are associated with neoadjuvant ipilimumab clinical benefit. Wu et al. [10] detected some potential biomarkers for the diagnosis and prediction of preeclampsia. Ren et al. [11] identified some subtype-specific novel biomarkers using colon cancer gene expression data. However, SAM is very sensitive to outliers. It produces larger FDR and lower power in presence of outliers. Therefore, in this paper, an attempt is made to robustify the SAM [6] approach using the minimum β-divergence method [12]. Then we investigate the performance of the proposed method in a comparison of traditional SAM approach as well as some other popular methods such as ANOVA [13, 14], LIMMA [15], Kruskal-Wallis (KW) [16], empirical Bayes (EB) [1719], Gama-Gama model (GaGa) [20], and Bridge [21] using both simulated and real gene expression datasets. The paper is organized as follows: Section 2 contains the formulation of traditional SAM algorithm and proposed robust SAM algorithm with detailed description. Simulation and real microarray data analysis are carried out in Section 3. Finally, we end up with a conclusion.

2. Methods

Let be the th gene expression for the th samples . Also let denote the mean of the th gene for kth condition. Then we would like to test the hypothesis H0: versus H1: which implies that H0: versus is not true. A gene is said to be equally expressed (EE) if is accepted; otherwise it is DE. If denotes the sample mean of th gene for th condition and denotes the pooled within-class sample variance then the formula of two-sample t-test to test the above null-hypothesis is as follows:whereHere, ,The t-statistic given in (1) follows the t-distribution with degrees of freedom. As early mentioned this test statistic increases the FDR for small-sample cases. To overcome this problem Tusher et al. [6] proposed a modification of the t-statistic by adding a constant to the denominator, which is known as the test statistic of Significance Analysis of Microarrays (SAM) algorithm. This statistic is defined as follows:where is the percentile of the distribution of . For conditions, the modified t-statistic in (4) is defined in terms of Fisher’s linear discriminant, assuming samples consist of nonoverlapping subsets, such that the response parameter , , and is the number of expressions in . Then the scores and standard deviation in (4) are replaced by the following two equations, respectively:where . For the details about SAM procedure visit http://statweb.stanford.edu/~tibs/SAM/.

However, the test statistic given in (4) produces misleading results in presence of outliers, since and in (2) and (5) are sensitive to outliers. Therefore, in this paper, an attempt is made to robustify the test statistic, , in (4) by minimum -divergence method. The minimum -divergence estimators of the parameters are computed iteratively as follows: where . Herewhich is known as -weight function. This weight function was first introduced in [12] for the robustification of prewhitening procedure to improve the performance of independent component analysis (ICA) algorithms for blind source separation (BSS). It was generalized in [22] for the robust extraction of local principal components. Then it was used in [23] for the robustification of empirical Bayes approach [18] to identify DE genes between two conditions. It was also used in [24] to improve one-way ANOVA for the robust and efficient estimation of DE genes with multiple patterns. Note that if , the minimum β-divergence estimators reduce to classical maximum likelihood estimators (MLEs) . Since, in absence of outliers, the MLEs of Gaussian distribution are consistent and asymptotically efficient [25], therefore, in this paper, the MLEs are used in absence of outliers and in presence of outliers, the minimum β-divergence estimators are used for the estimation of in SAM approach. We can apply our proposed SAM approach in two ways. One way is to select the using the cross validation (CV) which was discussed in detail in [12, 22]. The CV approach produces in absence of outliers, while in presence of outliers. The minimum -divergence estimators in (6) with are equivalent to MLEs in (3) as mentioned previously. Thus the minimum -divergence estimators with an appropriate selection by CV produce both robust and efficient estimates for the parameters. However, in our current problem, it would be time-consuming, since CV approach needs to be applied for each gene of the entire genome in each condition separately to select the appropriate . To overcome this problem, in this paper we consider outlier detection approach based on -weight function with a fixed . The value of -weight function lies between 0 and 1. It produces larger weights with the usual gene expressions and smaller weights with the unusual/outlying gene expressions for a wide range of [12, 22]. By assigning low weights to outliers, the estimators become robust. The larger increases the robustness of estimators but decreases the efficiency, while the smaller increases the efficiency but decreases the robustness. Thus the value of controls the balance between the robustness and efficiency of the estimators. Therefore, in this paper, we fix for outlier detection using -weight function which was also used in [24]. A value () of gene expression is classified as usual or unusual based on this -weight function as follows:where we fix the cutoff value , since -weights lie between 0 and 1 and smaller weights occur with unusual/outlying gene expressions. Here is calculated by the following equation:which was also used in [22]. Here, is a smaller quantity; in our analysis we consider . After convergence of (6), we obtained the robust estimates of the parameters . Then we combine the MLEs and minimum -divergence estimators to estimate the parameters as follows:

3. Results and Discussion

To demonstrate the performance of the proposed method in a comparison of other popular methods (ANOVA, SAM, LIMMA, KW, EB, GAGA, and BRIDGE), we used both simulated and real microarray gene expression datasets. We used five R packages of other methods such as samr, limma, EBarrays, gaga, and bridge. The performance measures AUC and pAUC were computed for each of the methods using ROC package. All R packages are available in the comprehensive R archive network (cran) or bioconductor.

3.1. Performance Evaluation

In order to investigate the performance of the proposed method in a comparison of some other popular methods for binary class prediction such as DE or EE, we use different performance measures including the receiving operating characteristic (ROC) curve, area under the ROC curve (AUC), and partial AUC (pAUC) derived through the confusion matrix as shown in Table 1.

We compute different performance index based on the confusion matrix as follows:True positive rate (TPR) = , true negative rate (TNR) = .False positive rate (FPR) = , false negative rate (FNR) = .False discovery rate (FDR) = , misclassification error rate (MER) = .Power = 1 − FNR, where denotes the number of true positives and so on.A method is said to be good performer if it produces higher values of TPR, TNR, AUC, pAUC, and power and smaller values of FPR, FNR, FDR, and MER.

3.2. Data Generating Model

We used the following statistical model to generate simulated data with known characteristics:where is the th gene expression for the jth samples, is the mean of all expressions of th gene in the th condition, and is the random error term which follows ). The outlying datasets were generated by replacing at most 5% of the expression values from samples by outliers using where is an arbitrarily fixed value.

3.2.1. Simulation Study 1

To investigate the performance of the proposed robust SAM in a comparison with the traditional SAM approach, we generated a dataset using (11) with genes, and samples, 10 each in condition 1 and condition 2. There are 100 DE genes in this dataset (50 upregulated and 50 downregulated). Then we employed SAM and proposed method in this dataset to determine the DE genes. Figures 1(a) and 1(b) represent the Q-Q plots for this dataset using SAM and proposed method. In this figure, the number of genes above the band in the upper right (red color) and below the band in the bottom left (green color) indicates the number of upregulated and downregulated DE genes, respectively. We observed that both SAM and our proposed method identified 88 true DE genes with considering 12 false positives on an average. To evaluate the performance of these two methods in presence of outliers, we generate the outlying dataset using (12). We consider one outlier in each of 10% genes. Then we employed these methods in the outlying dataset to identify the DE genes. Figures 1(c) and 1(d) show the Q-Q plots using this outlying dataset. We can clearly see from Figure 1(c) that there are only 19 true DE genes identified by the SAM with considering 42 false positives on an average, whereas in Figure 1(d) the proposed method identified 86 true DE genes with considering 14 false positives on an average. The plots of smallest -weight for each of 1000 genes are displayed in Figures S1(a) and S1(b) in the supporting file (see Supplementary Material available online at https://doi.org/10.1155/2017/5310198) in absence and presence of outliers, respectively. Cleary we observe that, in absence of outliers, -weight function produces larger weights and in presence of outliers, it produces smaller weights (almost close to zero) for unusual/outlying gene expressions. The outlier genes are indicated in red color (see Figure S1(b)). So we may conclude that our proposed method performs better in both situations, in absence and presence of outliers.

3.2.2. Simulation Study 2

To investigate the performance of the proposed method in a comparison of the other seven popular methods as early mentioned for conditions, we performed 100 simulations to generate 100 datasets for both small- and large- sample cases using (11). We set the arbitrary values ( and for datasets generated from each simulation. Each dataset for each case represented the gene expression profiles of genes, with samples. The proportions of DE (pde) gene were set to 0.02, 0.04, 0.06, 0.08, and 0.1 for each of the 100 datasets. For these values, the theoretical numbers of DE genes are, respectively, 200, 400, 600, 800, and 1,000. To evaluate the performance of all the methods in presence of outliers, we generated 100 outlying datasets from each of the original datasets using (12). We consider one or two outliers in each of 10%, 20%, and 50% genes for each datasets. We computed average values of different performance measures such as TPR, TNR, FPR, FNR, FDR, AUC, and power based on 200, 400, 600, 800, and 1,000 estimated DE genes by eight methods (ANOVA, SAM, LIMMA, KW, EB, GAGA, BRIDGE, and proposed) for each of 100 datasets. Figure 2 and Figure S2 show the ROC curve based on 400 estimated DE genes by each of the methods, in absence and presence of one or two outliers in each of 10%, 20%, and 50% genes for both small- and large-sample cases, respectively. We observe that all the eight methods performed almost similarly in absence of outliers for both small- and large-sample cases, except ANOVA for small-sample case (see Figure 2(a)). But in presence of outliers, the proposed method outperforms the other seven methods for small-sample case (see Figures 2(b)2(d)). In this case BRIDGE also performs better. For large-sample case in presence of outliers, KW, BRIDGE, and proposed method perform well (see Figures S2(b)–S2(d)). Figure 3 represents the barplot of power estimated by different methods associated with varying proportions of DE gene in both absence and presence of outliers for small-sample case. Figures 4 and 5 show the boxplots of AUC values based on 100 simulated datasets by each of the methods in absence and presence of outliers for small- and large-sample cases, respectively. The panels (a), (b), (c), (d), and (e) in Figures 4 and 5 show the boxplots of AUC values using the five values for parameter pde, 0.02, 0.04, 0.06, 0.08, and 0.1, respectively, for small- and large-sample cases. Similar results are found from these boxplots and barplots for all pde values like ROC curve. Table 2 represents the average FDR estimated by eight methods based on 100 simulated datasets with pde 2%, 4%, 6%, 8%, and 10% in absence and presence of outliers for both small- and large-sample cases. In this table the results within the brackets (·), , and [·] indicate estimated FDR by different methods in presence of one or two outliers in each of 10%, 20%, and 50% genes, respectively. From this table we also draw similar interpretations like ROC curve, boxplots, and barplots. So we may conclude that on an average the proposed method performed well compared to the other seven methods with conditions both in absence and in presence of outliers.

3.2.3. Simulation Study 3

To demonstrate the performance of the proposed method in a comparison of other popular methods for conditions with multiple patterns, we generated 100 datasets for both small- and large- sample cases using (11). Each dataset was generated for and . Each dataset contains the expression values for genes with samples. The proportion of DE gene was fixed at 0.02 for each of the datasets. We generated the outlying gene expression datasets using (12) as before. We investigated the performance of the proposed method in a comparison of the other popular methods that are suitable for multiple-comparison tests (ANOVA, KW, SAM, and LIMMA). We first applied these methods to classify DE or EE genes and estimated different performance measures such as AUC, pAUC, MER, and FDR by these methods. Results obtained from these methods are presented in Table 3.

From this table we observe that, for small-sample case in absence of outliers, three methods (SAM, LIMMA, and proposed) exhibited better performance (AUC > 80%) than ANOVA and KW. KW performs worse in this case. But in presence of outliers, the proposed method outperforms other methods, producing more stable and consistent results (lower FDR and higher AUC, pAUC values). On the other hand, for large-sample case KW and proposed method performed well compared to the other methods (ANOVA, SAM, and LIMMA), in presence of outliers. Our proposed method exhibited slightly better performance than KW in this case, whereas in absence of outliers, they performed similarly. For both cases the Benjamini-Hochberg (BH) method was used to adjust the values for each of the methods. Figure S3 represents the boxplots of MER values estimated by these five methods based on 200 DE genes in absence and presence of outliers for small-sample case. This figure also supports the results of Table 3. To demonstrate the pattern-detection performance of these methods, we again generated 100 datasets using (11). This time we consider the gene expression profiles for genes with 300 DE genes for sample size of 3 in each condition. These 300 DE genes consist of four different patterns. These patterns are shown in Figure S4. Table 4 represents the performance of different methods for detection of up- and downregulated genes in pairs with conditions for small-sample case. In this table the values within the bracket represents the number of true up- and downregulated (UR, DR) DE genes in 6 pairs of conditions and the values within the brackets and represent the number of predicted (UR, DR) genes by five methods (ANOVA, KW, SAM, LIMMA, and proposed) and correctly detected number of (UR, DR) DE genes by these methods in absence and presence of one outlier in each of 10% genes. We observed that, in absence of outliers, three methods (SAM, LIMMA, and proposed) perform well for detecting of number of (UP, DR) genes in different pairs. KW performed badly in this case compared to ANOVA, whereas, in presence of outliers, the proposed method performed better than other methods (ANOVA, KW, SAM, and LIMMA). So, from this simulated study, we may conclude that the proposed method outperforms other methods in presence of outliers and in absence of outliers it keeps equal performance with other methods.

3.3. Real Microarray Data

To evaluate the performance of the proposed method in a comparison of the other seven methods as mentioned earlier, we used four microarray datasets. The first dataset is the Colon Cancer dataset [26] which consists of 22 control and 40 colon cancer samples. The second dataset is the Leukemia dataset [27]. The third dataset is the Platinum Spike dataset [28], which consists of 18 spike-in samples (9 controls versus 9 tests). The last dataset is the Breast Cancer dataset [29], which included 3226 genes measured on 22 breast cancer samples (7 sporadic, 7 BRCA-1, and 8 BRCA-2).

3.3.1. Colon Cancer Microarray Dataset

We used the colon cancer gene expression dataset. The dataset was downloaded from http://microarray.princeton.edu/oncology and was also used in the study [26]. The number of genes in this dataset is 2000. Figure 6(a) represents the Venn diagram of top 100 genes estimated by ANOVA, SAM, LIMMA, and proposed method. From this Venn diagram, we clearly observe that our proposed method shares more genes with other methods. We further compared the performance of proposed method with two robust methods (KW and BRIDGE) and SAM (see Figure 6(b)). This comparison also revealed that proposed method shares more genes with SAM than KW or BRIDGE methods. There were 57 genes detected as common by these four methods. The proposed method also shared 18 genes with the SAM and KW methods, which were not detected by the BRIDGE method.

3.3.2. Leukemia Microarray Dataset

This dataset was used in the study [27] and contains 7129 gene expressions for 72 leukemia samples in which 47 samples are acute lymphoblastic leukemia and 25 samples are acute myeloblastic leukemia. The results obtained from ANOVA, SAM, LIMMA, and proposed method based on top 100 estimated DE genes are presented in a Venn diagram in Figure 6(c). This figure shows that larger number of genes (86) is detected by these 4 methods. SAM and proposed method shared more genes () in this case. When comparing the proposed method with other robust methods (KW and BRIDGE) and SAM (see Figure 6(d)), we observed that 45 genes are common with these methods. The proposed method shares more genes with SAM than KW or BRIDGE methods. The proposed method also shared 20 genes with the SAM and KW methods, which were not detected by the BRIDGE method.

3.3.3. Platinum Spike Microarray Dataset

We downloaded Affymetrix CEL format files from the GEO website with accession number GSE21344 and we applied robust multiarray average (RMA) to obtain signals for probes. The designated FC associated with these probes is downloaded from http://www.biomedcentral.com/content/supplementary/1471-2105-11-285-s5.txt. After RMA preprocessing, normalization, and dropout of the MC and MF values, we obtained 18707 probes. In this dataset the valid 1944 DE genes are known with different FC values. Then we applied the eight methods (ANOVA, KW, SAM, LIMMA, EB, GAGA, BRIDGE, and proposed) in this dataset to estimate different performance measures TPR, TNR, FPR, FNR, FDR, and AUC. We also investigate performance of all the methods in presence of outliers. We consider one outlier with 20% of valid DE genes using (12). The results are summarized in Table A1 in supporting file. From this table, we observed that all the eight methods perform almost similar, in absence of outliers, whereas in presence of outliers, BRIDGE and proposed method performed better than the other six methods (ANOVA, KW, SAM, LIMMA, EB, and GAGA). To evaluate the performance of these methods in the small number of samples, we selected the small subsets of samples from control and test group of patients. To select the subsamples we repeatedly took sample size 3 from 9 control and 9 test group patients and calculated the value for each gene by these eight methods. This was repeated 100 times and the average of values was recorded. The results are summarized in Table 5. We clearly observed that, in absence of outliers, three methods (SAM, LIMMA, and proposed) perform well compared to the other five methods (ANOVA, KW, EB, GAGA, and BRIDGE). But in presence of outliers, the proposed method outperformed other methods. Figures S5(a) and S5(b) shows Venn diagram of top 1944 genes detected by SAM, LIMMA, and proposed method and ANOVA, KW, and proposed method, respectively, with 1944 known valid DE genes’ set, in absence of outliers. The Venn diagram in Figures S5(c) and S5(d) represents the top 1944 genes detected by the same methods, in presence of outliers. It is revealed from this Venn diagram that the proposed method performs better in both situations by sharing more genes with the valid 1944 DE genes. Figure 6(e) shows the M-A plot for this dataset. The red asterisk, blue triangle, and black circle are used for 312, 9, and 40 genes detected by proposed method, LIMMA, and SAM, respectively, that are common with valid 1944 DE genes’ set (see Figure S5(c)).

3.3.4. BRCA Microarray Dataset

This data comes from the breast cancer cDNA microarray experiment [29]. This dataset consists of 3226 genes from 22 breast cancer samples, which are also divided into three classes (7 sporadic, 7 BRCA-1, and 8 BRCA-2) according to their mutational status. Figure 7(a) represents the Venn diagram of top 100 genes identified by ANOVA, KW, SAM, LIMMA, and proposed method. From this Venn diagram we clearly observe that proposed method identified three genes that were not detected by the other methods (ANOVA, KW, SAM, and LIMMA). Then we explored the biological functions of these three genes (CTNNA1, NFKB1, and TM4SF1) using [30]. From this website we obtained GO (Gene Ontology), KEGG pathway, and disease association results. Using the GO database, we found that these genes are involved in biological processes and different molecular functions like negative regulation of programmed cell death, positive regulation of signal transduction, negative regulation of cell death, protein binding, protein complex, and molecular function (see S1 File (xls)). Figure S6 shows the Gene Ontology (GO) categories of these three () genes using directed acyclic graph (DAG). Using the KEGG database, we found that these genes are involved in cancer pathways with adjusted value = 0.0002 (see Table 6 and S2 File (.xls)). Table 7 represents the disease association results of these genes. In both tables the hypergeometric test is used to calculate the values and adjusted by Benjamini-Hochberg method for multiple testing corrections. Figure 7(b) represents the functional interactions (gene network) of these 3 genes were analyzed by GeneMANIA web server [31].

4. Conclusion

Differentially expressed (DE) genes identification to discover the disease biomarkers is one of the important tasks in microarray data analysis. Significance Analysis of Microarrays (SAM) is a popular statistical approach for identification of DE genes for both small- and large-sample cases. However, it is sensitive to outlying gene expressions and produces low power in presence of outliers. Therefore, in this paper, an attempt is made to robustify the SAM approach using the minimum -divergence method. We used MLEs, in absence of outliers with and in presence of outliers, the minimum -divergence estimators are used to calculate the SAM statistic. We revealed from the simulated and real gene expression data analysis with conditions that all the eight methods behave almost similar in absence of outliers, for both small- and large-sample cases. For large-sample case, in presence of outliers, three methods (KW, BRIDGE, and proposed) performed well compared to the other five methods (ANOVA, SAM, LIMMA, EB, and GAGA). However, the proposed method outperforms other seven methods for small-sample case, in presence of outliers. From the simulated dataset with conditions with multiple patterns, we observed that five methods (ANOVA, KW, SAM, LIMMA, and proposed) perform well in absence of outliers, both small- and large-sample cases, except KW for small-sample case, whereas in presence of outliers for large-sample case two methods (KW and proposed) perform well. However, the proposed method outperforms the other four methods (ANOVA, KW, SAM, and LIMMA) for small-sample case, in presence of outliers. Similar results were found from real gene expression data analysis. Therefore, we may conclude that, on an average, the proposed method performs better than the other methods.

Conflicts of Interest

The authors have no conflicts of interest to state.

Acknowledgments

This study is supported by HEQEP Subproject (CP-3603, W2, R3), Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh.

Supplementary Materials

Figure S1: Plot of smallest β-weight for simulation study 1. (a) In absence of outliers. (b) One outlier with each of 10% genes. Where the smallest β-weight represents the minimum value of 20 β-weights for 20 samples for each gene. The outlier genes are indicated in red color. The gray line indicates the maximum value of cutoff, λ = 0.13 for outlying genes.

Figure S2: Performance evaluation using ROC curve produced by different methods for large-sample case (n1 = n2 = 25). (a) In absence of outliers. (b) One outlier with each of 10% genes. (c) One outlier with each of 20% genes. (d) One outlier with each of 50% genes.

Figure S3: Performance evaluation using boxplot of MER values estimated by five methods for small-sample case (n1 = n2 = n3 = n4 = 3). Boxplot of MER values in absence and presence of one outlier with each of 10%, 20%, and 50% genes for small-sample case (n1 = n2 = n3 = n4 = 3). The MER values were calculated by five methods (ANOVA, KW, SAM, LIMMA, and proposed) based on top 200 genes.

Figure S4: Four different patterns of DE genes for small-sample case (n1 = n2 = n3 = n4 = 3) using simulation study 3.

Figure S5: Comparison of the top 1944 selected genes by five methods with 1944 valid DE gene set for Platinum Spike dataset for small-sample case (n1 = n2 = 3). In absence of outliers, Venn diagram of top 1944 genes detected by (a) the SAM, LIMMA, and proposed method or by (b) the ANOVA, KW, and proposed method with 1944 valid DE genes’ set. In presence of one outlier in 20% of 1944 valid DE genes, Venn diagram of top 1944 genes detected by (c) the SAM, LIMMA, and proposed method or by (d) the ANOVA, KW, and proposed method with 1944 valid DE genes’ set.

Figure S6: Gene Ontology (GO) categories of three (3) genes for BRCA dataset. The directed acyclic graph (DAG) shows the GO categories of three (3) genes, detected by the proposed method only for Breast Cancer (BRCA) dataset. In the DAG tree, each box in the tree lists holds the name of the GO category, the number of genes in the category, and adjusted p value. The box with red categories indicates that they are enriched with adj. p value <0.05.

Table A1: Performance evaluation of different methods based on Spike gene expression dataset.

S1 File: GO pathways for 3 breast cancer DE genes using BRCA dataset. Gene Ontology analysis information (XLS).

S2 File: KEGG pathways for 3 breast cancer DE genes using BRCA dataset. KEGG analysis information (XLS).

  1. Supplementary Material
  2. Supplementary Material
  3. Supplementary Material
  4. Supplementary Material
  5. Supplementary Material
  6. Supplementary Material
  7. Supplementary Material
  8. Supplementary Material
  9. Supplementary Material