Functional enrichment analysis of gene expression data is a standard step following most transcriptome profiling experiments in biomedical research. It enables researchers to characterize the biological functions affected by the experiment based on a subset of relevant genes and study the drug and disease mechanisms at the molecular level 1. Gene sets Over-representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) are two widely used tools to perform functional enrichment analysis 2–4. ORA tests whether a pre-selected gene set is over-represented in a list of differentially expressed genes than what would be expected by chance based on one or more statistical tests, such as the hypergeometric test 5. However, the significance threshold or certain criteria for defining a set of differentially expressed genes is arbitrary and may miss important information for genes at the boundary 6. To address this issue, GSEA has been proposed 7. By analyzing the enrichment of a gene set annotation at the top or bottom of a ranked list of genes, it finds pathways that have coordinated change in genes across any two phenotypes 8. Both methodologies rely on gene annotation databases, in which a great deal of work has been devoted to organizing genes and proteins into biological functions and pathways. Examples include Reactome, the Kyoto Encyclopedia of Genes and Genomes (KEGG), the Gene Ontology (GO) and the Molecular Signatures Database (MSigDB). KEGG was developed in 1995 9, and GO 10 was founded in 1998, between which year the first eukaryotic genomes were released to the public. Reactome 11 was created in 2004, while MSigDB 7 was introduced in 2005 as a resource for GSEA. Each of these gene annotation databases has certain advantages for particular studies, however, any of them can be used to perform functional enrichment analysis.
Although functional enrichment analysis is typically focused on determining the impact of experimental differences, sometimes researchers are interested in how transcriptomic variation is associated with health outcomes, such as survival 12,13. Additionally, there is often an interest in which biological functions are associated with the outcome, rather than individual gene 14. Even if a specific gene is the cause of poor disease outcomes, an intervention might be possible at multiple points in the functional pathway involving that specific gene. Alternatively, there may be small cumulative impacts on a biological pathway that result in poor outcomes. However, typical functional enrichment analyses are focused on identifying biological functions associated with experimental differences, not outcomes. Although these ideas can align, the focus on disease differences can lead to an analysis missing important results. For example, in the case of cancer, some of the pathways that differ between tumor and normal samples will likely be associated with survival 15. However, other pathways that influence survival differences among survivors may not differ much between tumor and normal samples. That is because these differences relate to population variation that might be key to survival and these differences are not driven by the disease 16. Furthermore, the analysis will not directly identify the association between pathways and survival outcomes, or its magnitude, and researchers often resort to tests of individual genes in some of their top pathways or other indirect methods. For example, several studies used GSEA to identify cancer-related pathways and then followed by survival analysis, but their methods do not directly incorporate survival information during pathway enrichment analysis, and hence their results do not reflect the degree of association between survival and biological functions17,18. Furthermore, the survival differences among those with the disease are not what drove the results of the GSEA analysis or survival analysis tests only among the treatment differences rather than case-only, which creates a blind spot in the results.
There have been a couple of attempts to directly link functional enrichment to clinical outcomes (e.g. survival status or survival time). For example, Woltmann, et al. 19 identified enriched pathways through different pathway enrichment analysis tools based on a genome-wide association study (GWAS) of short-term vs. long-term breast cancer survival in order to make a conclusion to a more general breast cancer population globally. However, their studies mainly focused on verifying breast cancer-related pathways by showing the similarities between the GWAS method and different enrichment analyses and did not suggest an approach to handle censored survival data. Furthermore, though Goeman et al. 20 developed a statistical test of the association between groups of genes and a clinical outcome based on the Cox proportional hazards model, their proposed method requires a set of pre-selected genes and does not consider coordinated changes of the genes in a pathway.
We suggest there is a simple solution to this challenge. Researchers can perform Survival-based Gene Set Enrichment Analysis (SGSEA) to understand how transcriptomic variation among patients can be used to identify biological functions associated with survival from the disease. Using this approach, researchers may discover unexpected patterns of gene activity, leading to new avenues of research. In this paper, we demonstrate how while GSEA can often provide insight into disease etiology, SGSEA can provide unique insights into disease outcomes. Although this method is already readily available to researchers, we believe one reason it is not widely used is that it is not included as a standard tool in most software and there has not been a paper to describe it. Therefore, we present the SGSEA R package and Shiny app that can be used to conveniently conduct these analyses.