ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article
Revised

satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications

[version 2; peer review: 2 approved]
PUBLISHED 08 Aug 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Bioconductor gateway.

This article is included in the RPackage gateway.

This article is included in the Cell & Molecular Biology gateway.

Abstract

Alternative splicing produces multiple functional transcripts from a single gene. Dysregulation of splicing is known to be associated with disease and as a hallmark of cancer. Existing tools for differential transcript usage (DTU) analysis either lack in performance, cannot account for complex experimental designs or do not scale to massive single-cell transcriptome sequencing (scRNA-seq) datasets. We introduce satuRn, a fast and flexible quasi-binomial generalized linear modelling framework that is on par with the best performing DTU methods from the bulk RNA-seq realm, while providing good false discovery rate control, addressing complex experimental designs, and scaling to scRNA-seq applications.

Keywords

RNA-seq, single-cell transcriptomics, splicing, differential transcript usage, statistical framework, satuRn

Revised Amendments from Version 1

We included three additional case studies to the manuscript: (i) a case study analysis with a more complex multifactor experimental design, (ii) a case study assessing differential usage of equivalence classes and (iii) an analysis that uses satuRn to perform a differential exon usage analysis. These analyses were included in the Results and Discussion sections of the manuscript and include the novel figures 10 and 11, table 3 and Extended data figures S27-29. In addition, we elaborated on the effects of data sparsity, the magnitude of the induced DTU signal and the different counting approaches on the performance of the different DTU tools. Novel findings were incorporated in the Results section and include figures S10-13, S16-19, S30-32, and table S1 in the Extended data. We also included a scalability assessment for the different methods on bulk RNA-seq data, which is discussed in the Results section and displayed in Extended data figures S20 and S21. Finally, we addressed all other minor suggestions by the reviewers.

See the authors' detailed response to the review by Alejandro Reyes
See the authors' detailed response to the review by Marek Cmero

Introduction

Studying differential expression (DE) is one of the key tasks in the downstream analysis of RNA-seq data. Typically, DE analyses identify expression changes on the gene level. However, the widespread adoption of expression quantification through pseudo-alignment,1,2 which enables fast and accurate quantification of expression at the transcript level, has effectively paved the way for transcript-level analyses. Here, we specifically address differential transcript usage (DTU) analysis, one type of transcript-level analysis that studies the change in relative usage of transcripts/isoforms within the same gene. DTU analysis holds great potential: previous research has shown that most multi-exon human genes are subject to alternative splicing and can thus produce a variety of functionally different isoforms from the same genomic locus.35 The dysregulation of this splicing process has been reported extensively as a cause for disease,69 including several neurological diseases such as frontotemporal dementia, Parkinsonism and spinal muscular atrophy, and is a well-known hallmark of cancer.10

In this context, full-length single-cell RNA-Seq (scRNA-seq) technologies such as Smart-Seq211 and Smart-Seq312 hold the promise to further increase the resolution of DTU analysis from bulk RNA-seq data towards the single-cell level, where differences in transcript usage are expected to occur naturally between cell types. However, only a few bespoke DTU methods have been developed for scRNA-seq data and they lack biological interpretation. Indeed, methods specifically developed for scRNA-seq data are either restricted to exon/event level13,14 analysis (e.g. pinpointing exons involved in splicing events), or they can only pinpoint DTU genes without unveiling the actual transcripts that are involved.15 Interestingly, many DTU methods for bulk RNA-seq do provide inference at the transcript level and their performance has already been extensively profiled in benchmark studies.1618 Based on a subset of the simulated RNA-seq dataset from Love et al.18 (see Methods), we show the performance of six DTU tools; DEXSeq,19 DoubleExpSeq,20 DRIMSeq,21 edgeR diffSplice,22 limma diffSplice23 and NBSplice24 (Figure 1A). DEXSeq and DoubleExpSeq have a higher performance than the other methods. In addition, we observe that most methods, and DRIMSeq in particular, fail to control the false discovery rate (FDR) at its nominal level, which is in line with previous reports.1618

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure1.gif

Figure 1. Performance and scalability evaluation of six differential transcript usage (DTU) methods.

A: Performance evaluation on the simulated bulk RNA-Seq dataset from Love et al.18 Each curve displays the performance of each method by evaluating the sensitivity (true positive rate, TPR) with respect to the false discovery rate (FDR). The three circles on each curve represent working points when the FDR level is set at nominal levels of 1%, 5% and 10%, respectively. The circles are filled if the empirical FDR is equal or below the imposed FDR threshold. DEXSeq and DoubleExpSeq clearly have the highest performances. Note that most methods, and DRIMSeq in particular, fail to control the FDR at its nominal level. B: Scalability with respect to the number of cells in a scRNA-Seq dataset. While all other methods scale linearly with an increasing number of cells, DEXSeq scales quadratically. As such, DEXSeq cannot be used for the analysis of large bulk and scRNA-Seq datasets. For all sample sizes, the number of transcripts in the datasets were set at 30.000. Note that NBSplice needed to be omitted from this analysis as it fails to converge on datasets with a large proportion of zero counts (see below). C: Scalability with respect to the number of transcripts in a scRNA-Seq dataset. While all other methods scale linearly with an increasing number of cells, BANDITS scales quadratically. Moreover, BANDITS failed to run on our system for datasets with 7.500 transcripts or more. As such, it had to be omitted from panels A and B. A performance and scalability evaluation of BANDITS on datasets with an (artificial) lower number of transcripts is provided as Extended data figures S1 and S3.25

In order to assess DTU in single-cell applications, however, these bulk RNA-seq DTU tools should scale to the large data volumes generated by full-length scRNA-seq platforms, which can profile the transcriptome of several thousands of cells2628 in a single experiment. In Figure 1B, we evaluate the required computational time in function of the number of sequenced libraries for a two-group DTU analysis for 30,000 transcripts on a subset of the scRNA-seq dataset from Chen et al.29 Despite of its good performance, the popular tool DEXSeq already required more than five hours to analyze two groups of 32 cells and clearly does not scale to large bulk nor scRNA-seq datasets.

In addition, DTU methods should allow for the analysis of datasets with large numbers of (unique) transcripts. The number of transcripts that are typically assessed depends on the coverage of the RNA-seq experiment and the adopted filtering criteria in the data analysis workflow. As the coverage of RNA-seq experiments has increased rapidly over the past few years and can be expected to continue expanding, scalability towards large numbers of transcripts will be essential to enable a transcriptome-wide view on the isoform usage changes. In Figure 1C, we perform a DTU analysis across a range of transcripts in a two-group comparison with 16 cells each, using the dataset from Chen et al. Here, we observed that the DTU tool BANDITS30 scales particularly poorly to large numbers of transcripts. More specifically, BANDITS did not complete the DTU analysis on the dataset with 7.500 transcripts within 137 hours on our system (see Methods); therefore, larger analyses were omitted. As such, BANDITS had to be omitted from the analyses shown in Figures 1A and 1B. For a performance and scalability evaluation of BANDITS on datasets with an (artificial) lower number of transcripts, we refer to Extended data figures S1 and S3.25

Besides scalability, several other issues arise when porting bulk RNA-seq DTU tools towards scRNA-seq applications. Indeed, modelling scRNA-seq data often requires multifactorial designs, for instance when comparing expression levels across multiple cell types between multiple treatment groups. Accounting for multiple covariates, however, is not implemented in BANDITS, NBSplice and DoubleExpSeq, jeopardizing their utility for (sc-)RNA-seq DTU analysis. Another issue arises with the large numbers of zero counts in scRNA-seq data, which seems to be particularly problematic for NBSplice that fails to converge if the gene-level count of any of the samples or cells is zero. As such, NBSplice could not be evaluated in Figures 1B and 1C.

Altogether, many of the existing DTU analysis tools are not well suited to analyze large bulk RNA-seq and full-length scRNA-seq datasets, leaving the great potential of differential splicing analysis for these data largely unexploited. In light of these shortcomings we developed satuRn, which is an acronym for Scalable Analysis of differential Transcript Usage for RNa-seq data, a novel method for DTU analysis that (i) is highly performant, (ii) provides a good control of the false discovery rate (FDR) (iii) scales seamlessly to the large data volumes of contemporary (sc-)RNA-seq datasets, (iv) allows for modelling complex experimental designs, (v) can deal with realistic proportions of zero counts and (vi) provides direct inference on the biologically relevant transcript level. In brief, satuRn adopts a quasi-binomial (QB) generalized linear model (GLM) framework. satuRn provides direct inference on DTU by modelling the relative usage of a transcript, in comparison to other transcripts from the same gene, between groups of interest. To stabilize the estimation of the overdispersion parameter of the QB model, we borrow strength across transcripts by building upon the empirical Bayes methodology as introduced by Smyth et al.23 In order to control the number of false positive findings, an empirical null distribution is used to obtain the p-values,31 which are corrected for multiple testing with the FDR method of Benjamini and Hochberg.32 Our method is implemented in an R package available at https://github.com/statOmics/satuRn and is distributed through the Bioconductor project.

Methods

satuRn model

As input, satuRn requires a matrix of transcript-level expression counts (or, analogously, a matrix of exon-level or equivalence class-level counts), which may be obtained either through pseudo-alignment using kallisto1 or Salmon,2 or by classical alignment-based tools followed by transcript-level quantification (e.g. STAR33,34 and RSEM35). Let Ygti denote the observed expression value for a given transcript t = 1, … , Tg of gene g = 1, ..., G in cell or sample i = 1, … , n. The total expression of gene g in sample i can then be expressed as

(1)
Yg.i=t=1TgYgti,

i.e., by taking the sum of expression values for all Tg transcripts belonging to gene g in sample i. The usage of transcript t in sample or cell i can then be estimated as

(2)
Ugti=YgtiYg.i.

Next, we adopt a quasi-binomial (QB) generalized linear modelling (GLM) strategy to model transcript usage. As opposed to canonical maximum likelihood models, this quasi-likelihood modelling strategy only requires the specification of the first two moments of the response distribution, i.e., the mean and the variance. We define the mean of the QB model as

(3)
E[Ugti|Xi,Yg.i]=πgtilog(πgti1πgti)=ηgtiηgti=XiTβgt

In this notation, πgti is the expected probability of observing transcript t within the pool of transcripts (1, … , Tg) belonging to gene g in sample i and, as such, corresponds to its expected usage for that sample. We model πgti using a logit link function, where βt is a p x 1 column vector of regression parameters modelling the association between the average usage and the covariates for transcript t. Finally, XiT is a row in the n x p design matrix X that corresponds with the covariate pattern of sample i, with p the number of parameters of the mean model, i.e., the length of vector βt.

The variance of the QB model can be described as

(4)
VarUgtiXi,Yg.i=πgti1πgtiYg.iϕgt

with Yg.iπgti1πgti the canonical variance of the binomial distribution and ϕgt a transcript-specific overdispersion parameter to describe additional variance in the data with respect to the binomial variance. We adopt the empirical Bayes procedure from Smyth et al.,23 as implemented in the squeezeVar function of the limma Bioconductor R package, to stabilize the estimates of ϕgt by borrowing information across transcripts, which is adopted in the default edgeR quasi-likelihood workflow for bulk RNA-seq data.22,36 Note that stabilizing the dispersion estimation is particularly useful in datasets with a small sample size.

Taken together, the quasi-binomial thus allows us to model the log-odds of drawing a particular transcript t from the pool of transcripts from the corresponding gene g across samples. The intercept also has an interpretation of a log-odds and the remaining mean model parameters are log-odds ratios, which may thus be interpreted in terms of differential transcript usage. We adopt t-tests that are computed based on the log-odds ratio estimates of the QB model and the posterior variance, as obtained from the empirical Bayes procedure. P-values are computed assuming a t-distribution under the null hypothesis with posterior degrees of freedom calculated as the sum of the residual degrees of freedom and the prior degrees of freedom from the empirical Bayes procedure.

For bulk analyses, the implementation of satuRn as described above provides a high performance and a good control of the FDR. However, for single-cell datasets we observed that our inference is too liberal (see Extended data25 figure S14), which could suggest that the theoretical null, the t-distribution, is no longer valid. Indeed, in large-scale inference settings, failure of the theoretical null distribution is often observed. Efron37 (Chapter 6) describes four reasons why the theoretical null distribution may fail; failed mathematical assumptions, correlation across features (transcript expression), correlation across subjects (samples or cells), and unobserved confounders in observational studies. To avoid these issues, Efron proposes to exploit the massive parallel data structure of omics datasets to empirically estimate the null distribution of the test statistics.37 To this end, Efron converts the test statistic to z-scores, which should follow a standard normal distribution under the theoretical null, and then proposes to approximate the empirical null distribution with a normal distribution with unknown mean (μ) and standard deviation (σ), which can be estimated by maximum likelihood on a subset of the test statistics near zero.

As such, we first convert the two-sided p-values to z-scores according to

(5)
zgt=Φ1pgt2signSgt,

with Φ the cumulative distribution function for the standard normal distribution, pgt the original two-sided p-value indicating the statistical significance of differential usage of transcript t from gene g between the conditions of interest, sign(Sgt) the sign of the t-test statistic Sgt and Zgt the resulting z-score. Next, we adopt the maximum likelihood procedure, implemented in the locfdr function of the locfdr R package from CRAN,38 to estimate the mean μ and standard deviation σof the empirical null distribution. Based on these estimates, we recompute the z-scores and corresponding p-values as follows

(7)
zgt=zgtμσ
(8)
pgt=2Φabszgt.

Finally, the resulting (empirical) p-values are corrected for multiple testing with the FDR method of Benjamini and Hochberg.32 As opposed to the original p-values that were calculated based on the theoretical null distribution for the t-statistics, we found that this procedure allows for a better FDR control in single-cell applications.

DTU tools literature

Table 1 provides a brief description of the DTU methods that were included in the performance benchmarks of this paper. For more details, we refer to the Extended data25 and the respective original publications. Note that all methods were run in R3.6.1 using default settings.

Table 1. Brief description of different DTU tools that were included in our performance benchmarks.

Columns 1 to 5 respectively display the name of each method, the package version used, a brief description of each method, the test statistic used for inference on differential transcript usage, and whether the method can handle complex designs, e.g., to incorporate additional covariates such as batch effects. All packages are available from Bioconductor 3.10. Acronyms; NB: negative binomial, GLM: generalized linear model, LRT: likelihood ratio test, DB: double binomial, DM: Dirichlet-multinomial.

MethodVersionBrief description of the modelling procedureTestComplex designs
DEXSeq191.32.0First a matrix C = [Cgti] is calculated, which defines how many reads Cgti map to any of the other transcripts of the same gene g as respective transcript t in cell i. Next, matrix C and the original expression matrix Y are concatenated as Y’ = [Y C]. Then, a NB GLM is fitted to each transcript of Y’. The design matrix X of the GLM defines a covariate pattern with (i) sample-level intercepts that account for the fact that Ygti and Cgti originate from the same sample i, and (ii) other covariates associated with the design of the experiment. DTU is parameterized as an interaction effect that evaluates if the log fold change between transcript t and all other transcripts in its corresponding gene differs between conditions of interest.LRTYes
Double-ExpSeq201.1Assumes a DB model for each transcript that models the log-odds of drawing a particular transcript t from the pool of transcripts in the corresponding gene g across samples. The intercept has an interpretation of a log-odds and the remaining mean model parameter(s) are log-odds ratios, which may thus be interpreted in terms of DTU.LRTNo
DRIMSeq211.14.0Assumes that the transcript-level expression counts follow a DM distribution. The quantity of interest is the change in proportion of each transcript within a gene between groups of samples or cells.LRTYes
Limma diffsplice233.42.2Assumes a linear model to model the log-transformed transcript-level counts, using weights to account for heteroskedasticity. Assesses DTU by comparing the log-fold change in expression of transcript t within gene g with the average log-fold change of all other transcripts of gene g.T-testYes
edgeR diffsplice223.28.1Assumes a NB GLM for each transcript and tests for DTU by comparing the obtained log-fold changes for each transcript within a gene with the log-fold change of the entire gene.LRTYes
NBSplice241.4.0Assumes a NB GLM for each gene. DTU between groups of interest can be tested using a LRT, where the full model contains an isoform-condition interaction term that is omitted in the null model.LRTYes
BANDITS301.2.3Adopts a Bayesian DM hierarchical model. Equivalence class counts are used as additional input to account for the quantification uncertainty arising from reads mapping to multiple transcripts. BANDITS tests the mean relative usage of each transcript within its corresponding gene across conditions.WaldNo

Filtering

We adopted two different strategies for filtering transcripts in each of the RNA-seq datasets in the performance benchmarks.

The first filtering strategy uses the filterByExpr function implemented in edgeR.39 This filtering strategy only retains transcripts that have at least an expression level of min.count counts-per-million (CPM, calculated as the number of read counts divided by the total number of reads in the dataset and multiplied by one million) in at least n samples or cells. In addition, the sum of the CPM of the transcript across all cells or samples must be at least min.total.count. For the bulk RNA-seq datasets, we use the default settings (min.count = 10, n = min(10, 0.7*ns) and min.total.count = 10), with ns the number of samples or cells in the smallest group. For the scRNA-seq datasets, the settings are adjusted to; min.count = 1 (as requiring a transcript to be expressed in all single-cells is a stringent criterium), n = 0.5*ns and min.total.count = 0. In addition, if only one transcript of a gene passes this filtering criterion, it is omitted from the analysis, as DTU analysis is meaningless when only one transcript is retained. As such, we specifically set the parameters to generate a very lenient filtering criterium.

The second filtering strategy uses the dmFilter function implemented in DRIMSeq.21 This filter is more stringent and specifically designed for DTU analysis. The filtering process can be thought of as proceeding in three steps. The first step requires the transcripts to have a count of at least 10 in at least ns samples. The second filtering step requires the transcript to make up at least 10% of the total count of its corresponding gene in at least ns samples or cells. The third filtering step removes all transcripts for which the corresponding gene has a count below 10 in any of the samples or cells in the dataset. Again, if only one transcript of a gene passes this filtering criterion, it is omitted from the analysis.

Abundance metrics

We additionally compared the results between two quantification strategies: counts and scaledTPM abundances. Counts are the transcript-level abundances estimated by kallisto1 or Salmon2, i.e., the number of reads mapping to each transcript. scaledTPM normalized abundances are obtained by first computing transcript-per-million (TPM) abundances by (1) dividing the transcript counts by the length of the corresponding transcript in kilobases, (2) dividing by a cell/sample specific scaling factor, which is computed by taking the sum across transcripts of all length-corrected abundances from (1) and (3) multiplying by one million. Finally, the TPM abundances are scaled per cell/sample, so that the sum of the TPM abundances in the cell/sample equals the original number of counts for that cell/sample.

Bulk simulation study

To evaluate the performance of the different DTU analysis methods, we first adopt three simulated bulk RNA-seq datasets from previous publications: the simulated dataset from Love et al.18 (dataset 1) and both the Drosophila melanogaster (dataset 2) and Homo sapiens (dataset 3) simulation studies from Van den Berge et al..40 All three datasets were generated based on parameter values obtained from real RNA-seq samples, to mimic real RNA-seq data as close as possible.

Notably, there is a subtle difference in how DTU is introduced between the two simulation frameworks. For dataset 1 from Love et al.,18 the origin of DTU is twofold: On the one hand, DTU was specifically introduced by swapping the transcript-per-million (TPM) abundances between two expressed isoforms. On the other hand, DTU was also obtained as a consequence of introducing DTE, where a single expressed isoform was induced to be differentially expressed at a certain log fold change, which leads to DTU if this transcript belongs to a gene expressing multiple isoforms. For datasets 2 and 3 from Van den Berge et al.,40 there is only one source of DTU. The number of differentially used transcripts within a gene was sampled ranging from a minimum of two up to a random number drawn from a binomial distribution with size equal to the number of transcripts and success probability 1/3, thus allowing for differential usage of more than two transcripts within the same gene. For the selected transcripts, the percentage of usage withing the corresponding gene is first computed on the TPM scale. Next, DTU was introduced by swapping these percentages between transcripts. Subsequently, the swapped percentages are multiplied by the original gene-level read count per kilobase (RPK) to obtain transcript-level RPK. Finally, TPM and counts are computed based on these transcript-level RPK. Additionally, dataset 1 uses Salmon2 (version 1.1.0) for estimating transcript-level abundances, whereas datasets 2 and 3 were quantified with kallisto1 (version 0.46.0).

Real bulk RNA-seq dataset evaluation

We evaluate the performance of the different DTU methods on real bulk RNA-seq data, by subsampling a homogeneous set of samples from the large bulk RNA-seq dataset available from the Genotype-Tissue Expression (GTEx) consortium41 release version 8. Nine datasets were generated non-parametrically. More specifically, we first selected samples from adrenal gland tissue that were extracted with the RNA extraction method “RNA Extraction from Paxgene-derived Lysate Plate Based”. From the remaining samples we subsampled nine datasets, comprising three repeats for each of three sample sizes; 5 versus 5, 20 versus 20 and 50 versus 50 samples. Next, DTU is artificially in 15% of the genes. The number of differentially used transcripts within a gene was determined with the strategy of Van den Berge et al.40 For the selected transcripts, the counts of transcripts within the same genes are swapped in one group of samples or cells, inducing DTU signal between the groups. The GTEx data was quantified with RSEM35 version 1.3.0.

Real single-cell RNA-seq datasets evaluation

We evaluate the performance of the different DTU methods on real scRNA-seq datasets. These scRNA-seq datasets were generated non-parametrically by subsampling a homogeneous set of cells from three real scRNA-seq datasets,26,29,42 after which DTU is artificially introduced as discussed above (see Methods, Real bulk RNA-seq dataset evaluation paragraph). The datasets from Chen et al.29 and Tasic et al.42 were quantified using Salmon v1.1.0, whereas the dataset from Darmanis et al.26 was quantified with Salmon v0.8.2 as obtained from Soneson et al.43

For the dataset of Chen et al.,29 which was used to construct Figures 4 and the Extended data Figure S7,25 we selected a homogeneous population of cells by considering only the EpiStem cells of female mice, resulting in a dataset of 120 cells. From this homogeneous population of cells, we then subsampled six datasets, comprising three repeats for each of two sample sizes: 20 versus 20 and 50 versus 50 cells. Next, DTU was artificially introduced by swapping transcript counts between groups of cells. Finally, we adopted either edgeR or DRIMSeq for filtering.

The other two scRNA-seq datasets were generated analogously. For the dataset of Tasic et al.,42 which was used to construct Figure S8 (Extended data25), we selected a homogeneous population of cells by considering only the Lamp5 cells in the anterior lateral motor cortex of mice without any eye conditions, resulting in a dataset of 897 cells. After introducing DTU by swapping transcript counts, we randomly subsampled 20, 75 or 200 cells from each group. For the dataset of Darmanis et al.,26 which was used to construct Extended data25 Figure S9, we selected the immune cells that clustered together in tSNE cluster 8 of the original publication, resulting in a dataset of 248 cells. After introducing DTU, we randomly subsampled 20, 50 or 100 cells from each group.

Case study differential gene expression analysis

We perform a differential gene expression (DGE) analysis on a subset of the Tasic single-cell dataset,42 i.e. between different the cell types originating from the ALM and VISp regions of the glutamatergic L5 IT subclass. We use the quasi-likelihood method of edgeR44 to model the gene expression profiles and additionally adopt the edgeR glmTreat function to test differential expression against a log2-fold change threshold (log2-fold change = 1). Statistical significance was evaluated at the 5% FDR level.

Performance assessment

We assess the performance of different DTU methods on a bulk simulation dataset with scatterplots of the true positive rate (TPR) versus the false discovery rate (FDR), according to the following definitions:

(9)
TPR=TPTP+FN
(10)
FDP=FPFP+TP
(11)
FDR=EFDP

where FN, FP and TP denote the numbers of false negatives, false positives and true positives, respectively. The FDR is the expected value of the false discovery proportion (FDP), which is the ratio of the number of false positives to the total number of positives. The FDR-TPR curves are constructed using the Bioconductor R package ICOBRA version 1.14.0.45

Scalability benchmark

The scalability benchmark was run on subsets of the Chen scRNA-seq dataset,29 which contains 617 cells in total. For the scalability benchmark with respect to the number of cells in the dataset, we randomly subsample a certain number of cells (8, 16, 32, 64, 128 or 256 cells per group) from the dataset (without introducing DTU or selecting specific homogeneous cell populations). Next, we filter this subsample using the edgeR-based filtering criterion. This was done to remove very lowly abundant transcripts, which may otherwise cause problems in the parameter estimation procedure. From the remaining transcripts, we randomly subsampled to a total of 30,000 transcripts before running the DTU analysis. To allow for a scalability benchmark of BANDITS, which scales poorly to the number of transcripts (Figure 5B), we also generated a dataset with only 1,000 transcripts (see Extended data figure S125). For the scalability benchmark with respect to the number of transcripts, we randomly sampled two groups of 16 cells from the dataset. After applying the edgeR-based filter, we sampled eight distinct numbers of transcripts: 1,000, 2,000, 5,000, 10,000, 15,000, 20,000, 25,000, 30,000 and 35,000 prior to the DTU analysis.

We additionally perform a scalability benchmark on the GTEx41 bulk RNA-seq dataset. Using the same strategy as described above for the scRNA-seq scalability benchmark, we here assess scalability with respect to the number of samples (8, 16, 32 or 64 samples per group) while keeping the number of transcripts fixed at 30,000 and scalability with respect to the number of transcripts (same range as above) while keeping the number of samples fixed at sixteen. Results are displayed in the Extended data25 figures S20 and S21.

All scalability benchmarks were run on a single core of a virtual machine with an Intel(R) Xeon(R) CPU E5-2420 v2 (2.20GHz, Speed: 2200 MHz) processor and 30GB RAM.

Results

We first evaluate the performance of our novel DTU method, satuRn, on publicly available simulated and real bulk RNA-seq data, as well as on real scRNA-seq data. In general, we found that the performance of satuRn was at least on par with the performances of the best tools from the literature. In addition, our method controls the FDR closer to the nominal level, on average. Second, we show that satuRn scales towards the large data volumes generated by contemporary bulk and single-cell RNA-seq experiments, allowing for a transcriptome-wide analysis of datasets consisting of several thousands of cells, in only a few minutes. Finally, we analyze a large full-length scRNA-seq case study dataset, where we obtain highly relevant biological results on isoform-level changes between cell types that would have remained obscured in a canonical differential gene expression (DGE) analysis.

Performance on simulated bulk RNA-seq datasets

To evaluate the performance of satuRn, we adopt three simulated bulk RNA-seq datasets from previous publications. Dataset 1 was obtained from Love et al.18 and contains two groups of 12 samples each, which we subsample without replacement to evaluate 3vs3, 6vs6 and 10vs10 two-group comparisons. Datasets 2 and 3 are the Drosophila melanogaster and Homo sapiens simulation studies from Van den Berge et al.40 and Soneson et al.,17 which both contain two groups of five samples each. In brief, all datasets were constructed by generating sequencing reads based on parameters that are estimated from real bulk RNA-seq data. DTU between groups of samples was artificially introduced in the data, prior to the quantification of expression using either Salmon2 (dataset 1) or kallisto1 (datasets 2 and 3). Notably, there are some methodological differences between the simulation framework of dataset 1 and that of datasets 2 and 3 with respect to the read generation and the simulation of DTU signal (see Methods). In terms of transcript filtering, we adopt two different strategies as implemented by edgeR44 and DRIMSeq,21 which correspond to a lenient and more stringent filtering, respectively (see Methods).

The result of the performance evaluation of satuRn with respect to other DTU methods on the three simulated bulk datasets is displayed in Figure 2. Figure 2A shows the average performance over three 6 versus 6 subsamples for dataset 1, after filtering with edgeR. Figures 2B and 2C display the performance on datasets 2 and 3 after edgeR filtering, respectively. In all three datasets, satuRn outperforms NBSplice, edgeR diffsplice and limma diffsplice. Intriguingly, the performance of DRIMSeq varies strongly between the three datasets. This discrepancy may be explained by the different strategies for generating reads and introducing DTU between dataset 1 on the one hand, and datasets 2 and 3 on the other hand (see Methods). We furthermore find the performance of satuRn is on par with the best performing tools from the literature, DEXSeq and DoubleExpSeq. In addition, both satuRn and DoubleExpSeq provide a stringent control of the FDR, while DEXSeq and DRIMSeq are often too liberal, as reported previously.18

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure2.gif

Figure 2. Performance evaluation of satuRn on three simulated bulk RNA-Seq datasets.

Each curve visualizes the performance of each method by displaying the sensitivity of the method (true positive rate, TPR) with respect to the false discovery rate (FDR). The three circles on each curve represent working points when the FDR level is set at nominal levels of 1%, 5% and 10%, respectively. The circles are filled if the empirical FDR is equal or below the imposed FDR threshold. The performance of satuRn is on par with the best tools from the literature, DEXSeq and DoubleExpSeq, for all datasets. In addition, our method consistently controls the FDR close to its imposed nominal FDR threshold.

We also evaluated the effects of sample size and different filtering criteria on the performance of the different DTU methods (see Extended data25 figures S2, S13). Neither sample size nor filtering criterion had a profound impact on the ranking of the performances of the different DTU methods; satuRn, DEXSeq and DoubleExpSeq remain the best performing methods overall. In addition, we studied the impact of using either transcript count estimates or normalized abundance estimates (scaledTPM, see Methods) as input data for the DTU algorithms. We observed a slightly higher performance in all datasets when providing count estimates, except for Dataset 1 from Love et al.18 Love et al.18 introduced DTU either by swapping TPM abundances or by introducing a fold change in transcript expression in one group of samples, again on TPM abundances. However, when back-transforming the swapped TPM abundances to count abundances, the transcript lengths were not swapped. Hence, the simulation strategy can induce differences in the total count for a gene, which in turn may alter the usage of all (target and off-target) transcripts in that gene when counts are used as an input, which gives rise to many false positives in off-target transcripts of DTU genes. When introducing DTU by swapping counts (real bulk and single-cell evaluations), we could expect spurious signal in analyses on the TPM scale for the same reason. However, the difference between the transcript count and TPM analyses was much less pronounced, which was also the case for the Van den Berge et al. simulation approach. Therefore, all performance evaluations in the body of this publication were generated with count estimates as input data, except for Figure 2, panel A. For a full overview on the effects of sample size, filtering criteria and data input type, we refer to Figures S2 and S13 of the Extended data.25

Performance on a real bulk RNA-seq dataset

While simulation studies are common for evaluating the performance of DE analysis methods, there is currently no consensus on the simulation strategy that best mimics real (sc)RNA-seq data. In addition, simulation frameworks typically generate data according to parametric assumptions on the data-generating mechanism, thus potentially favoring DE methods that adopt similar distributional assumptions in their statistical model.40 An alternative procedure is to non-parametrically modify a real dataset. Here, we obtained different subsamples from the large bulk RNA-seq dataset available from the Genotype-Tissue Expression (GTEx) consortium,41 generating nine datasets in total, i.e. three repeats for each of three sample sizes; 5 versus 5, 20 versus 20 and 50 versus 50 samples. We then artificially introduced DTU in these data by swapping transcript counts between groups of samples (see Methods for details). Again, we adopt two different filtering strategies as implemented by edgeR44 and DRIMSeq21 (see Methods).

The results of the performance evaluation of satuRn on the real bulk datasets upon edgeR filtering is displayed in Figure 3. In agreement with the results obtained from the simulated bulk RNA-seq study, we observe that the performance of satuRn is on par with DEXSeq and DoubleExpSeq. Again, satuRn provides a conservative FDR control. While the FDR control of DoubleExpSeq is good overall, it appears to become too liberal with increasing sample size. In this evaluation, DRIMSeq performs poorly, in contrast to simulated bulk RNA-seq datasets 2 and 3, but in line with the performance evaluation on the simulated bulk RNA-seq dataset 1. Note that DEXSeq, DRIMSeq and NBSplice were omitted from the analysis of the largest dataset (50 versus 50 samples), as these methods do not scale to such large datasets (Figure 1). Adopting the DRIMSeq-based filtering did not have a qualitative impact on the performance (Extended data Figure S625).

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure3.gif

Figure 3. Performance evaluation of satuRn on a real bulk RNA-Seq dataset.

Each curve visualizes the performance of each method by displaying the sensitivity of the method (true positive rate, TPR) with respect to the false discovery rate (FDR). The three circles on each curve represent working points when the FDR level is set at nominal levels of 1%, 5% and 10%, respectively. The circles are filled if the empirical FDR is equal or below the imposed FDR threshold. The performance of satuRn is on par with the best tools from the literature, DEXSeq and DoubleExpSeq. In addition, satuRn consistently controls the FDR close to its imposed nominal FDR threshold, while DoubleExpSeq becomes more liberal with increasing sample sizes. Note that DEXSeq, DRIMSeq and NBSplice were omitted from the larger comparison, as these methods do not scale to large datasets (Figure 1).

Performance on real single-cell data

Finally, we evaluate the performance of satuRn on single-cell RNA-seq data. As with the real bulk analysis, the single-cell datasets were generated by subsetting from three different real scRNA-seq datasets26,29,42 (see Methods). Again, we subsampled three repeats of different sample sizes, artificially introduced DTU with the swapping strategy and applied either the edgeR- or DRIMSeq-based filtering criterium (see Methods for details).

By subsampling the Chen et al.29 dataset, we generated three repeats of two sample sizes, i.e. 20 versus 20 and 50 versus 50 cells. The results of the performance evaluation of satuRn on this dataset upon edgeR filtering is displayed in Figure 4. The performance of satuRn is slightly better than that of the best tool from the literature, DoubleExpSeq. As compared to the evaluations on bulk data, we observe a performance drop for DEXSeq relative to satuRn and DoubleExpSeq. This, in combination with its poor scalability (Figure 1), greatly compromises the use of DEXSeq for the analysis of scRNA-seq data. satuRn again provides a stringent control of the FDR, while the inference of DoubleExpSeq is too liberal, again becoming more problematic for larger sample sizes. Adopting the DRIMSeq filter did not have a qualitative impact on the performances (Extended data figure S725). The results of the performance evaluations on the other two scRNA-seq datasets26,42 are in strong agreement with the results displayed here, with satuRn performing at least on par with DoubleExpSeq and satuRn additionally controlling the FDR around the nominal level (Extended data25 figures S8 and S9).

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure4.gif

Figure 4. Performance evaluation of satuRn on a real scRNA-Seq dataset.

Each curve visualizes the performance of each method by displaying the sensitivity of the method (true positive rate, TPR) with respect to the false discovery rate (FDR). The three circles on each curve represent working points when the FDR level is set at nominal levels of 1%, 5% and 10%, respectively. The circles are filled if the empirical FDR is equal or below the imposed FDR threshold. The performance of satuRn is on par with the best tools from the literature, DEXSeq and DoubleExpSeq. In addition, our method provides a stringent control of the FDR, while DoubleExpSeq becomes more liberal with increasing sample sizes (see also Extended data figure S725). Note that DEXSeq and DRIMSeq were omitted from the two largest comparisons, as these methods do not scale to large datasets (Figure 1). NBSplice was omitted from all comparisons, as it does not converge on datasets with many zeros, such as scRNA-Seq datasets.

Notably, we found that the theoretical null distribution of the test statistics from satuRn failed to provide good FDR control in single-cell analyses (Extended data figure S1425). To obtain proper p-values with satuRn in single-cell applications, we therefore estimate the null distribution of the test statistic empirically (see Methods, satuRn paragraph). Note that the performances in the bulk RNA-seq benchmarks are not affected by adopting the empirical null distribution, because no deviations of the theoretical null distribution occur in bulk data. However, the empirical null resulted in much improved FDR control in scRNA-seq datasets (Extended data25 figure S14). We therefore adopt the empirical null estimation as the default setting in satuRn. As such, all satuRn results in this publication are relying on the empirical null strategy. As a final remark, we likewise attempted to improve the FDR control of DoubleExpSeq. However, in all analyses with DoubleExpSeq we observed a large spike of p-values equal to 1, which poses a problem when estimating the empirical null distribution (Extended data25 figure S15). Therefore, this strategy could not be used to improve the FDR control of DoubleExpSeq.

In these benchmark single-cell datasets, DTU signal was introduced non-parametrically by swapping transcript counts (see Methods), thus introducing DTU signal of different magnitude in the different transcripts. In Extended data25 Figures S16-S19, we stratified the FDR-TPR curves on the difference in the observed average transcript usage between the two groups of cells. In all benchmark datasets, satuRn and DoubleExpSeq are more successful in detecting small differences as compared to the other methods. Indeed, while all methods succeed in detecting large differences in absolute transcript usage, the performance gain of satuRn and DoubleExpSeq in the results without stratification can be attributed to picking up DTU with smaller effect sizes.

Scalability benchmark

We performed a computational benchmark of satuRn to investigate its scalability with respect to the number of samples/cells and the number of transcripts in an RNA-seq dataset. All scalability benchmarks were run on a single core of a Linux machine with an Intel(R) Xeon(R) CPU E5-2420 v2 (2.20GHz, Speed: 2200 MHz) processor and 30GB RAM. The results are displayed in Figure 5.

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure5.gif

Figure 5. Scalability evaluation on scRNA-Seq data. A: Runtime with respect to the number of cells in a scRNA-Seq dataset.

Left panel: DRIMSeq and especially DEXSeq scale poorly with the number of cells in the dataset. Right panel: Detailed plot of the fastest methods. satuRn scales linearly with increasing numbers of cells, with a slope that is comparable to that of limma diffsplice. As such, satuRn can perform a DTU analysis on a dataset with two groups of 256 cells each and 30,000 transcripts in less than three minutes. For all sample sizes, the number of transcripts in the datasets were set at 30,000. Note that NBSplice was not included in this analysis as it fails to converge on datasets with a large proportion of zero counts. B: Runtime with respect to the number of transcripts in a scRNA-Seq dataset. Left panel: DEXSeq, DRIMSeq and especially BANDITS scale poorly to the number of transcripts in the dataset. Right panel: Detailed plot of the fastest methods. satuRn scales linearly with increasing numbers of transcripts, but with a steeper slope than edgeR diffsplice, DoubleExpSeq and limma diffsplice. The number of cells in the dataset was set fixed to two groups of 16 cells. All scalability benchmarks were run on a single core.

Figure 5A displays the scalability with respect to the number of cells in the dataset, (different subsets of the single-cell dataset of Chen et al.29), while keeping the number of transcripts in the dataset fixed at 30,000. From the left panel, it is clear that DRIMSeq and especially DEXSeq scale very poorly with the number of cells in the dataset, which was already shown in Figure 1B. In the right panel, we focus on the four remaining methods. satuRn scales linearly with increasing numbers of cells, with a slope comparable to limma diffsplice. As such, satuRn can perform a DTU analysis on a dataset with two groups of 256 cells each and 30,000 transcripts in less than three minutes. Note that BANDITS30 was not included in this benchmark, as it does not scale to datasets with this many transcripts. For a performance and scalability evaluation of BANDITS on datasets with a lower number of transcripts, we refer to Extended data figure S1.25 NBSplice was also omitted as it fails to converge on datasets with a large proportion of zero counts.

Figure 5B shows the scalability with respect to the number of transcripts in the dataset, while keeping the number of cells in the dataset fixed to two groups of 16 cells. As shown in Figure 1C, BANDITS, DEXSeq and DRIMSeq scale poorly to datasets with many transcripts. From the right panel, satuRn scales linearly with increasing numbers of transcripts, albeit with a steeper slope than edgeR diffsplice, DoubleExpSeq and limma diffsplice. Note, that the scalability of DTU analyses can be improved through parallelization, if this is allowed by the underlying algorithm. Parallel execution is possible in satuRn and in all methods from the literature that were discussed in this manuscript, except for DoubleExpSeq and NBSplice.

In Extended data Figure S20,25 we additionally evaluate the scalability of the DTU tools on different subsets of the bulk GTEx41 dataset. These results are in line with those of the single-cell scalability benchmark, with DEXSeq and DRIMSeq being markedly slower than the other methods. When directly comparing the scalability between the bulk and single-cell data on equally sized subsets (Extended data Figure S2125), we found limited differences in scalability between the data types, except for DEXSeq. DEXSeq scales considerably better on bulk data than on single-cell data, suggesting that the estimation of the GLM parameters is slower when the data are sparse. However, as the scalability profile of DEXSeq is quadratic with respect to the number of cells/samples in the data, it is still infeasible to adopt DEXSeq in datasets with many cells/samples.

Altogether, we find that while several methods for DTU analysis exist, none are optimally suited for analyzing single-cell datasets. DRIMSeq, NBSplice, edgeR diffsplice and limma diffsplice have a much lower overall performance in our benchmarks. DEXSeq does not scale to large datasets. Finally, DoubleExpSeq does not support experimental designs that require an analysis with multiple additive effects, e.g. randomized complete block designs and designs where batch-effect correction is required, which are essential for many practical scRNA-Seq analysis settings.46 In addition, it fails to control the FDR at the desired level, especially with increasing sample sizes. With satuRn, we overcome these issues and effectively unlock DTU analysis for single cell applications.

Case study

We use satuRn to perform a DTU analysis on a subset of the single-cell (SMART-seq211) RNA-seq dataset from Tasic et al.42 In addition, we analyze the same dataset with DoubleExpSeq and limma diffsplice, which are the only other DTU methods that scale to large scRNA-seq datasets and have a reasonable performance in our benchmarks. In the original publication, the authors studied differential gene expression between cell types originating from two areas at distant poles of the mouse neocortex; the primary visual cortical area (VISp), which processes sensory information with millisecond timescale dynamics4749 and the anterior lateral motor cortex (ALM), which displays slower dynamics related to short-term memory, deliberation, decision-making and planning.50,51 Based on marker genes, Tasic et al.42 assigned all of the 23,822 cells from the scRNA-seq dataset to one of three cell classes; glutamatergic (excitatory) neurons, GABAergic (inhibitory) neurons or non-neuronal cells. The authors then further classified the neuronal cells into several subclasses based on their dominant layer of dissection and projection patterns (through a retrograde labelling experiment). Finally, these subclasses are further classified into cell types based on the expression of specific marker genes.

DGE analysis with edgeR

In their original DGE analysis, Tasic et al.42 obtained the largest number of differentially expressed genes between the cell types originating from the ALM and VISp regions of the glutamatergic L5 IT subclass (2,739 cells in total), where L5 refers to layer-of-dissection 5 and IT refers to the intratelencephalic projection type. Here, we first perform a DGE analysis with an edgeR-based workflow (see Methods) on the same comparisons between L5 IT cell types that were assessed by Tasic et al. Table 2 shows the number of differentially expressed genes between the groups of interest in column 4.

DTU analysis with satuRn

Next, we perform a DTU analysis for the same cell types using satuRn. In column 5 of Table 2, we display the number of differentially used transcripts for each comparison. We also show the number of unique genes in which we find evidence for changes in usage of at least one transcript (column 6). While the number of differentially used transcripts is lower than the number of differentially expressed genes in each of the contrasts, we did identify differentially used transcripts in all contrasts of interest. Most interestingly, we observe that the overlap between the differentially expressed genes and the genes in which we found evidence for DTU is very limited (Table 2, column 7). This shows that the information obtained from our DTU analyses are orthogonal to the results from the canonical DGE analyses, which has been reported previously for simulated bulk data.18

Table 2. Number of differentially expressed genes and differentially used transcripts in eight comparisons between cell types.

The first three columns indicate the comparisons between ALM (column 2) and VISp (column 3) cell types, respectively. Column 4 indicates the number of differentially expressed genes as identified with an edgeR analysis. Column 5 displays the number of transcripts that satuRn flagged as differentially used. Column 6 shows the number of unique genes in which satuRn finds evidence of differential usage of at least one transcript. Column 7 displays the absolute number of genes that overlap between columns 4 and 6.

ComparisonCell type 1
(ALM)
Cell type 2
(VISp)
DGEDTU
Tx
DTU
Gene
Overlap
1Cpa6 Gpr88Batf320324151
2Cbln4 Fezf2Col27a128192533
3Cpa6 Gpr88Col6a1 Fezf2154750
4Gkn1 Pcdh19Col6a1 Fezf223133221
5Lypd1 Gpr88Hsd11b1 Endou331118694
6TncHsd11b1 Endou59519311210
7Tmem163 Dmrtb1Hsd11b1 Endou47190537
8Tmem163 Arhgap25Whrn Tox219763401

Gene set enrichment analysis

We perform a gene set enrichment analysis (GSEA52) on the three comparisons with most DE genes and most genes with evidence for DTU (comparisons 5, 6 and 7). Similar gene ontology categories are returned for the set of DGE genes and the set of DTU genes, with many of the enriched processes being biologically relevant in the context of this case study. Enriched gene sets include the gene ontology classes, synapse, neuron projection, synaptic signaling and cell projection organization. This shows that the complementary information brought by the DTU analysis is indeed biologically relevant. For an extensive overview of the GSEA of the set of DGE genes and genes with evidence of DTU in comparisons 5, 6 and 7, we refer to the Extended data.25

satuRn identifies biologically relevant DTU transcripts

To display the utility of satuRn for identifying and visualizing DTU transcripts in scRNA-seq datasets, we focus on comparison #6 of the DTU analysis and discuss the gene P2X Purinoceptor 4 or P2rx4 (Ensembl ID ENSMUSG00000029470), alsb gene which is part of a family of purinergic receptors that have been implicated in functions such as learning, memory and sleep. In the DGE analysis, no evidence for differential expression of P2rx4 was found at the gene level (FDR-adjusted p-value = 1). By contrast, in our DTU analysis the transcripts of P2xr4 displayed the highest statistical evidence for differential usage within the set of transcripts that could be assigned to the ontology class ‘neuron projection’.52 The mean usage of transcript ENSMUST00000081554 is estimated to be 28.9% in Tnc cells and 75.9% in Hsd11b1 Endou cells (FDR-adjusted p-value = 1.22E-13). For transcript ENSMUST00000195963, the transcript usage changes from 58.2% in Tnc cells and 16.6% in Hsd11b1 Endou cells (FDR-adjusted p-value = 1.79E-10). For the third transcript of P2rx4 that was assessed in our DTU analysis, ENSMUST00000132062, we found no statistical evidence for DTU (FDR-adjusted p-value = 0.534). In Figure 6, we show the output for the visualization of the transcript usages for P2rx4 as obtained from satuRn. Interestingly, the majority transcript in the Tnc cell type, ENSMUST00000195963, is not protein coding.54 By contrast, the majority transcript in the Hsd11b1 Endou cell type, ENSMUST00000081554, is coding for the P2X purinoceptor protein (UniProt ID Q9Z256). As such, the changes in transcript usage between both cell types represent actual biological differences in the functionality of the gene products, which may be relevant to the process of neuron projection. This functional difference would have remained obscured when only performing a canonical DGE analysis.

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure6.gif

Figure 6. Differential transcript usage in the P2rx4 gene.

Panels A-C show the usage of the different transcripts of the P2rx4 gene across cells of the Tnc and Hsd11b1 cell types. The size of each datapoint is weighted according to the total expression of the gene in that cell, i.e., the gene counts per cell. The yellow diamonds indicate the estimated mean usage of a transcript for each cell type, as estimated by satuRn. The cyan and dark green diamonds indicate mean and median gene expression levels per cell type, respectively. Panels A and B display the transcript usage across cells of the Tnc and Hsd11b1 Endou cell types of transcripts ENSMUST00000081554 and ENSMUST00000195963, respectively. The proportion of usage of the former transcript is clearly higher in Hsd11b1 Endou cells, while the latter transcripts is most abundant in Tnc cells. For the third transcript, ENSMUST00000132062 (panel C) there is no evidence for differential usage between both cell types. In addition, there is no evidence for differential expression of P2rx4 on the gene level (panel D). DTU and DGE significance levels are indicated in the figure headers.

Comparison to limma diffsplice

We also analyzed the case study dataset with limma diffsplice.23 When running limma diffsplice with default settings, a large number of DTU transcripts was returned (Extended data25 figure S22) and we observe that the p-values were shifted towards smaller values (Extended data25 figures S23 and S24). Therefore, we adopted the same empirical null strategy as implemented in satuRn to post-process the results. While this dramatically decreased the number of significant DTU transcripts, limma diffsplice still identified more transcripts (i.e., true or false positives) than our method. However, when we inspected the transcripts that were highly ranked in the top DTU list of limma diffsplice but lowly ranked in our top list, we found that most of these transcripts either originate from genes that are lowly expressed, or they are transcripts with a large fraction of zero counts (i.e., zero expression in a large percentage of cells). Limma diffsplice thus claims differential usage more often for transcripts that only contain little information for assessing DTU. This is depicted in Figure 7.

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure7.gif

Figure 7. Global comparison between differential transcript usage (DTU) transcripts uniquely identified by satuRn, uniquely identified by limma diffsplice or by both methods.

Left panel: Boxplots on the average gene-level count for the DTU genes identified by the respective methods. Transcripts uniquely identified by satuRn originate from genes that have a much higher gene-level count (averaged over cells) as compared to transcripts uniquely identified by limma diffsplice. Note that the y-axis is displayed on a log10 scale. Right panel: Violin plots indicating the fraction of cells in which the transcripts are expressed. Transcripts uniquely identified by satuRn are expressed, on average, in a much larger fraction of the cells. Conversely, transcripts identified as DTU uniquely by limma diffsplice often have no expression in a large fraction of the cells. The dark green diamond indicates the median fraction of cells in which the DTU transcripts are expressed.

This behavior can be expected. Limma diffsplice tests for DTU by comparing the log-fold change in expression of transcript t with the average log-fold change in the expression of all transcripts belonging to the same gene as transcript t. As such, limma diffsplice does not incorporate any information on the absolute gene expression levels. In contrast, our quasi-binomial GLM framework models the log-odds of drawing a particular transcript t from the pool of transcripts in the corresponding gene. As a consequence, transcripts belonging to lowly expressed genes are correctly considered less informative in satuRn and are thus less likely to be picked up. For example, in Figure 8A, we show that while our method estimates a mean usage of 7% in Tnc cells and 26% in Hsd11b1 Endou cells (indicated by the gold diamond), the transcript is not identified as differentially used, given the low abundance of the corresponding gene and the highly variable single-cell level observations.

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure8.gif

Figure 8. Three examples displaying differential transcript usage (DTU) transcripts that are uniquely identified by satuRn or limma diffsplice.

Each panel shows transcript usage across cells of the Tnc and Hsd11b1 cell types. The size of each datapoint is weighted according to the total expression of the corresponding gene in that cell, i.e., the total gene count per cell. The yellow diamonds indicate the estimated mean usage of a transcript for each cell type, as estimated by satuRn. The cyan diamonds indicate the mean transcript expression levels per cell type. The header of each panel indicates the FDR-adjusted p-value and the rank of the DTU finding in the top lists by limma diffsplice and satuRn analyses. Panel A: Transcript uniquely identified as differentially used by limma diffsplice. The DTU claim by limma is driven by the difference in mean transcript usage between cell types. Given the low abundance of the corresponding gene and the highly dispersed single-cell level observations, satuRn does not identify the transcript as differentially used. Panel B: Transcript uniquely identified as differentially used by satuRn. Even though the mean difference in transcript usage between cell types is estimated to be 3%, satuRn claims significance given that the difference is stably supported by many cells with high gene-level expression levels. Panel C: Transcript uniquely identified as differentially used by limma diffsplice. The DTU claim by limma is driven by the raw mean difference in transcript usage between cell types. In contrast, satuRn takes into account that the Hsd11b1 Endou cells expressing the transcript at 0% usage have low gene-level count. The size of the dots (which represent individual cells) is weighted according to the total expression of the gene in that cell, i.e., the total gene count per cell. The yellow diamonds indicate the estimated mean usage of a transcript for each cell type, as estimated by satuRn. The cyan diamonds indicate the raw mean transcript usage levels per cell type.

Conversely, by looking at the transcripts that were highly ranked in our DTU list but lowly ranked in the top list of limma, we observe that our model is more likely to capture small changes in transcript usage that are stable across all cells and belong to genes that are highly expressed. An example of such a transcript is shown in Figure 8B. satuRn estimates a mean usage of 3% in Tnc cells and 6% in Hsd11b1 Endou cells. While this is only a minor change in transcript usage, satuRn still identifies this transcript as differentially used because the gene is highly expressed and the small change in usage is supported by a large number of cells. In case such small differences in usage are not considered biologically meaningful, it is possible to set a threshold on the minimal desired difference. Finally, by not taking into account gene abundances, limma is more influenced by outlying observations that have a low gene-level abundance (Figure 8C). Indeed, DTU claims by limma are driven by differences in raw mean usages of transcripts. In Figure 8C, the raw mean usage of the transcript is 77% in Tnc cells and 45% in Hsd11b1 Endou cells, as indicated by the cyan diamonds. By contrast, the mean usage estimate by satuRn, which takes into account that the Hsd11b1 Endou cells expressing the transcript at 0% usage have low gene-level count, is 83% for Tnc cells and 75% for Hsd11b1 Endou cells, as indicated by the gold diamonds.

We therefore argue that, given the above observations, the transcripts identified by satuRn should be considered more reliable, as they generally originate from genes containing more information for assessing DTU.

Comparison to DoubleExpSeq

We additionally analyzed the dataset by Tasic et al. with DoubleExpSeq.20 DoubleExpSeq identified a large number of DTU transcripts in all eight comparisons between cell types, ranging from 335 to 4580 DTU transcripts (Extended data25 figure S22). This is consistent with our performance benchmarks, which already suggested that DoubleExpSeq becomes overly liberal in single-cell datasets with a large number of cells (Figure 4, Extended data25 Figures S7, S8 and S9). We therefore expect many of these transcripts to correspond to false positives. Furthermore, this is reflected in the pathological distribution of p-values obtained by DoubleExpSeq, where p-values tend to be small and therefore the analysis too liberal (Extended data25 figure S25). Furthermore, as discussed in the benchmark studies, we could not adopt the empirical null strategy to improve the FDR control of DoubleExpSeq. Again, a large number of p-values equal 1 poses a problem for estimating the empirical null distribution (Extended data25 figure S26).

While the results of DoubleExpSeq are likely to be overly liberal, the ranking of the transcripts (based on the p-values of the DTU analysis) might still be reasonable. In Figure 9 we observe a large overlap between the top 200 transcripts identified by satuRn in comparison #6 of the case study and the top 200 transcripts of DoubleExpSeq in that comparison. This overlap is considerably smaller with a limma diffsplice analysis.

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure9.gif

Figure 9. Venn diagram displaying the degree of overlap of the top 200 transcripts in comparison #6 of the case study in three DTU analysis tools.

We observe that in the set of the top 200 transcripts identified by satuRn, 149 transcripts overlap with the top 200 list from DoubleExpSeq. In the top 200 list of limma diffsplice, 108 transcripts are present that were not in the top lists of satuRn or DoubleExpSeq.

Finally, we note that while DoubleExpSeq could still be used in this case study given the simple factorial design (using a single factor to assign each cell to a cell type), DoubleExpSeq cannot be used in multifactorial designs, for instance to compare expression levels across multiple cell types between multiple samples or treatment groups.

Analyses with complex experimental designs

The analysis of scRNA-seq experiments often requires incorporating covariates in the statistical model, for instance to correct for batch effects or to account for subject-level baseline characteristics like age or body mass index. satuRn and most other DTU tools allow for the analysis of complex experimental designs with multiple categorical and continuous covariates (see Table 1). Exceptions are DoubleExpSeq and BANDITS that do not support modeling multiple covariates, which limits their utility for scRNA-seq data analysis and bulk RNA-seq experiments with complex designs. This may result in suboptimal data analysis. First, covariates may be able to explain a larger part of the variability in the data, which will increase the statistical power to pick up the effect of interest, as the explained variability would otherwise be incorporated in the error term. Second, the omission of covariates may result in biased parameter estimates. Third, the included covariates themselves could be key to further unravel the studied biological process.

To demonstrate satuRn’s ability to analyze datasets with complex experimental designs, we reanalyzed the single-cell study from Tasic et al.42 but now we include a main effect for cell type, gender, and the continuous covariate age, as well as an interaction effect between cell type and gender.

We first assess DTU between the Hsd11b1 Endou and Tnc cell types and compare the results with the original analysis ignoring gender and age effects. As expected, the additional covariates affect the ranking of transcripts in the resulting top list. For instance, transcript ENSMUST00000198199 dropped from the 8th to the 68th position after correcting for gender and age (empFDR = 1.89e-11 without covariates, empFDR = 2.26e-3 with covariates). Second, we assessed the interaction effect to prioritize transcripts for which the DTU between cell types differs according to gender. The differential transcript usage of ENSMUST00000145174 between Hsd11b1 Endou and Tnc cell types, for instance, is significantly different between female and male mice (empirical FDR = 0.0036). Indeed, for female mice there is evidence for DTU (empirical FDR = 0.01258), while no difference in usage is observed for male mice (empirical FDR = 0.9830, Figure 10).

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure10.gif

Figure 10. DTU between Hsd11b1_Endou cells and Tnc cells of transcript ENSMUST00000145174 is different between female and male mice.

The difference in the usage of transcript ENSMUST00000145174 between Hsd11b1 cells and Tnc cells is significantly different between female and male mice (empirical FDR = 0.0036). Left panel: In female mice, the usage the transcript is significantly higher in Hsd11b1_Endou cells than in Tnc cells (empirical FDR = 0.0126). Right panel: In male mice, no difference in usage is observed (empirical FDR = 0.9830).

Alternative input data types

Above, we have used satuRn for transcript-level differential usage analyses. However, there are two other differential usage analyses that we can perform at the sub gene level: differential usage of transcript compatibility counts,55 sometimes also referred to as equivalence class counts,56 and differential exon usage analyses. Below, we first repeat the case study on the single-cell dataset from Tasic et al.42 using a transcript compatibility count matrix as input to satuRn. Second, we perform a differential exon usage analysis on a subset of the “pasilla” bulk RNA-Seq dataset by Brooks et al.,57 which is the dataset that was used in the exon-level differential usage analysis of the DEXSeq publication.19

Case study on transcript compatibility counts

To obtain transcript-level abundance estimates, pseudoalignment tools like kallisto1 or Salmon2 rely on an expectation-maximization (EM) algorithm58 or similar for assigning reads to transcripts. This is a challenging task, given the largely overlapping RNA sequence of different gene isoforms. Therefore, it has been suggested to perform differential usage analyses on transcript compatibility counts (TCCs),55,59,60 sometimes referred to as equivalence class (EC) counts,56 instead. An equivalence class is the set of transcripts that a sequencing read is compatible with. As such, reads that are pseudo-aligned to the same set of transcripts are part of the same equivalence class. A TCC is the number of reads assigned to each of the different equivalence classes. Note, that TCCs can be obtained much faster than transcript-level counts, as the former does not require running the EM algorithm. Another important advantage of TCCs is that they can also be obtained from UMI-based scRNA-seq experiments, like the 10× Chromium protocol.

Here we perform a differential equivalence class usage (DECU) analysis on the single-cell dataset from Tasic et al.42 The DECU analysis is performed without correction for gender and age to allow for a comparison with our elaborate transcript-level analysis from the first case study analysis. We adopt the same filtering criteria as for the transcript-level analysis. However, ECs that correspond to more than one gene are additionally removed, as the relationship between ECs and genes needs to be determined unambiguously to infer within-gene EC usages.56,59 In our analysis, approximately 15% of the reads were discarded for this reason, which is in line with previous reports.61 After these steps, 26.141 ECs are retained, as compared to 14.420 transcripts in the transcript-level analysis. Next, the same comparisons of interest are tested. The number of differentially used ECs for each comparison are displayed in Table 3.

Table 3. Number of differentially used transcripts and equivalence classes in eight comparisons between cell types.

The first three columns indicate the comparisons between ALM (column 2) and VISp (column 3) cell types, respectively. Column 4 displays the number of transcripts that satuRn flagged as differentially used. Column 5 shows the number of unique genes in which satuRn finds evidence of differential usage of at least one transcript. Column 6 displays the number of equivalence classes that satuRn flagged as differentially used. Column 7 shows the number of unique genes in which satuRn finds evidence of differential usage of at least one equivalence class. Column 8 displays the number of genes that overlap between columns 5 and 7.

ComparisonCell type 1
(ALM)
Cell type 2
(VISp)
DTUDTU
gene
DECUDECU
gene
Overlap
gene
1Cpa6 Gpr88Batf324151288012
2Cbln4 Fezf2Col27a19253623223
3Cpa6 Gpr88Col6a1 Fezf27531195
4Gkn1 Pcdh19Col6a1 Fezf233221026216
5Lypd1 Gpr88Hsd11b1 Endou11869573222
6TncHsd11b1 Endou19311217210254
7Tmem163 Dmrtb1Hsd11b1 Endou905329015240
8Tmem163 Arhgap25Whrn Tox26340251910

For several comparisons, there is a clear discrepancy between the number of differentially used transcripts and equivalence classes (Table 3, columns 4 and 6) and in the number of unique genes in which evidence for differential usage was observed (Table 3, columns 5 and 7). The overlap between the genes found in the transcript-level and the EC-level analysis ranges from 53% in comparisons 6 and 8 to 100% in comparison 3.

For comparison 6 (53% overlap), we investigate the genes identified by both analyses. For some genes that are identified in both analyses, there is a strong agreement between the transcript-level and EC-level results. One example is gene Phosphodiesterase 4D or Pde4d (Ensembl ID ENSMUSG00000021699). As depicted in Figure 11, there is a one-to-one concordance in the usage profiles of the equivalence classes and the transcripts. Interestingly, the first equivalence class 52911|52918 has reads that are compatible to both transcripts ENSMUST00000134973 and ENSMUST00000138938. In the transcript-level analysis, transcript ENSMUST00000138938 was not retained after filtering, as most of these reads were assigned to transcript ENSMUST00000134973 by Salmon. Note that while the interpretation at the transcript level is more sensible from a biological perspective, it also assumes an appropriate attribution of reads to transcripts.

9cfd0cc3-413e-4d1d-b592-27c5be40dcd8_figure11.gif

Figure 11. Concordance between a differential usage analysis at the equivalence class level and the transcript level for gene Pde4d.

Top row: Equivalence class-level analysis. Three equivalence classes passed feature-level filtering. For equivalence class 52911|52918, strong evidence for differential usage between cell type Hsd11b1_Endou and Tnc was observed (empirical FDR = 3.970e-05). Bottom row: Transcript-level analysis. Three transcripts passed feature-level filtering. Each transcript corresponds to the respective equivalence classes in the top row. Note that while equivalence class 52911|52918 is compatible with both transcripts ENSMUST00000134973 and ENSMUST00000138938, the latter did not pass filtering in the transcript-level analysis. The visualizations and empirical FDR values are strongly concordant between the EC-level and the transcript-level analyses. Note that the gene-level counts (size black dots) in the EC-level analysis are lower because equivalence classes that correspond to multiple genes are being removed.

For most other genes in which differential usage was identified in both analyses, the comparison between the transcript analysis and the equivalence class analysis is more ambiguous, as equivalence classes often do not allow to pinpoint the responsible transcript(s) driving the observed differential signal. For example, the gene P2X Purinoceptor 4 or P2rx4 (Ensembl ID ENSMUSG00000029470), which was discussed in Figure 6 of the transcript-level case study, was also identified in the EC-level analysis. Four P2rx4 equivalence classes passed feature-level filtering, which we will denote with EC1-EC4. EC1 uniquely corresponds to transcript ENSMUST00000195963, and both the transcript and EC are found to be differentially used in their respective analyses (Extended data Figure S27, panel A25). The other ECs correspond to two, three and four different transcripts, respectively, but not to transcript ENSMUST00000195963 (Extended data Figure S27, panel A25). For EC2 and EC3, strong evidence for differential usage was found. For EC4, there was no evidence for differential usage. However, based on these ECs alone, it would not be possible to unambiguously attribute these changes to single transcripts, complicating the biological interpretation (Extended data Figure S2725). In contrast, while the transcript-level analyses do allow for clear functional interpretation, whether these results reflect true underlying biology depends on the ability of the quantification tools to assign reads to transcripts (see Discussion). Finally, none of the four ECs are compatible with transcript ENSMUST00000132062, the third transcript that passed feature-level filtering in the transcript-level analysis. In the EC-level analysis, there were many equivalence classes corresponding to this transcript but with a relatively low count, such that these individual equivalence classes did not pass filtering. The fact that the transcript was retained in the transcript-level analysis can be explained by Salmon assigning many of the reads of these low count equivalence classes that correspond to multiple transcripts to transcript ENSMUST00000132062.

Case study on exon-level counts

Finally, we performed a demonstrational case study on a subset of the Drosophila melanogasterpasilla” bulk RNA-Seq dataset by Brooks et al.,57 which is the dataset that was used in the exon-level differential usage analysis of the DEXSeq publication and its corresponding R package vignette.19 In brief, this dataset consists of two experimental groups of three and four samples each. In addition, these samples were sequenced partly in single-end runs and partly in paired-end runs, which is corrected for by including a covariate to the DEXSeq and satuRn models. The dataset was obtained from the pasilla R data package,62 which contains a count matrix for the seven samples and 498 exons originating from 46 genes.

Compared to the original DEXSeq analysis, satuRn identified fewer differentially used exons between the two experimental groups, with three statistically significant findings at 5% FDR for satuRn compared to 20 significant findings with DEXSeq. While we did not benchmark satuRn and DEXSeq on exon-level data, this could correspond to DEXSeq being overly liberal in transcript-level bulk analyses, especially with small sample sizes (Figure 2 and Extended data Figure S225). More importantly, however, the ranking of the exons based on their respective p-values is almost identical between the two methods (Extended data Figure S2825). Again, this is supportive of a similar performance of DEXSeq and satuRn on small bulk RNA-Seq datasets, albeit with a different type 1 error rate. In Extended data Figure S29,25 we additionally demonstrate how satuRn can be used to visualize changes in the usage of exons.

Discussion

In this manuscript, we have proposed satuRn, a new software tool for DTU analysis. satuRn adopts a quasi-binomial GLM framework and obtains direct inference on DTU by modelling the relative usage of a transcript, in comparison to other transcripts from the same gene, between conditions of interest. We evaluated the performance of satuRn with respect to seven other DTU methods on three simulated bulk RNA-seq datasets, a real bulk RNA-seq dataset and three real scRNA-seq datasets. These benchmarks underscored the strong performance of satuRn, as well as its ability to control the FDR close to the nominal level. In addition, we showed that satuRn scales seamlessly to the large data volumes that are produced in contemporary (sc-)RNA-seq experiments. Furthermore, given the underlying GLM framework, our method can handle complex experimental designs that are commonplace in scRNA-seq experiments. Finally, satuRn can extract biologically relevant information from a large scRNA-seq dataset that would have remained obscured in a canonical DGE analysis.

Since most sequencing reads map to multiple transcripts, quantification tools such as Salmon or kallisto only provide an estimate of the expected number of fragments originating from each transcript. Incorporating quantification uncertainty has recently been shown to improve results in differential expression analysis of single-cell RNA-seq datasets.63 Currently, satuRn and all other DTU methods discussed in this manuscript, except for BANDITS,30 neglect the uncertainty on this abundance estimate. BANDITS models the abundance uncertainty; however, it had a markedly lower performance than our method in our benchmark evaluation (Extended data figure S125).

One challenge common to all DTU methods is that the power to detect differentially used transcripts depends strongly on sequencing depth sequencing depth of the scRNA-seq dataset. This becomes clear when comparing the performances for the three different scRNA-seq benchmarks in this manuscript. The performances on the Darmanis26 dataset (Extended data25 figure S9) are markedly lower than the performances on the other two datasets (Figure 4 and Extended data25 figure S8). However, when we stratify the benchmark results according to the percentage of zero counts at the gene level, we observe that performances are negatively associated with sparsity, with all methods having worse performances for transcripts with many zero counts, across all datasets (Extended data25 Figures S30 and S31). The overall difference in performance between the different datasets is thus mainly driven by the number of transcripts with low, middle, and high fraction of zeros. A closer inspection of the Darmanis dataset showed that, after filtering, the transcript-level counts matrix contains a much larger percentage of zero counts than the other datasets (Extended data25 Figures S32 and table S1). We also more frequently observed the scenario where the expression level of a gene could be attributed to a single isoform (Extended data25 Figures S32 and table S1). This effectively causes a binary transcript usage pattern at the level of individual cells, with transcript usages of 0% or 100% Although that this may reflect the true underlying biology, We argue that it is more likely to be a technical artefact as a consequence of more shallow sequencing, given the lower percentage of binary usage profiles in the Chen and Tasic datasets. The supposedly binary expression of transcripts due to coverage-dependent bias and the use of more stringent filtering criteria to reduce this bias has already been comprehensively reported by Najar et al.64

For a subset of the single-cell dataset of Tasic et al.,42 we performed a case study on both transcript-level counts and transcript compatibility counts (TCCs). TCCs have several interesting properties: they are fast to compute and avoid the ambiguity in assigning reads to transcripts with often highly similar RNA sequences. The latter can be particularly difficult with UMI-based RNA-seq (e.g., 10× Chromium) data, given that only one end of the paired-end sequencing read can be used for transcriptome alignment. Nevertheless, biological interpretation of equivalence classes that correspond to multiple transcripts of a gene is challenging. Functional interpretation will therefore be limited to pinpointing genes with differential usage of equivalence classes. Transcript-level analyses, however, do allow for clear functional interpretation but assume the correct attribution of reads to transcripts. An additional disadvantage of equivalence class-level analyses is that reads mapping to multiple genes must be discarded, which for our analysis removed approximately 15% of all counts.

We conclude with the following recommendations for DTU analysis from an applied perspective. In case of small bulk RNA-seq datasets, satuRn, DEXSeq and DoubleExpSeq can be used interchangeably. In case of datasets with more complex designs that require the DTU model to incorporate additional covariates, e.g., batch effects, DoubleExpSeq cannot be used. For single-cell datasets, using DEXSeq will become infeasible in terms of scalability and DoubleExpSeq may give overly liberal results. As such, we recommend satuRn for performing DTU analyses in large bulk and single-cell RNA-seq datasets.

Data availability

Underlying data

Zenodo: Datasets associated with this publication https://doi.org/10.5281/zenodo.6826603.65

This project contains the following underlying data:

  • - Case_study.zip (Transcript-level expression count matrix Tasic_caseStudy_transcript_counts.Rds and corresponding metadata files Tasic_metadata_1.xlsx and Tasic_metadata_2.csv of a subset of the dataset by Tasic et al.42)

  • - Performance_Chen.zip (Transcript-level expression matrices Chen_counts.Rds and Chen_scaledTPM.Rds, as well as the corresponding metadata file Chen_metadata.csv of the dataset by Chen et al.29)

  • - Performance_Darmanis.zip (Transcript-level expression count matrix Darmanis_counts.Rds and the corresponding metadata file Darmanis_metadata.Rdata of the dataset by Darmanis et al.26)

  • - Performance_Dmelanogaster.zip (Dmelanogaster_kallisto is a folder containing the full output of the quantification of the Dmelanogaster dataset40 as quantified with kallisto1 version 0.46.0. The corresponding metadata can be retrieved from Dmelanogaster_metadata_1.txt and Dmelanogaster_metadata_2.txt.)

  • - Performance_GTEx.zip (Transcript-level expression matrices GTEx_counts.gz and GTE_scaledTPM.gz, as well as the corresponding metadata file GTEx_metadata.txt of the GTEx dataset.41)

  • - Performance_Hsapiens.zip (Hsapiens_kallisto is a folder containing the full output of the quantification of the Hsapiens dataset40 as quantified with kallisto1 version 0.46.0. The corresponding metadata can be retrieved from Hsapiens_metadata_1.txt and Hsapiens_metadata_2.txt.)

  • - Performance_Love.zip (Love_kallisto is a folder containing the full output of the quantification of the dataset by Love et al.18 as quantified with Salmon2 version 0.1.0. The corresponding metadata can be retrieved from Love_metadata.rda. Effective transcript length estimates for BANDITS30 are available from Love_eff_len.rds.

  • - Performance_Tasic.zip (Transcript-level expression matrices Tasic_counts.Rds and Tasic_scaledTPM.Rds, as well as the corresponding metadata files Tasic_metadata_1.xlxs and Tasic_metadata_2.csv of a subset of the dataset by Tasic et al.42)

  • - Scalability_analysis.zip (Several.Rdata files containing the scalability results of the different DTU tools on datasets of different sizes.)

In addition, all folders except Scalability_analysis.zip contain intermediate DTU analysis results that are available as.Rdata files or, in the case of Case_study.zip, .Rds files. Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Extended data

Zenodo: Extended data for satuRn publication https://doi.org/10.5281/zenodo.6810615.25

This project contains the following extended data:

  • - Supplementary_Figures.pdf: The supplementary figures to the satuRn publication, including figure captions.

  • - DTU_Methods_Detail.pdf: A.pdf text file that describes the different DTU tools that were included in our benchmarks in greater detail as compared to the description in our main publication. For even higher detail, we refer to the respective original publications.

  • - GSEA_MSigDB.xlsx: The output of the Gene Set Enrichment Analyses (GSEA) for the case study of our publication as generated by the online MSigDB platform.

License: Data are available under the terms of the CC-BY 4.0 license.

Software availability

satuRn R package: https://github.com/statOmics/satuRn

Archived source code at time of publication: https://doi.org/10.5281/zenodo.4656084.66

Code to reproduce analyses and figures: https://github.com/statOmics/satuRnPaper

Archived analysis code at time of publication: https://doi.org/10.5281/zenodo.6826612.67

License: CC-BY 4.0 license.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 11 May 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Gilis J, Vitting-Seerup K, Van den Berge K and Clement L. satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications [version 2; peer review: 2 approved] F1000Research 2022, 10:374 (https://doi.org/10.12688/f1000research.51749.2)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 08 Aug 2022
Revised
Views
8
Cite
Reviewer Report 01 Feb 2023
Marek Cmero, Walter and Eliza Hall Institute of Medical Research, Melbourne, Vic, 3052, Australia;  Department of Medical Biology, The University of Melbourne, Melbourne, Vic, 3052, Australia 
Approved
VIEWS 8
In version 2 of the manuscript, the authors have comprehensively addressed the suggested revisions. There are some remaining minor issues; addressing these would help with clarity:
  1. The axis labels for Figure S20A should read "number of
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Cmero M. Reviewer Report For: satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications [version 2; peer review: 2 approved]. F1000Research 2022, 10:374 (https://doi.org/10.5256/f1000research.136695.r146806)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
10
Cite
Reviewer Report 30 Jan 2023
Alejandro Reyes, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland 
Approved
VIEWS 10
Thanks to the authors for addressing fully the concerns raised in my first revision. They now discuss the general limitations of scRNA-seq affect differential transcript usage analyses, and how each tool is able to capture events of different magnitudes.  Finally, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Reyes A. Reviewer Report For: satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications [version 2; peer review: 2 approved]. F1000Research 2022, 10:374 (https://doi.org/10.5256/f1000research.136695.r146805)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 11 May 2021
Views
54
Cite
Reviewer Report 21 Jun 2021
Alejandro Reyes, Novartis Institutes for BioMedical Research (NIBR), Basel, Switzerland 
Approved with Reservations
VIEWS 54
The authors propose a new method to test for differences in transcript usage for both bulk RNA-seq and single-cell RNA-seq data. The manuscript is clear and the method is sound and well explained. The authors provide an excellent implementation of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Reyes A. Reviewer Report For: satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications [version 2; peer review: 2 approved]. F1000Research 2022, 10:374 (https://doi.org/10.5256/f1000research.54945.r85196)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 07 Sep 2022
    Jeroen Gilis, Applied Mathematics, Computer science and Statistics, Ghent University, Ghent, 9000, Belgium
    07 Sep 2022
    Author Response
    We are grateful for the comments and suggestions of both reviewers, which have markedly improved the readability of the manuscript and have led to several additional analyses and findings. We ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 07 Sep 2022
    Jeroen Gilis, Applied Mathematics, Computer science and Statistics, Ghent University, Ghent, 9000, Belgium
    07 Sep 2022
    Author Response
    We are grateful for the comments and suggestions of both reviewers, which have markedly improved the readability of the manuscript and have led to several additional analyses and findings. We ... Continue reading
Views
57
Cite
Reviewer Report 10 Jun 2021
Marek Cmero, Walter and Eliza Hall Institute of Medical Research, Melbourne, Vic, 3052, Australia;  Department of Medical Biology, The University of Melbourne, Melbourne, Vic, 3052, Australia 
Approved with Reservations
VIEWS 57
Gilis et al. present satuRn, a novel method for performing differential transcript usage analysis on bulk and single-cell RNA-seq data. The approach improves upon existing methods through computational scalability and support for multifactorial experimental designs, while also tightly controlling FDR ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Cmero M. Reviewer Report For: satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications [version 2; peer review: 2 approved]. F1000Research 2022, 10:374 (https://doi.org/10.5256/f1000research.54945.r85874)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 07 Sep 2022
    Jeroen Gilis, Applied Mathematics, Computer science and Statistics, Ghent University, Ghent, 9000, Belgium
    07 Sep 2022
    Author Response
    We are grateful for the comments and suggestions of both reviewers, which have markedly improved the readability of the manuscript and have led to several additional analyses and findings. We ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 07 Sep 2022
    Jeroen Gilis, Applied Mathematics, Computer science and Statistics, Ghent University, Ghent, 9000, Belgium
    07 Sep 2022
    Author Response
    We are grateful for the comments and suggestions of both reviewers, which have markedly improved the readability of the manuscript and have led to several additional analyses and findings. We ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 11 May 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.