Label-aware distance mitigates temporal and spatial variability for clustering and visualization of single-cell gene expression data

Liang, Shaoheng; Dou, Jinzhuang; Iqbal, Ramiz; Chen, Ken

doi:10.1038/s42003-024-05988-y

Download PDF

Article
Open access
Published: 14 March 2024

Label-aware distance mitigates temporal and spatial variability for clustering and visualization of single-cell gene expression data

Communications Biology volume 7, Article number: 326 (2024) Cite this article

364 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Clustering and visualization are essential parts of single-cell gene expression data analysis. The Euclidean distance used in most distance-based methods is not optimal. The batch effect, i.e., the variability among samples gathered from different times, tissues, and patients, introduces large between-group distance and obscures the true identities of cells. To solve this problem, we introduce Label-Aware Distance (Lad), a metric using temporal/spatial locality of the batch effect to control for such factors. We validate Lad on simulated data as well as apply it to a mouse retina development dataset and a lung dataset. We also found the utility of our approach in understanding the progression of the Coronavirus Disease 2019 (COVID-19). Lad provides better cell embedding than state-of-the-art batch correction methods on longitudinal datasets. It can be used in distance-based clustering and visualization methods to combine the power of multiple samples to help make biological findings.

Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with PRECAST

Article Open access 18 January 2023

Reconstruction of the cell pseudo-space from single-cell RNA sequencing data with scSpace

Article Open access 29 April 2023

SIMBA: single-cell embedding along with features

Article Open access 29 May 2023

Introduction

Gene expression reflects the identity of a cell. Single-cell RNA sequencing (scRNA-seq) technologies profile thousands of cells simultaneously¹, enabling trajectory inference to reveal the course of cell development and transformation². Although a large amount of data has been gathered from different tissues among large cohorts of patients³, technical variances among separately assayed samples often overshadow the similarity of cells, resulting in disconnected trajectories (Fig. 1a) that hinders the discovery of underlying biological processes. This phenomenon often called the batch effect, complicates the single-cell sequencing data analysis. For samples collected from the same condition (i.e., same time and tissue but different participants), differences among samples are usually considered batch effects and get corrected, but over-correction is suspected in some cases⁴. For longitudinal data, where true biological changes and nuisance factors are entwined (Figure 1b), the distinction between biological and batch effects becomes more elusive.

**Fig. 1: Illustration of how Lad removes batch and reconstruct trajectories from longitudinal data.**

As a practical example, researchers have collected data from the developing retina of 13 mice at different ages⁵. Ideally, the embedding of such data should reveal a fluent trajectory of how pluripotent stem cells differentiate/evolve into multipotent stem cells, and finally to specific cell types. However, as a mouse does not survive the tissue extraction, each sample in the dataset is from a unique mouse. The batch effect exists among samples collected from different mice. Because there is no bijective (i.e., injective and surjective) mapping for cells from different samples, existing methods addressing batch effect in longitudinal data assume that the features are measured on the same set of entities (cells) at different time points⁶ are not applicable to single-cell data.

To address this issue, multiple methods have been published⁷. For example, Limma fits a linear model and removes the component for the batch effects⁸. Seurat utilizes mutual nearest neighbor (MNN, in CCA or RPCA space) to identify similar clusters in different batches and integrates those batches by removing the differences⁹. Harmony also integrates proximal clusters, but in an iterative way through soft clustering¹⁰. Liger uses nonnegative matrix factorization (NMF) to separate common and sample-specific features¹¹. A neural network approach, scVI, combines a variational autoencoder and a zero-inflated model to visualize and correct the data. Notably, Harmony generates integrated clusters and visualization without giving a corrected expression profile. It is deemed acceptable, however, because corrected datasets are often hard to interpret. Researchers usually only cluster and visualize the data using such methods, and recur to statistical tests that control for the batch effect in the downstream analyses¹². These methods have been utilized in a few large-cohort studies and satisfyingly removed the batch effect among samples^9,10. However, none of them are designed for longitudinal data.

In an effort to address this issue, we noticed the temporal locality among the single-cell samples. Specifically, samples gathered at closer time points are expected to be more biologically alike. If we decompose the difference between two samples to batch effect and biological effect. The former is uniform between all samples, but the latter increases with the distance between samples. Thus, the relative strength of the effects decreases over distance. In other words, the observed difference between two adjacent time points is more likely to be the batch effect, than that of two distant time points.

Here, we define a distance metric, termed Label-Aware Distance (Lad) which exploits the locality to precisely remove the batch effect (Figure 1c) but keeps biologically meaningful information that forms the trajectory. As a distance metric, it is naturally compatible with all state-of-the-art distance-based clustering and embedding methods. We compare Lad with Euclidean distance, Limma, Seurat integration (with CCA and RPCA), and Harmony on simulated data and mouse retina development data, and show applications of Lad on COVID-19 patient data and human fetal lung development data. The results clearly show the benefit of Lad to biology studies.

Results

The label-aware distance

Across single-cell samples, genes (and their combinations) are differentially affected by the batch effect. Traditional clustering and trajectory inference methods based on metrics that assign equal weights to all genes, such as the Euclidean distance, fail to address this discrepancy in data. We introduce Label-Aware Distance (Lad), where a weight matrix (covariance matrix) for genes is used to offset the batch effect. The matrix is directly inferred from the gene expression data and time labels or spatial coordinates (Methods), based on the temporal/spatial locality intuition: the batch effect is uniform across samples, while the biological effect increases over the distance of samples.

To validate Lad, we simulated a dataset with seven samples. We assume that all cells are differentiated from an initial state, stem cells named S. S differentiates into two multipotent cell types A and B, which differentiate into terminal types A1, A2 and B1, B2, respectively (Fig. 2a). We simulated 200 genes, and assigned a unique ideal gene expression profile for each cell type, denoted as s, a, b, a₁, a₂, b₁, and b₂. Because the development of the cell types is gradual¹³, we used weighted mean to represent the process in the simulated samples. For example, to simulate intermediate states between A and A1, we used (θa + (1 − θ)a₁) as its profile, where θ is set to be uniformly distributed in a range corresponds to the developing stage of a sample (Table 1). The ambiguous cell types (θ close to 0.5) are denoted as “A → A1”, while cells more similar to “A” or “A1” (θ close to 0 or 1) are labelled as them each.

**Fig. 2: Results on the simulated data.**

Table 1 Simulated dataset

Full size table

We randomly selected a set of genes to be affected by the batch effect and added different additive components to those genes in different samples. For every single cell, we further added random noise into the ideal profile and used a Poisson distribution to generate the gene expression. The number of samples at each time point and compositions of samples are shown in Table 1. It can be seen that the cell types gradually evolve over time.

We used Seurat to process the data with Lad, Euclidean distance, Limma, Seurat integration (with CCA and RPCA), and Harmony. In the process, 150 highly variable genes are selected, and 30 PCs are used. The result of Lad is shown in Figure 2b. Given the prior knowledge that all cell types originate from S, it can be clearly seen that it branches to A and B, and further to A1, A2, and B1, B2, respectively. Batch effect structure can still be seen in A1, A2, B1, and B2, but the downstream visualization is improved as samples are now organized by their time. Day 1 is at the center, day 4 is separated into A1 and A2 at the top-left and B1 and B2 at the bottom-right corner, while days 2–3 correspond to the trajectory of which the cells differentiate.

In contrast, Euclidean distance (on PCA), the baseline, does not correct for any batch effect. The difference between Euclidean distance and LAD is apparent when visualizing the similarity of cell types (Supplementary Fig. 1 and Supplementary Note 1). Consequently, the terminal types A1, A2, B1, and B2 each splits into two groups, corresponding to different samples. It also fails to illustrate the evolution of the cell types. Limma does not generate a visually appealing figure, though a reasonable integration is achieved with the branches in the data largely preserved. Seurat integration shows a high capacity for removing batch effects, as the cells from multiple days are mixed. However, it is an over-correction. For example, S, S → A and A are mixed, and so do A → A1 and A1, and A → A2 and A2. The same applies to the branch of B. It also leads to incorrect trajectory inferences, as A is now closer to A than A → A1. Overall, it makes it harder to delineate cell differentiation. Seurat with RPCA rather than CCA is slightly less prone to overcorrection but still fails to preserve the A1/2 and B1/2 branches. The result of Harmony is slightly better. The two batches are fairly mixed for A1, A2, B1, and B2. The trajectory of S → A to A, then to A → A1 and A → A2, and finally to A1 and A2 can be seen, although the B → B2 and B2 are misplaced. However, the A branch and B branch are still in disconnected clusters, leaving their relationship with S undiscovered. Overall, Lad performs the best. For time consumption, Lad and Limma use less than 1 second, while Harmony uses 9 seconds and Seurat integration uses 20 seconds.

Mouse retina development dataset

We applied Lad to a dataset including 13 single-cell specimens collected from the developing retina of one sample at E12 (day 12 embryo), two at E14, one at E16, two at E18, one at P0 (postnatal day 0), two at P2, one at P5, two at P8, and one at P14⁵. A total of 110,359 cells are collected and processed with Seurat, where 2,000 highly variable genes are selected, and 30 PCs are used.

The development of the mouse retina is well-understood by the field (Fig. 3a). Briefly, retinal progenitor cells (RPCs, including early RPCs and late RPCs) differentiate into Neurogenic cells, which further differentiate into photoreceptor precursors, Amacrine cells, horizontal cells, and retinal ganglion cells¹⁴. The photoreceptor precursors differentiate into cone cells, rod cells, and bipolar cells¹⁵. These cell types are all neural cells. Late RPCs also differentiate to Müller glia¹⁴. These cells form a neural network where each terminal cell type has a unique function. Cone cells and rod cells forming the input layer are photoreceptors that work in light and dark environments, respectively. The signal from cone cells and rod cells propagates through the bipolar cells first and then retinal ganglion cells to go to the brain. Horizontal cells provide horizontal connections between photoreceptors and bipolar cells, while amacrine cells perform similar functions between bipolar cells and retinal ganglion cells. These cells form two hidden layers. Müller glia is an auxiliary cell type that supports the aforementioned neural cells.

**Fig. 3: Results for the retina development dataset.**

The result of Lad shows a clear trajectory of the cells gradually evolving from E11 to P14 (Figure 3b), The aforementioned evolution trajectory of the cell types can also be seen along the trajectory. The Euclidean distance also performs relatively well on this dataset. However, a clear batch effect can be seen in uncorrected data (PCA), where cells are clustered by samples. As a result, a cell type is divided into multiple clusters, making it harder to delineate the differentiation trajectory of the cell types. Both Seurat integration and Harmony over-correct the batch effect. Although the samples are well mixed, many cell types are mixed. For example, early RPC, late RPC, and Müller glia are not distinguishable from the result, and so are horizontal cells, amacrine cells, bipolar cells, and retinal ganglion cells. Even worse, the bipolar cells, which should be differentiated from photoreceptor precursors, are misplaced directly under neurogenic cells. These mistakes are because these methods do not utilize the temporal locality of the samples. Quantitative benchmarks concur with these qualitative evaluations (Figure 3c) Lad shows comparable batch correction strength but conserves more biological content. In addition, trajectory analysis shows that Lad helps Monocle 3 infer pseudotimes more coherent with the actual times and prior biological knowledge about Nfia, Nfib, and Nfic (Supplementary Fig. 2, 3 and Supplementary Note 2)⁵. It should be noted that notwithstanding the better overall embedding, the batch structure is largely retained when zoomed in (Supplementary Fig. 4). Thus, the results may need to be interpreted with caution. For this dataset, Lad uses less than 2 seconds, while Harmony uses 5 minutes. Seurat integration costs more than 3 hours. We also included a similar analysis of developmental data on Zebrafish embryo data in Supplementary Note 3 and Supplementary Fig. 5.

COVID-19 immune compartment datasets

The global pandemic COVID-19 has reportedly infected 6.3 million people worldwide, with the death toll at 380 thousand. Understanding the immune response to the virus is essential to developing treatments. We applied Lad to a recently published dataset including immune compartment samples collected from 13 participants¹⁶ including 4 health controls (HC1-4), 3 moderate (M1-3) cases, and 6 severe (S1-6) cases. Although these are all different participants, we consider HC, M, and S a time course to reflect the progression of the disease. We used Seurat and Lad to process the data. During the process, 2,000 highly variable genes are selected, and 30 PCs are used.

The result is shown in fig. 4a. The largest group is macrophages, which recognize and destroy virus-infected cells. The result shows from top to bottom a gradual change from those in health controls, to moderate cases and severe cases. Some under-correction can be seen in the results. For example, HC3/4 are closer to, but not fully integrated with HC1/2. However, it would still be easier to identify macrophages with the corrected data.

**Fig. 4: Results on the COVID-19 dataset.**

Quantitative benchmarks give similar scores to all methods (Figure 4b), but only Lad delineates the gradual changes in macrophages among health controls, moderate cases, and severe cases (Figure 4a). In contrast, Euclidean distance yields disconnected groups of macrophages because of the batch effect, while Seurat integration (CCA and RPCA), albeit ranked top, and Harmony overcorrect the effect and confuse cells from moderate cases with those from health controls. Limma, on the other hand, appears to be underpowered, leaving many scattered macrophage clusters from different samples. The changes in gene expression that drive the differences can be attained by differential expression analysis. Similar trajectories can also be seen on T cells and plasma cells, which kill infected cells and produce antibodies, respectively. Using these pieces of information, researchers can identify the most effective form of immune cells and find ways to transform others into it to treat the disease.

Human lung data

Besides studies of the immune compartment, knowledge of lung development may also help cure the disease and restore the functionality of the lung¹⁷. Miller et al. has produced a single-cell dataset of cells from¹⁸. Cells from fetal human lungs are collected at week 11.5 (W11.5), W15, W18, and W21, and are available for the trachea, small airways in the lung, and the distal tip of the lung.

We explored the dataset using Lad. Because both temporal and spatial information of the samples is available, we use a vector [week, location] to label each sample, where week is the number of weeks mentioned above, and location is set to be 0, 2, and 4 for trachea, small airways, and distal lung, respectively. The result is shown in Fig. 5. A branching trajectory can be seen from W11.5 to W18 for distal lung and small airways mesenchymal cells. The cells from the two locations are similar at the early stage (W11.5) but become more distinct when they are more developed (W15 and W18). In contrast, affected by the batch effect, Euclidean distance and Harmony show W15 and W18 small airway mesenchymal cells as isolated clusters, with no connection with W11.5, while Seurat mixes all mesenchymal cells across the two locations and all times points, blurring the trajectory (Supplementary Fig. 6 and Supplementary Note 4). Similar trajectories also show for Epithelial cells, endothelial cells, and pericytes. Researchers may use these pieces of information to further study the changes in gene expression in the cell type developments and develop treatments.

**Fig. 5: Results on the human fetal lung development dataset.**

Discussion

In order to build a reliable trajectory of cell type development from a longitudinal dataset, the batch effect should be corrected for. This is a new problem as state-of-the-art batch effect correction methods cannot utilize the time/spacial information, and thus result in over-correction. Our method, Lad, utilizes such information to accurately identify and remove the batch effect while preserving the correct structure of the data. The results based on the Lad clearly show the evolution of the cell types through time. The discovered trajectory helps researchers make hypotheses of the physiological changes during development or pathological changes as a result of disease infection. The changes can be ascertained by differential gene expression analysis.

Lad assumes that the batch effect is shared by all the samples. This is a reasonable assumption because many kinds of batch effects have a biological basis. For example, the cellular stress response stimulated by the sample preparation affects certain genes. In the case the set of genes does change, the local linear approximation may be used. Because the sums and products of distances are guaranteed to be valid distances, multiple Lads each correcting for the batch effect in a specific set of samples can be combined as a consensus distance.

Lad is intrinsically a linear transformation, which may be insufficient for more complex batch effects such as nonlinear interactions of genes in multicenter clinical studies. Notwithstanding the better overall embedding, the batch structure is retained in some cases, necessitating extra caution in interpreting the results.” Nevertheless, it is a proof of concept that the time/spatial locality should be considered in batch correction for longitudinal datasets, which are on the rise. Nonlinear methods may be invented based on the same concept. For example, kernelization can be a direct extension to Lad. Researchers usually try multiple quality control, batch correction, and visualization methods to find the most suitable ones for their data, and Lad is an inexpensive option to be included. There is also a potential to cascade Lad with methods like Harmony, which essentially refines the cell-cell similarity graph. Although batch correction helps produce better visualization and clustering results, biases can be introduced during the process. Small false similarity between cells may also be added by the process, although most clustering and visualization methods will ignore them when building the similarity graphs based on only the most similar cells. Overall, rigorous statistical tests on original data are needed to ascertain the findings^19,20.

Although all the experiments are from a biology background, the scope of Lad is not confined to it. It can be applied to any longitudinal/spatial dataset affected by batch effects where the temporal/spatial locality holds. Lad also illustrates that batch the effect correction problem is related to the alternative clustering problems. Over the past few years, many advanced alternative clustering models have been introduced, and translating them to this context may result in better performance.

In summary, we defined a pairwise distance of the cells, namely Label-Aware Distance (Lad), where the effect of the unwanted clustering is controlled. Results show our method achieves more accurate clusters and better visualizations than state-of-the-art methods on longitudinal datasets. The Lad can be directly integrated into most clustering and visualization methods to enable more scientific findings.

Methods

The label-aware distance

Trajectory inference, in general, aims to find a graph G = (V, E) which reflects the hop-by-hop gradual change (E) of cells (V_i, V_j, ⋯∈ V) by optimizing

$$\min \mathop{\sum}\limits_{({V}_{i},{V}_{j})\in E}\parallel {x}_{i}-{x}_{j}\parallel$$

(1)

subjecting to a set of constraints^2,21. Here, x_i is the profile of cell i, which can be the whole gene expression, or the first few principal components (PCs). Euclidean distance is the most widely used metric, while the Mahalanobis distance

$$\parallel {x}_{i}-{x}_{j}{\parallel }_{{\Sigma }^{-1}}=\sqrt{{({x}_{i}-{x}_{j})}^{{\mathsf{T}}}{\Sigma }^{-1}({x}_{i}-{x}_{j})},$$

(2)

where Σ is the covariance matrix calculated from all the samples, may be used to account for different (co)variances among features. To account for the batch effect, we propose to redefine the Σ as

$$\tilde{\Sigma }=\mathop{\sum }\limits_{{{{{{{{\rm{cell}}}}}}}} \, i=1}^{n}\mathop{\sum}\limits_{{{{{{{{\rm{batch}}}}}}}}b\ne {C}_{i}}{W}_{ib}({x}_{i}-{m}_{b}){({x}_{i}-{m}_{b})}^{{\mathsf{T}}},$$

(3)

where C_i is the batch cell i is from, and m_b is the mean expression of cells in batch b. It sums over all b’s except for the one cell i belongs to. If weight W_ib ≡ 1, it degrades to the metric defined by Qi and Davidson²² for generating an alternative clustering. In essence, it removes the variances across the batches, while retaining the variance within each batch. For a longitudinal dataset, each sample b is collected from a time point t_b, and a cell i is from the time point ${t}_{{C}_{i}}$. To utilize the temporal/longitudinal locality, we set

$${W}_{ib}=\exp \left(-\frac{\parallel {t}_{{C}_{i}}-{t}_{b}{\parallel }^{2}}{2\,{l}^{2}}\right),$$

(4)

where l (set to 1 in our experiments) is the length scale within which two samples are considered temporally/spatially close. The sensitivity to l is generally small (Supplementary Note 5 and Supplementary Fig. 7), and a hyperparameter search can help find suitable ones. The covariance of proximal time points is weighted more and thus is suppressed in the refined distance. When the dataset contains both temporal and spatial labels, τ_i and t_j can be vectors that include both labels. Inversed Cholesky-decomposed $\tilde{\Sigma }$ can be used to transform the data. If first k PCs are used, the computational complexity is O(nk² + k³). Practically, we solve the linear system ${\tilde{\Sigma }}^{\frac{1}{2}}y=({x}_{i}-{x}_{j})$ to avoid numerical issues in finding the inverse (${\tilde{\Sigma }}^{\frac{1}{2}}$ is the Cholesky decomposition).

Gene expression data processing

We use Seurat²³, an R package, to analyze the gene expression data. The package provides functionalities to normalize data, find highly variable features (i.e., genes) by variance stabilizing transformation, scale the features, perform principal component analysis (PCA), and visualize the result with UMAP (uniform manifold approximation and projection). This is the de facto standard single-cell data analysis protocol. The normalization step, in particular, normalizes the summation of gene expression in each cell to be one. The scaling step standardizes each gene so that the average expression over all cells is zero, and the standard deviation is one. UMAP is a nonlinear embedding method to visualize data by their distance²⁴.

Also provided in Seurat is a data integration method that corrects batch effect⁹. It first projects samples into a common subspace using canonical correlation analysis (CCA), and then finds MNNs in the CCA subspace as “anchors” to correct the data. We refer to it as Seurat integration (not to be confused with the entire Seurat protocol). Harmony first projects the data into a lower-dimensional PCA space, and then iteratively removes batch effects. At each iteration, it clusters cells while maximizing the diversity of batches within each cluster and calculates a correction factor for each cell to remove the batch effect. Tran et al.⁷ show by systematic assessments that both methods are state-of-the-art. Thus, we compare Lad with them. To ensure good comparability, we implemented Lad with an interface to Seurat. This choice also makes Lad easy to use for biology researchers familiar with Seurat.

Preprocessing is done according to the requirements/recommendations of different methods. Seurat integration handles individually preprocessed data (and uses CCA/RPCA to find a consensus embedding), while other methods working on embedding spaces can only use data that are preprocessed combined. Limma requires log-transformed normalized data before scaling. Because normalization is done per cell, preprocessing separately or combined will give the same result.

Benchmarks

We added two sets of benchmarks: bio-conservation and batch integration. For bio-conservation, we use silhouette score to measure the overall concordance of cell distribution and ground truth labels and isolated labels silhouette score to measure how well each label is distinguished from all other labels using the average-width silhouette score. For batch integration, we use batch silhouette score to measure how well the batches are mixed and graph connectivity to quantify the connectivity of the subgraph per cell type label. The aggregated score is the arithmetic mean. We used the Scib-metric implementation of these metrics²⁵.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All datasets used in this work are public data. They are available in public repositories, including the mouse retina (GSE118614)²⁶, COVID-19 (GSE145926)²⁷, and human lung (E-MTAB-8221)²⁸. The source data behind Fig. 3c and 4b in the paper can be found in Supplementary Data 1.

Code availability

The software package is available at https://github.com/KChen-lab/ladand Zenodo²⁹. The scripts to generate all results are included in the repository.

References

Lim, B., Lin, Y. & Navin, N. Advancing cancer research and medicine with single-cell genomics. Cancer Cell 37, 456–470 (2020).
Article CAS PubMed PubMed Central Google Scholar
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
Article CAS PubMed Google Scholar
Regev, A. et al. Science forum: the human cell atlas. Elife 6, e27041 (2017).
Article PubMed PubMed Central Google Scholar
Nygaard, V., Rødland, E. A. & Hovig, E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17, 29–39 (2016).
Article MathSciNet PubMed Google Scholar
Clark, B. S. et al. Single-cell RNA-seq analysis of retinal development identifies NFI factors as regulating mitotic exit and late-born cell specification. Neuron 102, 1111–1126 (2019).
Article CAS PubMed PubMed Central Google Scholar
Müller, C. et al. Removing batch effects from longitudinal gene expression-quantile normalization plus combat as best approach for microarray transcriptome data. PloS One11.6, e0156594 (2016).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 1–32 (2020).
Article Google Scholar
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47–e47 (2015).
Article PubMed PubMed Central Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Article CAS PubMed PubMed Central Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods 16.12, 1289-1296 (2019).
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Article CAS PubMed PubMed Central Google Scholar
Finak, G. et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).
Article PubMed PubMed Central Google Scholar
Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943 (2019).
Article CAS PubMed PubMed Central Google Scholar
Stenkamp, D. L. Development of the vertebrate eye and retina. In Progress in Molecular Biology and Translational Science, 134, 397–414 (Elsevier, 2015).
Brzezinski, J. A. & Reh, T. A. Photoreceptor cell fate specification in vertebrates. Development 142, 3263–3273 (2015).
Article CAS PubMed PubMed Central Google Scholar
Liao, M. et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat. Med. 26.6, 842–844 (2020).
Golchin, A., Seyedjafari, E. & Ardeshirylajimi, A. Mesenchymal stem cell therapy for covid-19: present or future. Stem Cell Rev. Rep. 16, 427–433 (2020).
Miller, A. J. et al. In vitro and in vivo development of the human airway at single-cell resolution. Dev. Cell 53.1, 117–128 (2020).
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Article PubMed PubMed Central Google Scholar
Liang, S., Liang, Q., Chen, R. & Chen, K. Stratified test accurately identifies differentially expressed genes under batch effects in single-cell data. IEEE/ACM Trans. Comput. Biol. Bioinforma. 18, 2072–2079 (2021).
Article CAS Google Scholar
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979 (2017).
Article CAS PubMed PubMed Central Google Scholar
Qi, Z. & Davidson, I. A principled and flexible framework for finding alternative clusterings. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 717–726 (2009).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS PubMed PubMed Central Google Scholar
McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Article CAS PubMed Google Scholar
Clark, B. Single-cell rna-seq analysis of retinal development identifies NFI factors as regulating mitotic exit and late-born cell specification [dataset]. NCBI GEO database accession GSE118614 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118614 (2019).
Zhang, Z. Single-cell landscape of bronchoalveolar immune cells in COVID-19 patients [dataset]. NCBI GEO database accession GSE145926 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926 (2020).
Czerwinski, M., Yu, Q. & Spence, J. SCRNA-SEQ of human fetal lung primary tissues and cell cultures derived from fetal bud tip progenitors under dual smad treatment [dataset]. E-MTAB database accession E-MTAB-8221 https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-8221 (2020).
Liang, S., Dou, J., Iqbal, R. & Chen, K. Label-aware distance mitigates temporal and spatial variability for clustering and visualization of single-cell gene expression data [software]. Zenodo DOI: 10.5281/zenodo.10646069 (2022).

Download references

Acknowledgements

This publication is part of the Human Cell Atlas - www.humancellatlas.org/publications/. This work was supported in part by grant number 2018-182735 to KC, Human Breast Cell Atlas Seed Network Grant (HCA3-0000000147) to KC from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation, grant RP180248 to KC from Cancer Prevention & Research Institute of Texas, The University of Texas MD Anderson Cancer Center Pre-Cancer Atlas Project to KC, and The University of Texas MD Anderson Cancer Center Colorectal Cancer Moonshot and P30 CA016672 (US National Institutes of Health/National Cancer Institute) to the University of Texas Anderson Cancer Center Core Support Grant.

Author information

Shaoheng Liang
Present address: Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Authors and Affiliations

Department of Bioinformatics and Computational Biology, MD Anderson Cancer Center, Houston, TX, USA
Shaoheng Liang, Jinzhuang Dou, Ramiz Iqbal & Ken Chen
Department of Computer Science, Rice University, Houston, TX, USA
Shaoheng Liang

Authors

Shaoheng Liang
View author publications
You can also search for this author in PubMed Google Scholar
Jinzhuang Dou
View author publications
You can also search for this author in PubMed Google Scholar
Ramiz Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Ken Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.L. and K.C. conceptualized the problem. S.L., K.C., J.D., and R.I. contributed to the method. S.L. implemented the software and performed the experiments. S.L., K.C., J.D., and R.I. analyzed the results and drafted the manuscript.

Corresponding authors

Correspondence to Shaoheng Liang or Ken Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics statement

The original studies had received sufficient ethical approval from the Institutional Animal Care and Use Committee of the Johns Hopkins University School of Medicine²⁶, the Research Ethics Committee of Shenzhen Third People’s Hospital²⁷, and The University of Michigan Institutional Review Board¹⁸. All participants provided written informed consent for sample collection and subsequent analyses²⁷.

Statistics and reproducibility

No statistical tests are involved. The computational experiments can be reproduced by using the code provided in the repository.

Peer review

Peer review information

Communications Biology thanks Laleh Haghverdi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Chien-Yu Chen and George Inglis. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liang, S., Dou, J., Iqbal, R. et al. Label-aware distance mitigates temporal and spatial variability for clustering and visualization of single-cell gene expression data. Commun Biol 7, 326 (2024). https://doi.org/10.1038/s42003-024-05988-y

Download citation

Received: 03 July 2023
Accepted: 28 February 2024
Published: 14 March 2024
DOI: https://doi.org/10.1038/s42003-024-05988-y

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

The label-aware distance

Mouse retina development dataset

COVID-19 immune compartment datasets

Human lung data

Discussion

Methods

The label-aware distance

Gene expression data processing

Benchmarks

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethics statement

Statistics and reproducibility

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links