Main

Systems immunology provides a holistic understanding of the immune system, spanning single immunological components and pathways to form cross-scale networks. Unlike reductionist approaches aimed at understanding individual parts, systems immunology aims to understand properties of individual parts working together—a challenge requiring specialized methodologies1,2.

During the past century, many successful experimental strategies have been developed that were instrumental in defining cell types and cellular states within the immune system to reveal major molecular and functional components of the immune system and to establish causal relationships for transcriptional and functional cascades that drive immune activation (Fig. 1). For two decades, high-throughput, high-resolution technologies from the omics field have revolutionized our understanding of immunology and enabled the simultaneous assessment of hundreds to thousands of cellular, functional and molecular parameters with continuously increasing throughput and decreasing turn-around times.

Fig. 1: Milestone methods in immunology.
figure 1

Timeline of the most important technological developments in immunology research, with a special focus on the evolution of omics from the advent of microarrays to current single-cell approaches.

Sequencing-based technologies are used to assess genomic, transcriptomic and epigenomic information, and sophisticated technologies in proteomics, metabolomics, microbiomics and lipidomics have been introduced to immunological research. In the past decade, single-cell sequencing technologies have emerged, with single-cell transcriptomics leading the way3.

In this how-to guide, we provide a brief introduction in the use and integration of omics technologies in systems immunology and explain how current single-cell-level omics technologies can be applied to immunological questions in model systems and increasingly in human immunology and clinical trials for immune-mediated diseases. We focus on the use of transcriptomic technologies in systems immunology, particularly on single-cell assays, as mRNA constitutes the first functional and relatively easily accessible readout of the genome; as such, it can serve as a surrogate to bridge genomic and functional phenotypes to enable the description and prediction of causal relationships and effectors of immune-cell function.

Evolution of omics in immunology

High-throughput, high-resolution omics technologies have been used in immunology since microarrays were introduced4,5. These early microarray-based techniques were applied widely, for example to understand genetic differences and evolution of Bacillus Calmette–Guérin (BCG) vaccines6, to examine systemic inflammation and the network of leukocytes in people with systemic lupus erythematosus5 and to characterize the activation network of macrophages in response to diverse stimuli7. However, in the late 2000s, microarray-based techniques were superseded by unbiased massive-scale whole RNA sequencing (RNA-seq)8,9. The identification of long non-coding RNAs as broad-acting regulatory components of inflammatory responses exemplified how these technologies can lead to the discovery of completely new molecular concepts in immunology10. Shortly thereafter, the first single-cell immune-cell transcriptomes were described11,12, which fundamentally changed the way immune-cell types and cellular states can be defined and how this information can be used to predict cellular activity and immune-cell function. Since then, a single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq), single-cell DNA methylation, single-cell lipidomics and single-cell metabolomics have been introduced as further means to characterize immune cells13,14,15.

Specific to immunological research, repertoire analyses of recombined B cell receptors (BCRs) and T cell receptors (TCRs) by BCR-seq and TCR-seq have a crucial role in understanding the complex mechanisms controlling the diversity and specificity of adaptive immune responses16. Combined with single-cell transcriptomic and antigen-binding analyses using sophisticated analytical tools17,18,19,20, BCR-seq and TCR-seq can shed light on the functional state of the adaptive immune repertoire and its specificity in response, for example to pathogens or tumor antigens, which might become important features for diagnosis and therapy of immune-mediated diseases21.

In parallel to the advances in single-cell genomics, the field of multiparameter antibody-based characterization of immune cells has deepened our understanding of immunology, mainly owing to the advent of heavy-metals-based cytometry by time of flight (CyTOF) and oligonucleotide-based cellular indexing of transcriptomes and epitopes (CITE)-seq, complementing the development of high-dimensional fluorochrome-based flow cytometry. Furthermore, imaging mass cytometry is a highly valuable extension of CyTOF for characterizing immune cells in their natural environment and spatial context.

The first antibody-independent mass-spectrometry-based single-cell proteomics technologies, such as ScoPE2, were also reported this year and present a logical next step for the characterization of immune cells22.

As technologies continue to improve, metabolome, proteome, microbiome and lipidome studies and the integration of the data that they produce are being used to tackle immunological questions at a system-wide scale. Nevertheless, sequencing-based technologies remain the most commonly used techniques, particularly at the single-cell level. The most accessible omics technologies for immunology researchers are listed in Tables 13 and have been reviewed elsewhere23,24,25,26.

Table 1 Overview of technologies: transcriptomics
Table 2 Overview of technologies: epigenomics
Table 3 Overview of technologies: other

Choosing an omics technology

When it comes to the application of omics technologies in systems immunology, one needs first to define which of the omics technologies are best suited to answer the proposed question. Here, we introduce the respective technologies for the three major layers of biological information, namely the transcriptome, epigenome and genome.

Transcriptomics

Techniques for interrogating the transcriptome should be mentioned as forerunners of the omics revolution. RNA sequencing remains the gold standard for unbiased, genome-wide assessment of gene expression on a population level, and many protocols exist for a variety of purposes27.

For questions focusing on the heterogeneity of cell populations in health and disease or cellular differentiation and developmental trajectories, single-cell transcriptomics has quickly gained popularity since its introduction28. Despite the cost and technical complexity, we encourage the use of single-cell technologies when studying heterogeneous cellular populations and molecular phenotypes (for example, in complex tissues or in response to a diversity of perturbations). If preliminary results indicate negligible cellular heterogeneity, bulk analysis is a viable option given its lower cost and thereby its potential to analyze larger sample sizes and to provide higher density transcriptomic information. Furthermore, for large to very large clinical studies exceeding hundreds of samples, bulk technologies in combination with deconvolution algorithms trained on a small set of single-cell resolved data are currently an effective way to gain important information about transcriptional regulation and function (for example, associated with a specific disease), pathological processes or therapeutic interventions, including vaccines29,30,31.

Today, several complementary scRNA-seq approaches exist, each with specific advantages and applications for answering different questions (summarized in Table 1). As examples of the two most widely used scRNA-seq methods, plate-based full-length-mRNA techniques have the highest sensitivity (albeit limited in throughput) and enable isoform detection in isolated cell populations32, and 3′- or 5′-mRNA-capture approaches using microfluidics or nanoliter-well arrays enable high throughput at the cost of decreased sensitivity33,34.

Although profiling of the immune-receptor repertoire on a population level can inform us about clonal diversity of lymphocytes in homeostasis and disease16, its adaptation to single-cell resolution (scTCR-seq and scBCR-seq) enables clonotypic phenotype analysis35. Furthermore, oligonucleotide-coupled antibodies (CITE-seq36 or commercially available reagents, such as TotalSeq and AbSeq) enable the combination of scRNA-seq with analysis of surface protein expression.

Epigenomics

In addition to the transcriptome, the epigenome can be interrogated using omics technologies. The variety within this class of technologies reflects the complexity of epigenetic regulation37. Although ATAC-seq is seemingly the most prominent representative of epigenomics technologies38, many solutions to investigate DNA accessibility and conformation, histone modifications and transcription factor binding at the bulk and single-cell level have been developed and are summarized in Table 2. Furthermore, multi-omics methods to profile epigenetic markers alongside the transcriptome in single cells have undergone a rapid development from proof-of-concept reports to robust protocols applicable at a large scale39.

Genomics

Omics technologies are crucial to define genetic alterations as major drivers for human immune phenotypes (Table 3)40. Aside from whole-genome or whole-exome sequencing and targeted next-generation sequencing (NGS) panels, genotyping microarrays present a scalable and cost-effective solution for population genetics. Genetic variability has been investigated at the single-cell level (reviewed elsewhere41), but often these data suffer from sparsity and high levels of noise as the biological material assessed in a single cell is limited. Exploiting structural genomic rearrangements leading to abnormal copy numbers by copy-number variation (CNV) analyses or taking advantage of the higher copy number of the mitochondrial genome by targeted mitochondrial DNA sequencing (scMito-seq) are sophisticated approaches to overcome these limitations and can be used to infer cell fate through lineage tracing42,43,44,45.

The need for hypothesis-driven research

The potential of omics technologies in immunology seems endless, and the high-dimensional output of these technologies often triggers unbiased analytical approaches. Although exploratory data analysis can be essential to initially understand data structure and detect potential biases, we encourage researchers to follow established principles of hypothesis-driven science as outlined in the proposed systems-immunology cycle (Fig. 2)46. In this cycle, the application of (single-cell) multi-omics technologies follows the formulation of the hypothesis or question in combination with classical experimental design (for example, loss-of-function or gain-of-function experiments, defined clinical cohorts or clinical trials, such as vaccine trials or immunotherapy) to establish immune function, molecular phenotypes or immunotherapy and outcome prediction. Although a hypothesis is seen by some as a liability, we stress its guiding function while acknowledging the risk of it blinding researchers to alternative questions or paths of analysis47. Admittedly, a hypothesis in omics-based immunological studies can be vague, such as proposing broad transcriptional differences in multiple peripheral immune cells in a case–control study of an inflammatory disease. Nevertheless, we argue that a hypothesis-driven modus operandi helps scientists formulate and focus on central questions and does not preclude the potential for independent discovery and the derivation of new hypotheses in secondary data usage.

Fig. 2: ‘How to’ in immunomics.
figure 2

The systems-immunology cycle, with representative examples for each step, from the first medical observation or phenotype to validation of results. DE, differential expression; ML, machine learning.

The major difference between this holistic approach and classical reductionist experimentation is the requirement for mathematical and computational modeling of big data. This step of the cycle could be termed ‘data driven’; however, without a well-formulated hypothesis and sound experimental design, cutting-edge multi-omics technologies are at risk of missing their mark. By contrast, hypothesis-driven approaches and well-conceived experimental setups result in high-resolution omics data that provide valuable, and often unanticipated, biological explanations while reducing the risk of failure and enabling the prioritization of follow-up and validation studies.

How to apply omics technology

Human immunology has already benefited substantially from omics technologies, such as the large endeavors of the Human Immunology Project Consortium48, the ImmVar study49, the Human Functional Genomics Project40,50 and the Milieu Intérieur study51, to study the variability of the human immune system and to better characterize genotype-phenotype associations. Bulk omics technologies, such as DNA-seq, RNA-seq and ATAC-seq, are now commonly included in clinical studies of immunological diseases52,53. The Human Cell Atlas was the first large initiative to integrate omics technologies54, and now single-cell-resolution technologies (in particular scRNA-seq) are increasingly included in systems-level immunological readouts in large clinical trials55.

Experimental design

For human immunology research, five scenarios can be envisioned (Fig. 2): (1) exploratory studies to define immune functions and immune-cell types, usually performed in small cohorts up to 20 individuals56; (2) validation studies in humans assessing immunological findings derived from model systems43; (3) cross-sectional cohort studies, either in healthy or diseased individuals, to study human immune variation and immune deviation in the context of diseases57; (4) vaccine, immunotherapy and other therapy-response trials31; and (5) studies exploring genetic or environmental effects on human immune function58,59. Depending on the goals and the size of the study, factors affecting data quality might differ. For example, although human variation can have a strong effect on exploratory studies with samples from only a few individuals, restricted diversity within a cohort might not fully represent the spectrum of human variation for genome-wide assessments, which also holds true for genetic susceptibility studies. Similar considerations need to be included in the design of (immuno)therapy trials. For example, the assessment of the dynamics of immune activation, function and cellular distribution following a vaccine will differ between individuals, and genome-wide changes might have different kinetics, the capture of which requires not only highly standardized sampling schemes, but also sophisticated analytical approaches60,61,62.

Necessity of teamwork and expertise

Compared with the lower-resolution methods that are often used as primary readouts in clinical studies63, the application of omics technologies requires consideration of many potential factors that affect data quality, and thus requires thorough planning to harness the potential of high-resolution, high-content technologies. As technologies are continuously evolving, a team of omics experts should be included in the design of clinical studies that address immunological questions. Furthermore, omics data generation and analysis should be included in educational programs in immunology46.

In addition to study design, sample handling according to well-defined standard operating procedures (SOPs), library production, sample multiplexing, sequencing strategies and depth, data pre-processing and downstream analyses, including metadata handling, need to be addressed (Box 1).

Batch effects

One, if not the major, aspect when planning omics applications, particularly with increasing sample sizes generated across different institutions, is the consideration of the effect from technical parameters, often referred to as ‘batch’ effects64,65. Given the vast number of measurements defining the feature space, ranging from a few (~100) features in targeted sequencing up to hundreds of thousands of features in chromatin-landscape analysis, and considering the sensitivity of omics technologies, batch effects can be introduced at any step in the sample and data-generation process; thus, attempts to decrease batch effects are essential for good study design.

For single-cell transcriptomics, multiplexing of samples enables joint processing, which can help to reduce technical variability. A number of strategies for this purpose have been developed, including labeling of cells using either oligonucleotide-tagged antibodies (cell-hashing) or lipid-modified and cholesterol-modified oligonucleotides66, or the use of natural genetic variation67 to disentangle cells from multiple donors.

To avoid effects of circadian rhythm and seasonality, samples should be collected at similar times of day if possible68,69. For studies conducted over an extended period of time, seasonality might either be considered as a covariate of immune function70 or eliminated by sampling during the same time of the year. Although seemingly trivial, sampling itself needs to be highly standardized, as organ, location of biopsy, sampling devices, time from biopsy to sample processing and sample-freezing procedures need to be as uniform as possible71,72,73, and any deviations from the protocol must be recorded carefully for each included sample. Such technical and clinical metadata can later be useful to understand unanticipated variance and tackle batch effects during data analysis using batch-effect-removal algorithms65,74. In addition, if available, these metadata facilitate further reuse of the data.

Isolation procedures for RNA or DNA also require careful standardization. For example, batch effects might be introduced by handling some samples manually and others using automated sample handling. Similarly, if studies become too large to handle all samples in one run, individual batches might have differences due to the reagents or buffers that are used. Here, randomization of the samples extracted and processed in each batch can prevent the introduction of uncontrollable biases in the data.

Prior to the setup of larger (multi-center) studies, small pilot trials evaluating all necessary steps and predicting potential confounding effects of upscaling are advisable.

Taken together, batch effects have to be considered in the interpretation of omics data, and knowing their origin and how to minimize their effect on data analysis is critical for the production of robust results.

Metadata collection and standardized documentation

Collecting dense technical and clinical metadata on participants in clinical trials when using omics technologies is becoming more important. Interpretability of variability observed in high-resolution omics data might remain opaque without comprehensive records covering both technical aspects, such as sampling method and device, library production protocol or experimental day or batch, and clinical parameters, including sex, age, body-mass index, disease history, comorbidities, medication, smoking history and additional clinical markers such as serum levels of inflammatory biomarkers and differential blood-cell counts (see refs. 57,75,76). Even worse, missing technical or clinical metadata might cause misinterpretation of complex data and mislead subsequent research directions. Also crucial for meta-analysis is that researchers and publishers enable metadata to be accessible while respecting privacy and data-protection regulations. Moreover, the use of accepted ontologies for clinical metadata helps to maintain the highest possible degree of consistency across studies77.

In principle, similar caution should be applied when it comes to sequencing, as library production, and even sequencing itself, have many variables that affect downstream analysis (Box 1). Although sequencing core facilities usually know how to minimize batch effects, careful planning of these steps together can improve data quality.

Aside from the many pitfalls in data production, data processing and analysis also require high standards for reproducibility and documentation. The many options and consequential choices during data processing have noticeable effects on data content and quality and therefore need to be standardized for any given project and stringently reported to the community. Taking the simple example of the alignment of sequencing reads to a reference genome, it has been thoroughly demonstrated how the choice of the alignment algorithm and reference genome and transcriptome annotation can affect data quality and content78. Another clear but important example is the selection of gene biotypes, such as protein-coding, long non-coding or microRNAs, in gene expression quantification and downstream analyses. Focusing on protein-coding genes, as is quite common in transcriptomic analyses, might simplify the task of analyzing and interpreting gene expression data, but prevents assessment of regulation mediated by non-coding RNAs, despite the fact that total RNA libraries contain this information.

Sample size

Sample size is another important aspect to be considered in the design of omics studies in human-systems immunology. Although exploratory pilot studies work with low numbers of well-defined samples, studies addressing genetic susceptibility require large cohorts. Furthermore, the contrast of the inter-individual variability versus the intra-individual (that is, inter-cellular) heterogeneity needs to be specifically considered for sample-size estimation in single-cell omics studies. Molecular profiling of a specific subset of cells from individuals with a heterogeneous disease requires many samples from a relatively large patient cohort. By comparison, an exploratory study of a clinically well-defined disease spanning a whole cellular compartment, such as peripheral blood immune cells and their cell states, will require large numbers of cells from each individual.

Moreover, the fact that the effect sizes among different features can vary considerably further complicates study size estimation. Approaches to power calculation, such as the scPower or powsimR frameworks79,80, should be taken into account during study design.

The high cost of sequencing-based omics techniques still limits the number of samples possible to analyze. One potential solution to avoid underpowered studies is to focus on subcohorts of individuals in which the largest effect size is predicted to evaluate the initial hypothesis considering intra-sample and inter-sample heterogeneity and to include additional individuals only if an interim analysis has shown differences between the groups. Multiplexing samples can substantially reduce cost and is particularly suitable for studies with many samples but that require few cells per sample. In addition, the initial sequencing data are used to perform a better power estimation and, if necessary, sequencing data from additional samples have to be added to the study data set. This is particularly feasible for omics data for which samples can be safely stored for extended periods of time. This approach enables optimization between cost and the informative value of the study, as long as technical batch effects between study phases are minimized.

Challenges and opportunities for data analysis

Once data production and quality control have been completed, in-depth downstream analysis of the data can begin. In Box 2, we list the hardware requirements for such analyses and a selection of bioinformatics tools that we find particularly useful. With the vast numbers of new bioinformatics tools and the fast pace at which they are being published, the possibilities for analyses are seemingly endless81. We therefore strongly advocate for the formulation of an analysis plan with clearly defined and prioritized questions and well-established methods to address them (Fig. 3), if possible in consultation with experienced data scientists. Such a plan can greatly speed up the analysis and help computational team members who might not have the subject-specific biological knowledge to address the most relevant questions. In view of the enormous feature space and large amount of room for unexpected observations, this plan must be dynamic. But, even if it seems naive, writing a strategy that allows for adjustments prevents the analyst from getting lost in the many analytical possibilities. As already emphasized, we favor hypothesis-driven studies that combine omics data with computational modeling and experimental validation as a powerful approach for data generation, interpretation and efficient knowledge gain. Once priorities and major questions are defined, and depending on the nature of the available data—be it transcriptomic, epigenetic or genetic data at bulk or single-cell resolution—different analytical pipelines and tools need to be applied (reviewed elsewhere26,81). First analytical steps comprise means of data exploration using unbiased and unsupervised methodologies, such as hierarchical clustering or principal component analysis. Understanding the data independent of the initial hypothesis is vital to identify the present axes of variance in uncharted data and to grasp the dominating variables, such as disease classification in clinical studies or the experimental date in case of batch effects. Evaluation of the robustness of parameter settings (for example, quality filtering cut-offs, doublet identification, dimensionality-reduction parameters and clustering resolutions) and establishment of a suitable data model might take considerable time and should not be underestimated.

Fig. 3: Experimental and analytical plan.
figure 3

Proposed workflow for experimental and analytical planning in the systems-immunology cycle.

Subsequently, hypothesis testing using statistical methods to contrast gene expression levels in different groups within the study cohort presents a major readout82. An alternative to classical inferential hypothesis testing to define differentially expressed genes between groups of samples are gene regulatory network approaches that enable identification of subtle, but robust, patterns of expression changes within the study despite small effect sizes83,84. Naturally, with the introduction of single-cell resolution in omics technologies, the complexity of data analysis has increased. New challenges have emerged, including unprecedented dimensionality as well as high sparsity and noise in the data. In addition, cell-type identification and annotation in the context of existing knowledge, integration of data across experiments or cell-type-associated expression of genes has demanded new analytical solutions85.

The results of such complex computer-aided analyses can be viewed as models of the data, which depend on numerous parameter settings and aim to represent the underlying biology. This is probably the major difference compared with classical readouts (for example, a cytokine measurement or cell surface marker expression). Uncertainties are inherent to all of these computational approaches and are controlled in two complementary ways. First, the use of different computational methods to describe the data structure presents an in silico validation of the model. Second, experimental validations are crucial, for example to test predicted cellular phenotypes by functional assays in the laboratory.

However, to propose such validation experiments, the data models must be biologically interpreted. Questions to be asked range from defining signaling pathways changed within the study cohort86, predicting transcription factor binding within regions of open chromatin inducing certain transcriptional programs87, to modeling potential ligand-receptor interactions between different cell types, if single-cell data are available88,89. For almost all of these important tasks, many approaches and tools are available. When considering how to apply new predictive layers to unravel underlying biology within the data, it is often advisable to collaborate or exchange information with experts who have introduced new computational approaches.

Machine learning

Attempts to integrate multiple omics layers, for example transcriptomic with metabolomic90 or epigenomic data91, add yet another level of complexity to the analysis. As expected, the mathematical and computational models for integration are complicated and constantly evolving. We expect new, innovative and user-friendly approaches to be introduced in the next few years92. This is similarly true when it comes to machine-learning methods that are becoming more common for the analysis of omics data93, particularly at the single-cell level81. As machine learning is also a very wide and strongly proliferating field, it is of utmost importance that the question to be answered is phrased clearly before applying machine-learning strategies. Reconstruction of gene regulatory networks to identify targetable hub genes is one possible application of machine learning87. Moreover, for the characterization of transcriptional alterations induced by, for example, a new infectious disease, deep-learning strategies can be used to map new data sets on top of a reference from healthy individuals94.

Data storage

By definition, omics data sets are large, and with regard to large human cohorts or clinical studies, data storage and processing require a lot of computing resources (Box 2). Moreover, omics data contain information that can be sufficient to re-identify individuals and therefore are regulated by national laws for privacy protection, such as the General Data Protection Regulation (GDPR) in the European Union and the Health Insurance Portability and Accountability Act (HIPAA) in the United States. It is therefore advisable to take the necessary ethical and legal precaution when dealing with human omics data. Data storage needs to occur either on highly protected in-house systems or on regulated repositories maintained by public organizations, such as the European Genome Archive (EGA) or the database of Genotypes and Phenotypes (dbGaP). Pre-processing of raw data, including quality control, alignment or pseudoalignment to reference genomes and transcriptomes or novel assemblies, and normalization of data, is computationally expensive, requiring significant compute power, which is best carried out by collaborating genome centers. Once data are preprocessed and summarized, the data size is usually smaller, and such data can be handled by standard computer equipment that is even as low-powered as laptops. Irrespective of the computing infrastructure used for analysis, data-privacy standards should be recognized at all times, and platforms integrating user management, data access, data and metadata management, data storage and data analysis in a protective environment are the way forward. Such platforms can provide principles of findability, accessibility, interoperability and reusability (FAIR)95 and containerized environments to ensure data reproducibility with the option of serving as a safe place to train young scientists in computational biology applications96.

Data availability

Given the high computational component of current omic studies, the respective code and scripts are a critical component of the work. Although it is impractical to publish the entire code in the materials section of a publication, we strongly encourage the community to provide the entire source code used for the analysis in public repositories (for example GitHub or Zenodo) to ensure full reproducibility of analyses. Furthermore, online platforms such as FastGenomics96 provide a place to store both preprocessed data and the accompanying code, allowing the reader to interactively reproduce the analysis in predefined containerized environments.

When it comes to data sharing, access and reuse, the omics field is currently leading the way147 and it is good to see that similar strategies are now being supported within the field of immunology research, for example for CyTOF and multiparameter flow cytometry data or immunophenotyping data for clinical studies148. Benchmarking new studies, or using previous knowledge to classify insights from new omics data, is becoming a standard procedure. This rich information from existing data can be further leveraged. For example, cohort-wide data sets of functional immunological and omics information can be used to assess human variation in gene expression as a predictor for gene function when comparing individuals with low or high expression of a gene of interest97.

Validation of omics data

Omics data are of value during the experimental validation phase of the systems-immunology cycle (Fig. 2), both in model systems and in human validation studies. The results from human omics studies can be validated at the molecular and mechanistic level by using well-defined genetic model systems that build on decades of immunological and genetic research. Applying omics approaches, for example in a specific mouse knockout condition, enables further exploration of related molecular alterations and extension of the human phenotype97. Nevertheless, mouse models do not always reflect human immunology and thus should not be used as the only means of validation. We therefore suggest the use of two or more validation strategies whenever possible. Mechanistic hypotheses can be directly evaluated within the model system with classical functional assays and extended by molecular-biology-based in vitro studies in cell culture. Insights gained from genetic model systems can then be transferred back into the human setting and further identified molecular details can be tested in the initially acquired human data sets.

Validation is possible entirely within human data sets by making use of the availability of natural variation at a locus of interest related to the identified results, that is a single nucleotide polymorphism (SNP) existing in the human population that can be studied as phenotype-linked quantitative trait loci (QTLs)53. Alternatively, genetic models can be generated in human cells through gene editing that can be assessed by functional in vitro assays, or by applying targeted CRISPR-mediated gene perturbations coupled with sequencing (Perturb-seq) that enable entire pathways or molecular networks to be targeted within a single experiment and generating omics-level readouts98.

Meta-analysis of available data sets

Another important validation approach is to link new data and findings to prior knowledge of published results (Fig. 2). Newly identified molecular phenotypes can be cross-checked in independent human studies with any classical immunological assay, such as flow cytometry or functional tests. Further, studies including omics-level information can be used to derive gene signatures for a certain cell state, a cell type or a disease, which are then tested for enrichment in the new data set. This approach is widely used in single-cell transcriptomics, as these data are ideal for generating such signatures99. Existing data are found in specialized repositories, such as the Gene Expression Omnibus (GEO), dbGAP and EGA. Sometimes, anonymized processed data, such as gene expression count data, are part of the initial publication or are available on interactive online platforms for easy access and exploration96 (https://data.humancellatlas.org/ and https://singlecell.broadinstitute.org/single_cell). One needs to know that preprocessed data might not always be ideal for secondary use, as, for example, realignment against newer versions of the genome or inclusion of sequences of pathogenic species, as well as normalization considering different covariates, are no longer possible. Under these circumstances, starting from raw data and using standardized pipelines100 is advisable. Other options to reuse existing data during validation include the investigation of newly identified genes of interest or pathways of interest in existing data sets, or the reanalysis of similar public data sets with the same algorithms as applied to the new data to identify similarities and overlap of the new findings.

Beyond comparison to existing data, the integration of newly generated data into existing data is another option for meta-analysis. This strategy is continuously performed and improved in ongoing projects of the Human Cell Atlas (HCA) consortium54. Whether data integration or validation cohorts will be the major way forward in clinical applications of omics technologies will be determined in the near future. On the basis of our experience in COVID-19 research75,101,102,103, we favor validation cohorts over data integration. While integration can be a powerful way to increase cellular resolution and enable identification of rare cellular states, it carries the risk of erroneous over-correction and loss of biological signaling. The validation-cohort approach accepts the limitation of the individual data sets but ensures the reproducibility of the observations. Both approaches have their merits and areas of application.

Validation cohorts and functional experiments

Research during the COVID-19 pandemic has taught us many principles concerning the use of omics, and particularly of single-cell multi-omics technologies. During the discovery phase, for example studying an unknown disease, single-cell multi-omics technologies provide a comprehensive overview of systemic and local changes in molecular phenotypes of affected tissues as well as the complete immune compartment. Well-defined experimental settings for clinical studies, including independent validation cohorts in combination with functional immunological validation experiments, such as flow cytometry or functional assessments of individual cell types, and potential use of animal models can lead to the discovery of cellular alterations and molecular pathways, with relevance to disease severity and trajectory and to subgroup-stratified prediction of response to potential drugs83.

Altogether, we suggest performing experiments that address molecular mechanisms and reuse of existing data as the last part of the systems immunology circle for validating the findings of a study but also for formulating subsequent hypotheses.

Conclusion

This guide is designed to provide immunologists with an entry point to use single-cell and bulk (multi-)omics technologies as a way toward a better understanding of the complex cellular and molecular interactions that operate within the immune system. This ranges from comprehensive characterization of immune homeostasis and immune variation in whole populations, the molecular definition of cell types and states, the characterization of the dynamics of immune responses, locally and systemically, to the interaction of the immune system within organ systems. Increasingly sophisticated computational algorithms, combined with perturbation experiments and ever larger data sets, enable the identification of causal relationships within data104,105,106. Furthermore, the substantially increased quality of multi-omics technologies and the high potential to standardize these technologies has already led to their application in clinical settings, paving the way towards precision medicine107. Whether it is to decipher molecular and functional mechanisms in a new disease, such as COVID-19 (refs. 75,101,102,103), to identify therapeutic targets or monitor therapeutic responses108 or whether it is to guide outcome prediction109, single-cell and bulk multi-omics technologies are suited to capture the immune system’s complexity when in action. As the omics community is well prepared for large-scale international collaborations, it is foreseeable that large scientific collaborative networks will build on the newest developments in experimental techniques as well as data-analysis approaches, which can even include concepts of specialized machine-learning approaches preserving data ownership and privacy110. We expect an enormous acceleration in knowledge once insights from multi-omics data can be used across many laboratories and institutions worldwide, without the need to share primary data.