Introduction

Tuberculosis (TB) is the leading killer due to a single infectious disease, causing over 1 million deaths per year1. Despite renewed efforts to combat the TB epidemic, the current decline in TB incidence of 1.5% per year has fallen far short of the needed 4–5% annual decline to meet the 2020 goals for the World Health Organization’s (WHO) End TB Strategy2. While approximately ¼ (1.7 billion) of the world’s population has been infected with its causative agent Mycobacterium tuberculosis (M.tb), only 5 to 10% of infected individuals will develop active TB disease during their lifespan, with the remainder controlling the infection in a state known as latent TB infection (LTBI)3,4. Recent global workshops have reemphasized targeting transmission of TB as critical to accelerating efforts to reduce the burden of TB disease throughout the world5, 6. Two critical areas for understanding and preventing TB transmission are knowing where and when transmission occurs, and preventing infected individuals from progressing to active TB disease and thereafter transmitting the bacteria via the airborne route7,8.

Historically, successful control of TB in nations has followed from a reduction in transmission to very low levels7,9. Studies of close contacts, and in particular household contacts, of active TB cases are a critical tool for identifying new active TB cases from recent transmission and targeting therapy for preventing both subsequent disease and transmission. However, in high incidence countries where most of the burden of disease resides, more than 80% of TB transmission occurs outside of the home10,11. Genotyping M.tb isolates from active TB cases coupled with comparative genomic analysis has permitted population-level identification of hotspots of localized transmission, but these data are mostly available retrospectively and thus do not allow real-time monitoring of TB transmission in a community, particularly in areas of high incidence12. It thus remains unknown whether with current methods TB transmission can be appreciably disrupted in high incidence settings. This is in contrast to low incidence settings where both contact studies and targeting specific higher incidence communities have been effective13.

Recent infection is the single strongest clinical risk factor for developing active TB disease in immunocompetent persons, who comprise the vast majority of LTBI and active TB cases14,15,16,17,18,19. However, time since exposure or infection is very difficult to ascertain in the clinical setting, and its estimate is often unreliable20. Moreover, there are no known validated biomarkers of recent exposure or infection beyond conversion on a tuberculin skin test (TST) or IFN-\(\gamma\) release assay (IGRA), which requires longitudinal sampling. At the same time, treating all LTBI + individuals in areas of high TB incidence to prevent the development of active TB is not feasible and would entail unnecessary risk to the vast majority of LTBI + individuals who will never develop disease. Mass LTBI treatment also poses a theoretical risk of promoting drug resistant M.tb strains. Prospective gene expression-based (RNA) signatures of risk of developing active TB disease have been recently identified for LTBI + adolescents and adult healthy household contacts (HHCs)21,22,23. While the positive predictive value of these RNA signatures of risk of active TB is higher than TST/IGRAs, they are still significantly less than ideal: to prevent one case of active TB, ~ 37-64 LTBI + people not at risk need to be treated (vs. ~ 85 for TST/IGRA)21,22,23,24. It is currently unknown whether these RNA signatures correlate with time since infection. Importantly, their positive predictive value for TB progression and the number needed to treat could be dramatically improved if combined with accurate knowledge of time since infection in the same individual.

Building on this prior work, we assess RNA expression as a potential biomarker of recent exposure or infection with M.tb. Using our murine data, and recently published studies in cynomolgus macaques and humans21,22,25,26, we show for the first time that RNA expression predicts recent infection/exposure in all three species. Moreover, in both macaques and humans, these RNA signatures of recent infection/exposure are independent of the recently identified signatures of individual prospective TB disease risk. Our work supports the need for future longitudinal studies of recent TB contacts to determine whether biomarkers of recent infection can provide prognostic information of TB disease risk in individuals and help map recent transmission in communities.

Results

Blood genome-wide RNA expression accurately discriminates early vs. late M.tb infection time periods in C57BL/6 mice

While several published studies have made genome-scale measurements of the in vivo host response to M.tb at several time points in mice27,28,29, none have addressed the question of whether these parameters can predict infection time point. To determine whether it is possible to predict time since M.tb infection in mice via a blood RNA signature, we measured genome-wide RNA expression in whole blood in C57BL/6 mice following low dose aerosol M.tb infection. Mouse cohorts were sacrificed every month post-infection for 5 months (n = 4 per time point) along with age-matched uninfected C57BL/6 mice (n = 1–2 per time point). While M.tb colony forming units (CFUs) were not measured, it is well characterized that in this mouse strain lung bacterial burden increases exponentially from the day of M.tb infection until the peak of the adaptive immune response in the lungs at 1 month post-infection, thereafter remaining stable for approximately 300 days30,31,32. Thus, lung CFUs do not predict time since infection in this model after one month post-infection.

Principle component analysis (PCA) of our whole dataset revealed that the blood transcriptional state of M.tb infection during the first five months was distinct from that of uninfected mice, with uninfected and infected mice being entirely separable along the 1st principle component (21.8% of data variance; Fig. 1A). When we performed PCA on only M.tb infected mice, we found that early (30–60 days) and late (90–150 days) time periods were transcriptionally distinct, being separable along the 1st and 2nd principle components (18.8% and 16.5% of data variance, respectively; Fig. 1B). Only 1 mouse from the 60 day time point clustered with the late time period along the 2nd principle component.

Figure 1
figure 1

Blood genome-wide RNA expression discriminates early versus late M.tb infection time periods in C57BL/6 mice. Principle component analysis of genome-wide RNA expression measured via microarray in (A) all mice (n = 6 uninfected mice, n = 20 M.tb infected mice) stratified by infection status and (B) only M.tb infected mice stratified by time period post-infection. (C) ROC curve for out-of-bag performance of Random Forest Classifier predicting time period post-infection (1–2 months versus 3–5 months; P from Wilcoxon test, 95% confidence interval shown). (D) Random Forest Regression out-of-bag predictions of monthly time point post-infection. Fit curve calculated via the Loess method with 95% CI shown.

To find a predictive RNA signature of time since M.tb infection, we used the Random Forest Classifier algorithm, without hyperparameter tuning due to low sample size, to predict early (30–60 days) versus late (90–150 days) infection time period. This analysis treated time as a binary variable. Using out-of-bag predictions (approximately threefold cross-validation) to obtain an unbiased estimate of predictive performance, we found that we could predict early versus late infection time period with 0.99 area under the curve (AUC) (95% CI 0.96–1.00, P = 1.6 × 10–5; 87.5% sensitivity, 91.7% specificity for early infection; Fig. 1C). To assess whether each month post-infection could be predicted accurately, thus treating time as a continuous variable, we performed Random Forest Regression with threefold cross-validation and confirmed that days 30 and 60 were predicted to be earlier time points than days 90–150 (Fig. 1D). Days 90–150 were not resolved. Low group size precludes confident quantification of the degree to which days 30 and 60 can be separated. Probes used in these models as well as their feature importance for the regression model are shown in Table S1. We further assessed our classifier model’s predictions on uninfected mice, which showed their similarity to the late (90–150 days) time points (Figure S1). Taken together, these data indicate that we can broadly discriminate early and late M.tb infection in this cohort of C57BL/6 mice based on the whole blood transcriptomic response. Treating time as a binary variable (Fig. 1C) or as a continuous variable (Fig. 1D) gave similar results.

Blood RNA signature discriminates early versus late M.tb infection time periods in cynomolgus macaques

While inbred mice are a suitable model for studying molecular components of the immune response to M.tb, they do not replicate the variable clinical outcomes of M.tb infection in humans. Cynomolgus macaques, an outbred non-human primate model for TB, do exhibit heterogeneity in clinical outcomes, with approximately half of macaques progressing to symptomatic active TB disease that can be verified radiologically and bacteriologically within the first 6 months of infection, and the remainder controlling the infection in a latent state33,34. The lung pathology of M.tb infection in cynomolgus macaques also better replicates several features of human lung pathology than mice34.

To determine whether our findings in the murine model translated to the more human-like cynomolgus macaque model of M.tb infection, we mined publicly available data from a longitudinal study of M.tb infection in macaques25. In that study, cynomolgus macaques were infected with a low dose of M.tb in the lung, and their blood was sampled at 11 time points post-infection and 2 time points pre-infection for genome-wide RNA expression analysis. Importantly, while the study’s authors provided a broad, unsupervised analysis of their data according to time periods of infection, they did not assess our hypothesis that blood genome-wide RNA expression predicts time period or time point post-infection25. To test our hypothesis and allow comparison with our mouse data and recently available human data, we restricted our analysis to 8 time points from 20 days through 180 days (6 months) post-infection. To permit comparison of different computational models and allow a final unbiased estimate of predictive performance, we randomly divided the 38 macaques from this study into a training set and a test set, keeping the ratio of macaques with latent and active TB balanced in both groups (Figure S2).

Using ninefold cross-validation on the training set, we found that Regularized Logistic Regression, a linear method with regularization terms to reduce the number of probes in the signature, was not inferior to several nonlinear classification methods in predicting early (20–56 days) versus late (90–180 days) infection time period (Figure S3). We thus chose Regularized Logistic Regression to find a predictive RNA signature of time since M.tb infection in cynomolgus macaques. We found that this model predicted early (20–56 days) versus late (90–180 days) infection time period with an AUC of 0.78 in the training set (95% CI 0.72–0.85, P = 5.6 × 10–13; ninefold cross-validation; Fig. 2A), and an AUC of 0.81 in the test set (95% CI 0.71–0.91, P = 1.6 × 10–7; Fig. 2A).

Figure 2
figure 2

Blood RNA signature discriminates early versus late M.tb infection time periods in cynomolgus macaques. (A) ROC curves for Regularized Logistic Regression prediction of time period post-infection (20–56 days versus 90–180 days) from RNA expression in cynomolgus macaques on ninefold cross-validation in the training set (blue curve; n = 107 early time period samples, n = 103 late time period samples) and final model prediction on test set (red curve; n = 44 early time period, n = 40 late time period) (P from Wilcoxon test). (B) Comparison between early (20–56 days) (n = 107 train, n = 44 test) versus predicted early (n = 104 train, n = 50 test) time period samples in proportion of samples from macaques that develop active TB (P from Fischer’s Exact test). Regularized Linear Regression predictions of time point post-infection for (C) ninefold cross-validation in the training set (n = 210) and for (D) final model prediction on the test set (n = 84). (E–F) Predictions from models trained and evaluated only on samples from the first 90 days post-infection (n = 134 train, n = 55 test). Boxplots represent medians with interquartile ranges for the predictions at each time point (best fit line shown, P from Pearson test).

Importantly, our model was trained and tested on macaques irrespective of their present or future TB disease status. If our model partially predicted disease status rather than only time period post-infection, the proportion of samples from macaques with active disease would differ between predicted and actual early time period samples. However, we found that there was no change in the proportion of samples from macaques with active disease in the predicted early time periods relative to the actual early time periods, in both the training and test sets (P = 1.0, P = 1.0 respectively; Fig. 2B). This was also true focusing on late time period predictions (P = 1.0 training, P = 1.0 test; data not shown).

Next, to assess whether each month post-infection could be predicted in cynomolgus macaques, treating time as a continuous variable, we performed Regularized Linear Regression with ninefold cross-validation on the training set and confirmed that days 20–56 were predicted as earlier time points than days 90–180, in both the training set and in the test set (Fig. 2C–D). As in our murine model analysis days 90–180 were not resolved. Quantitatively, the median absolute error (MAE) of the model was 38.5 days (Pearson’s r = 0.48, P = 1.6 × 10–13) on the training set and 35.7 days (r = 0.54, P = 1.1 × 10–7) on the test set. Probes selected and used by the final trained regression model to predict in the test set are shown in Table S2. To assess whether we could predict specific time point of infection within the first 3 months, as suggested by our murine data, we trained a model on only time points from 20–90 days (Fig. 2E–F). The MAE of this model was 15.8 days on the training set and 14.3 days on the test set (r = 0.52, P = 1.9 × 10–10 and r = 0.46, P = 4.7 × 10–4, respectively).

We further assessed our early versus late classifier model’s predictions on the pre-infection and 3–10 day time points (Figure S4). This showed that the pre-infection and 3–10 day time points were more similar to the late (90–180 days) time period than the distinct early (20–56 days) time period (Figure S4).

Taken together, these data indicate that we can broadly discriminate early and late M.tb infection in this cohort of cynomolgus macaques based on the whole blood transcriptomic response, and that we can moderately discriminate between the first two months of infection. These predictions do not depend on disease status, and the accuracy of the predictions is quantitatively lower in cynomolgus macaques than in C57BL/6 mice, as reflected by the AUC analyses. Treating time as a binary variable (Fig. 2A) or as a continuous variable (Fig. 2C–F) gave similar results.

Blood RNA expression of 250 genes predicts time since active TB exposure in humans

We next sought to determine whether our findings in mice and cynomolgus macaques could translate to humans. Whereas the day of infection is known in animal models, the precise time of exposure resulting in infection is difficult to determine in humans, even in careful clinical studies. One surrogate for time of infection in humans is time of IGRA or TST conversion in people who were known to be IGRA/TST negative previously. This would synchronize a human study cohort to the time of an initial systemic T cell response to M.tb. To test this hypothesis we accessed public data from South African adolescents who acquired latent M.tb infection during longitudinal blood sampling every 6 months26. We found that Regularized Logistic Regression was unable to predict the first time point of known IGRA conversion from 6 months post-first known IGRA conversion (0.54 AUC, 95% CI 0.27–0.82, P = 0.64 on test set; Fig. 3A). Notably, the biological event of actual IGRA conversion in this cohort could have occurred anytime between the first time point of IGRA positivity and the preceding 6 months. Given our findings in mice and macaques that the RNA signature of time since M.tb infection occurs within a brief window of 2–3 months, we interpret these findings to mean that sampling blood every 6 months in humans is unlikely to constitute a cohort where actual time of IGRA conversion is synchronized sufficiently to discover an RNA signature of time since IGRA conversion.

Figure 3
figure 3

Blood RNA expression of 250 genes predicts time since active TB exposure in humans. (A) ROC curves for prediction of time since first known IGRA + (0 vs. 6 months) in South African adolescents who acquire M.tb infection for tenfold cross-validation in the training set (blue curve; n = 17 0 month samples, n = 21 6 month samples) and final model prediction on the test set (red curve; n = 10 0 month, n = 9 6 month) using Regularized Logistic Regression. (B) ROC curves for Regularized Logistic Regression prediction of time since active TB exposure (baseline vs. 6 months post-enrollment) in GC6-74 Gambia and Ethiopia test set (n = 37 baseline samples, n = 31 6 months samples) using expression of genes from published signatures that predict prospective risk of active TB. (C) ROC curves for Regularized Logistic Regression prediction of time since active TB exposure for tenfold cross-validation on the Gambia and Ethiopia training set (blue curve; n = 67 baseline, n = 48 6 months) and for final model prediction (contains 250 genes) on the Gambia and Ethiopia test set (red curve; n = 37 baseline, n = 31 6 months). (D) ROC curves for prediction of prospective risk of TB for tenfold cross-validation on the Gambia and Ethiopia training set (blue curve; n = 67 baseline, n = 48 6 months) and for final model prediction on the test set (red curve; n = 37 baseline, n = 31 6 months) using the 250-gene set that predicted time since active TB exposure. P values for all ROC curves are from Wilcoxon test, and 95% confidence intervals are shown.

Another study design that could identify RNA correlates of recent infection in humans is a household contact study wherein healthy contacts of active TB cases are enrolled within a certain time from the date of diagnosis of the active TB case and sampled longitudinally. Important limitations of this design that could reduce the power to detect RNA correlates of recent infection are that the precise time of infection is not known and individuals who are IGRA + at enrollment may have been infected either from the present exposure or in the more distant past. Cognizant of these limitations, we accessed publicly available data from the Grand Challenges 6-74 (GC6-74) study of healthy household contacts (HHCs) of patients with active pulmonary TB22. HHCs in this cohort were enrolled within 2 months of the diagnosis of the active TB index case and had blood samples drawn at baseline, 6 months and/or 18 months post-enrollment22. Because our mouse and macaque analysis suggests that blood transcriptional changes are most prominent in an early 3 month window post-infection, we focused our first analysis on the baseline and 6 month time points. This included data from Gambian and Ethiopian cohorts but excluded data from the South African cohort because 6 month time points were not available for South Africa22. We used the same training/test split as the authors in the Gambian cohort but randomly split the Ethiopian cohort 50/50 between our training and test sets. Importantly, with this training/test split and our data pre-processing, we could predict risk of TB with 0.72 AUC (95% CI 0.60–0.83, P = 1.6 × 10–4; data not shown) in the training set by tenfold cross-validation and 0.70 AUC (95% CI 0.53–0.88, P = 0.0071; data not shown) in the test set using Regularized Logistic Regression. From the GC6-74 and the Adolescent Cohort Study (ACS) we used the RISK4 genes (BLK, CD1C, GAS6 and SEPT4), the post-hoc selected C1QC, TRAV27, ANKRD22, OSBPL10 genes and the 16 correlate of risk (COR) predictive genes together for this analysis21,22. When we used these same genes to train a model to predict time since TB exposure, we obtained no predictive performance, whether the model was trained with these gene sets separately or together (P > 0.05 for all test set predictions; Fig. 3B). This suggests that genes selected for optimal prediction of prospective TB risk do not change across these two time points post-exposure.

To find a predictive RNA signature of time since TB exposure in these data, and as the study authors performed for TB risk prediction, we used the Wilcoxon test on the training set to select transcripts that differed in expression between baseline and 6 month time points22. Using Regularized Logistic Regression we found that these genes predicted baseline versus 6 month time points with 0.90 AUC (95% CI 0.84–0.96, P = 1.9 × 10–13; tenfold cross-validation; Fig. 3C) in the training set and 0.69 AUC (95% CI 0.56–0.81, P = 0.0039; Fig. 3C) in the test set. We further used the final genes selected by the model (250 genes, Table S3) on the training set to train a model to predict risk of TB. As expected, these genes exhibited no direct predictive performance for risk of TB on the training or test sets (P = 0.85, P = 0.07, respectively; Fig. 3D). In summary, our findings with the household contact study design in humans parallel the results in macaques in that we can predict broad time period post-exposure via the whole blood transcriptomic response. Moreover, this transcriptomic signature of time period post-exposure to an active TB case is independent of the transcriptomic signature of risk of TB recently identified in the GC6-74 and ACS studies21,22.

Time since TB exposure in humans is associated with alteration in CD4 + T cell proportion and immune activation pathways

Cell-type deconvolution algorithms have recently been used with genome-wide RNA expression data to help identify changes in immune cell proportions in the blood that are associated with TB disease, prospective TB disease risk and treatment success26,35. To identify immune cell populations that are associated with time since TB exposure in the GC6-74 study, we used the leukocyte expression signature matrix ‘immunoStates’ and linear regression to infer leukocyte proportions for each subject’s sample35. We found that the proportion of CD4 + α/β T cells was increased at 6 months versus baseline time point in the Gambian and Ethiopian cohorts (P = 0.0079; linear mixed model; Fig. 4A), but was not significantly changed at 18 months (P = 0.20 vs. baseline; linear mixed model, included South African cohort; Fig. 4B). We saw no significant differences in NK cell proportion over time in the Gambian and Ethiopian cohorts (Fig. 4C–D). Likewise, no other cell types estimated by the ‘immunoStates’ signature matrix showed significant differences over time in these cohorts (P > 0.05, linear mixed model, data not shown). This result with CD4 + α/β T cells and NK cells is consistent with the conclusion that the RNA signature of time since TB exposure is independent from the RNA signature of prospective TB risk, since both T cells and NK cells are known to decrease in circulation in active TB disease26,36.

Figure 4
figure 4

Time since TB exposure in humans is associated with alteration in CD4 + T cell proportion and immune activation pathways. Changes in CD4 + T cell percentages (A,B) and NK cell percentages (C,D) in GC6-74 healthy household contacts cohort at baseline (n = 104 in A,C; n = 272 in B,D), 6 month (A,C; n = 79) and 18 month (B,D; n = 64) time points after active TB exposure were determined by cell-type deconvolution (P from linear mixed model). Boxplots represent medians with interquartile ranges. (E) Top immunity related enriched canonical pathways in the 250-gene RNA signature of time since exposure to active TB index case (6 months vs. baseline) by IPA (P from Fisher’s Exact test).

Our RNA signature of baseline versus 6 month time points post-exposure included 250 genes selected by Regularized Logistic Regression (Table S3). We utilized Ingenuity Pathway Analysis (IPA) to identify pathways associated with these genes. The majority of enriched canonical pathways (-log(p value) > 2) were associated with immune cell signaling, including B cells (B cell receptor and PI3K signaling), T cells (T cell receptor, PKC\(\theta\), regulation of IL-2 expression, 4-1BB and CD28 signaling), cytokines (IL-6, IL-15, IL-12, TNF, IL-8, IL-10 and IL-17A), innate immune cells (dendritic cell maturation and LPS-stimulated MAPK signaling) and humoral immunity (Fc Epsilon RI Signaling) (Fig. 4E, Table S4). Other enriched canonical pathways were related to cellular injury and toxicity (apoptosis), metabolism, nervous system signaling, PPAR signaling, cell cycle regulation and intracellular & second messenger signaling (Table S4). Considering the overall direction of change in the immune pathways between 6 month versus baseline time points, the upregulation of several pro-inflammatory signaling pathways (IL-6, IL-8, FLT3 signaling, PI3K signaling in B Lymphocytes and Dendritic Cell maturation) and decrease in anti-inflammatory signaling (PPAR signaling) suggests that an increase in peripheral blood immune activation occurs at the 6 month time point after exposure (Fig. 4E).

To compare transcriptional modules altered in humans to those altered in mice and macaques, we used the recently defined disco score to identify concordantly and discordantly altered modules between these species37. Several modules related to T cells and NK cells were enriched (adjusted P < 0.05) in each pairwise comparison between two species (human vs. mouse, human vs. monkey, and monkey vs. mouse) (Fig. 5). Several B cell-related modules were uniquely concordantly regulated between macaques and mice (Fig. 5).

Figure 5
figure 5

Enriched transcriptional modules are concordantly or discordantly regulated during recent M.tb exposure or infection between mice, macaques or humans by disco analysis. P from CERNO statistical test.

Application of reduced 6-gene expression signature of time since active TB exposure to adolescent M.tb infection acquisition cohort confirms its identification of recent infection in humans

Implementation of our newly discovered RNA signature of time since active TB exposure using qRT-PCR would require a more parsimonious gene set than the 250 genes heretofore described. To find a reduced gene signature we ran a forward search using the MetaIntegrator R package38. This method identified 6 genes, RP11-552F3.12, PYURF, TRIM7, TUBGCP4, ZNF608 and BEAN1, that recapitulated the performance of the 250 gene signature on baseline versus 6 month time point discrimination with 0.86 AUC (95% CI 0.80–0.93, P = 1.7 × 10–11; Fig. 6A) in our GC6-74 training set and 0.68 AUC (95% CI 0.55–0.81, P = 0.0055; Fig. 6A) in the test set. Independent validation of this signature requires a cohort wherein recent M.tb infection is documented and time points are available to test whether the signature allows discrimination between recent and more remote infection. While the cohort of South African adolescents who acquired latent M.tb infection did not permit discovery of an RNA signature of recent M.tb infection, we reasoned that the whole cohort would be powered for validation of our signature discovered in the GC6-74 household contact study design26. Three genes, TRIM7, ZNF608, TUBGCP4, from our 6-gene signature were represented by detected probes in the microarray used in this study. These 3 genes discriminated the first time point of known IGRA conversion from all pre-conversion time points (6 months and 12 months prior to known conversion) with 0.72 AUC (95% CI 0.58–0.87, P = 0.0030; Fig. 6B). These 3 genes likewise discriminated the first time point of known IGRA conversion from all sampled time points (6, 12 months prior to conversion and 6, 12 months after known conversion) with 0.68 AUC (95% CI 0.56–0.81, P = 0.0039; Fig. 6B). Figure S5 shows the trajectory of the 3 gene score over time, being highest at the first time point of known IGRA conversion.

Figure 6
figure 6

Application of reduced 6-gene signature of time since active TB exposure to adolescent M.tb infection acquisition cohort confirms its identification of recent infection in humans. (A) ROC curves for 6-gene score prediction of time since active TB exposure in the Gambia and Ethiopia training set (blue curve; n = 67 baseline samples, n = 48 6 months samples) and for the Gambia and Ethiopia test set (red curve; n = 37 baseline, n = 31 6 months). (B) ROC curves for discrimination between time of first known IGRA + and all pre-conversion time points (blue curve; n = 27 0 month, n = 24 pre-conversion) and between time of first known IGRA + and all other time points (green curve; n = 27 0 month, n = 24 pre-conversion and n = 31 6 or 12 months after known conversion) in South African adolescents who acquire M.tb infection using 3-gene score from genes detected in microarray data. (C) ROC curves for prediction of prospective risk of TB using highest 6-gene score observed per individual in the ACS cohort (n = 74 nonprogressors, n = 31 progressors), GC6-74 Gambia and Ethiopia test set (n = 49 nonprogressors, n = 11 progressors) and GC6-74 South Africa cohort (n = 141 nonprogressors, n = 39 progressors). (D) ROC curves for discrimination of early and late time periods post-infection in mice (blue curve; n = 8 early mice, n = 12 late mice) and macaques (green curve; n = 151 early samples, n = 143 late samples) using genes from the 6-gene signature that were detected in the respective microarrays. P values for all ROC curves are from Wilcoxon test, and 95% confidence intervals are shown.

Given that time since active TB exposure is the single strongest clinical risk factor for developing TB disease in immunocompetent persons, the finding that time since exposure and risk of TB, as predicted by the blood transcriptomic response, are independent in the GC6-74 study of healthy household contacts suggests that these signatures could be combined to possibly better predict risk of TB when the time of exposure is unknown17,19,20. While the GC6-74 study was not powered for this particular secondary analysis, we assessed whether the highest 6-gene score during longitudinal sampling allowed discrimination of subjects who did or did not progress to active TB disease during study follow-up. The highest 6-gene score did not discriminate progressors from non-progressors in the test set, whether the subjects were from South Africa (AUC 0.51, 95% CI 0.40–0.63, P = 0.60; Fig. 6C) or from Gambia or Ethiopia (AUC 0.63, 95% CI 0.46–0.80, P = 0.095; Fig. 6C). The same result was observed in the ACS cohort of IGRA + adolescents with unknown exposure history (AUC 0.55, 95% CI 0.43–0.66, P = 0.78; Fig. 6C).

We additionally tested whether the 6-gene signature discriminated early from late time periods post-infection as defined in our analysis of animal models of M.tb infection. The 5 genes represented by detected probes of homologous genes in the mouse microarray data discriminated the early (30–60 days) versus late (90–150 days) infection time period with 0.80 AUC (95% CI 0.60 – 1.00, P = 0.012; Fig. 6D). Only 1 gene was represented by a detected probe in the macaque microarray data, and it alone did not discriminate the early versus late infection time period (0.52 AUC, 95% CI 0.45 – 0.58, P = 0.70; Fig. 6D). Of note, this study utilized a human microarray platform for the macaque samples, which may have contributed to reduced measurability of the macaque homologues to these human genes25. Because of the different measurement platforms between the animal and human data (microarray vs RNAseq) we could not directly apply our human 250 gene signature to the animal data.

Discussion

Early clinical studies in the pre-antibiotic era in the relatively isolated Faroe Islands shed light on the clinical features of primary infection with M.tb in humans, which often include fever, elevated erythrocyte sedimentation rate, X-ray abnormalities and, less often, erythema nodosum39,40. With time of exposure to an active TB case pinpointed within a two week period, and sometimes to a single day, Poulsen determined that these clinical features accompany and/or follow TST conversion, which occurs within 6 weeks of exposure14,39,40. While these clinical features of initial M.tb infection are transient and not specific to M.tb infection, a method to determine that a person is currently in the first 1–2 years post initial infection would have great prognostic value for near-future TB disease and could allow real-time geospatial mapping of recent TB transmission in communities14,15,17,39,41,42,43. In our proof of concept analysis, we sought to determine whether it is possible to develop an RNA-based blood test to detect recent exposure or infection with M.tb. For TB disease risk prediction, we hypothesized that such a test could complement the recently developed RNA signatures of TB disease risk that are based on detecting incipient TB, which is the asymptomatic phase of early TB disease during which pathology progresses gradually before full-blown clinical TB21,22,44,45. Because RNA signatures of incipient TB are only sensitive in the 1–3 months preceding symptomatic TB diagnosis, their current proposed use involves serially screening infected individuals on a regular basis45. We envision that a blood test for recent infection could be used concurrently with signatures of incipient TB at initial testing to aid in the decision to start treatment or to rule out further screening in individuals who are initially negative for incipient TB. Using our mouse data and published macaque data, we have demonstrated a highly accurate RNA signature of recent infection with M.tb (1–2 vs. 3–5/6 months post-infection). Using the GC6-74 cohort of HHCs of patients with active pulmonary TB, we discovered 250-gene and 6-gene human RNA signatures of recent exposure (0–2 vs. 6–8 months post-diagnosis of index case) that validated within a held-out test set22. Using an independent cohort of adolescents who acquired M.tb infection during 6-month longitudinal sampling, we demonstrated that the 6-gene signature could discriminate the first known time point of IGRA conversion from pre-conversion time points and from 6–12 months later with modest accuracy (0.68 AUC)26. However, this 6-gene signature was unable to provide prognostic information of TB risk in the GC6-74 cohort or the ACS cohort. The incomplete time point sampling of most individuals, and 6-month sampling likely reduced the power to find an association between our 6-gene signature score and TB risk in these two studies. Nevertheless, we believe the sampling constraints and target populations of these studies, adults who are HHCs and adolescents with LTBI of unknown exposure history, both in highly endemic areas, are mostly in line with what may be feasible for applying transcriptional signatures of TB risk for targeted treatment to reduce TB incidence45. Given that early blood transcriptional changes occurred within a short 3 month window in our mouse and macaque analyses, and the human data analyzed are not inconsistent with this brief timeline, we believe that blood RNA signatures for recent M.tb infection are too brief in duration to yield a useful biomarker to improve prediction of TB risk for targeted preventive therapy. Nevertheless, the finding that all three species exhibit an early blood transcriptional response to M.tb infection is highly novel and has implications for further research into common and different early immune responses to M.tb among these species.

Because the vast majority of TB disease burden can be accounted for epidemiologically by recent infection (past 1–2 years), we hypothesize that, on average, the factors influencing progression of disease have resolved by 2 years post initial infection in humans3, 14. Therefore, we hypothesize that a biological correlate of recent infection that has the longest duration during that time when the outcome of early disease progression has not been resolved would have the highest chance of being useful as a complement to tests for incipient TB in predicting TB risk. Importantly, most or all biological correlates of recent infection will not play a causal role in determining who will or will not progress to TB disease because most infected individuals never develop TB disease. This is supported by the fact that our RNA signature of recent infection was independent of known RNA signatures of TB risk. Our RNA signature of recent infection is likely characteristic of a very early phase of infection when many factors defining the outcome of infection have yet to resolve. A biosignature of recent M.tb infection that is useful in predicting individual TB disease risk will derive its utility not from showing who among recently infected individuals will progress to TB disease but rather by discriminating between recently infected individuals who are still at some risk of TB disease and those infected remotely for whom the factors driving disease progression have resolved. The longer the duration of a biological correlate of recent infection, the higher sensitivity it will have for identifying recently infected individuals.

Our estimated cell type and pathway analyses suggest that both cellular and molecular signatures of immune activation associated with recent exposure and could be interrogated by other modalities such as epigenetics. Immune cell differences between recently acquired and remotely acquired infection have been reported by others in single cohorts without longitudinal sampling20,46. The high enrichment of B cell signaling in our signature is interesting, and a recent case control study in a single cohort showed that several IgG and IgA antibodies to M.tb antigens strongly discriminated (AUC > 0.90) active TB contacts who converted on TST from non-converters both at first known conversion and 3 months prior47.

Our analyses and these considerations suggest that sampling IGRA-, untreated HHCs every month (or more frequently) for one to two years, starting as soon as possible after the diagnosis of their respective index case and determining IGRA conversion events, would allow for the discovery of biosignatures of recent M.tb infection that could be useful for helping predict TB disease risk. Follow-up in such a cohort for TB progression would allow better assessment of how signatures of recent infection and signatures of incipient TB could be combined to improve TB risk prediction. The addition of chest X-rays with deep machine learning analysis could be useful to discover heretofore unknown, specific radiogenomic features of recent infection or incipient TB48,49. After IGRA conversion, staggered sampling at different times could reduce the study’s burden on individual subjects and allow more precise estimation of the duration of any biomarker. Most follow-up in such a study would have to be performed on those who refuse preventive treatment, as treatment would need to be offered because recent infection is precisely documented. Another potential benefit of such a study is that validated biomarkers that associate strongly with TST/IGRA conversion but precede conversion, such as currently unvalidated IgG and IgA markers, could be used to identify M.tb infection before TST/IGRA conversion and thus reduce the burden of follow-up of recent contacts in TB control programs and potentially help reduce LTBI treatment time47.

If deployed in population screening efforts, a test for recent M.tb infection could also allow real-time geospatial mapping of recent TB transmission in communities. This could greatly help the application of current control methods to reduce TB transmission and disease in high incidence settings. While it is possible that our current 6-gene signature of recent M.tb infection could be evaluated in the future for this purpose, we think it would be more prudent to first find biomarkers of recent M.tb infection that have a longer duration and are useful for individual TB risk prediction. Being predictive of individual TB risk is more difficult to achieve than correlation with recent infection, as we have shown. Both applications are related, because the better a test for recent infection associates with disease progression in an otherwise unselected population of previously infected individuals, the more sensitive it will be for contact tracing and transmission studies. Nevertheless, biomarkers of varying duration could be jointly useful for the application of mapping recent transmission.

Our results in mice, macaques and humans, together with recent literature, suggest that future longitudinal studies of HHCs may be successful at identifying more accurate biomarkers of time since M.tb infection in humans. Our study represents one of only a handful of studies since Poulsen’s early work showing that there are biological events in the early human response to M.tb infection that can be reproducibly measured39. Future biomarker studies may enable the study of early events of infection in humans both routinely and ethically and permit the identification of immunological or other biological events that determine whether an exposed person will develop TB disease or control the infection5,50,51,52. This could greatly aid vaccine development for TB as no correlates of protection for TB are yet known53. We also expect that more accurate biomarkers of time since M.tb infection will be excellent tools to help better understand the human phenotypes of IGRA reversion and persistent resistance to IGRA conversion51,54.

Our current analysis has some limitations. Because most transmission occurs outside the household contact setting, many individuals in the GC6-74 study were TST + at enrollment (~ 51.4% in Ethiopia, ~ 36.3% in The Gambia), and follow-up TST in this study were incomplete, it is highly likely that many, and possibly the majority, of contacts in this study were not infected or re-infected from their index TB case10,11,55. However, the 6-gene RNA signature discovered in this cohort was validated in adolescents where recent M.tb infection was documented via IGRA conversion in 100% of study participants26. Finally, our current analysis excluded HIV co-infection.

Methods

Study design

The objective of this study was to identify blood RNA correlates of time since M.tb infection or exposure. We first infected mice with M.tb via the aerosol route and measured genome-wide RNA expression at pre-specified time points. Unsupervised analysis revealed potential discrimination between mice sacrificed at early time points (1–2 months) vs. late time points (3–5 months). Cross-validation without hyperparameter tuning identified an unbiased RNA signature that accurately predicted early vs. late time period post-infection. We then retrospectively mined publicly available data from a prospective M.tb infected cynomolgus macaque cohort and a prospective healthy household contact human cohort to identify RNA signatures that predicted these same time periods post-infection. The human RNA signature was validated in an independent cohort, adolescents who were recently infected with M.tb during longitudinal sampling.

Mice

Specific pathogen-free, 6–12 week old, female C57BL/6 wild-type mice (The Jackson Laboratory, Bar Harbor, ME) were maintained in ventilated cages inside a biosafety level 3 (BSL3) facility and provided with sterile food and water ad libitum. All protocols were approved by The Ohio State University’s Institutional Laboratory Animal Care and Use Committee. Mice experiments were performed in accordance with the U.S. National Institutes of Health Guide for the Care and Use of Laboratory Animals.

Mouse aerosol infection and blood collection

M.tb Erdman (ATCC no. 35801) was obtained from the American Type Culture Collection. Stocks were grown according to published methods56. Mice were infected with M.tb Erdman using an inhalation exposure system (Glas-Col) calibrated to deliver 50 to 100 CFUs to the lungs of each mouse, as previously described56,57. At specific time points post-M.tb infection, infected and age-matched uninfected mice were sacrificed and blood collected (400 µL) from the heart into 1.2 mL Tempus reagent and stored at − 80 °C. No formal randomization was employed for choosing cages of mice to be sacrificed at each time point. For the M.tb infected mice, sample size per time point was determined by using the number we routinely use for well-powered molecular and immunological studies in inbred mice. No blinding was performed for the mouse study.

RNA processing and microarray hybridization

Whole blood RNA was processed, quantified using a NanoDrop 1000 Spectrophotometer (NanoDrop Technologies) and RNA integrity (RIN) determined by a 2100 Bioanalyzer (Agilent). Samples with RIN ≥ 6.5 were submitted for hybridization onto Illumina Mouse WG 6-V2 BeadChips and scanned on an Illumina BeadStation system. Microarray data are available in the Gene Expression Omnibus (GEO) database under accession number GSE124688.

Microarray data pre-processing

For our murine data, Illumina BeadStudio/GenomeStudio software was used to subtract background and scale average signal intensity for each sample to the global median average intensity across all samples. Probes with a detection P value ≤ 0.01 in at least 10% of mice were filtered for analysis. Thereafter R scripts were used to quantile normalize the data, set all values < 10 to 10 and log2 transform the data. Probes were filtered by two-fold change in expression from the median in at least 10% of samples. For the macaque data (GSE84152), microarray data pre-processing was performed as previously described25. The data from the human adolescent cohort of IGRA converters (GSE116014) was pre-processed identically as the macaque data, except that data were quantile normalized and no batch correction was performed. When these adolescent data were used to validate the 6-gene signature, the data were downloaded at the gene-level using the R MetaIntegrator package, before additional pre-processing38.

RNA-seq data pre-processing

Human data from the Grand Challenges 6–74 (GC6-74) cohort were downloaded at the gene count level from GEO (GSE94438). Genes with read count ≤ 5 in 50% of samples were excluded. Data were quantile normalized and log2 transformed. To facilitate comparisons with a common RNA-seq alignment pipeline, gene counts were obtained from the ARCHS4 resource when comparing data from the Adolescent Cohort Study (GSE79362) and GC6-74 cohorts using the 6-gene signature58. These data were otherwise processed identically.

Machine learning predictions

For predicting time since infection in mice, we used the Random Forest algorithm in R with default parameter values59. Out-of-bag predictions were used to estimate model accuracy, which corresponds approximately to threefold cross-validation.

To predict time since infection in macaques, we randomly partitioned the macaques into training (70%) and test (30%) sets. We compared several different machine learning algorithms using the R caret package60. These included: Random Forest (R ranger package61), Gradient Boosted Machines (R gbm package62), Support Vector Machines using Polynomial (R kernlab package63) or RBF kernels (R kernlab package63) and Regularized Logistic Regression (R glmnet package64). Ninefold cross validation was used in the training set to optimize model hyperparameters and assess predictive performance, with all samples related to individual macaques being partitioned into the same held-out fold to ensure unbiased cross-validation. The caret package implementation did not permit tenfold cross validation for this dataset, as in humans, but the results should be equivalent. Only Regularized Logistic Regression was used for predictions in the test set and Regularized Linear Regression for predicting each time point post-infection after Regularized Logistic Regression was shown to be superior in predicting time period post-infection.

To predict time since TB exposure, time since IGRA conversion or prospective risk of TB in humans, we used tenfold cross validation on the training set (either GC6-74 or Adolescent IGRA converter cohort), with each subject’s samples partitioned into the same held-out fold, to optimize Regularized Logistic Regression model hyperparameters before predicting on the test set. Prior to performing this procedure for time since TB exposure on the GC6-74 training set, we performed feature selection on genes by a Wilcoxon test (P < 0.05). Where longitudinal data were available for individual macaques or persons, each time point was considered as an independent sample.

Forward search to discover parsimonious 6-gene signature

A forward search was performed in the GC6-74 Gambia and Ethiopia training set on genes selected by a Wilcoxon test (P < 0.05) using the R MetaIntegrator package as previously described38,65. The stopping threshold for increase in AUC with the addition of each gene was varied until a signature comprising less than 10 genes and including both upregulated and downregulated genes at 6 months post-enrollment (vs. baseline) was obtained. The final signature’s score is calculated on normalized log2 expression values as a difference between upregulated and downregulated genes: (RP11-552F3.12 + PYURF + TRIM7 + TUBGCP4)–(ZNF608 + BEAN1). When applying this score to microarray data, multiple detected probes that mapped to these genes, using the R biomaRt package, were averaged66. Genes without corresponding detected probes were omitted from the calculation.

Cell type deconvolution, pathway and transcriptional module analysis

Cell type proportions in blood were estimated from RNA-seq data as previously described using the R MetaIntegrator package26,35,38. Gene-level expression for this deconvolution was obtained from the ARCHS4 resource58. For pathway analysis, the 250 genes comprising the signature of time since exposure to an active TB case (6 months vs. baseline) were analyzed using canonical pathway analysis with QIAGEN’s Ingenuity Pathway Analysis platform (IPA, QIAGEN Redwood City, www.qiagen.com/ingenuity). To compare transcriptional modules that were concordantly or discordantly regulated between mice, macaques and humans at early and late post-exposure time points, we used the R disco and tmod packages with transcriptional modules from Li et al.67,68,69. Genes used in this analysis included all detected probes (mice and macaques) and genes (humans). Differential expression and ortholog assignment were performed as previously described37. The 6-month time point in the GC6-74 cohort was taken as the early time point in humans based on the results in Figs. 5 and S3, as this time point had the highest 6-gene score (data not shown).

Statistical analysis

All statistical analyses were performed in R (version 3.4.3). Prediction performance was evaluated using receiver operator characteristic (ROC) curves. Statistical significance of the area under the curve (AUC) was assessed using the one-sided Wilcoxon test via the R verification package70. ROC graphs and confidence intervals were obtained via the R pROC package71. Pearson test was used for correlation analysis. Fischer’s exact test (two-sided) was used to determine statistical significance of comparisons between proportions in evaluating the independence of the time since infection signatures from risk of TB disease in macaques. We used linear mixed models to assess the significance of cell type proportion changes with time since TB exposure via the R lme4 package72. Subject and site were included as random effects and time since exposure and site as fixed effects. These two-sided P values were obtained via the Satterthwaite approximation. The IPA canonical pathway P values were calculated by a one-sided Fisher’s Exact Test, with P < 0.01 considered as significant. The transcriptional module P values were calculated using the CERNO statistical test, with P < 0.05 considered as significant after Benjamini–Hochberg correction37. For all other statistical tests, P < 0.05 was considered as significant.