Introduction

Tuberculosis (TB) continues to be highly prevalent in developing countries where the resources for TB screening and diagnosis are limited1,2. In order to interrupt the chain of transmission, a TB control and prevention system must include prompt diagnosis and treatment. However, findings from TB prevalence surveys reveal a high burden of undetected TB cases even in the presence of quality-assured programmatic TB diagnosis and treatment, suggesting that TB patients tend to undergo a long and sometimes unsuccessful journey of health care seeking3. One way of addressing the challenge of delayed and undiagnosed TB is to perform active case finding activities where TB cases are actively and systematically sought in the community4.

Under the resource constraints, systematic screening for active TB usually starts with initial screening using TB symptoms, followed by confirmatory diagnostic processes for those who are symptom positive. One method proposed by the World Health Organization (WHO) is to use prolonged cough as the initial screening rule while acknowledging its low sensitivity. The sensitivity and specificity of prolonged cough were 35% (95% CI 24–46%) and 95% (95% CI 93–97%) respectively in the general population regardless of human immunodeficiency virus (HIV) status5. An alternative method is to use the presence of any TB symptom as the initial rule, which showed a corresponding sensitivity and specificity of 77% (95% CI 68–86%) and 68% (95%CI 50–85%) respectively in the general population regardless of HIV status5. These two methods represent two extremes on the TB screening continuum, where either too many TB-negative individuals receive further testing (in the case of any TB symptom) or too few TB-positive cases are included (in the case of prolonged cough). This trade-off between sensitivity and specificity presents a challenge to the selection of an appropriate screening strategy especially if there is a budget constraint.

In addition to conventional screening strategies, prediction models have been increasingly used to assist development of new screening approaches6. We hypothesized that prediction models that include TB risk factors in addition to symptoms may have improved prediction accuracy than conventional methods. We also hypothesized that a prediction model can stratify people into different risk groups and provide the basis of cost-effective screening options. We developed and validated such a prediction model, aiming to improve prediction accuracy and to assist the implementation of systematic screening in the community for prevalent TB. We focused the present analysis on TB screening among individuals with negative or unknown HIV status in the community, as the screening algorithm for people living with HIV has been previously described7.

Methods

Study design and participants

The study population consisted of participants in a TB prevalence survey of the Zambia South Africa Tuberculosis and HIV/AIDS Reduction (ZAMSTAR) trial conducted from January to December 2010. A prevalence survey was done to measure TB prevalence (the primary outcome of the trial) and was described elsewhere8,9. The dataset contained information from participants from eight communities in South Africa (SA) and 16 communities in Zambia. Participants who were receiving TB treatment at the time of data collection were excluded to avoid bias from the effect of TB treatment on TB symptoms. Only participants with HIV negative or HIV unknown status were included in this study since HIV positive participants might present different TB symptoms and signs compared to HIV negative individuals10. Participants without information on previous TB history were excluded since we were interested in this factor as a predictor. Overall 26,864 individuals from SA and 29,381 from Zambia were included in the analysis (Appendix Fig. 1). The SA dataset was further separated into a training and a validation set. To meet the requirement of more than 100 cases in a validation set11, the SA dataset was randomly split into two thirds (355 cases among 17,912 participants) for training and one third (176 cases among 8,952 participants) for (internal) validation. The Zambian dataset (107 cases in 29,381 participants) was used as the second (external) validation set to evaluate the performance of the SA model when extrapolated to a different epidemiological setting with lower TB prevalence.

Measurement of outcome and predictors

The outcome of the study was prevalent active culture-confirmed TB. All ZAMSTAR participants were required to produce one sputum sample at the time of interview for Mycobacterium tuberculosis culture and speciation regardless of TB symptoms. The laboratory algorithm has been described elsewhere3,12. The results of the culture examination were used as the gold standard for diagnosing prevalent TB in the prediction model. Symptoms and risk factors of TB were considered as candidate predictor variables and the information was based on self-report only. TB-related symptoms included: cough less than two weeks (cough), cough more than two weeks (prolonged cough), fever, weight loss, night sweats, chest pain, shortness of breath, and sputum production. The second group of variables was well described risk factors of TB, including age, sex, ever smoking, ever drink alcohol, body mass index (BMI), and known diabetes mellitus (DM)13,14,15,16,17,18. Self-reported household TB history and personal TB history were also included as potential predictors. The two types of TB history were treated as mutually exclusive states as they reflected different levels of prior TB exposure. If an individual had a personal TB history and a household TB history, the person was classified as having a personal TB history since it is a stronger indication of TB exposure.

Statistical analysis

Multivariable logistic regression was used to estimate the relationship (odds ratio, OR) between multiple predictors and prevalent TB in the SA training dataset. Model selection was done using backward elimination based on the Akaike information criterion (AIC). The results of the final model were converted into a scoring system by dividing the coefficients by the smallest coefficient value in the results of final regression model19. The discrimination of the scoring system was showed in Receiver Operating Characteristic (ROC) curves. The area under the ROC curve (AUC) was calculated to compare the prediction performance of the newly developed model against prolonged cough or any TB symptom.

In addition to AUC, cost-effectiveness analyses were conducted to compare systematic screening algorithms that applied the different initial screening methods, namely the scoring system, prolonged cough, and any TB symptom, in the validation datasets. The algorithms represent possible systematic screening strategies in the community in the context of the healthcare systems in SA and Zambia (Appendix Fig. 2). For each algorithm the expected yield of TB cases and associated costs were calculated in the two validation datasets. The cost items included the costing from the confirmatory tests. The staff time and the patient costs were not included in this study because they might vary in different settings. (Appendix Table 1). The incremental cost-effectiveness ratio (ICER) and the average cost-effectiveness ratio (ACER) was calculated for each algorithm using different initial screening methods. All analyses were done under 1000 bootstrapped samples. Lastly, the median value of each screening algorithm from bootstrapping results was applied to several hypothetical populations with different TB prevalences to calculate the number of TB cases detected and associated costs at different cutoff values of the scoring system.

In the main analysis, BMI was not included as a potential predictor due to the high proportion of missing values (66%) in the SA dataset. Cumulating evidence suggests that BMI is strongly associated with the risk of TB17. Therefore, a sensitivity analysis was conducted using multiple imputation to account for missing BMI values. The missing values were imputed for ten times in the training dataset and two testing datasets were performed independently. Rubin’s rule was adopted to pool the results of regression models and the model performances20. The final prediction model in the sensitivity analysis included the predictors which were selected in more than half of the ten imputed training datasets21. The pooled sensitivities and specificities of the prediction models were applied to the cost-effectiveness analysis to estimate the benefit and cost in the two validation datasets.

Ethics approval and consent to participate

The original study (ZAMSTAR) had the ethics approval from University of Zambia, Stellenbosch University and the London School of Hygiene and Tropical Medicine. All the individuals participated in the study had provided written informed consent. This analysis had additional ethics approval from Stellenbosch University (S14/09/178). All methods used in this study were performed in accordance with the relevant guidelines and regulations.

Results

The mean age of participants was over 30 years, and the majority (>60%) were female (Table 1). The proportion of unknown HIV status was 42.9% and 18.5% in the SA and Zambian participants respectively. Up to 35% and 24% of the SA and Zambian participants reported any TB-related symptoms (cough, weight loss, night sweats, and fever). The prevalence of TB was 2·0% and 0·36% among the SA and Zambian participants respectively. Among prevalent TB cases, the probabilities of presenting symptoms were higher than non-TB participants in both countries. The mean BMI of the TB positive individuals were lower than those TB negative individuals. Participants from South Africa had higher probabilities of known personal and household TB history than those who were from Zambia (Table 1).

Table 1 Descriptive statistics in the two datasets used in the analysis.

Model development

After backward variable selection, the final prediction model included the following variables: cough for less than two weeks, prolonged (> = 2 weeks) cough, night sweats, weight loss, known personal TB history, known household TB history, ever smoking, and ever using alcohol. In the final multivariable model, prolonged cough (OR 3·2, 95% CI 2·3–4·4) was the predictor with the highest OR (Table 2). The possible sum of scores in the scoring system ranged from 0 to 13. The AUC of the scoring system in SA training dataset was 68% (95% CI 65–71%), which was higher than that for prolonged cough (57%; 95% CI 54–60%) and any TB symptom (60%; 95% CI 56–63%) in the same dataset (Fig. 1a). The AUC of the scoring system (68%) was similar to that of the original logistic regression model (67%).

Table 2 Predictors and corresponding scores in the final selected prediction model based on the South African training dataset.
Figure 1
figure 1

ROC Curve of screening options in different datasets. Performance of the screening strategies in South African training dataset (a), South African validation set (b) and Zambian dataset (c).

Internal and external validation

In the SA (internal) validation dataset, the scoring system had an AUC of 68% (95% CI 64–72%), which was higher than prolonged cough (AUC 58%, 95% CI 54–62%) and any TB related symptom (AUC 60%, 95% CI 56–64%) (Fig. 1b). The AUC of the scoring system in the Zambian (external) dataset was 66% (95% CI 60–72%). The value was higher than that of prolonged cough (AUC 57%, 95% CI 51–62%) and any TB symptom (AUC 59%, 95% CI 53–65%) (Fig. 1c). The distribution of total score among the SA and Zambian validation populations is right skewed, with the majority of people having a score of less than 3 (Fig. 2a,b). The probability of prevalent TB increases with increasing score. In the SA validation dataset, the proportion of population who would require a confirmatory test increases rapidly when the score cutoff falls below six points (Fig. 2c). At a cutoff score of six points, 9·4% of the total population would require confirmatory testing and 31·3% of total TB cases would be detected eventually by this scoring system. Using a cutoff value of five points in the Zambian dataset, the screening algorithm which incorporated the scoring system could detect over 40% of all TB cases from 12% of the population who would screen positive (Fig. 2d).

Figure 2
figure 2

Performance of scoring system in validation datasets. Number of individuals at each score cutoff point in South African validation set (a) and Zambian dataset (b). Proportion of total screened population requiring further confirmatory tests at various cutoff values of score and proportion of all true TB cases detected at corresponding cutoff scores in South African validation set (c) and Zambian dataset (d).

Cost-effectiveness analysis

In the SA (internal) validation dataset, a cost-effectiveness analysis was conducted for a systematic screening algorithm that used an initial screening method followed by Xpert MTB/RIF. With a decreasing cutoff value of the scoring system, the number of prevalent TB cases that could be detected increased, along with increased expenditure on systematic screening (Fig. 3a). The average cost per TB case detected (ACER) ranged from 429$USD to 1,848$USD under various cutoff values (Table 3). If the TB program is to maximize the number of TB cases detected given the budget, it should follow the right lower part of the cost effectiveness plane (i.e., the blue curve in Fig. 3a) depending on budget availability. It can be seen from the cost effectiveness plane that the conventional symptom-based strategies would be “dominated” by other options, since the TB program would never use them under any budget plan (Fig. 3a and Table 3). In the 1000 bootstrapped samples, the systematic screening algorithms using the scoring system dominated that using any TB symptom (99% of the samples) and prolonged cough (69% of the samples). Among the viable options (i.e., the blue curve in Fig. 3a), the cost per TB case detected increased if the program sets to find more TB cases (see Table 3 for the incremental rations comparing one option to the next viable option, or ICERs).

Figure 3
figure 3

Cost-effectiveness plane. Cost-effectiveness analysis in South African validation dataset using the algorithm with Xpert MTB/RIF as the confirmatory tool (a). Cost-effectiveness analysis in Zambian dataset using the algorithm with smear plus Xpert MTB/RIF as the confirmatory tool (b). Orange dots indicated the dominated options.

Table 3 Cost-effectiveness analysis in South African validation dataset (N = 8,952, TB prevalence = 1.97%) applying the diagnostic algorithm in Appendix Fig. 2(a), with Xpert MTB/RIF as the only confirmatory tool.

In the Zambian (external) validation dataset, the systematic screening algorithm of initial screening followed by sequential confirmatory tests of smear microscopy plus Xpert MTB/RIF was evaluated (Appendix Fig. 2b). Similar to the SA situation, the new scoring system dominated the conventional strategies of any TB symptoms (100% of the bootstrapped samples) and prolonged cough (58% of the samples) in terms of cost-effectiveness (Fig. 3b). The ACERs and ICERs of the screening strategies (including the scoring system and the symptom-based strategies) in the Zambian cohort were substantially larger compared to those in the SA cohort, because of the lower TB prevalence (Table 4).

Table 4 Cost-effectiveness analysis in Zambian validation dataset (N = 29,381 TB prevalence = 0.36%) applying the diagnostic algorithm in Appendix Fig. 2(b), with sputum smear as the first confirmatory tool and Xpert MTB/RIF as the secondary confirmatory tool among smear-negatives.

Lastly, the results of cost-effectiveness analysis in the hypothetical populations with different TB prevalence suggested that cost per TB case detected was highly affected by the background TB prevalence (Fig. 4). The ACERs were generally very high (>2,112$USD per TB case detected) when the background TB prevalence fell below 1%. When the background prevalence of TB was as high as 5%, the ACERs were generally below 800$USD and can be as low as 187$USD per TB case detected.

Figure 4
figure 4

Cost-effectiveness analysis in hypothetical population under various background prevalence of TB. Average cost-effectiveness ratio (ACER) of diagnosis algorithm with scoring system as screening tool and Xpert MTB/RIF as confirmatory tool in different TB prevalence setting. Performance of the scoring system in South African validation set was shown. *PC: Prolonged cough; **AS: Any TB-related symptoms (Cough, weight loss, night sweat, and fever).

Sensitivity analysis

In the sensitivity analysis, multiple imputation was used to impute missing predictor information (mainly for BMI) and the analysis repeated. Predictors selected included weight loss, night sweats, cough less than two weeks, prolonged cough, alcohol use, BMI, sex, known personal TB history, and known household history. In addition to weight loss, BMI was selected in the final model and had the highest score among those predictors (Appendix Table 2). Similar to the results in the main analysis, the scoring system from the sensitivity analysis had a higher AUC value than conventional screening tools in the SA and Zambian validation datasets (Appendix Fig. 3). The systematic screening algorithm using the new scoring system again dominated those that used the conventional tools in the cost-effectiveness analysis (Appendix Fig. 4).

Discussion

To our knowledge, this study is the first to employ prevalence survey data to develop and validate a prediction model for prevalent TB among HIV negative or HIV status unknown individuals. The new scoring system that incorporated TB symptoms and risk factors were found to have better discriminatory power over conventional symptom-based screening strategies, although the overall discriminatory performance was only modest with an AUC of 66–68%. Importantly, the scoring system also performed better than the conventional strategies in terms of the number of TB cases detected given a fixed budget. Nonetheless, the tradeoff between sensitivity and specificity was still observed in the scoring system, and the cost per TB case detected remained relatively high unless the background TB prevalence was very high.

Previous discussions on systematic screening for TB using identification of symptoms as the first screening step have focused on the use of single or multiple TB symptoms as the rule-in criteria. Systematic reviews have concluded that these symptoms criteria would either miss a majority of TB cases in the population (because of low sensitivity) or lead to an overuse of confirmatory diagnostics among individuals without TB (because of low specificity)4. In addition, a recent systematic review identified only a few prediction models of prevalent TB and most of the studies were of poor quality and not externally validated6. One recently reported paper used the same ZAMSTAR dataset to develop potential screening strategies for prevalent TB. However, only various combinations of TB-related symptoms as the screening methods were considered22. Few studies have attempted to explore the potential of a multivariable modelling approach that incorporate additional risk factor information beyond TB symptoms and the magnitude of association between the predictors and TB. One prediction model of TB was developed previously using patients attending an outpatient clinic in Brazil, focusing on TB symptoms as predictors23. However, the predictors selected in the final model were not reported and the results were not externally validated.

Our study was among the very few TB prediction models that applied both internal validation in the same setting as the model development cohort and external validation in a setting with somewhat different TB and HIV epidemiology. The discriminatory performance of the prediction model was well preserved in the SA validation set (AUC: 68%, 64–72%), probably because the original training set had a large sample size and would not suffer from overfitting. The model performance was slightly decreased in the Zambian validation set (AUC: 66%, 60–72%), reflecting different epidemiology and clinical manifestations of TB in Zambia. Nonetheless, our modelling approach could still be applied to other settings when data are available. Using prolonged cough as an indicator to predict TB was previously reported to detect 35% (95% CI: 24–46%) of TB cases5, but in our validation sets, the sensitivity of prolonged cough was only 14·0% (Zambian dataset) and 18·8% (SA validation set), which indicates that the performance of the current symptom screening methods might differ in various settings.

In the absence of a single best cutoff that could achieve both high sensitivity and specificity, the key question of systematic screening will be the determination of an adequate cutoff under a budget constraint. Our cost-effectiveness analysis could help the TB program select the screening strategy, depending on the program’s budget and the willingness to pay to detect one additional TB case (the “unit price”). For example, under a total budget of up to 12,832$USD in our internal validation (SA) cohort, using a cutoff value of 8 would be the best option if the program accepts the price of 428$USD per TB case detected (Fig. 3a and Table 3). If the budget would be anywhere between 12,832$USD and 18,944$USD, one would first apply the confirmatory test to all subjects with a score of > = 8 points, and then apply the confirmatory test to subjects with a score of seven points until the budget is used up (Fig. 3a). Moving from the cutoff of 8 to 7 represents an incremental ratio (ICER) of 611$USD per TB case detected, meaning that the cost per TB case detected would increase if the program wants to find more TB cases.

Going beyond the simple dichotomy of risk stratification in conventional symptom-based strategies (e.g., prolonged cough or not), the scoring system provides a more flexible option of risk stratification through adopting different cutoff values. In addition, we found that the use of the scoring system would result in more TB cases detected compared to conventional strategies given the same budget, although the improvement was only modest. Nonetheless, our cost-effectiveness analysis revealed high ACERs and ICERs for both the scoring system and the symptom-based strategies when the screening was followed by confirmatory TB diagnostics. High costs for detecting a prevalent TB case present practical challenges. Previous studies have also showed that cost per TB case detected was high in community wide screening24. Our hypothetical analysis of applying the scoring system to populations with different TB prevalence suggested that the ACER would be high unless the systematic screening is applied to a high prevalence setting (2% or higher).

Our study has several strengths when considering its application to active case finding programs. First, the setting is similar to a community-based case-finding program. Second, cost-effectiveness analyses were conducted comparing diagnostic algorithms using different initial screening methods and different cutoff values of the scoring system which can help select the most feasible initial screening method in a screening program depending on budget constraints. Third, the model was validated not only in a SA dataset but also in a Zambian dataset, and the results suggested that the better discriminatory performance of the scoring system observed in the SA training dataset could not be explained by chance or overfitting.

The study also has limitations. First, we only included the cost of confirmatory diagnostics and did not consider the cost of staff time of conducting systematic screening and the patient cost, nor did we consider the possible intermediate screening step using chest radiography. Second, the outcome of the study was undiagnosed and bacteriologically confirmable prevalent pulmonary TB. Some of the prevalent cases detected in systematic screening might be detected anyway by passive case detection. We also did not account for the potential impact of systematic screening on transmission and incidence of TB. Third, the information in the scoring system was only based on self-report; therefore the performance of the prediction model may not be optimal. We argue that for the scoring system to be used easily in the community, the predictor information should be readily available with minimal further testing (e.g., blood glucose or HIV testing). Fourth, we only included in our prediction model the risk factors available in the ZAMSTAR study. Fifth, the study population was in Sub-Saharan Africa only; thus, the model might not be applicable in South-east Asia or other regions. Further studies in different areas with various TB epidemiology and healthcare settings are needed. Lastly, we did not include people living with HIV in the present analysis, since these people may have different clinical manifestations and risk factors of TB and should be evaluated separately. In practice, our screening algorithm on HIV negative and HIV unknown individuals can be combined with existing or upcoming screening algorithm on people living with HIV when being implemented in the community7.

Conclusion

Using data from a large prevalence survey, we developed and validated the first scoring system to be used in systematic case finding among HIV negative or HIV unknown individuals. Although the scoring system dominated the conventional screening strategies by only a margin, it provided more flexible screening options to fill the shortcomings of current screening methods. Our scoring system and results from cost-effectiveness analysis can be used to inform targeted systematic case finding efforts in community settings with a similar TB and HIV epidemiology as in our study cohorts. Our study also provided a framework for future studies to follow and develop other prediction models based on locally available data. Depending on the available budget in TB screening programs, health care systems would be able to target high risk populations for diagnostic tests by choosing different cutoff values of the scoring system.