Introduction

The hazards of mechanical ventilation make it imperative to disconnect patients from the ventilator at the earliest feasible time [1, 2, 3, 4, 5, 6]. Studies, however, indicate that clinicians are slow to recognize a patient's ability to tolerate ventilator weaning [7, 8]. Psychology research has shown that delays in decision making result from over reliance on heuristics and insufficient attention to prior probability [9]. Minimizing delay in diagnosis is the primary reason that screening tests are performed [10, 11, 12]. To attain maximal benefit screening tests should be performed when the prior (pretest) probability is very low (ideally < 20%) [10, 11, 12]. The tests used to screen for readiness to tolerate ventilator discontinuation are weaning-predictor tests [13].

Recently an Evidence-Based Medicine Task Force of the American College of Chest Physicians (ACCP) [14, 15] evaluated the usefulness of weaning-predictor tests using a meta-analysis. The ACCP Task Force focused predominantly on the weaning-predictor test that has been most frequently studied (> 25 studies): the ratio of frequency-to-tidal volume (f/VT), a measure of rapid shallow breathing [13, 16]. The Task Force calculated pooled likelihood ratios for f/VT and judged the summed values to signify that f/VT is not a reliable predictor of weaning outcome. The Task Force concluded that physicians should bypass measurement of all weaning-predictor tests and begin the weaning process with a trial of spontaneous breathing.

When assessing the reliability of weaning-predictor tests, it is critically important to recognize that weaning procedures constitute a form of diagnostic testing. Consequently evaluation of their reliability must comply with the canons developed for evaluating diagnostic tests [10, 11, 12]. In the assessment of published reports of weaning-predictor tests, the element most often ignored is the enormous influence of pretest probability on the test results. In their textbook on medical decision analysis Sox and colleagues [12] state, “Perhaps the most important idea in this book is the following: The interpretation of a test result depends on the pretest probability of disease.” The importance of this point is heightened whenever research is carried out on a diagnostic test that has already been accepted by clinicians and incorporated into their everyday practice [17].

The implications of pretest probability are greater for weaning than for many clinical situations because weaning involves a sequence of three diagnostic tests: measurement of predictors, followed by a weaning trial, followed by an extubation trial. The undertaking of three diagnostic tests in a sequential manner poses an enormous risk for the occurrence of test-referral bias [11, 12]. Test-referral bias arises when a test under evaluation (weaning-predictor test) influences which patients undergo either of the two subsequent tests. If tolerance of extubation is used as the gold standard for evaluating the reliability of the weaning-predictor test, the requirement to pass a weaning trial (e.g., T-tube trial) before extubation necessarily excludes all patients who fail a weaning trial. The study population is thereby skewed towards less severely ill patients, an effect termed spectrum bias [18]. This step not only alters pretest probability. It also alters both the sensitivity and specificity of weaning-predictor test [11, 12].

Failure to take into account the effects of spectrum and test-referral bias on pretest probability leads to fundamental misinterpretation of the reliability of weaning-predictor tests. We hypothesized that much of the variation among studies that have evaluated the reliability of f/VT in predicting weaning outcome is explained by spectrum and test-referral bias, as reflected by variation in pretest probability of successful outcome. We further hypothesized that once variation in pretest probability among subsequent studies of f/VT is taken into account, these studies confirm the sensitivity and specificity reported in the original 1991 study on f/VT.

Methods

All articles included in the meta-analysis of the ACCP Task Force were retrieved [13, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]. In five of these articles [18, 23, 26, 31, 32] the authors did not report data on pretest probability, sensitivity, and specificity. Beyond articles included in the ACCP Task Force's meta-analysis we retrieved additional articles [40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50] via Medline search of studies published up to May 2005 and by search of personal files. The studies evaluated here are listed in Table 1. The two present authors examined the full text of all articles. The following data were abstracted from each: number of patients studied, definition of study endpoint (toleration of weaning trial, extubation trial, or combination), threshold value of f/V T, sensitivity, specificity, positive-predictive value, negative-predictive value, prevalence of successful outcome, and whether primary clinicians were blinded to the data. Investigators varied in the efforts they took to blind physicians to f/VT measurements; most made no explicit attempt.

Table 1 Accuracy of f/VT in predicting weaning or extubation outcome. The listed studies are those that report data on the accuracy of f/VT as a predictor of weaning outcome. Four studies (nos. 9, 16, 17, and 18) report data under two different conditions in their articles; both sets of data are presented. Pretest probability of success in a study is the fraction of patients with a successful outcome out of the total population (both success and failure patients) included in the study (f/V T frequency-to-tidal volume ratio, PPV positive-predictive value, NPV negative-predictive value, WF weaning failure, EF extubation failure, PS pressure support, IMV intermittent mandatory ventilation, bpm breaths per min, MICU medical ICU, RICU respiratory ICU, Md ICU multidisciplinary ICU, SICU surgical ICU, PICU pediatric ICU, M-SICU medical-surgical ICU, CCU cardiac care unit, NS nonstated location, PLR positive likelihood ratio, NLR negative likelihood ratio)

When an “average” statistic is computed from a meta-analysis, erroneous interpretations can arise if there is significant heterogeneity among the included studies [51, 52]. When heterogeneity is significant, it is recommended to search for a factor that may be acting as an effect modifier [51]. We accordingly investigated whether the heterogeneity in the pretest probability of successful outcome among 20 studies included in the ACCP Task Force's meta-analysis was significant by means of χ2 analysis [53]. We subsequently show that the heterogeneity in pretest probability is significant, and that this may arise from test-referral bias and spectrum bias consequent to the sequential nature of diagnostic testing during weaning.

Bayes' theorem is an equation that describes the relationship between a physician's initial clinical gestalt of the probability of a particular condition (pretest probability) and the physician's revised probability after obtaining the result of a diagnostic test (posttest probability) [11, 54, 55]. It is used to estimate how much the uncertainty of weaning outcome changes from before measurement of a weaning-predictor test (pretest probability) to after obtaining the new information (conditional probability) [12]. In particular, Bayes' theorem is used to transform the information contained in sensitivity and specificity into a format that can be employed in diagnostic testing (calculation of posttest probability, in the format of positive- and negative-predictive value) [11].

To determine the influence of spectrum bias and test-referral bias on the reported posttest probabilities of f/VT we used pretest probability as an indirect measure of these two biases. In everyday practice a clinician's pretest probability of a clinical outcome is his or her clinical gestalt. When applying a Bayesian framework to the evaluation of studies of diagnostic tests, prevalence of the outcome under investigation is used as a surrogate for the pretest probability [56, 57, 58, 59, 60, 61]. Accordingly, we calculated pretest probability as the prevalence of successful outcome divided by the sum of patients with a successful and unsuccessful outcome in a study. (The conclusions of our study would remain the same if the term “pretest probability” were deleted and replaced by “prevalence of successful outcome.” We choose to frame the analysis in terms of “pretest probability” because it is a more intuitive expression when conducting a Bayesian analysis, and because the two terms (pretest probability and prevalence of the condition) are used interchangeably in writings on diagnostic testing [56, 57, 58, 59, 60, 61].)

To assess whether subsequent studies of f/VT reproduce the sensitivity and specificity reported in the original study on f/VT [13], we used Bayes' theorem. The framework for this portion of the data analysis was based on the true-positive rate (sensitivity 0.97) and false-positive rate (1−specificity 0.36) in the original report [13]. Using these data and the formulae below (based on Bayes' theorem [12]), we calculated the posttest probability of f/VT (positive-predictive value and negative-predictive value) for 0.01-unit increments in pretest probability between 0.00 and 1.00:

$$ PPV=(PPS\times TPR)/\{(PPS\times TPR)+[(1 - PPS)\times FPR]\} $$
(1)

and

$$ NPV=[(1 - PPS)\times TNR]/{[(1 - PPS)\times TNR] +(PPS\times FNR)}, $$
(2)

where PPV = positive-predictive value, PPS = pretest probability of success, TPR = true-positive rate, FPR = false-positive rate, TNR = true-negative rate, and FNR = false-negative rate. The resulting values of positive- and negative-predictive value (which we refer to as the predicted values) were plotted against pretest probability. The upper and lower 95% confidence intervals were then calculated and superimposed on the plots [62].

We checked each study for internal consistency. We took the reported sensitivity, specificity, and pretest probability of success and entered them into the above formulae. All but two studies [39, 43] showed good internal consistency. Zeggwagh et al. [43] reported a positive-predictive value of 0.68 and a negative-predictive value of 0.86; we calculated respective values of 0.86 and 0.67. Farias et al. [39] reported a positive-predictive value of 0.53 and a negative-predictive value of 0.83; we calculated respective values of 0.91 and 0.36. Because of these inconsistencies we excluded these two studies from further data analysis. (The conclusions of our study would not change if these data [39, 43] were included.)

The values of negative-predictive value and positive-predictive value reported in the each study of f/VT were entered on the above plots. We examined the influence of pretest probability of success on positive-predictive value and negative-predictive value of f/VT because those relationships are explicated by Bayes' theorem. In contrast, an equivalent governing framework to encompass the relationships between pretest probability and sensitivity and specificity (and thus likelihood ratio) has not been developed, and it seems unlikely that one can be developed.

We used a weighted Pearson's correlation analysis (adjusting for the number of patients contained in a study) to compare the relationship between pretest probability (prevalence of success) and reported values of positive- and negative-predictive value. Secondly, we used a weighted Pearson's correlation analysis to compare the relationship between the predicted values of positive- and negative-predictive value and the actual values reported in each study. Thirdly, we undertook a Bland-Altman analysis to determine whether the reported values of positive- and negative-predictive value fall within the 95% confidence intervals of the values predicted by entering (reported) pretest probabilities into Eqs. 1 and 2.

Results

Pretest probability of successful outcome in the studies included in Table 1 varied from 0.45 to 0.98. For studies included in the ACCP Task Force's meta-analysis, pretest probability of successful outcome was 0.75 ± 0.15. The studies included in the ACCP Task Force's meta-analysis also demonstrated a significant degree of heterogeneity in the pretest probability of success (χ2 = 227.1, df 19, p < 0.00001; Fig. 1). Actual values of f/VT were listed in 15 studies [22, 24, 29, 34, 35, 36, 37, 40, 41, 44, 46, 47, 49, 50]. The mean value was lower in these studies than in the original report [13], 77.4 ± 25.2 vs. 89.1, providing evidence for the occurrence of spectrum bias.

Fig. 1
figure 1

Pretest probability of successful outcome for studies included in the ACCP Task-Force's meta-analysis. Error bars 95% confidence intervals. Numbering of studies corresponds to that in Table 1 and not to that in the references. The heterogeneity in pretest probability of success is statistically significant (p < 0.00001)

Reported specificity ranged from 0.00 to 0.89 (Table 1), with a mean of 0.52 ± 0.26 (excluding two studies with inconsistent data [39, 43]). The lower specificity in subsequent reports than in the original report on f/VT [13], 0.64, provides evidence for the occurrence of test-referral bias. Reported sensitivity ranged from 0.35 to 1.00 (Table 1), with a mean of 0.87 ± 0.14. Test-referral bias is also expected to produce an increase in sensitivity over that originally reported. The sensitivity of 0.97 in the original report on f/VT [13] approaches the ceiling of 1.00, not allowing much room to detect a further increase (allowing for usual biological noise created in any experiment). Seventeen subsequent studies [20, 21, 22, 24, 25, 27, 30, 36, 37, 38, 40, 41, 44, 45, 47] reveal sensitivity values for f/VT at least 0.90, a finding consistent with test-referral bias.

Positive- and negative-predictive values of f/VT and pretest probability of success were reported by 27 investigators; four groups [21, 30, 36, 41] evaluated reliability of f/VT under two sets of conditions. The range in reported reliability was wide: negative-predictive values range from 0.00 to 1.00 and positive-predictive values range from 0.53 to 0.98 (Table 1). The reported positive-predictive for f/VT was correlated with pretest probability of successful outcome (r = 0.69, p < 0.0001); likewise, the reported negative-predictive for f/VT was correlated with pretest probability of successful outcome (r = −0.75, p < 0.0001).

Figures 2 and 3 show that most of the positive- and negative-predictive values in the studies fall close to or above the lower 95% confidence intervals of the values predicted by Bayes' theorem for pretest probability (using the sensitivity and specificity originally reported by Yang and Tobin [13]). For the studies included in the ACCP Task Force's meta-analysis the correlation between reported and predicted positive-predictive value was r = 0.86 (p < 0.0001, Fig. 4); the correlation between reported and predicted negative-predictive values was r = 0.82 (p < 0.0001, Fig. 5). For the entire group of 29 studies the correlation between reported and predicted positive-predictive value was r = 0.67 (p < 0.0001), and that between reported and predicted negative-predictive value was r = 0.66 (p < 0.0001).

Fig. 2
figure 2

Positive-predictive value (posttest probability of successful outcome) for f/VT plotted against pretest probability of successful outcome. Closed symbols studies included in ACCP Task Force meta-analysis; open symbols additional studies (see Methods). The curve is based on the sensitivity, specificity originally reported by Yang and Tobin [13] and Bayes' formula for 0.01-unit increments in pretest probability between 0.00 and 1.00 [12]. The lines represent the upper and lower 95% confidence intervals for the predicted relationship of the positive predictive values against pretest probability. The observed positive-predictive value in a study is plotted against the pretest probability of weaning success (prevalence of successful outcome). Numbering of studies corresponds to that in Table 1 and not to that in references. Study nos. 5 [38], 6 [28], 11 [27], 18a [41], 18b [41], and 24 [45] include measurements of f/VT obtained during pressure support; nos. 14 [33] and 21 [42] include measurements obtained in pediatric patients; nos. 7 [34], 18a [41], 18b [41], and 28 [49] used f/VT threshold values less than 65

Fig. 3
figure 3

Negative-predictive value (posttest probability of unsuccessful outcome) for f/VT. Closed symbols studies included in ACCP Task Force meta-analysis; open symbols additional studies (see Methods) are indicated by. The curve, its 95% confidence intervals, and placement of a study on the plot are described in the legend to Fig. 2. The observed negative-predictive value in a study is plotted against the pretest probability of weaning success (prevalence of successful outcome). Numbering of studies corresponds to that in Table 1 and not to that in the references. (See legend to Fig. 2 for the numbering of studies that include measurements of f/VT during pressure support, in pediatric patients, or operating at a threshold value below 65.) Study no. 11 [27] has a negative-predictive value of 0.00 and specificity of 0.00, which are predictable given its pretest probability of weaning success of 98.2%; the large number of subjects (n = 163) means that this study made a substantial contribution to the pooled likelihood ratio calculated in the meta-analysis of the ACCP Task Force

Fig. 4
figure 4

The relationship between the reported values of positive-predictive value among the studies included in the ACCP Task Force's meta-analysis and the values predicted by observed pretest probability together with the sensitivity and specificity originally reported by Yang and Tobin [13]. The weighted Pearson's correlation is r = 0.86 (p < 0.0001)

Fig. 5
figure 5

The relationship between the reported values of negative-predictive value among the studies included in the ACCP Task Force's meta-analysis and the values predicted by observed pretest probability together with the sensitivity and specificity originally reported by Yang and Tobin [13]. The weighted Pearson's correlation is r = 0.82 (p < 0.0001)

A Bland-Altman analysis was undertaken to determine the extent of agreement between reported values of positive- and negative-predictive value and the values predicted by (reported) pretest probability together with the sensitivity and specificity originally reported by Yang and Tobin [13]. For the studies included in the ACCP Task Force's meta-analysis, all of the reported positive-predictive values and all but two of the reported negative-predictive values fell within the 95% confidence interval of the values predicted. For the entire group of 29 studies, all of the reported positive- and negative-predictive values fell within the 95% confidence interval of the values predicted.

Discussion

The ACCP Task Force concluded that f/VT is not a reliable predictor of weaning success based on their meta-analysis of likelihood ratios. For a meta-analysis to be statistically valid, however, it must be free of significant heterogeneity (or control for it) [51, 52]. Figure 1 reveals marked heterogeneity (p < 0.00001) in pretest probability of successful outcome among studies in the meta-analysis. This heterogeneity in pretest probability accounts for most of the variation in reported reliability of f/VT. Once these data are entered into a Bayesian model with pretest probability as the operating point, the reported positive-predictive values are significantly correlated with the values predicted by the original report on f/VT [13], r = 0.86 (p < 0.0001); likewise, reported negative-predictive values are correlated with the values predicted, r = 0.82 (p < 0.0001) (Figs. 45). Moreover, the rate of successful outcome was 75% or higher in more than half the studies, reflecting the influence of spectrum bias and test-referral bias (Table 1). Thus the low values of likelihood ratios for f/VT reported by the Task Force are largely explained by their failure to correct for the occurrence of spectrum bias and test-referral bias.

A more fundamental conceptual problem arises with the ACCP Task Force's evaluation strategy. Their meta-analysis is not focused on the goal that a weaning-predictor test is designed to meet: to detect the earliest time a patient might tolerate a weaning trial. That is, a weaning-predictor test serves solely as a screening test. As discussed below, the most precise tool for evaluating screening-test reliability is sensitivity [11]. In contrast, the Task Force based their entire evaluation on likelihood ratio. Likelihood ratio, however, is not precisely suited to screening-test evaluation because it includes test components vital for screening (true-positive and false-negative rates) but also components not directly focused on screening (true-negative and false-positive rates); the latter cloud the contribution of the vital components [11].

Bayes' theorem and reliability of weaning predictors

Bayes' theorem uses new information (the conditional probability) to update old information (the pretest probability) [12]. Conditional probability refers to the probability that a particular event will occur (a patient will tolerate a weaning trial) given that some other condition has been met (obtaining a positive result on a weaning-predictor test) [11]. The updated result is termed the posttest probability, expressed as positive (or negative) predictive value. According to Bayes' theorem, three factors determine posttest probability: pretest probability, sensitivity, and specificity (of a weaning-predictor test) [11, 12].

Figures 2 and 3 convey the relationship between pretest probability and posttest probability of weaning success based on the theoretical framework of Bayes' theorem. The weighted Pearson's correlation analysis reveals that these two variables were closely related (p < 0.0001). For studies in the ACCP Task Force's meta-analysis pretest probability explained 74% of the variation in positive-predictive value and 62% of the variation in negative-predictive value of f/VT. (The remaining variation in the relationship between pretest probability and posttest probability among the studies probably resulted from population differences, differences in instrumentation, measurements during pressure support, and random variation.)

The information presented in Figs. 2 and 3 represents the interaction between two conceptual models. The overall map is generated by means of Bayes' theorem; the specific contour (interrupted) lines enclosing predicted values of posttest probability for every possible pretest probability is generated by the sensitivity and specificity reported by Yang and Tobin [13]. Without these conceptual models, the wide scatter in reported posttest probability by different investigators suggests that f/VT is an unreliable weaning-predictor test. When the two models are applied, however, the values of posttest probability reported by most investigators are largely those one would predict for each reported value of pretest probability (p < 0.0001). The importance of pretest probability was not taken into account by the Task Force when reaching their conclusion that f/VT is an unreliable predictor of weaning outcome. Yet, according to Bayes' theorem, no factor has a greater influence on posttest probability than pretest probability [12].

The importance of pretest probability is further emphasized by the results of the second weighted Pearson's correlation analysis. After adjusting for variation in pretest probability (among studies in the ACCP Task Force meta-analysis), this analysis revealed a significant relationship between the reported and predicted posttest probability of f/VT: the relationship for positive-predictive values was r = 0.86 (p < 0.0001) and that for negative-predictive values r = 0.82 (p < 0.0001; Figs. 45). The relationship was further confirmed by the Bland-Altman analysis (of studies in the meta-analysis): all of the reported positive-predictive values and all but two of the reported negative-predictive values fell within the 95% confidence interval of the predicted values. For the entire study group all of the reported values fell within the 95% confidence limits of the predicted values. The apparent discrepancy between the number of points lying outside the 95% confidence intervals on Figs. 2 and 5 and the Bland-Altman analysis is related to different entities being quantified. The outer curves on Figs. 2 and 3 represent the upper and lower 95% confidence interval for predicted posttest probability at a particular pretest probability. The Bland-Altman analysis (the usual method for quantifying the agreement between a prediction against a reference standard) measures the difference between predicted and reported posttest probability as related to the mean of these two values.

Pretest probability: wide variation, and overall above 0.75

More than one-half the studies were conducted in populations in which the rate of successful outcome was 75% or higher (Table 1). Such a high pretest probability has a major influence on posttest probability [63, 64].

Consider a clinician who obtains a positive reading on a hypothetical weaning-predictor test that has sensitivity 0.90 and specificity 0.90. If pretest probability of weaning success is 0.40, according to Bayes' theorem posttest probability is 0.86. If pretest probability is 0.80, posttest probability is 0.97. The increase between pretest and posttest probability in the second instance (21%, 0.17/.80) is only a fraction of that in the first instance (115%, 0.46/0.40) despite the sensitivity and specificity being identical. Thus a high pretest probability markedly decreases the apparent reliability of a weaning-predictor test.

Spectrum and test-referral bias

When two or more diagnostic tests that are not conditionally independent are used in sequence, spectrum and test-referral bias become almost inevitable [11, 12, 17, 64, 65, 66]. Spectrum bias occurs when a new study population contains fewer (or more) sick patients than the population in which a diagnostic test was originally developed [11, 12, 18]. For example, researchers may obtain measurements of a test that was originally developed to predict the outcome of a weaning trial. The researchers then decide to assess the reliability of that same test in predicting the outcome of a trial of extubation. By design, the researchers must exclude patients who failed the weaning trial. By excluding sicker patients, those failing a weaning trial, the researchers change the spectrum of disease severity in the new population compared with that in the original study population and thus increase pretest probability of success. Evidence for the occurrence of spectrum bias is provided by the lower (average) value of f/VT in 15 studies (where data were reported) than in the original study of Yang and Tobin [13], 77.4 vs. 89.1.

A second form of bias, test-referral bias, occurs when the results of a test under evaluation are used to select patients for the gold-standard test [11, 12]. Consider a weaning-predictor test where its reliability is evaluated in terms of its ability to predict the successful toleration of extubation. If patients are required to pass a weaning trial before extubation, this study-entry requirement necessarily excludes all patients who fail. This step has three effects on the study population; firstly, fewer patients with negative results (of the weaning-predictor test) are included; secondly, relatively more patients with positive results are included; thirdly, pretest probability of success is increased [11, 12]. The first consequence produces a decrease in the specificity of the weaning-predictor test in this population compared with the population in which the test was originally developed. The second consequence increases the sensitivity of the test. (See S.F1 in the Electronic Supplementary Material, which provides a hypothetical example of how test-referral bias leads to changes in pretest probability, sensitivity, and specificity.)

Specificity of f/VT in the original report was 0.64. Of subsequent studies free of major problems (excluding [31, 39, 43]), 18 report specificity values for f/VT that are less than 0.64 (Table 1). Sensitivity of f/VT in the original report was 0.97. Because sensitivity has a ceiling of 1.00, a value of 0.97 does not leave much room to detect a further increase in sensitivity (allowing for usual biological noise created in any experiment). Of subsequent studies free of major problems 17 report sensitivity values for f/VT that are greater than 0.90. These lower specificities and high sensitivities provide evidence for the occurrence of test-referral bias.

Screening testing and confirmatory testing

The ACCP Task Force recommendation to bypass a weaning-predictor test and go directly to a weaning trial [14, 15] contravenes a cardinal precept of diagnostic testing: use of a screening test followed by a confirmatory test [10, 11, 12]. Diagnostic testing is commonly seen as a monolithic entity—a test is a test is a test. In reality, diagnostic testing is expected to fulfill two very different demands [10, 11, 12]. One is screening: to pick up cases of a condition at the earliest possible time. This demand requires a test with high sensitivity [10, 11, 12]. The second is confirmation of a condition for which there is already a strong suspicion. This demand requires a test with high specificity [10, 11, 12]. With rare exceptions a single diagnostic test does not satisfy both demands [10, 11, 12]. Thus before evaluating a test's performance, it is imperative to ask to which demand is it directed.

A weaning-predictor test is used to spot the earliest point in time that a patient might tolerate a weaning trial [13]. It serves solely as a screening test. On its own a positive predictor-test result is not used as justification for extubation [67]. Before that step a weaning trial (a confirmatory test) is undertaken. The ideal time to undertake a screening test is when the pretest probability of weaning success is 20% or less [10]. In contrast, weaning trials are commonly performed when the pretest probability of success is 75% or more. None of the 29 studies in Table 1 had a pretest probability under 45%. This finding is not surprising. Physicians know that a weaning trial takes as long as 30 min–2 h to perform, and staff must be available to closely monitor the patient. Thus physicians do not initiate a weaning trial unless they think the patient has a reasonably high likelihood of success.

The development of a reliable screening test hinges on avoiding false-negative results (a test predicting failure, but the patient actually succeeds) [10, 11]. Simultaneously the test needs to pick up every possible true-positive result—the mindset is to miss no patient who can breathe without the ventilator. To capture the maximum meaningful number of true-positive results, the threshold for defining a positive screening test may be set deliberately high [10, 11]. This necessarily increases the number of false-positive results, producing a proportional decrease in specificity.

Sensitivity captures exactly the components that define the reliability of a screening test since it contains only true-positive and false-negative rate. Likewise, specificity captures exactly the constituents of a reliable confirmatory test: avoidance of false-positive results (a test predicting success, but the patient actually fails) and maximizing true-negative rate [10, 11]. The studies listed in Table 1 reveal sensitivity values for f/VT that are at least 0.90 [22, 24, 25, 27, 30, 37, 44, 45, 47] or at least 0.97 [13, 20, 21, 36, 38, 40, 41]. Thus f/VT constitutes a reliable screening test. In contrast, the sensitivity of a weaning trial as a diagnostic test has never been tested.

Limitations

The studies shown in Figs. 2 and  3 include every study that provided the necessary information on f/VT. We recognize that a case could be made to exclude data from certain studies, for example, those conducted in infants [33, 39, 42], those that included measurements of f/VT during pressure-support ventilation [27, 28, 38, 45], or those in which pretest probability exceeded 88% [27, 36, 42, 47]. The reasons to exclude a study are necessarily arbitrary in nature. Because no study, other than the two studies with inconsistent data [39, 43], was excluded, the relationships between reported posttest probability of f/VT with both pretest probability and predicted posttest probability may be underestimates.

Our data analysis is framed in terms of pretest probability, although that value was not reported directly by authors of the primary studies. We took prevalence of successful outcome as a surrogate for pretest probability because these two terms are used interchangeably in the literature on diagnostic testing [56, 57, 58, 59, 60, 61]. The primary aim of the present study was to determine whether spectrum bias and test-referral bias explain some of the reported variation in f/VT reliability as a screening test (for weaning success). Evidence for spectrum and test-referral bias is provided by the lower values of f/VT and specificity, respectively, in subsequent reports than in the original study. The conclusion would remain the same were we to eliminate all mention of pretest probability, and express our findings in terms of prevalence.

Conclusion

Based on a meta-analysis of likelihood ratios, an ACCP Task Force concluded that f/VT is not a reliable predictor of weaning success. The included studies, however, exhibited significant heterogeneity (p < 0.00001), a factor that nullifies a meta-analysis. The heterogeneity in pretest probability (prevalence of successful outcome) most likely resulted from spectrum and test-referral bias. When data from 29 studies were entered into a Bayesian model with pretest probability as the operating point, the observed posttest probabilities were closely correlated with the values predicted by the original study on f/VT (p < 0.0001).

A separate problem was the Task Force's failure to focus on the goal of a weaning-predictor test: to screen for weanability. Likelihood ratio is not precisely suited to assessing screening-test reliability (because it includes constituents not directly relevant), whereas sensitivity solely captures the vital components. The average reported sensitivity of f/VT was 0.87. Thus contrary to the conclusion reached by the ACCP Task Force, the facts included in the aggregated studies show that f/VT is a reliable predictor of weaning success.