INTRODUCTION

Diagnostic error adversely affects patients and healthcare systems. To achieve diagnostic excellence, correct diagnostic test interpretation is a prerequisite. Two related heuristics designed to help interpret diagnostic tests—SpPin and SnNout—have been taught for decades by leaders in the field of evidence-based medicine.1,2 SpPin indicates that when Specificity is high, a Positive result rules in the disease in question, and SnNout indicates that when Sensitivity is high, a Negative result rules out the disease in question. Our experience over years teaching diagnostic reasoning to hundreds of medicine residents and faculty at eight academic medical centers is that these heuristics are universally known and frequently relied upon to evaluate the utility of diagnostic tests.

Unfortunately, relying on SpPin and SnNout can be maladaptive, increasing diagnostic error. Previous publications warning about limitations of SpPin and SnNout have focused on data quality issues (risk of bias, imprecision, and generalizability) or have used complicated formulas many find difficult to understand.3,4 This paper improves upon the existing literature, using simple examples without formulas to illustrate the limitations of SpPin and SnNout that exist even when data for test characteristics are of high quality (large representative sample with low risk of bias). In addition, we demonstrate that to effectively evaluate the utility of diagnostic tests, one must rely on likelihood ratios interpreted in the context of pretest probability, rather than rely on these heuristics.

THE ORIGINS OF SPPIN AND SNNOUT

The SnNout heuristic was conceived over three decades ago in the context of a test that was reported to have 100% sensitivity.1 Sensitivity is the probability of a positive test result among patients with the disease in question. Specificity is the probability of a negative test result among patients without the disease in question, and the SpPin heuristic was subsequently conceived as a counterpart to SnNout.1 SpPin and SnNout are guaranteed to work when the corresponding test characteristic is 100%. However, the use of the heuristics has expanded over time to include tests with sensitivity or specificity less than 100%, but with values still considered “high.”1,2,3

WHAT CONSTITUTES “RULING IN” AND “RULING OUT”?

When tests are less than 100% accurate (which is almost always), residual diagnostic uncertainty will exist. As a practical matter, we consider a disease ruled out when its probability is less than some low threshold (justifying abandonment of further testing for that disease) and ruled in when its probability is greater than some high threshold (justifying initiation of treatment for that disease without further testing).1,2 Therefore, a test’s utility for ruling in or ruling out disease depends on a patient’s posttest probability.

THE PROBLEMS WITH SPPIN AND SNNOUT

1) Neither Sensitivity Nor Specificity Should Be Considered in Isolation of the Other

Correctly assessing how a test result changes probability of disease requires information about test performance in patients both with and without the disease in question. Neither sensitivity nor specificity contains that information. Likelihood ratios do.

The likelihood ratio (LR) for a given test result is the probability of that result among patients with the disease in question divided by the probability of the same result among patients without that disease. When LR = 1, the result is equally likely in both groups and does not affect probability of disease (pretest probability = posttest probability). When LR > 1, probability of disease increases, and when LR < 1, probability of disease decreases. The further away from one in either direction, the greater the change in probability, with possible values for LR ranging from zero to infinity.

The following exercise illustrates the importance of relying on LR rather than sensitivity or specificity alone. Consider test characteristics of three available tests—A, B, and C—for a certain disease (Table 1). According to SpPin and SnNout, test A is best for ruling in disease (highest specificity) and test B is best for ruling out disease (highest sensitivity). In truth, test C is best at both because it has both the largest and smallest LR and thus generates the highest posttest probability when positive and lowest posttest probability when negative. SpPin and SnNout get it wrong.

Table 1 Likelihood Ratios* Are Superior to SpPin and SnNout—an Illustrative Example

A second exercise, using a real-world example, is further illuminating. Kernig’s sign has 5% sensitivity and 95% specificity for meningitis.5 According to SpPin and SnNout, this test can rule in meningitis when positive, but cannot rule out meningitis when negative. In truth, because the LR for a positive result (LR+) = 1 and the LR for a negative result (LR−) = 1, probability of meningitis does not change with either result. For any dichotomous test, when sensitivity + specificity = 100%, that test is unhelpful.

2) Pretest Probability Matters

When sensitivity and specificity are both high, SpPin and SnNout are still unreliable, particularly when a patient’s pretest probability is far from the rule-in (for SpPin) or rule-out (for SnNout) threshold. A classic example is a positive HIV antibody test in a patient with a very low pretest probability of HIV (1 in 10,000). Even if specificity = 99.8%—a seemingly clear example of SpPin—and sensitivity = 100% (LR+  = 500), posttest probability after a positive test is just 5%, which is clearly inadequate to rule in HIV and initiate treatment.6

3) Most Tests Are Not Truly Dichotomous

Sensitivity and specificity are numbers that imply there are only two possible test results. Such tests are rare in the real world. Physical exam maneuvers and imaging studies tend to be ordinal, with results such as “negative,” “indeterminate,” and “positive” (e.g., chest X-ray for pneumonia), and blood tests tend to be continuous, with essentially infinite possible results (e.g., B-type natriuretic peptide for heart failure).

Dichotomizing non-dichotomous tests introduces measurement error and leads to mistakes. The solution is to instead use multilevel LRs to maximize a test’s utility.7 For example, in a recent study evaluating ultrasound measurement of jugular venous pressure for diagnosis of elevated central venous pressure, authors dichotomized the test and reported a sensitivity of 73% and specificity of 79% (LR+  = 3.4, LR−  = 0.34). While this test would not be considered very helpful according to SpPin or SnNout, a more useful reanalysis demonstrated six distinct levels of test results with unique LRs that ranged from zero to infinity.8

OUT WITH THE OLD RULE, IN WITH THE OLDER RULE

Bayes’ Rule: Pretest Odds x Likelihood Ratio = Posttest Odds1,7

Bayes’ rule considers test performance in patients both with and without the disease in question, incorporates pretest probability, does not require dichotomization, and allows for easy comparison between posttest probability and decision-making thresholds. While the advantages of this approach have long been recognized, including by evidence-based medicine experts who simultaneously taught SpPin and SnNout,1,2 it seems that most learners over the past several decades have retained only the heuristics, perhaps due to their simplicity. Fortunately, with the availability of a handy nomogram1,2 and more recently, online calculators (e.g., https://sample-size.net/post-probability-calculator-test-new/), clinicians need not worry about memorizing formulas, converting between probability and odds, or making any calculations on their own.

LIMITATIONS TO OUR APPROACH

First, when data quality issues are present, LR estimates will be unreliable. However, the same limitations will apply to SpPin and SnNout,3 and Bayes’ rule will still improve upon the heuristics by incorporating pretest probability. Second, accurately estimating a patient’s pretest probability can be difficult. Likewise, finding the correct LR for a test result can be difficult because diagnostic accuracy studies often report dichotomized test characteristics for non-dichotomous tests. However, previously described strategies for estimating pretest probability and for using multiple levels to interpret data from diagnostic accuracy studies can be used to help overcome these challenges.1,2,7

AN ALTERNATIVE TO OUR APPROACH: THE LIKELIHOOD RATIO HEURISTIC

When teaching diagnostic test interpretation and Bayes’ rule, some evidence-based medicine experts have promoted an alternative heuristic that goes something like this: LRs greater than 10 or less than 0.1 are very powerful and often conclusive; LRs ranging from 5 to 10 or 0.1 to 0.2 have a moderate effect on probability; LRs ranging from 2 to 5 or 0.2 to 0.5 have a small effect on probability; and LRs ranging from 0.5 to 2 are rarely helpful.9 We agree that it can be useful for learners to get a feel for the impact of different LRs in order to develop an innate sense of how “good” a test result is based on its LR. However, it is important to contextualize the utility of LRs in terms of the varied magnitudes of effect they will have at different pretest probabilities, the potential availability of other independent tests, and decision-making thresholds.1 Even tests with very modest LRs can appropriately change patient management, depending on these other factors.

CONCLUSION

While conceived as well-intentioned teaching tools, given their multiple flaws, it’s time to retire the heuristics SpPin and SnNout. A reinvigorated emphasis on considering pretest probability and using multilevel LRs with Bayes’ rule is needed in medical education at all levels, as part of the greater effort in healthcare to achieve diagnostic excellence.