Estimation of sensitivity and specificity of diagnostic tests and disease prevalence when the true disease state is unknown

https://doi.org/10.1016/S0167-5877(00)00117-3Get rights and content

Abstract

The performance of a new diagnostic test is frequently evaluated by comparison to a perfect reference test (i.e. a gold standard). In many instances, however, a reference test is less than perfect. In this paper, we review methods for estimation of the accuracy of a diagnostic test when an imperfect reference test with known classification errors is available. Furthermore, we focus our presentation on available methods of estimation of test characteristics when the sensitivity and specificity of both tests are unknown. We present some of the available statistical methods for estimation of the accuracy of diagnostic tests when a reference test does not exist (including maximum likelihood estimation and Bayesian inference). We illustrate the application of the described methods using data from an evaluation of a nested polymerase chain reaction and microscopic examination of kidney imprints for detection of Nucleospora salmonis in rainbow trout.

Introduction

The sensitivity and specificity of a test are usually determined by comparison with a reference test (often referred to as a “gold standard”), which is supposed to determine the true disease state of the animals unambiguously (Office International des Epizooties, 1996; Greiner and Gardner, 2000). When a gold standard is available, sensitivity and specificity can be estimated directly (Kraemer, 1992). The true disease state, however, is rarely known in practice, because perfect test results may be difficult or impossible to obtain (Tyler and Cullor, 1989).

If classification errors in the reference test are ignored, serious bias may be introduced in the assessment of the accuracies of the new test (Staquet et al., 1981; Valenstein, 1990). However, when the error probabilities of the reference test are known, it is possible to obtain unbiased estimates of the accuracies of the test in question (Gart and Buck, 1966; Staquet et al., 1981). The estimation is based on the assumption that the classification errors in the reference and the new test are independent, conditional on the true disease state. However, estimation is possible even when conditional independence is not assumed (Thibodeau, 1981).

Hui and Walter (1980) considered the case where two tests (both with unknown sensitivity and specificity) were simultaneously applied to individuals from two populations with different prevalences of disease. They showed that sensitivity and specificity of both tests (assuming conditional independence) — as well as true prevalence in both populations — could be estimated by maximum likelihood (ML). A thorough discussion of the applicability of the method in other settings (such as the case with one-population and three or more tests) is given by Walter and Irwig (1988). Bayesian methodology has also been used for the model proposed by Hui and Walter and ones like it (Joseph et al., 1995; Johnson et al., 2000). Computations are accomplished by Gibbs sampling (Gelfand and Smith, 1990). Hui and Zhou (1998) presented an overview of available methods for diagnostic test evaluation with an emphasis on methodology for estimation of sensitivity and specificity (without need for the assumption of conditional independence).

Although not widely adopted, the Hui and Walter model (and models similar to it) has been applied in statistical research (McClish and Quade, 1985; Vacek, 1985; Ashton and Moeschberger, 1988; Walter and Irwig, 1988; Qu et al., 1996; Sinclair and Gastwirth, 1996; Weng, 1996; Torrance-Rynard and Walter, 1997; Johnson and Pearson, 1999; Johnson et al., 2000) and in human medical science (van Ulsen et al., 1986; Shaw et al., 1987; Walter et al., 1991; de Bock et al., 1994; Faraone and Tsuang, 1994; Faraone et al., 1996; Line et al., 1997; McDermott et al., 1997; Mahoney et al., 1998; Rybicki et al., 1998). The methods have been introduced only recently in the evaluation of diagnostic tests used for detection of animal disease (Spangler et al., 1992; Agger et al., 1997; Chriel and Willeberg, 1997; Enøe et al., 1997; Sørensen et al., 1997; Willeberg et al., 1997; Georgiadis et al., 1998; Singer et al., 1998).

In this paper, we describe methods of estimating the sensitivity and specificity of diagnostic tests and disease prevalence when the true disease state is unknown. We present methods beginning with the case where an imperfect reference test is available, and we ultimately give special emphasis to the model and methods described by Hui and Walter (1980). Methods are illustrated using data from Georgiadis et al. (1998).

Section snippets

Reference test with known sensitivity and specificity

Consider the case where the true disease state of animals cannot be determined perfectly, but where the sensitivity (SeR) and the specificity (SpR) of a reference test are presumed known. When each individual animal in a random sample of size n is tested by a new diagnostic test and a reference test, four outcomes are possible: both tests positive (T1+, T2+; denoted a); one test positive and one negative (T1+, T2−; denoted b) and (T1−, T2+; denoted c); both tests negative (T1−, T2−; denoted d).

Methods of estimation and computational techniques for the Hui and Walter model

ML estimates are a set of parameter estimates that were most “likely” to have generated the observed data and are obtained by maximizing the likelihood function (Tanner, 1996). Variances are obtained by calculating the Fisher Observed Information matrix and inverting it (Gelman et al., 1995, p. 100). The square roots of the diagonals of this matrix are the corresponding S.E.s. ML estimates have many optimal properties when sample sizes are large. They are asymptotically unbiased and efficient

Assumptions

The methods presented in this paper are based on several assumptions that — if not taken into careful consideration — can seriously invalidate the results.

In the models proposed by Gart and Buck (1966), Staquet et al. (1981) and Hui and Walter (1980), the two tests are assumed to be conditionally independent. The assumption of conditional independence implies that given that an animal is diseased (or not), the probability of positive (or negative) outcomes for T1 is the same regardless of a

Illustrations

To illustrate the reviewed methods and models, we used data from Georgiadis et al. (1998) who evaluated a nested polymerase chain reaction (PCR) test (Barlough et al., 1995) and microscopic examination (ME) of kidney imprints for detection of the microsporidian parasite Nucleospora salmonis in rainbow trout.

Briefly, Georgiadis et al. (1998) used the NR and the EM algorithms to assess the accuracy of the PCR test and ME using the Hui and Walter two-population model. Thus, some of the results in

Conclusions

When evaluating a new diagnostic test, it is generally wise to assume that the sensitivity and specificity of the reference test are not precisely known, and to use available methods to estimate them as well. For this purpose, we advocate the use of the ML methods when two or more populations can be sampled, and when the assumptions for the Hui and Walter model can be justified. A simple Newton–Raphson approach should suffice when the cells of the 2×2 tables displaying the cross-classified test

Acknowledgements

The study was supported in part by the NRI Competitive Grants Program/USDA Award No. 98-35204-6535. Additionally, we thank S. Andersen, J. Barlough, W. Cox, I.A. Gardner, R.P. Hedrick, L.M. Pearson, R. Singh and M. Thurmond for valuable contributions to this work.

References (64)

  • S Andersen

    Re: Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard

    Am. J. Epidemiol.

    (1997)
  • Ashton, J.J., Moeschberger, M.L., 1988. An SAS macro for estimating the error rates of two diagnostic tests, neither...
  • J.E Barlough et al.

    Nested polymerase chain reaction for detection of Enterocytozoon salmonis genomic DNA in chinook salmon Oncorhynchus tshawytscha

    Dis. Aquat. Org.

    (1995)
  • E.J Bedrick et al.

    Bayesian binomial regression: predicting survival at a trauma center

    Am. Statist.

    (1997)
  • H Brenner

    How independent are multiple “independent” diagnostic classifications

    Statist. Med.

    (1996)
  • H Brenner et al.

    Variation of sensitivity, specificity, likelihood ratios and predictive values with disease prevalence

    Statist. Med.

    (1997)
  • Brookmeyer, R., Gail, M.H., 1994. AIDS Epidemiology: A Quantitative Approach. Oxford University Press, London, 354...
  • B.C.K Choi

    Causal modeling to estimate sensitivity and specificity of a test when prevalence changes

    Epidemiology

    (1997)
  • M Chriel et al.

    Dependency between sensitivity, specificity and prevalence analysed by means of Gibbs sampling

    Epidémiologie et Santé Animale

    (1997)
  • R.D Cook

    Detection of influential observations in linear regression

    Technometrics

    (1977)
  • A Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. Roy. Statist. Soc. Ser. B

    (1977)
  • C Enøe et al.

    Estimation of the sensitivity and the specificity of two diagnostic tests for the detection of antibodies against Actinobacillus pleuropneumoniae serotype 2 in pigs by maximum-likelihood-estimation and Gibbs sampling

    Epidémiologie et Santé Animale

    (1997)
  • S.V Faraone et al.

    Measuring diagnostic accuracy in the absence of a gold standard

    Am. J. Psychiatr.

    (1994)
  • S.V Faraone et al.

    Diagnostic accuracy and confusability analyses: an application to the diagnostic interview for genetic studies

    Psychol. Med.

    (1996)
  • J.J Gart et al.

    Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests

    Am. J. Epidemiol.

    (1966)
  • J.L Gastwirth

    The statistical precision of medical screening tests

    Statist. Sci.

    (1987)
  • J.L Gastwirth et al.

    Bayesian analysis of screening data: application to AIDS in blood donors

    Can. J. Statist.

    (1991)
  • A Gelfand et al.

    Sampling-based approaches to calculating marginal densities

    J. Am. Statist. Assoc.

    (1990)
  • Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 1995. Bayesian Data Analaysis. Chapman & Hall, London, 528...
  • S Geisser et al.

    Optimal administration of dual screening tests for detecting a characteristic with special reference to low prevalence diseases

    Biometrics

    (1992)
  • M.P Georgiadis et al.

    Field evaluation of sensitivity and specificity of a polymerase chain reaction (PCR) for detection of N. salmonis in rainbow trout

    J. Aquat. Anim. Health

    (1998)
  • A Hadgu et al.

    A biomedical application of latent class models with random effects

    Appl. Statist.

    (1998)
  • Cited by (394)

    View all citing articles on Scopus
    View full text