Original Article
Evaluation of a preliminary physical function item bank supported the expected advantages of the Patient-Reported Outcomes Measurement Information System (PROMIS)

https://doi.org/10.1016/j.jclinepi.2006.06.025Get rights and content

Abstract

Objective

The Patient-Reported Outcomes Measurement Information System (PROMIS) was initiated to improve precision, reduce respondent burden, and enhance the comparability of health outcomes measures. We used item response theory (IRT) to construct and evaluate a preliminary item bank for physical function assuming four subdomains.

Study Design and Setting

Data from seven samples (N = 17,726) using 136 items from nine questionnaires were evaluated. A generalized partial credit model was used to estimate item parameters, which were normed to a mean of 50 (SD = 10) in the US population. Item bank properties were evaluated through Computerized Adaptive Test (CAT) simulations.

Results

IRT requirements were fulfilled by 70 items covering activities of daily living, lower extremity, and central body functions. The original item context partly affected parameter stability. Items on upper body function, and need for aid or devices did not fit the IRT model. In simulations, a 10-item CAT eliminated floor and decreased ceiling effects, achieving a small standard error (<2.2) across scores from 20 to 50 (reliability >0.95 for a representative US sample). This precision was not achieved over a similar range by any comparable fixed length item sets.

Conclusion

The methods of the PROMIS project are likely to substantially improve measures of physical function and to increase the efficiency of their administration using CAT.

Introduction

Over the past several decades, the use of patient-reported outcomes (PROs) in clinical studies has steadily increased in frequency, as has their importance in evaluating therapies and developing treatment plans. The plethora of outcomes tools available today allows for increasing specification of a range of domains related to health and well-being, but with two major limitations.

First, health outcomes research has produced a number of well validated instruments [1], but the most precise and comprehensive questionnaires are rather lengthy and complex, leading to a level of respondent burden that hampers recruitment, limits the representativeness of the patient population being studied, and leads to substantial problems of missing data. This is particularly important if different constructs are measured. Thus, the most popular health profile instruments are relatively short questionnaires (e.g., SF-36® Health Survey [2], [3]), but even for the measurement of one specific domain, like physical function, brief questionnaires are mostly favored (e.g., Health Assessment Questionnaire [HAQ] [4]). These shorter questionnaires represent a compromise in measurement precision, range, and other desirable attributes in favor of practicality. The short forms are useful for measuring the health status of larger groups, but the precision loss is of greater concern when groups are rather small or scores are estimated for individual patients to guide clinical decision making [5].

A second major limitation has been that results from different questionnaires are difficult to compare, even when two similar instruments assess the same outcomes for the same illness, such as measuring the disability of rheumatic patients with the HAQ [4] or Western Ontario and McMaster Universities arthritis index (WOMAC) [6]. The situation is as if leukocyte counts assessed in different settings were not comparable with one another, but were dependent on the particular laboratory used. There is a strong need to develop a standardized, efficient approach to outcome measurement for a variety of clinical applications including population monitoring, clinical trials research, and individual patient monitoring, so that results can be compared across conditions, therapies, trials, and patients.

Use of Item Response Theory (IRT) to build item banks and Computerized Adaptive Tests (CATs) are believed to be promising solutions to both problems. An item bank consists of a set of items measuring the same concept and a description of the items' measurement properties based on IRT models [7]. IRT [8], [9] describes the probability of choosing each response on a questionnaire item as a function of the latent trait measured by the items (referred to as the IRT score or theta [θ]) [10], [11]. On the basis of the IRT models, the latent trait can be estimated from the responses to any subset of items in the bank [7]. Accordingly, researchers or clinicians can select items that are most relevant for the given group or individual patient and score the responses on a general ruler that is independent of the choice of items. Further, if the item bank contains items from established questionnaires, scores on these questionnaires can be predicted from estimates of the latent trait. Thus, using an IRT item bank will allow comparisons between results from different questionnaires also. The item bank is not static, but can be continuously expanded and improved with additional items.

An IRT-based item bank also provides the foundation for CATs [12], [13], [14]. CATs make it possible to select the most informative items from the item bank for every individual patient according to his or her degree of the latent trait, and to administer only those items. Thus, the higher precision needed for individual patient measurement is achieved, while at the same time respondent burden can be controlled [15], [16], [17], [18], [19]. We demonstrated these advantages earlier for the Headache Impact Test [20], [21], and the Anxiety CAT [22] and have developed CATs for all SF-36 domains (see subsequent paper in this series.

To systematically apply IRT and CATs in its studies, the National Institutes of Health (NIH) recently initiated the development of the Patient-Reported Outcomes Measurement Information System (PROMIS) (http://nihroadmap.nih.gov/). This trans-NIH initiative aims “to revolutionize the way patient-reported outcome tools are selected and employed in clinical research and practice evaluation” (http://www.nihpromis.org). Five domains are being assessed initially: physical function, pain, fatigue, mental health, and role functioning, across six Primary Research Sites and a Statistical Coordinating Center. This study describes the pilot development and analysis of a preliminary item bank for physical function.

The physical function construct has been evaluated using IRT methods for more than a decade [23], [24], [25], [26], [27], [28], [29], [30], [31], [32]. Items covering a wide range of physical activity levels, from self-care (e.g., bathing and dressing) to performance of vigorous physical activities (e.g., running, strenuous sports), usually can be sufficiently calibrated on a common metric to satisfy the assumptions of IRT models. These studies, like others using IRT techniques to rescore PRO measurements [33], [34], generally show that the use of IRT techniques is superior to the use of classical test theory. In principle, the same measurement assumptions are made in classical test theory as in IRT. However, the explicit formulation of requirements in IRT may force us to refine our assumptions about a construct itself and which particular items or subdomains may or may not be included [35]. We are aware of just two completed projects to build IRT-based CATs (in rehabilitation research) for physical function [36], [37]. Thus, despite the work done, we are still at the early stages, compared to the ambitious goals of the PROMIS initiative. However, preliminary work, including this article, is essential to identify issues and resolve problems as a necessary precursor to reach these aims.

Within this article, we analyze steps by which an item bank for physical function can be built, and describe how IRT scores and traditional instruments can be compared and their relative utility assessed.

Section snippets

Methods

An overview of the following different steps of the analyses is shown in Fig. 1.

Item selection

All data sets were screened systematically for items covering the physical function construct. After review, 136 items were used for the data analysis.

Skewness

From the GPF sample, we excluded four SIP items (staying in bed most of the time, do not use stairs at all, walk only with help, do not walk at all) that were too easy to be applied to the general population (more than 95% respondents chose the easiest response choice). For the same reason, we excluded eleven items from the HIE sample (help with

Discussion

We have demonstrated that it is possible to use a variety of existing instruments to build a preliminary item bank with promising properties. Apart from practicality, the use of existing instruments will allow us to cross-calibrate IRT scores with scores on traditional scales [21]. The IRT methods also showed that existing instruments like the HAQ or SF-36 only provide very good measurement precision (SE < 3.3) over a limited range. Our simulation studies suggest that a 10-item CAT based on the

Acknowledgments

This work was funded by the NIH through the NIH Roadmap for Medical Research, Grant U01 AR052158-01 (Improved Outcome Assessment in Arthritis and Aging, J. Fries (Principal Investigator) and J. Ware (Co-Principal Investigator), Project Officer D. Ader) and supported by Stanford University, QualityMetric Incorporated and Health Assessment Lab from their own research funds. Information on the PROMIS can be found at www.nihpromis.org.

References (93)

  • S.M. Haley et al.

    Short-form activity measure for post-acute care

    Arch Phys Med Rehabil

    (2004)
  • S.M. Haley et al.

    Score comparability of short forms and computerized adaptive testing: simulation study with the activity measure for post-acute care

    Arch Phys Med Rehabil

    (2004)
  • I. McDowell et al.

    Measuring health: a guide to rating scales and questionnaires

    (1996)
  • J.E. Ware et al.

    The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection

    Med Care

    (1992)
  • J.E. Ware et al.

    A 12-item Short-Form Health Survey

    Med Care

    (1996)
  • J.F. Fries et al.

    The dimensions of health outcomes: the health assessment questionnaire, disability and pain scales

    J Rheumatol

    (1982)
  • C.A. McHorney et al.

    Individual-patient monitoring in clinical practice: are available health status surveys adequate?

    Qual Life Res

    (1995)
  • N. Bellamy et al.

    Validation study of WOMAC: a health status instrument for measuring clinically important patient relevant outcomes to antirheumatic drug therapy in patients with osteoarthritis of the hip or knee

    J Rheumatol

    (1988)
  • J.B. Bjorner et al.

    Computerized adaptive testing and item banking

  • W.J. van der Linden et al.

    Handbook of modern item response theory

    (1997)
  • G.H. Fischer et al.

    Rasch models—foundations, recent developments, and applications

    (1995)
  • S.E. Embretson

    The new rules of measurement

    Psychol Assess

    (1996)
  • S.E. Embretson et al.

    Item response theory for psychologists

    (2000)
  • H. Wainer et al.

    Computerized adaptive testing: a primer

    (2000)
  • W.J. van der Linden et al.

    Computerized adaptive testing: theory and practice

    (2000)
  • J.E. Ware et al.

    Applications of computerized adaptive testing (CAT) to the assessment of headache impact

    Qual Life Res

    (2003)
  • D. Cella et al.

    A discussion of item response theory and its applications in health status assessment

    Med Care

    (2000)
  • R.K. Hambleton et al.

    Item response theory models and testing practices: current international status and future directions

    Eur J Psychol Assess

    (1997)
  • R.K. Hambleton et al.

    Setting performance standards on complex educational assessments

    Appl Psychol Meas

    (2000)
  • R.D. Hays et al.

    Item response theory and health outcomes measurement in the 21st century

    Med Care

    (2000)
  • J.E. Ware et al.

    Practical implications of item response theory and computerized adaptive testing: a brief summary of ongoing studies of widely used headache impact scales

    Med Care

    (2000)
  • J.B. Bjorner et al.

    Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales

    Qual Life Res

    (2003)
  • O. Walter et al.

    Developmental steps for a computer adaptive test for anxiety (A-CAT)

    Diagnostica

    (2005)
  • W.P. Fisher et al.

    Equating the MOS SF36 and the LSU HSI Physical Functioning Scales

    J Outcome Meas

    (1997)
  • C.V. Granger et al.

    Performance profiles of the functional independence measure

    Am J Phys Med Rehabil

    (1993)
  • T. Tsuji et al.

    ADL structure for stroke patients in Japan based on the functional independence measure

    Am J Phys Med Rehabil

    (1995)
  • C. Jenkinson et al.

    Can item response theory reduce patient burden when measuring health status in neurological disorders? Results from Rasch analysis of the SF-36 physical functioning scale (PF-10)

    J Neurol Neurosurg Psychiatr

    (2001)
  • L.B. Gray et al.

    An item response theory analysis of the Rosenberg Self-Esteem Scale

    Pers Soc Psychol Bull

    (1997)
  • D.W. King et al.

    Enhancing the precision of the Mississippi scale for combat-related posttraumatic stress disorder: an application of item response theory

    Psychol Assess

    (1993)
  • F. Wolfe et al.

    Development and validation of the health assessment questionnaire II: a revised version of the health assessment questionnaire

    Arthritis Rheum

    (2004)
  • J.E. Ware et al.

    Item response theory and computerized adaptive testing: implications for outcomes measurement in rehabilitation

    Rehabil Psychol

    (2005)
  • J.F. Fries et al.

    ARAMIS (the American Rheumatism Association Medical Information System). A prototypical national chronic-disease data bank

    West J Med

    (1986)
  • M. Kosinski et al.

    Determining minimally important changes in generic and disease-specific health-related quality of life questionnaires in clinical trials of rheumatoid arthritis

    Arthritis Rheum

    (2000)
  • A.R. Tarlov et al.

    The Medical Outcomes Study. An application of methods for monitoring the results of medical care

    J Am Med Assoc

    (1989)
  • A.L. Stewart et al.

    Measuring Functioning and Well-Being: The Medical Outcomes Study Approach

    (1992)
  • National Committee for Quality Assurance
    (2004)
  • Cited by (371)

    View all citing articles on Scopus
    View full text