Original ArticleEvaluation of a preliminary physical function item bank supported the expected advantages of the Patient-Reported Outcomes Measurement Information System (PROMIS)
Introduction
Over the past several decades, the use of patient-reported outcomes (PROs) in clinical studies has steadily increased in frequency, as has their importance in evaluating therapies and developing treatment plans. The plethora of outcomes tools available today allows for increasing specification of a range of domains related to health and well-being, but with two major limitations.
First, health outcomes research has produced a number of well validated instruments [1], but the most precise and comprehensive questionnaires are rather lengthy and complex, leading to a level of respondent burden that hampers recruitment, limits the representativeness of the patient population being studied, and leads to substantial problems of missing data. This is particularly important if different constructs are measured. Thus, the most popular health profile instruments are relatively short questionnaires (e.g., SF-36® Health Survey [2], [3]), but even for the measurement of one specific domain, like physical function, brief questionnaires are mostly favored (e.g., Health Assessment Questionnaire [HAQ] [4]). These shorter questionnaires represent a compromise in measurement precision, range, and other desirable attributes in favor of practicality. The short forms are useful for measuring the health status of larger groups, but the precision loss is of greater concern when groups are rather small or scores are estimated for individual patients to guide clinical decision making [5].
A second major limitation has been that results from different questionnaires are difficult to compare, even when two similar instruments assess the same outcomes for the same illness, such as measuring the disability of rheumatic patients with the HAQ [4] or Western Ontario and McMaster Universities arthritis index (WOMAC) [6]. The situation is as if leukocyte counts assessed in different settings were not comparable with one another, but were dependent on the particular laboratory used. There is a strong need to develop a standardized, efficient approach to outcome measurement for a variety of clinical applications including population monitoring, clinical trials research, and individual patient monitoring, so that results can be compared across conditions, therapies, trials, and patients.
Use of Item Response Theory (IRT) to build item banks and Computerized Adaptive Tests (CATs) are believed to be promising solutions to both problems. An item bank consists of a set of items measuring the same concept and a description of the items' measurement properties based on IRT models [7]. IRT [8], [9] describes the probability of choosing each response on a questionnaire item as a function of the latent trait measured by the items (referred to as the IRT score or theta [θ]) [10], [11]. On the basis of the IRT models, the latent trait can be estimated from the responses to any subset of items in the bank [7]. Accordingly, researchers or clinicians can select items that are most relevant for the given group or individual patient and score the responses on a general ruler that is independent of the choice of items. Further, if the item bank contains items from established questionnaires, scores on these questionnaires can be predicted from estimates of the latent trait. Thus, using an IRT item bank will allow comparisons between results from different questionnaires also. The item bank is not static, but can be continuously expanded and improved with additional items.
An IRT-based item bank also provides the foundation for CATs [12], [13], [14]. CATs make it possible to select the most informative items from the item bank for every individual patient according to his or her degree of the latent trait, and to administer only those items. Thus, the higher precision needed for individual patient measurement is achieved, while at the same time respondent burden can be controlled [15], [16], [17], [18], [19]. We demonstrated these advantages earlier for the Headache Impact Test [20], [21], and the Anxiety CAT [22] and have developed CATs for all SF-36 domains (see subsequent paper in this series.
To systematically apply IRT and CATs in its studies, the National Institutes of Health (NIH) recently initiated the development of the Patient-Reported Outcomes Measurement Information System (PROMIS) (http://nihroadmap.nih.gov/). This trans-NIH initiative aims “to revolutionize the way patient-reported outcome tools are selected and employed in clinical research and practice evaluation” (http://www.nihpromis.org). Five domains are being assessed initially: physical function, pain, fatigue, mental health, and role functioning, across six Primary Research Sites and a Statistical Coordinating Center. This study describes the pilot development and analysis of a preliminary item bank for physical function.
The physical function construct has been evaluated using IRT methods for more than a decade [23], [24], [25], [26], [27], [28], [29], [30], [31], [32]. Items covering a wide range of physical activity levels, from self-care (e.g., bathing and dressing) to performance of vigorous physical activities (e.g., running, strenuous sports), usually can be sufficiently calibrated on a common metric to satisfy the assumptions of IRT models. These studies, like others using IRT techniques to rescore PRO measurements [33], [34], generally show that the use of IRT techniques is superior to the use of classical test theory. In principle, the same measurement assumptions are made in classical test theory as in IRT. However, the explicit formulation of requirements in IRT may force us to refine our assumptions about a construct itself and which particular items or subdomains may or may not be included [35]. We are aware of just two completed projects to build IRT-based CATs (in rehabilitation research) for physical function [36], [37]. Thus, despite the work done, we are still at the early stages, compared to the ambitious goals of the PROMIS initiative. However, preliminary work, including this article, is essential to identify issues and resolve problems as a necessary precursor to reach these aims.
Within this article, we analyze steps by which an item bank for physical function can be built, and describe how IRT scores and traditional instruments can be compared and their relative utility assessed.
Section snippets
Methods
An overview of the following different steps of the analyses is shown in Fig. 1.
Item selection
All data sets were screened systematically for items covering the physical function construct. After review, 136 items were used for the data analysis.
Skewness
From the GPF sample, we excluded four SIP items (staying in bed most of the time, do not use stairs at all, walk only with help, do not walk at all) that were too easy to be applied to the general population (more than 95% respondents chose the easiest response choice). For the same reason, we excluded eleven items from the HIE sample (help with
Discussion
We have demonstrated that it is possible to use a variety of existing instruments to build a preliminary item bank with promising properties. Apart from practicality, the use of existing instruments will allow us to cross-calibrate IRT scores with scores on traditional scales [21]. The IRT methods also showed that existing instruments like the HAQ or SF-36 only provide very good measurement precision (SE < 3.3) over a limited range. Our simulation studies suggest that a 10-item CAT based on the
Acknowledgments
This work was funded by the NIH through the NIH Roadmap for Medical Research, Grant U01 AR052158-01 (Improved Outcome Assessment in Arthritis and Aging, J. Fries (Principal Investigator) and J. Ware (Co-Principal Investigator), Project Officer D. Ader) and supported by Stanford University, QualityMetric Incorporated and Health Assessment Lab from their own research funds. Information on the PROMIS can be found at www.nihpromis.org.
References (93)
Future directions for item response theory
Int J Educ Res
(1989)- et al.
Evaluation of the MOS SF-36 physical functioning scale (PF-10): I. Unidimensionality and reproducibility of the Rasch item scale
J Clin Epidemiol
(1994) - et al.
Relationships between impairment and physical disability as measured by the functional independence measure
Arch Phys Med Rehabil
(1993) - et al.
The structure and stability of the Functional Independence Measure
Arch Phys Med Rehabil
(1994) - et al.
Evaluation of the MOS SF-36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods
J Clin Epidemiol
(1997) - et al.
Differential item functioning in the Danish translation of the SF-36
J Clin Epidemiol
(1998) - et al.
Comparison of Rasch and summated rating scales constructed from SF-36 physical functioning items in seven countries: results from the IQOLA Project. International Quality of Life Assessment
J Clin Epidemiol
(1998) - et al.
Assessing mobility in children using a computer adaptive testing version of the pediatric evaluation of disability inventory
Arch Phys Med Rehabil
(2005) - et al.
Development and evaluation of the Kansas City Cardiomyopathy Questionnaire: a new health status measure for heart failure
J Am Coll Cardiol
(2000) - et al.
Assessing physical function in the elderly
Clin Geriatr Med
(1987)