Introduction

Acute otitis media (AOM) is a common childhood infection with a peak incidence occurring between 6 and 12 months of age. Five to fifteen percent of all children, depending on their age, suffer from recurrent acute infections of the middle ear (4 or more episodes per year) [14]. Repetitive episodes of pain, fever and general illness during acute ear infections [58] as well as worries about potential long-term sequelae such as hearing loss and disturbed language development [913] may all compromise the quality of life of the child and its family [1416]. Although several questionnaires have been used in assessing the effects of recurrent acute otitis media (rAOM) in children, lack of true health-related quality of life (HRQoL) questionnaires as well as incomplete data on their reliability and validity mean that our current knowledge on the subject is limited for both research and clinical practice [17].

Assessment of functional health status (FHS) and HRQoL, as defined in Table 1 [1826], has become increasingly important in clinical trials on the effectiveness of treatment in paediatric chronic conditions. The validation of FHS and HRQoL questionnaires, however, has so far mainly focused on reliability and construct validity. Responsiveness has been assessed for only a few paediatric HRQoL questionnaires for conditions other than otitis media [2731]. In order to evaluate treatment effects on FHS and HRQoL meaningfully, questionnaires are needed that are not only reliable and valid but also responsive to changes in FHS and HRQoL. In adult studies, various strategies have been used to assess responsiveness, which is defined as the ability to detect clinically important change over time and therefore involves both the assessment of sensitivity to change and the assignment of meaning to that change [32, 33]. Since none of these strategies is without limitations, we will try to assess the responsiveness of FHS and HRQoL questionnaires by using multiple strategies, categorized into distribution-based and anchor-based methods.

Table 1 Definitions of health-related quality of life and functional health status

Distribution-based methods express the amount of change relative to the amount of random variance of a questionnaire [34, 35], whereas anchor-based methods enhance interpretability of changes in questionnaire scores by linking meaning and clinical relevance to change scores [34, 36].

Both generic and disease-specific questionnaires have been used in studies of paediatric FHS or HRQoL. Generic questionnaires span a wide spectrum of quality of life components, bridging various health states and populations. Disease-specific questionnaires on the other hand, assess health-related issues specific to particular conditions and may be able to detect small changes that are often small but clinically important; these provide a more detailed assessment of HRQoL, but cannot be used for comparisons across health conditions [3739]. Both questionnaires are often combined in order to profit from the merits of both types. However, there have been few head-to-head comparisons between generic and disease-specific HRQoL measurement questionnaires in the setting of randomized controlled trials (RCT) [40].

The current RCT on the effectiveness of pneumococcal vaccination in children with rAOM will address both the issues of using generic versus disease-specific questionnaires and responsiveness in evaluating treatment effects on HRQoL in RCTs. The results will lead to recommendations regarding the applicability of these questionnaires in clinical studies in children with rAOM.

Methods

Setting and procedure

FHS and HRQoL were assessed in 383 children with rAOM participating in a double-blind randomized, placebo-controlled trial on the effectiveness of pneumococcal conjugate vaccination versus control hepatitis vaccination. The study was conducted at the paediatric outpatient departments of a general hospital (Spaarne Hospital Haarlem) and a tertiary care hospital (University Medical Center Utrecht). Children were recruited for this trial through referral by general practitioners, paediatricians, or otolaryngologists, or were enrolled on the caregiver’s own initiative from April 1998 to February 2001.

Study population

Inclusion criteria: children were aged between 12 and 84 months and suffering from rAOM at study entry; defined in this study as having had at least 2 episodes of physician diagnosed AOM in the year prior to study entry. Exclusion criteria were conditions with a known increased risk for AOM such as: known immunodeficiency (other than IgA or IgG2 subclass deficiency), cystic fibrosis, immotile cilia syndrome, cleft palate, chromosomal abnormalities (like Down syndrome) or severe adverse events upon vaccination in the past.

At each scheduled visit, two research physicians (C.N.M.B. and R.H.V.) collected data regarding the number of episodes of AOM (based on parental report at baseline and on physician report during follow-up), upper respiratory tract infections, and pneumonia. Information about the medical treatment, and ear, nose, and throat surgery in the preceding 6 months was also collected. The primary caregivers completed questionnaires assessing FHS and HRQoL of their child and family during the clinic visits at baseline and at 7, 14, and 26 months follow-up. Caregivers were requested to have the same person complete the questionnaires each time and to rate their child’s FHS and HRQoL with regard to their recurrent episodes of acute otitis media. Informed consent was obtained from caregivers of all children before study entry. Medical ethics committees of both participating hospitals approved the study protocol.

Questionnaires

Four generic questionnaires (RAND, FSQ Generic, FSQ Specific, TAIQOL) and one disease-specific questionnaire (OM-6) were used to assess FHS and HRQoL of the children in the study. Additionally, two disease-specific one-item numerical rating scales (NRS Child and NRS Caregiver) were used to obtain a global rating of HRQoL of the child and of the caregiver, respectively, related to rAOM. For the assessment of the impact of rAOM on family functioning a newly composed disease-specific questionnaire the Family Functioning Questionnaire (FFQ), was used to assess the impact of rAOM on family functioning. Table 2 summarises the characteristics of the questionnaires [14, 4157].

Table 2 Characteristics of FHS and HRQoL questionnaires used in this study

Generic questionnaires

The RAND general health-rating index (RAND) and the Functional Status Questionnaire (FSQ) had already been translated and validated for Dutch children by Post et al. [41, 42] (Table 2). The RAND assesses general health perceptions of caregivers regarding their child [43]. The FSQ consists of two parts: one measuring functional limitations in general, not necessarily related to illness (FSQ Generic) and the other (paradoxically named FSQ Specific) measuring functional limitations that are attributable to any illness [43]. Functional limitations in both versions of the FSQ are mainly expressed as behavioural problems. During the course of the study, a new Dutch questionnaire on generic HRQoL became available: the TNO-AZL Infant Quality of Life (TAIQOL) questionnaire [51, 53]. For this reason, from July 1999 the TAIQOL was added to the previously selected set of questionnaires. Although the full, original version of the TAIQOL has been applied during the study, only those subscales from the TAIQOL are discussed that, based on their content, were assumed to be sensitive to the consequences of AOM. The following subscales tap functional items that are often affected by AOM (OM-related): ‘Sleeping’, ‘Appetite’, ‘Liveliness’, ‘Problem behaviour’, ‘Positive mood’, and ‘Communication’ (items about speech and language capacity) which are 6 of the 12 subscales in the TAIQOL. Although the TAIQOL has been developed for children aged up to 5 years, we also used the questionnaire in children aged 6–7 years, as no appropriate alternative was available during the study.

Disease-specific questionnaires

To measure disease-specific FHS, the Otitis Media-6 (OM-6) [14, 55] was translated into Dutch according to principles of backward–forward translation [5861]. This six-item questionnaire covers both acute and long-term functional effects of otitis media in children on FHS.

A new questionnaire has been developed to assess the impact of rAOM in children on their caregivers and siblings: the FFQ. The content of the FFQ was based on previous work by Asmussen et al. [15, 62] on the impact of rAOM on family well-being. A panel of paediatric otorhinolaryngologists and paediatricians from our study sites selected the items most relevant according to their clinical experience. The FFQ is composed of six questions covering effects of the child’s rAOM on caregiver and family activities and two questions assessing these effects on emotional behaviour of the other siblings. The Likert-scale was used as a response format and was analogous to that of the RAND and OM-6 in our study, ranging from score 1–4.

Furthermore, two numerical rating scales (NRS) (0–100) were used, the NRS Child and the NRS Caregiver (see Table 2). The NRS Child [14] was translated into Dutch using the same principles of backward–forward translation that have been applied to translation of the OM-6. The newly created NRS Caregiver was modelled upon the NRS Child and added to the previously selected set of questionnaires from July 1999. The NRS caregiver has been created in this study, following the example of the NRS child which was created by Rosenfeld et al. [14].

Finally, the Dutch version of the OM Functional Status Questionnaire specific (OMFSQ [52]) was included as an anchor for responsiveness (instrument description in section on responsiveness).

Questionnaire application

Questionnaires were completed in a randomly selected, but fixed order during the follow-up assessments to prevent possible influence of order effects [63, 64]: RAND, FSQ Generic and Specific, OM-6, NRS Child, FFQ, TAIQOL, OMFSQ, NRS Caregiver. For all questionnaires higher scores indicate the presence of a better HRQoL or FHS. To allow comparisons between scores on the questionnaires, all scores were linearly transformed into 0–100 Scales. For each questionnaire, the evaluation period was the 6 weeks before completion.

Statistical analyses

Floor and ceiling effects

Floor and ceiling effects were estimated for the baseline-assessment of each questionnaire by calculating percentages of respondents that had minimum and maximum scores, respectively. Questionnaires should exhibit minimal floor and ceiling effects to be optimally able to detect difference and change.

Reliability

First, internal consistency was assessed by calculating Cronbach’s alpha, which should be above 0.70 for each questionnaire or subscale [65]. Inter-item correlations of questionnaires were assessed to reveal item redundancy or ‘hidden’ subscales that may erroneously yield a high overall Cronbach’s alpha.

For the assessment of test–retest reliability, a subset of caregivers attending the outpatient ward from February 2000 to June 2001 (n = 160) was given a second set of the same questionnaires (retest) to complete at home. The time frame for completion was 2 weeks after the first set of questionnaires was filled out during the outpatient visit at 14 months (first test). Children with AOM at the first test were excluded, since differences in their scores could be due to real change and interfere with the assessment of reliability.

For the assessment of test–retest reliability, a time-interval of 2–14 days is often considered long enough to prevent recall bias and too short for relevant change to occur in chronic disease [66]. Test–retest reliability was computed as the intraclass correlation coefficients (ICC) between the two sets of questionnaires. An ICC of 0.80 was considered the required minimum for good reliability [65, 67].

Construct and discriminant validity

In order to demonstrate construct validity, hypotheses were formulated about the strength of correlations between questionnaires. A higher percentage of correct predictions indicates stronger support for construct validity. A correlation of 0.10–0.30 was defined as weak, 0.30–0.50 as moderate, and >0.50 as strong [68]. The correlation between FSQ Generic and NRS Caregiver was predicted to be weak since they were expected to assess two different constructs. Moderate to strong correlations (r > 0.40) were predicted between RAND and NRS Caregiver. Moderate to strong correlations were also expected between OM-6 and FSQ Specific, NRS Child, NRS Caregiver and FFQ, as all assess otitis media-related HRQoL or FHS. The correlation between FSQ Generic and FSQ Specific was expected to be strong (r > 0.50). The remaining correlations among the questionnaires were expected to be moderate (Table 5). Additionally, correlations between questionnaire scores and frequency of physician visits for upper respiratory tract infections as well as frequency of AOM episodes in the preceding 6 months were calculated. Since distributions of questionnaire scores were skewed, correlations were assessed using Spearman’s rho.

Discriminant validity was assessed by dichotomizing the study participants in children with 2–3 versus 4 or more episodes of otitis media per year. Based on clinical and immunological data, children with 4 or more AOM episodes per year are considered as ‘otitis prone’ [2, 6971], reflecting a sub-group with an increased rate of upper respiratory tract infections, related medical interventions and compromised child functioning [72, 73]. It was assumed that this group would perform significantly poorer than children with 2–3 otitis media episodes per year on all questionnaires, which was assessed by independent sample Mann–Whitney tests.

Responsiveness

Since pneumococcal conjugate vaccination showed no clinical effectiveness when compared to the control vaccine [74], the intervention could not be used as an external criterion of change. Data of both vaccine groups were pooled instead for the assessment of responsiveness to spontaneous remission. The clinical experience of a panel of 5 experts in the field of otitis media, formed the basis for defining a reduction of 2 or more episodes of AOM per child per year as the external criterion for change while a reduction of 1 episode or less identified no change. Responsiveness was evaluated for two intervals: from 0 to 7 months and from 7 to 14 months follow-up. The observed change in these episodes was multiplied by 12/7 (1,714) to get the estimated change per year.

The first step in the assessment of responsiveness was to explore the ability of questionnaires to detect change at all, i.e., its sensitivity to change. Secondly, meaning and clinical relevance of the change score were determined in accordance with recent recommendations, using both distribution- and anchor-based methods [36, 7577]. Distribution-based methods express the amount of change relative to the amount of random variance of a questionnaire [34, 35]. Some ratios of change to random variance have, often empirically, been found to represent a minimally clinical important difference. Anchor-based methods enhance interpretability of changes in questionnaire scores by linking meaningful and clinically relevant indicators to change scores [34, 36].

The assessment of responsiveness will be described in further detail below.

Sensitivity to change

Sensitivity to change was assessed by calculating both the statistical significance of change scores using a paired t-test or Wilcoxon matched pairs test (for skewed distributions), and effect sizes (ES) using Guyatt’s responsiveness statistic [78] for changed subjects. In this statistic, the observed change that occurred in changed subjects is related to the observed random change or random error, in unchanged subjects. A parametric effect size was computed as: mean change score changed group/SD (change score unchanged group); a nonparametric effect size was computed as: median change score changed group/interquartile range (change score unchanged group)).

According to the benchmarks of Cohen [79], an effect size of 0.2 represents a small change, 0.5 a moderate change and 0.8 or higher represents a large change. For skewed distributions Wilcoxon matched pairs test was used to calculate the significance of change.

Clinical relevance of change scores

The interpretation of change is often assessed by calculating the minimally clinical important difference (MCID), which is the smallest difference in a questionnaire total or domain score that patients perceive as beneficial [80]. The MCID can be computed from both distribution-based and anchor-based methods. Several estimates of the MCID from both methods are reported, to assess the likely range of the MCID for each questionnaire.

Interpretation of change—distribution-based methods (ES-MCID and SEM-MCID)

The main distribution-based methods for assessing the MCID are the Effect Size and the Standard Error of Measurement. A change in questionnaire scores corresponding to the effect size of Guyatt’s Responsiveness Statistic with values of 0.3–0.5 has been found to be consistent with other (empirical) estimates of the MCID [36, 8183]. In this study the change in questionnaire scores corresponding with an effect size of 0.3 is used as benchmark of MCID (ES-MCID). A change of one Standard Error of Measurement (1-SEM) has empirically been found to correspond with the MCID of a questionnaire [77, 8486]. The 1-SEM of a questionnaire links reliability of an instrument to the variance of scores in a population as reflected in its formula: 1-SEM = SD (change scores unchanged subjects) * √(1-ICC). It is an estimate of what part of the observed change may be due to random measurement error by including distribution of scores (SD) and instrument reliability (ICC). Change larger than the SEM therefore is considered ‘real’ change. The SEM is here used as an estimate of the MCID (SEM-MCID). The ES-MCID and SEM-MCID support the interpretation of measured change, as they reflect the smallest change that is substantially larger than the random variability in the study population which is based on the standard deviation of the unchanged subjects.

Interpretation of change—anchor-based methods

Anchor-based methods require an independent standard, the anchor, that in itself is easily interpretable and that is at least moderately correlated (>0.3) with the questionnaire being assessed. Changes in questionnaire scores were compared with change in two clinically relevant anchors: the AOM frequency (incidence of AOM episodes per child) and the AOM severity assessed with the Dutch version of the OM-Functional Status Questionnaire specific (OM-FSQ) [52]. The OM-FSQ was used as an anchor for responsiveness. It consists of three questions assessing clinical AOM severity: earache, sleeping problems, and other signs and symptoms (irritability, fussiness, fever) that may indicate the presence of an ear infection. In our population, the OM-FSQ demonstrated high internal consistency (Cronbach’s α = 0.88) and good test–retest reliability (ICC = 0.94). The OM-FSQ correlated weakly with the NRS Child (Spearman’s rho = 0.18), but moderately with the RAND (0.36), FSQ Generic (0.37), and NRS Caregiver (0.34), and strongly with the FSQ Specific (0.52), OM-6 (0.73) and FFQ (0.61).

In relation to the AOM frequency, an expert panel in the field of otitis media considered a reduction of 2 episodes per year as a small or minimal clinically important change, whereas a change of 3 to 4 episodes per year was considered moderate to large. In the study of Alsarraf et al. [52], the OM-FSQ total score was about 62 on a scale of 0–100 during an episode of AOM, increasing to 92 at 6 weeks and to 90 at 12 weeks after an episode of AOM with higher scores reflecting less severe ear-related symptoms. Therefore, a score change of 10–20 on the 0–100 scale of the OM-FSQ in the current population was considered to be a small clinically relevant change in AOM severity, a score change of 30–50 as moderate to large. Anchor-based estimates of the MCID were computed as the change in questionnaire scores associated with small changes in AOM frequency and OM-FSQ.

For all analyses the Statistical Package for the Social Sciences (SPSS) version 10.1 was used.

Results

Population

The population characteristics summarized in Table 3 show that the majority of children suffered from 4 or more AOM episodes per year, and half of them suffered from chronic airway problems or atopic symptoms. Most children had undergone one or more ENT surgeries. Overall they seemed to suffer from more severe disease than the average child with 2–3 middle ear infections, as stated earlier.

Table 3 Characteristics of study population*

Floor and ceiling effects

Generally, the questionnaires demonstrated no floor-effects. However, Table 4 shows that some questionnaires (FSQ Specific and FFQ) and most TAIQOL subscales showed moderate to large ceiling effects, which indicates that measurement of improvement may be limited while it may actually be present.

Table 4 Floor and ceiling effects*, internal consistency and test–retest reliability of the questionnaires

Reliability

Cronbach alpha coefficients were adequate to high (range 0.72–0.90) for the TAIQOL subscales and high (range 0.80–0.90) for all other questionnaires. The calculation of inter-item correlations revealed no ‘hidden’ subscales or item redundancy (i.e., individual correlations are too high, with possible loss of content validity) (Table 4).

In order to assess test–retest reliability, 126 (79%) of 160 approached caregivers completed a second set of questionnaires of which 113 (71%) were completed within 2 weeks. Seven children with AOM at the time of the outpatient visit (test 1) were excluded, resulting in 106 sets for analysis (Table 4). ICCs were moderate to high for all questionnaires (range 0.81–0.93) and most TAIQOL subscales (range 0.76–0.90), but in the borderline range for the TAIQOL subscale ‘Liveliness’ (0.76).

Construct and discriminant validity

Table 5 reflects the calculated correlations between the questionnaires, which ranged from moderate to strong for the RAND, FSQ Generic, FSQ Specific, OM-6, and FFQ. These outcomes show that 14 (67%) of the hypothesized correlations were correct. False predictions were mainly made about the NRS Child and NRS Caregiver, as the correlations with other questionnaires were generally expected to be at least moderate, but were found to be weak. Disease-specific questionnaires (OM-6, NRS Child, FFQ and NRS Caregiver), showed moderate correlations (Spearmans’ rho 0.39–0.49) with the frequency of AOM episodes in the preceding 6 months. Moderate correlations (Spearmans’ rho 0.29–0.48) were also found between global FHS (RAND) and the disease-specific questionnaires on the one hand and the number of physician visits for all upper respiratory tract infections (URTIs), a more global indicator of illness, on the other hand (Table 6).

Table 5 Construct validity: calculated correlations *  between the questionnaires**
Table 6 Construct validity—‘correlations* between questionnaire scores and frequency of physician visits for URTI** and of AOM** episodes’

The RAND, FSQ Generic, FSQ Specific, OM-6 and FFQ were able to discriminate between children with moderately recurrent AOM (2–3 episodes per year) and “otitis-prone” children with severe, recurrent AOM (4 or more episodes per year) (Table 7). However, neither the two numerical rating scales (NRS Child and NRS Caregiver) nor the otitis media-related subscales of the TAIQOL discriminated between these two groups.

Table 7 Discriminant validity: scores of children with 2–3 vs. 4 or more AOM episodes in the preceding year*

Responsiveness

According to our external criterion of change (a reduction of 2 or more episodes of AOM per year), 270 children (70%) of 383 were classified as ‘changed’ for the first interval (0–7 months) and 126 children (33%) for the second interval (7–14 months). The two intervals differed considerably regarding the reduction of AOM incidence; during the 0–7 months follow-up the mean incidence per child decreased by 1.8 AOM episodes, whereas during 7–14 months follow-up the mean decrease was 0.35 episodes [74].

Sensitivity to change

Sensitivity to change, expressed as significant mean change and effect size, is presented in Table 8. Except for most TAIQOL subscales, generic as well as disease-specific questionnaires yielded significant change scores during both follow-up periods, ranging from 4.9 to 28.3 on a 0–100 scale. Absolute change scores for the first follow-up period generally were larger (range 0.4–28.3) than for the second period (range −2.8–14.2).

Table 8 Sensitivity to change: mean change-scores* and effect sizes** for changed subjects

The effect sizes for the generic FHS questionnaires ranged from small to moderate (0.29–0.60). For the generic TAIQOL subscales however, the effect sizes were lower, ranging from almost zero for the subscales ‘Appetite’ (0,0), ‘Problem behaviour’ (0.02) and ‘Positive mood’ (0.06) to small for ‘Sleeping’(0.37) and ‘Liveliness’ (0.22). Effect sizes for the disease-specific questionnaires were moderate to large (0.55–0.95). For the questionnaires the ES were quite similar for the first (0–7 months) and second intervals (7–14 months), whereas for the second interval absolute change scores were smaller.

The TAIQOL was excluded from further analyses on the interpretation of change, due to its poor sensitivity to change.

Interpretation of change—distribution-based methods

Minimally clinical important differences (MCIDs) calculated with distribution-based methods are presented in Table 9. During the first interval, ES-MCIDs using an effect size of 0.3 as benchmark were somewhat smaller for generic questionnaires, ranging from 5.0 to 7.4 on a 0–100 scale, than those for disease-specific questionnaires ranging from 6.1 to 9.4. During the second interval, however, ES-MCIDs for generic and disease-specific questionnaires were comparable (range 4.0–6.7), indicating that for both types of questionnaires similar change scores are needed in order to be clinically relevant.

Table 9 Responsiveness—distribution-based indices for minimally clinical important difference (MCID) using 0.3 Effect Size (ES) and one standard error of measurement (SEM)

Except for the NRS Child and NRS Caregiver, the SEM-MCIDs were quite comparable with the ES-MCIDs for both generic and disease-specific questionnaires. Assuming that the estimated MCIDs using either an effect size of 0.3 or a one-SEM as benchmark are correct, our results suggest that the range for the distribution-based MCID for generic as well as disease-specific questionnaires corresponds with a change of 3 - 9 points on a 0–100 scale (see Table 9).

Interpretation of change—anchor-based methods

Changes in AOM frequency (AOM incidence per child per year) were compared to the magnitude of change scores on the FHS and HRQoL questionnaires. A small change of 2 AOM episodes, which is considered a MCID, in AOM frequency corresponded with 3–10 points change on a 0–100 scale for the generic questionnaires (Graph 1a), and with 5–15 points change for disease-specific questionnaires, except for the NRS Child during the 0–7 months interval with 29 points change.

Graph 1
figure 1

 Responsiveness—change-scores per questionnaire corresponding with an anchor-based responsiveness index: (a) AOM frequency; (b) AOM severity (OM-FSQ score)

Likewise, a small improvement in AOM severity corresponded with change scores ranging from 2–10 points on a 0–100 scale for the generic questionnaires and with change scores from 4–8 points for the disease-specific questionnaires, except again for the NRS Child with 16 and 17 points change (Graph 1b).

Change scores corresponding with moderate to large changes in AOM frequency and severity are also presented in Graph 1a, b. Comparing small change with moderate to large change shows that, overall, the larger the change in AOM severity or frequency, the larger the magnitude of the change score on the questionnaires. However, this trend was not true for the FSQ Generic and the disease-specific NRS Child (e.g., a small change in AOM severity equalized a change score of 17 on the NRS Child, whereas a moderate-large change equalized a change score of 13).

Comparison of anchor- and distribution-based methods

Comparing the results of the anchor-based methods with those of the distribution-based methods (Graph 2) showed that generic questionnaires (RAND, FSQ Generic, and FSQ Specific), disease-specific questionnaires (OM-6 and FFQ) and the NRS Caregiver yielded quite similar estimates of the MCID for both methods (3–9 points on a 0–100 scale for distribution and 2–15 points for anchor-based methods) as well as for both follow-up periods (4–15 points for 0–7 months interval, 2–8 points for 7–14 months interval). Averaging these distribution-based and anchor-based estimates of MCID yields a point-estimate MCID for generic questionnaires of 6.0 (range 2–10) and for disease-specific questionnaires of 7.3 (range 3–15) on a 0–100 scale (excluding the NRS Child, as it had much larger estimates for the MCID).

Graph 2
figure 2

 Minimally clinical important difference (MCID) per questionnaire according to distribution-based (ES-MCID and SEM-MCID) and anchor-based (AOM frequency and AOM severity) methods

Discussion

In this study, the reliability and validity of generic as well as disease-specific FHS and HRQoL questionnaires have been assessed in the setting of a RCT concerning children with recurrent AOM. Most generic (RAND, FSQ-Generic and FSQ-Specific) and disease-specific (OM-6 and FFQ) questionnaires showed similar, good to excellent reliability and adequate construct and discriminant validity. Construct validity was poor for the numerical rating scales (NRS Child and NRS Caregiver), and discriminant validity was low to moderate for both NRS and the subscales of the TAIQOL considered to be otitis media-related (Tables 4, 5, 6 and 7).

Generic as well as disease-specific questionnaires proved to be sensitive to change in the incidence of AOM (Table 8). The effect sizes were found to be ranging from small to moderate for both generic and disease-specific questionnaires (Table 8). The MCIDs for generic and disease-specific questionnaires were quite similar in terms of responsiveness (Table 9 and Figure 1 & 2). However, most otitis media-related subscales of the TAIQOL, the only true HRQoL questionnaire, proved insensitive to change.

Reliability and validity

Results on internal consistency and test–retest reliability of the RAND, FSQ Generic, FSQ Specific, TAIQOL and OM-6 found in this study, were comparable with those of previous studies using these questionnaires [14, 41, 42, 51, 52]. The consistency of results across different paediatric populations supports the reliability of these questionnaires. Similar to the poor discriminant validity in this study of the otitis media-related TAIQOL subscales, Fekkes et al. [51] found the TAIQOL subscales ‘Problem behaviour’, ‘Positive mood’, and ‘Liveliness’ discriminated neither between healthy and preterm children nor between healthy and chronically ill children. The ability of the RAND, FSQ Generic and FSQ Specific to discriminate between children who differed in AOM frequency, on the other hand, supported their discriminant validity previously found in children with asthma and healthy children [41, 42]. However, the heterogeneity of methods used limits the comparability of results regarding validity of this study with those from previous studies.

The FFQ and NRS Caregiver are newly composed questionnaires to assess the influence of recurrent AOM on the caregiver and family. The FFQ demonstrated excellent reliability and validity, meeting the minimal required reliability coefficients of 0.90 for individual assessment [65, 87]. The strong correlation with the OM-6 supports its complementary usefulness in FHS and HRQoL assessment in children with rAOM. Results of the NRS Caregiver, however, were similarly poor as those observed for the NRS Child, which needs further exploration. Their global, single-item assessment of HRQoL may be too crude to reflect subtle differences in HRQoL [88, 89]. On the other hand, comments of the caregivers indicated that some of them may have misunderstood the NRS test-instructions. This is supported by the fact that improvement of construct validity occurred during follow-up assessments, presumably due to learning effects after reading the instructions a second time.

Responsiveness

So far, little attention has been given to the responsiveness of the questionnaires used in our study. Only Rosenfeld et al. [55] assessed effect sizes for the OM-6 (using a standardized response mean) that were much larger (1.1–1.7) than the ones found in this study. This may be explained by the use of different identifiers of change. Rosenfeld et al. [55] used an intervention with expected clinical effectiveness, for which proxies were not blinded, as indicator of change. Since pneumococcal vaccination proved to be clinically ineffective [74], treatment could not be used as an external criterion for change. Instead, a change of 2 or more AOM episodes per year was used as criterion to identify changed subjects. In addition, social desirability and expectancy bias may have influenced the outcome of the study of Rosenfeld et al. [55]

Although clinical criteria such as change in the incidence of AOM episodes have been suggested as adequate alternative criteria to identify change [34], the choice for any external criterion for change remains somewhat arbitrary. It is a surrogate measure that often only reflects one aspect of the QoL construct. The poor responsiveness of the TAIQOL subscales ‘Behavioural problems’, ‘Positive mood’ and ‘Liveliness’, for example, may indicate that our clinical indicator is less suitable as external criterion for change in emotional and behavioural functioning. However, considering the overall poor responsiveness of the twelve TAIQOL subscales (results not shown), it seems more obvious that poor responsiveness in itself mainly applies for these three subscales as well.

Several studies have supported the empirically found link between one SEM and the MCID for HRQoL questionnaires [75, 81, 85, 86]. In this study the MCIDS based on the value of one-SEM largely corresponded with a MCID that was estimated using 0.3 ES as a benchmark, which is in further support of the one-SEM as an indicator of MCID (Table 9). However, it should be realized that the SEM as well as the ES are both only statistical indicators, which relate change to random (error) variance. Interestingly, the anchor-based methods yielded similar estimates for the MCIDs (Graphs 1a, b, 2), which is in agreement with recent observations that one-SEM equals anchor-based MCID in patients with moderately severe illness [90]. By applying and comparing multiple methods as well as two evaluation periods, we have not only been able to demonstrate consistency in responsiveness but also to give ranges for minimally clinical important changes instead of point-estimates. As there is no ‘golden standard’ for the assessment of responsiveness in FHS and HRQoL measurement, a range of scores gives a more realistic reflection of responsiveness than a point-estimate. Point estimates can be misapplied by users who are either unaware of the limited precision of data used for estimating the MCID or who are unaware of the intrinsic limitations of dichotomising what is actually a continuum.

Generic versus disease-specific questionnaires

Although generic questionnaires are generally expected to be less sensitive to differences in FHS or HRQoL than disease-specific questionnaires [19, 37, 91, 92], in this study most disease-specific questionnaires performed only marginally better than the generic questionnaires on the discriminant validity test. Likewise, the responsiveness of generic questionnaires, and their usefulness as measures of outcome in randomized trials has been questioned [21]. Although in some studies generic measures indeed were found to be less responsive to treatment effects than specific measures [9396], other studies did find comparable responsiveness [9799]. In this study, only the smaller effect sizes for the FSQ Generic and FSQ Specific may indicate that sensitivity to responsiveness of generic questionnaires is somewhat poorer than that of disease-specific questionnaires. Possibly, this higher sensitivity at the start of the study reflects the higher incidence of symptoms and functional limitations that are specific to AOM, whereas during the study AOM incidence decreases and consequently AOM symptoms become less prominent compared to other health problems. Overall, the generic questionnaires appeared to be as sensitive to clinical change as disease-specific questionnaires, except for the TAIQOL.

For the FSQ Generic and FSQ Specific, but not for the RAND which assesses general health perceptions, sensitivity to differences and change in FHS could be explained by their content, as they include many physical and emotional behaviour items that may be affected by rAOM. The more relevant a questionnaire is to a particular condition, the more sensitive it is likely to be. The sensitivity of the RAND, assessing general health and resistance to illness, may indicate that it meets the perceptions of the caregivers of children with rAOM in thinking that their overall health is worse compared with other children. It also may reflect the significant co-morbidity like chronic airway problems and atopic symptoms in the study population (Table 3).

The reasons for the poor performance of the TAIQOL with regard to both discriminant validity and sensitivity to change are not obvious. Possibly the subscale scores represent each an aspect of HRQoL that is too limited to be sensitive to differences or change. Combining the subscales to more comprehensive constructs may then improve sensitivity. In addition, each item of the TAIQOL consists of two questions; a question about FHS is followed by the request to rate the child’s well-being in relation to this health status. Response shift bias may have modified the caregivers’ expectations about how their child feels in line with the child’s changing health, that is caregivers may rate their child’s well-being as better than it actually is as they adapt to the situation. Studies on factors that may influence sensitivity to change or responsiveness besides the type of questionnaire (generic versus disease-specific), such as questionnaire structure and content, disease severity, co-morbidity and other population characteristics, are needed.

Bias and generalisibility

There are several issues that need to be considered when interpreting the current results. First, frequency of AOM episodes at enrolment was based on proxy report, whereas during the trial only physician-diagnosed episodes were counted. The number of AOM episodes in the year prior to inclusion is likely to be overestimated by proxies [100], resulting in the underestimation of HRQoL change scores because they may have evaluated the situation as worse than it objectively was in the first place. However, if such a recall-bias regarding AOM frequency was in fact present, it may also have influenced caregivers’ reflection on subjective measures such as FHS and HRQoL, which results in realistic or even overestimated change scores. However, estimating responsiveness for the interval of 7–14 months, in which AOM frequency was not affected by recall bias since al episodes were physician diagnosed, yielded similar results. This indicates that recall bias appears not to have influenced responsiveness substantially.

Secondly, in assessing test–retest reliability, two different modes of questionnaire administration were used: completion at the clinic versus home completion. The possible intention to give more socially desirable answers at the clinic as well as other effects such as being more distracted when filling in the questionnaires at home, may have caused differences in questionnaire scores between the first (test) and second (retest) assessment. Although this impact may be larger for single item questionnaires such as the NRSs compared to multiple item questionnaires, and might explain their somewhat smaller ICCs, the impact on the ICCs appears to be small.

Thirdly, during the trial, 8 children (4.2%) in the pneumococcal vaccine group and 13 (6.7%) in the control vaccine group were lost to follow-up. One child switched from the control to the pneumococcal vaccine group. It is unlikely that these small numbers of dropouts and crossovers influenced the trial results.

Furthermore, indices of validity and reliability are not fixed characteristics of FHS and HRQoL questionnaires but are influenced by the study design, intervention, and study population in particular. Our study population had significantly severe ear disease with frequent episodes and was older than the average child with AOM. Assessment of reliability and validity of the questionnaires in populations with less severe disease may present more ceiling effects and lack of discriminant validity. Therefore, the results of this study should only be generalized to paediatric populations with moderately to seriously severe recurrent acute ear-infections at an older age (approximately 14–54 months).

Finally, of all questionnaires in this study, only the FFQ demonstrated a reliability that meets the minimal required reliability coefficients for individual assessment of HRQoL. Although some authors suggest to use FHS and HRQoL questionnaires for individual assessment in clinical practice as well [31], we do not support this approach. It is suggested that routine use of these questionnaires would facilitate detection and discussion of psychological issues and help guide decisions regarding, for example, referral. However, considering the complexity and many pitfalls of reproducibility and responsiveness assessment, individual use of HRQoL and FHS questionnaires as part of the follow-up of individuals is not reliable nor valid.

Recommendations for clinical use

In conclusion, generic (RAND, FSQ Generic and FSQ Specific) as well as disease-specific (OM-6, FFQ, and, to a lesser extent, NRS Caregiver) questionnaires demonstrated similar and high reliability and adequate construct and discriminant validity as well as responsiveness to justify use in clinical studies of children with rAOM. However, NRS as used in this study may be less adequate for assessment of HRQoL in this population. The TAIQOL, the only true generic HRQoL questionnaire, unfortunately showed a poor discriminant validity and sensitivity to change, needing extensive revision before further use in clinical outcome studies in children with otitis media. Using both a generic questionnaire (RAND or FSQ) and the OM-6 in clinical studies regarding FHS in children with rAOM is recommended, as it would combine the merits of both generalisability and sensitivity in outcome assessment and facilitate head-to-head comparisons of their performance in various paediatric populations with OM.

More studies are needed assessing responsiveness of paediatric QoL questionnaires by multiple, distribution as well as anchor-based, methods to increase our appreciation of minimal clinically important changes in various paediatric conditions. Further studies on factors such as questionnaire structure and content, disease severity, co-morbidity and other population characteristics that may influence sensitivity to change or responsiveness besides the type of questionnaire (generic versus disease-specific) may increase our appreciation of the complex dynamics in HRQoL and FHS assessment.