Introduction

Health-related quality of life (HRQoL) is an important aid for evaluating clinical and policy interventions [1,2,3] and can be defined as “how well a person functions in their life and his or her perceived well-being in physical, mental, and social domains of health” [4]. Functioning refers to an individual’s ability to carry out some pre-defined activities; however, well-being is understood as an individual’s subjective feeling(s) [4]. HRQoL measures that are accompanied by preference-based value sets generate utility scores that reflect preferences for health states on a cardinal scale where 0 represents being dead and 1 represents full health [1]. Preference-based HRQoL measures are widely recommended for use in publicly funded health care systems because of their role in cost-utility analysis, which in forms reimbursement, regulatory, and pricing mechanisms [3, 5,6,7,8]. Furthermore, they are increasingly used as outcome measures in clinical trials and patient care [9,10,11,12] across a wide spectrum of conditions and environments [1,2,3, 13,14,15] partly because they are highly correlated with widely used health metrics, including morbidity, mortality, and healthcare costs [2, 3, 16]. However, there is discordance between the preference-based HRQoL measures that are recommended for use by health technology assessment agencies in different jurisdictions [17]. Guidelines for selecting preference-based HRQoL instruments for randomized trials and observational studies are lacking.

Preference-based HRQoL measures are made up of descriptive systems and accompanying valuation systems. The descriptive system defines HRQoL across a number of health states and the valuation system is a mathematical construct for scoring each possible health state described by the measure. The valuation or scoring system generates utility scores that reflect population preferences for living in a particular health state. While these utility scores are indexed on a cardinal scale where 1 indicates full health and 0 represents death, negative values are theoretically possible and represent health states considered worse than death [1,2,3, 18].

The choice of a preference-based HRQoL measure is a critical decision because of downstream consequences related to cost-utility analysis, its use for deriving quality-adjusted life-years (QALYs), and subsequent resource allocation decisions [19]. Thus, evidence comparing the performance of preference-based HRQoL measures is needed to justify the selection of the most appropriate assessment tool [20]. Furthermore, researchers attempting to measure health outcomes face a trade-off on whether to include a single or multiple HRQoL instruments in their studies. The latter option is not always possible because of budgetary and time constraints as well as evidence for lower completion rates when multiple instruments are used [21,22,23].

The Health Utilities Index Mark 3 (HUI3) and Short Form 6D (SF-6D) are two widely used preference-based HRQoL measures that are anchored on a cardinal scale (with 0 = dead and 1 = full health) and generate utility scores that reflect population preferences for health states as well as for estimating QALYs for cost-utility analysis purposes [1,2,3, 18]. HRQoL measures accompanied by preference-based value sets are often referred to as multi-attribute utility instruments in the literature [1,2,3, 18]. Scoring algorithms for these measures have been derived based on nationally representative, community-based samples from different jurisdictions, such as Canada (HUI3) and the United Kingdom (SF-6D) [2, 3].

A recent review from health technology assessment (HTA) agencies regarding the preferred choice of preference-based HRQoL measures for cost-effectiveness based decision-making identified that, out of thirty-four guidelines, twenty-one recommended either the SF-6D (n = 11) or HUI2 or HUI3 (n = 10) instruments [17]. There is limited evidence for concurrent health status assessments using both the HUI3 and SF-6D instruments. However, at the same time, there is little consensus about the head-to-head performance of preference-based HRQoL measures across psychometric criteria.

Individuals born very preterm (VP; < 32 weeks’ gestation) or at very low birthweight (VLBW; < 1500 g) are at high-risk of adverse functional, neurodevelopmental and behavioral outcomes [24,25,26,27,28] and their HRQoL is frequently examined because of increasing rates of preterm birth worldwide [13, 29, 30]. However, the agreement and discrepancies between the outputs of the HUI3 and SF-6D instruments have not been evaluated in head-to-head comparisons in a sample of VP or VLBW individuals.

This constrains efforts to enhance comparability and standardization of findings across different VP/VLBW studies, as well as reduces transparency and reproducibility of outcomes research in this area. It is important to determine the most appropriate HRQoL instruments for individuals born VP/VLBW because preterm birth and low birthweight represent a growing public health concern. Increasing VP/VLBW rates coupled with improvements in survival rates place increased pressures on healthcare budgets worldwide [13, 29, 30]. At the same time, the evidence regarding agreement between these measures in general population samples is opaque.

The first study that described (dis)agreement between the outputs of the HUI3 and SF-6D was published over twenty years ago [3]Footnote 1; however, subsequent studies struggled to provide conclusive evidence and explain the source(s) of (dis)agreement. One explanation is that existing studies generally have not evaluated concurrent agreement of the HUI3 and SF-6D instruments in general population samples [31,32,33,34,35,36,37,38,39], since most studies recruited participants with specific conditions within clinical settings, such as tertiary care [31,32,33,34,35,36,37,38,39] or primary care [35]. Thus, existing findings may not be generalizable across general population contexts. To our knowledge, only one study has assessed levels of agreement between the HUI3 and SF-6D measures [40] among healthy individuals. However, given the overall study design employed, the evidence related to levels of agreement of the HUI3 and SF-6D measures within the general population is not conclusive [1]. Furthermore, results from other studies were based on patients recruited into clinical trials [31, 32, 38], which are prone to experimental design limitations [41, 42]. Finally, the majority of studies that assessed agreement between the outputs of the HUI3 and SF-6D were limited to one country or geographic region [31,32,33,34,35,36,37,38,39] and thus may have limited external validity.

Previous research called for comparative evaluations of the HUI3 and SF-6D measures across a diverse range of health conditions [43, 44]. Furthermore, research has advocated new comparative evaluation studies that use larger samples by maximizing their power and enhancing comparability when data across multiple cohort studies are combined [44]. To overcome the limitations associated with analyses restricted to a specific disease or disorder, conducted within limited clinical settings or within a single geographical region, the use of individual patient data analysis (IPD) consolidated over several geographically diverse cohorts offers advantages. This study uses IPD from European and Australian multi-site collaborative cohorts to inform the choice of the HUI3 and/or SF-6D measures for research studies that consider the consequences of VP/VLBW in adulthood as well as informs the cost-effectiveness of preventive or treatment interventions related to VP/VLBW status.

This study has the following aims: (a) to examine the agreement between the outputs of the HUI3 and SF-6D measures among adults born VP/VLBW and controls and to explain the sources of disagreement between instruments and (b) to provide useful information for the selection of preference-based HRQoL instruments for trials or research studies that ascertain the long-term consequences of VP/VLBW and birth at term or with normal birthweight.

Methods

Data

The following criteria have been utilized to identify relevant prospective cohorts: (1) have used two distinctive preference-based measures to assess HRQoL in adulthood (defined as ≥ 18 years [45]) amongst individuals born VP/VLBW, (2) included a comparison control group of term-born and/or normal birthweight individuals, and (3) contributed data to the RECAP consortium (www.recappreterm.eu), a database of cohorts of individuals born VP/VLBW. Two different and recent systematic reviews of preference-based HRQoL outcomes following preterm birth or low birthweight had identified eligible cohorts [46, 47]. The following two prospective cohort studies met the study inclusion criteria: The Bavarian Longitudinal Study (BLS) [48] and The Victorian Infant Collaborative Study (VICS) [49]. These two studies were designed to assess the associations of VP/VLBW status with various health outcomes [50] as well as received country-specific ethical approvals, including participants’ written informed consent in adulthood.

Table 1 described the background eligibility criteria, age(s) at assessment, and the control groups for the BLS and VICS cohorts. Detailed descriptions of each participating cohort (the study’s population, methodology, types of data and variables) have been previously published [48, 49]. All variables of interest across BLS and VICS were harmonized, meaning that an identical set of definitions, scaling methods, and classification were applied to all variables across BLS and VICS cohorts.

Table 1 Background characteristics of cohorts

Outcome measures

Participants’ perceptions of their HRQoL were assessed using both the HUI3 and SF-12 [48, 49]. Study participants completed the unedited Health Utilities Index 15-item questionnaire for usual health status assessment, which was obtained from the Health Utilities Index developers and covers the HUI3 health status classification system. The HUI3 was developed to describe HRQoL in general population and clinical contexts and consists of eight attributes: ambulation, dexterity, cognition, vision, hearing, speech, emotion, and pain [51,52,53]. Within each attribute, the levels of function were scored on a 5- or 6-point scale ranging from optimal function to severe impairment. Responses within each of the eight attributes can be valued as single attribute utility (SAU) scores on a scale ranging from 0 and 1 [51]. Responses within each of the eight attributes can also be mapped onto an eight-attribute health status vector. Algorithms reflecting the preferences of the general public for the HUI3 health states can be used to convert responses to the measure’s eight attributes into multiplicative multi-attribute utility scores. The Canadian algorithms [51,52,53,54] were applied in both cohorts, reflecting the preferences of 504 adults in the general population who were living in the city of Hamilton, Ontario, and who had previously been asked to value selected HUI3 health states using both visual analogue scaling and standard gamble techniques. HUI3 multi-attribute utility scores are valued on a cardinal scale ranging between -0.36 and 1.0, with -0.36 representing the worst possible HUI3 health state, 0.0 representing dead, and 1.0 representing full health [53, 54].

The SF-12 includes 12 of the 36 items contained within the SF-36. These have an identical dimension structure [55], and for each dimension, item responses are mapped onto a 0 to 100 scale. Responses to the SF-12 items were converted [56] into SF-6D multi-attribute utility scores using the UK SF-6D utility algorithms [55]. The SF-6D algorithms reduce the eight dimensions of the SF-36/12 to six by merging role limitations due to emotional and physical problems and eliminating general health perceptions. SF-6D multi-attribute utility scores are valued on a cardinal scale ranging between 0 and 1.0, with 0 representing dead and 1.0 representing full health [55]. For the SF-6D, only two out of six dimensions (physical functioning, role limitations) reflect physical aspects of health, while other dimensions (social functioning, pain, mental health, vitality, and emotional) relates to non-physical aspects of health. By contrast, most HUI3 attributes reflect the physical health of the individual (vision, hearing, speech, ambulation, and dexterity).

We used the following outcome variables of interest: HUI3 and SF-6D multi-attribute utility scores and the difference between HUI3 and SF-6D utility scores. The minimum clinically important difference in multi-attribute utility score is considered to be 0.03 for the HUI3 [57] and 0.04 for SF-6D [58, 59].

Empirical analyses

We combined IPD across the BLS and VICS cohorts. To identify whether our assessments of agreement between the HUI3 and SF-6D measures should be disentangled by birth status, we initially estimated the association between VP/VLBW status and HRQoL in adulthood using one-stage IPD analysis, which could be implemented either using fixed or random effects [60]. Fixed effects models were used because individuals born VP/VLBW and controls were enrolled across distinct geographical regions and time frames. This implies the presence of systematic differences across the BLS and VICS cohorts. However, we also utilized random effects as a robustness check. Models were adjusted for age and sex of the participants, mode of delivery (cesarean section vs vaginal delivery), and number of days in hospital after birth, as well as for the harmonized socio-demographic/socio-economic variables: maternal education level at birth or during childhood and maternal ethnicity.

We computed means, standard deviations, and t tests for unequal variances, medians, and Kruskal–Wallis tests to assess differences in agreement between HUI3 and SF-6D multi-attribute utility scores within VP/VLBW individuals, controls, and the combined sample. To identify statistically significant predictors that explain observed differences between HUI3 and SF-6D multi-attribute scores on covariates, we used generalized mixed models in a one-step approach. Models were estimated using multivariate linear fixed effects.

Furthermore, agreement between the HUI3 and SF-6D multi-attribute utility scores was investigated using the intra-class correlation measures and Bland–Altman plots. The analysis was performed for VP/VLBW individuals and controls separately as well as for the combined sample. An intra-class correlation coefficient less than 0.75 is indicative of moderate agreement, while an intra-class correlation coefficient greater than 0.75 indicates good agreement [60, 61]. Bland–Altman plots display the mean \(\left( {\frac{HUI3 + SF - 6D}{2}} \right)\) overall scores and the difference (HUI3-SF-6D) against each other. A line of mean difference estimates systematic difference between the two instruments, with limits of agreement estimated as the mean difference plus/minus 1.96 standard deviation of the mean difference. Limits of agreement (LoA) reflect the expected range in which 95% of observed differences would lie, with wider limits of agreement indicating poorer agreement [62]. Good concordance between the HUI3 and SF-6D would show a mean difference close to zero with ≤ 5% of scatter points lying outside the limits of agreement.

Analyses were performed using STATA version 17 and p-values of 0.05 or less were considered statistically significant.

Results

Baseline characteristics of prospective cohort studies

Table 1 displays baseline characteristics of the participants of the BLS and VICS cohorts. Years of birth ranged from 1985 to 1986 for BLS and 1991–1992 for VICS. Pooled data consisted of 778 HUI3 assessments (417 VP/VLBW individuals and 361 controls) and 780 SF-6D assessments (411 VP/VLBW and 369 controls). The mean age at assessment was 18 years for VICS participants and 26.3 years for BLS participants. Table 2 shows the characteristics of VP/VLBW individuals and controls with non-missing HUI3 and SF-6D multi-attribute utility scores. Within the meta-cohort no statistically significant differences were found by birth status across the following characteristics: age, sex of the participants, maternal education level at birth or during childhood, and maternal ethnicity.

Table 2 Characteristics of VP/VLBW individuals and controls within HUI3 and SF-6D Meta-cohorts

Relationship between VP/VLBW status and HRQoL using HUI3 vs SF-6D

Using a one-stage IPD meta-analysis, to identify whether our assessments of agreement between the HUI3 and SF-6D measures should be disentangled by VP/VLBW status, we initially estimated the association between VP/VLBW status and HRQoL in adulthood. The adjusted impact of VP/VLBW status on the HUI3 multi-attribute utility score was -0.04 (95% CI − 0.06, − 0.01) with no significant impact on the SF-6D multi-attribute utility score (Table 3). To understand the sources of identified differences we present the additional evidence in Online Appendix A (Tables A.1, A.2). We utilized random effects models and reported results in Online Appendix B. Further evidence on the association between VP/VLBW status and HRQoL in adulthood using HUI3 and SF-6D can be found in a recent study [63].

Table 3 One-stage IPD meta-analyses: Impact of preterm birth on HUI3-MAU score and SF-6DMAU score all cohorts combined

Comparison of HRQoL assessed by the HUI3 and SF-6D

Table 4 displays descriptive and inferential statistics for HUI3 and SF-6D multi-attribute utility scores for each group considered. Mean and median estimates for HUI3 multi-attribute utility scores were consistently higher compared with their respective SF-6D values. All differences were clinically [57,58,59] and statistically significant within the meta-cohort across all groups considered (p < 0.01). Table 5 shows the estimates from regressing differences between HUI3 and SF-6D multi-attribute scores on covariates. The evidence suggests that none of the variables considered was a statistically significant predictor of observed differences between multi-attribute scores.

Table 4 HUI3 and SF-6D utility scores, differences between scores and quantification of agreement by VP/VLBW, controls, and combined sample
Table 5 HUI3 and SF-6D utility scores, differences between scores and quantification of agreement by VP/VLBW, and controls

The correlation coefficient (ρ) between HUI3 and SF-6D multi-attribute utility scores for the BLS and VICS cohorts was computed. The evidence showed that ρ between the two multi-attribute utility scores within VICS was 0.45 (ρ = 0.51 for VP/VLBW individuals and ρ = 0.31 for controls), which was higher compared with ρ = 0.35 within the BLS cohort (ρ = 0.37 for VP/VLBW individuals and ρ = 0.33 for controls). Within the meta-cohort, the ICC was 0.40 for the VP/VLBW sample, 0.29 for the controls, and 0.36 for the combined sample. Overall, the evidence suggests that the HUI3 and SF-6D multi-attribute scores had moderate or low correlation.

The Bland–Altman plots were constructed by birth status (see Fig. 1) and showed a mean difference of 0.06 (95% CI 0.04, 0.07), i.e., HUI3 multi-attribute utility scores for controls were higher than the SF-6D multi-attribute utility scores for controls. The mean difference for VP/VLBW individuals was 0.03 (95% CI 0.01, 0.05), meaning that HUI3 multi-attribute utility scores were higher than SF-6D multi-attribute utility scores in this group. In the Bland–Altman plot (Fig. 1), the data points deviate widely from the agreement line at low levels of mean utility and the relationship between the difference in HUI3 and SF-6D utilities shifts in magnitude but not in direction. The same pattern is observed by combining the VP/VLBW sample with controls (Fig. 2), generating a mean difference between the paired observations of 0.04 (95% CI 0.03, 0.06). Notably, in all groups considered, the Bland–Altman plots showed a funneling effect with stronger agreement as the mean overall utility score approached 1.0. However, in the Bland–Altman plots, the 95% LoA ranged from − 0.30 to 0.37 within the VP/VLBW sample, − 0.22 to 0.34 within controls, and − 0.27 to 0.36 within the combined sample. Most importantly, in all three groups considered (VP/VLBW, controls, and the combined sample), the 95% agreement differences were far wider than the clinically meaningful differences postulated for the HUI3 and SF-6D.

Fig. 1
figure 1

The Bland–Altman plots by VP/VLBW status

Fig. 2
figure 2

The Bland–Altman plots for VP/VLBW and controls combined

Discussion

This study provides the first comparative evaluation of the HUI3 and SF-6D among adults born VP/VLBW and normal birthweight or term born controls. The results show a considerable degree of disagreement between the two sets of multi-attribute utility scores, consistent with previous reports for specific diseases [31, 33, 37, 40]. The patterns underlying differences vary, however, in a number of important aspects when compared with previous research. Our results identified less agreement compared with previous comparative evaluations of the HUI3 and SF-6D measures. Interestingly, our study found that agreement between the HUI3 and SF-6D measures was weaker in term-born or normal birthweight controls compared with VP/VLBW individuals.

Overall, the HUI3 and SF-6D measures disagree substantially because VP/VLBW status was found to be associated with minimal important decrements in utility score when health status was ascertained with the HUI3 and not the SF-6D. Furthermore, results show discordance between the outputs of the HUI3 and SF-6D in VP/VLBW individuals, controls, and the combined sample. This implies that the HUI3 and SF-6D each provide unique information on different aspects of health status across the groups considered and suggests that the HUI3 better captures preterm-induced changes to HRQoL in adulthood.

The evidence consistently demonstrates that the HUI3 and SF-6D instruments are not interchangeable for use in cost-utility based decision-making for interventions that target adults born VP/VLBW [64, 65]. Because our study also investigated concordance between the HUI3 and SF-6D in term-born or normal birthweight controls, the findings imply that the measures might also not be interchangeable for use in more general population samples.

Furthermore, given the evidence provided in this study regarding level of agreement between the HUI3 and SF-6D measures overall, our findings imply that studies focused on capturing the physical and cognitive effects of interventions should employ the HUI3 as a primary instrument, with the SF-6D as a potential supplementary measure. Our study implies that the HUI3 may be preferred to the SF-6D for studies designed at quantifying physical and cognitive aspects of health particularly since for SF-6D, only two out of six dimensions (physical functioning, role limitations) reflect physical aspects of health, while other dimensions (social functioning, pain, mental health, vitality, and emotional) relates to non-physical aspects of health. However, most HUI3 attributes reflect the physical health of the individual (vision, hearing, speech, ambulation, and dexterity). Prioritization of a preffered multi-attribute utility measure might increase the value of research design and potentially reduce unnecessary research costs related to primary data collection. Our results indicate that the HUI3 and SF-6D instruments are not interchangeable for use in clinical, population research, and cost-effectiveness based decision-making that considers the long-term consequences of VP/VLBW status [64, 65].

Our overall results are consistent with the differences in the HUI3 and SF-6D descriptive systems. Specifically, given that the HUI3 explicitly asks about a person’s vision, dexterity, ambulation, and cognition, while SF-6D does not, it is perhaps expected that VP/VLBW individuals, who are known to have impaired outcomes associated with these attributes [24,25,26,27,28], have lower levels of utility according to the HUI3 than according to the SF-6D. The evidence shows that discrepancies in the health descriptive systems of the HUI3 and SF-6D instruments may drive the differences in multi-attribute utility scores of VP/VLBW individuals and controls in adulthood. Our study demonstrates that variation in the descriptive systems of the measures is likely to be a major contributory factor to variation in the utility scores. Results of this research corroborate the conclusions of a study that analyzed patients in several disease areas and found that the EQ-5D, SF-6D, HUI3, 15D, QWB, and AQoL-8D instruments measure related but different constructs [44]. Also, the study concluded that the instruments differ in their relationship to different health dimensions, and the differences are primarily the result of the instruments’ descriptive systems.

Our study advances the literature because we provide clear evidence that differences in descriptive systems explain, at least in part, disagreement found between the outputs of the HUI3 and SF-6D measures. The evidence shows that the discordance between the outputs is observed within both adults born VP/VLBW and controls. However, differences related to HUI3 and SF-6D valuation protocols and utility ranges may also partly contribute to the differences in multi-attribute utility scores we document in this study. Furthermore, the study is the first in the literature to use a meta-analysis in this context combining data from two longitudinal prospective cohort studies.

This study does not infer that the HUI3 measure is generally preferable to SF-6D when health outcomes associated with clinical or public health interventions are ascertained. Rather, it provides insights for future research related to agreement between the HUI3 and SF-6D measures and suggests that the HUI3 classification system, unlike the SF-6D, is able to capture consequences of VP/VLBW status in adulthood, which is consistent with prior documented patterns reported in the disability literature [24,25,26,27,28]. We are not arguing against the use of the SF-6D or other preference-based HRQoL measures to investigate consequences of VP/VLBW status.

However, this study provides insight for stakeholders seeking to understand what instruments to use for comparative effectiveness research related to preterm birth and low birthweight. Further investigation is needed to understand the between-measure discrepancies attributable to descriptive classification systems for other measures, including the EQ-5D which is widely recommended in HTA guidelines [8, 17, 66,67,68,69] and other measures to inform the methodological debate and guide the selection of the most appropriate HRQoL instruments. Overall, the current study highlights the need to carefully consider the outcomes of interest and the characteristics being studied of the condition for an appropriate selection of HRQoL instrument.

Strengths and limitations

The data structure made it possible to examine the agreement between measures within VP/VLBW individuals, normal birthweight or term born controls, and within the combined sample. We were able to assess the validity of the results by replicating the main finding across different populations, which strengthens the study’s conclusions and which had not been studied previously as far as we are aware. This is the major strength of this study. Furthermore, our study ascertained agreement between the outputs of the HUI3 and SF-6D measures using controls selected from the general populations in Germany and Australia. This implies that results of this study may be generalizable to populations from Germany and Australia.

Another strength of this study is that we were able to confirm VP/VLBW status in each participant due to the rigorous recruitment, data collection, and follow-up methods utilized by the participating cohorts, which also harmonized relevant socio-demographic factors. Furthermore, our study employed socioeconomically diverse samples of VP/VLBW individuals and controls. Finally, results of this study are not affected by biases associated with proxy parental reporting [70] because participating cohorts used self-reported HRQoL data.

It is important to note that the scoring algorithms for the HUI3 and SF-6D differ in certain respects. Thus, while our study shows that the utility differences we found are driven by the underlying concepts of health being measured, the methods employed are not able to measure the contributory effects of valuation protocols, i.e., differences in scoring algorithms. A further limitation is that our study included cohorts from only two countries. Thus, replication of this study with data from other countries, particularly low- or middle-income countries, would be a valuable contribution to the literature.

Our report did not include the EQ-5D in this comparative evaluation because no individual study that contributed to the RECAP platform assessed HRQoL using the EQ-5D. This is a limitation because a recent review identified that the EQ-5D is the most frequently recommended multi-attribute utility instrument in HTA guidelines [17]. Thus, our study is not able to provide comprehensive evidence regarding the most appropriate preference-based HRQoL measure to ascertain utility scores in adulthood for VP/VLBW individuals or for normal birthweight or term born controls. Comparing agreement of the EQ-5D, HUI3, and SF-6D for VP/VLBW individuals and normal birthweight or term born controls offers a fruitful direction for further investigation.

Conclusion

The evidence from two longitudinal cohort studies conducted in Australia and Germany demonstrates poor agreement between the HUI3 and SF-6D in VP/VLBW individuals and normal birthweight or term born controls. It may be beneficial to use both the HUI3 and SF-6D instruments when evaluating health outcomes of interventions related to gestational age at birth and/or birthweight. However, studies focused on measuring physical or cognitive aspects of health will likely benefit from prioritizing the use of the HUI3 in order to better detect and quantify the effects of health interventions or assess outcomes.