FormalPara Key Points for Decision Makers

Our systematic review identified 110 versions of generic multidimensional patient-reported outcome measures (PROMs) for children (aged ≤ 18 years) spanning childhood age groups and conceptual bases of functional, disability and health status, quality of life and health-related quality of life.

A supplementary systematic review identified 21 preference-based value sets for ten PROMs designed to be accompanied by preference-based value sets.

Our catalogues of (1) PROMs categorised by target age group, conceptual bases and related characteristics (domain coverage, respondent type and design) and (2) value sets appraised for their development and statistical features can aid the development, selection and interpretation of appropriate childhood PROMs for clinical and population research and cost-effectiveness-based decision-making.

1 Background

Patient-reported outcome measures (PROMs) enable direct assessment of health status, health-related quality of life (HRQoL) or quality of life (QoL) by patients or individuals [1,2,3]. They are used extensively as outcome instruments and can support patient-centred care [2, 4]. The use of PROMs in childhood populations (aged ≤ 18 years) presents methodological challenges compared to application in adults. One challenge is to account for age-based biopsychosocial developmental differences between children and adults and across children of different ages [5,6,7]. Distinguishing the age or age group of the target childhood population is important for content validity or concept coverage of the childhood PROM, its capacity for child self-report, appropriate design, and the culturally appropriate age for child self-report [2].

There are many ways that PROMs can be categorised. They can be divided into generic measures and disease or condition-specific measures, and further into domain-specific measures and multidimensional measures [4, 8]. Generic measures have the advantage of allowing comparisons across conditions and interventions. Multidimensional measures can be both generic and condition-specific and capture the multifaceted nature of concepts including health, QoL or HRQoL. PROMs can thus be further categorised by the concept(s) they seek to measure through their descriptive systems. There are multiple definitions of these concepts in the literature, with terms often used interchangeably [3, 9]. In this paper, we draw on the categories arising from a synthesis of definitions of functioning, disability, health, HRQoL and QoL used by the World Health Organization (WHO). In this taxonomy, a QoL/HRQoL measure can be distinguished from a functional, disability and health (FDH) measure by its reflection of the individual respondent’s perception, or subjective judgement of importance, of their assessed status [3]. Specifically, an FDH measure captures the ‘interactions among body structures and function, and activities and participation in the context of the environment and personal factors’, whilst QoL is ‘a person’s perception of [his/her] position in life’, where perception involves a subjective judgement over how the position relates to his/her goals, expectations, standards, enjoyment or concerns and not just a self-report of the position. HRQoL represents the individual’s ‘perception of his/her health and health-related states’ where ‘health’ represents a narrower domain than ‘position in life’ (p. 1086) [3].

PROMs can be accompanied by preference-based value sets which produce an overall index when applied to the multidimensional states generated by the descriptive systems [8]. PROMs accompanied by preference-based value sets are often referred to as multi-attribute utility instruments in the literature [10]. The value sets accompanying these measures use stated preference methods including standard gamble (SG) and time trade-off (TTO) to trade-off between a health state and mortality risk or life expectancy, respectively [11]. A key aspect of these methods is their ability to yield values that are anchored on a scale with 0 = dead and 1 = full health, as required for the use of values for estimating quality-adjusted life years (QALYs) in cost-utility analysis. Negative values are theoretically possible and represent health states considered worse than being dead. Discrete choice experiments (DCE) can also be used to elicit stated preferences for value sets for PROMs, although the resulting values lie on a latent scale unless accompanied by additional methods to allow anchoring at dead = 0 [12]. Decision makers around the world have shown a preference for country-specific value sets reflecting underlying differences in preferences of different populations [13,14,15].

There are considerable methodological challenges that arise in valuing childhood PROMs. Among them, there is contention around whose preferences are relevant, with some instruments offering both adult- and child-elicited value sets [16]. Eliciting values from children themselves may help to ensure that health services are relevant to their needs [17, 18], particularly when adults’ and children’s preferences differ [19]. However, there are practical and ethical issues surrounding the age at which children can be asked to participate in stated preference studies, especially where these include tasks that involve trade-offs with life expectancy. Both SG and TTO may impose high cognitive and/or emotional burden on children relative to DCE or best–worst scaling (BWS) tasks [17]. Where values are used to inform the allocation of healthcare resources financed primarily from taxation, a common normative position is that these should be based on the preferences of the adult general public [12, 20]. However, valuing childhood PROMs using adults’ preferences also raises issues. For example, it is unclear whether the respondent should be asked to think about themselves as a child, their child or another child [20].

Several systematic reviews of generic multidimensional childhood PROMs have been published since 2000 [3, 4, 21,22,23,24,25,26]. The most recent, by Janssens et al., identified 63 unique versions of PROMs (including eight accompanied by value sets) published between 1992 and 2011 [4]. Their review tabulated the characteristics of PROMs, including target age range, respondent type, domain coverage and several design features (e.g. number of items, response options and recall period) [4]. That review also mapped the domains of the PROMs onto the WHO’s International Classification of Functioning, Disability and Health, Child and Youth version (ICF-CY) framework [27], but noted the shortcoming of conflating the domains of FDH and QoL/HRQoL measures that have a different conceptual basis. It did not distinguish between types of informant response as recommended by the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) guideline, i.e. proxy if the informant infers the child’s perception of health/QoL states and observer if the informant focuses on observable traits without inferring perception [2]. Moreover, little attention was paid to differences in characteristics of measures across age groups, or the value set features. Fayed and colleagues distinguished between FDH and QoL/HRQoL measures; however, their search period was limited to 2004–2008 [3]. Cremeens et al. [21] and Grange et al. [24] limited their reviews to 3- to 8-year-olds and < 5 year olds, respectively, which hinders comparisons of PROM characteristics across all childhood age groups. Finally, Chen and Ratcliffe previously reported on the features of nine childhood PROMs accompanied by value sets, but this was not a systematic review [10]. These gaps in previous studies inform the aim and objectives of this systematic review.

2 Aim and Objectives

This systematic review aims to generate a comprehensive catalogue of generic multidimensional childhood PROMs that rely on child self- and/or informant-report (hereafter childhood PROMs for brevity in accordance with the ISPOR guideline [2]) and the value sets that accompany them. In doing so, we aim to inform the development, selection and interpretation of childhood PROMs for clinical and population research and cost-effectiveness-based decision-making. Accordingly, the objectives are to:

  1. 1.

    Systematically identify generic multidimensional childhood PROMs and preference-based value sets that accompany them.

  2. 2.

    Categorise the PROMs according to conceptual basis and target age group and evaluate their methodological features according to the ISPOR good research practice recommendations on the use of childhood PROMs [2].

  3. 3.

    Catalogue the value sets that exist for childhood PROMs, compare the methods that have been used to produce them and report the characteristics of the resulting value sets.

3 Methods

A pre-specified protocol outlining the systematic review methods was developed and registered with the Prospective Register of Systematic Reviews (CRD42021230833). For reporting purposes, the review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [28]. See the Supplementary Information for the PRISMA checklist for this paper (see the electronic supplementary material).

3.1 Data Sources and Study Selection

Two independent systematic database searches were conducted. The first aimed to identify primary development studies for generic multidimensional childhood PROMs published between 1 January 2012 and 16 October 2020. Studies identified in the search were pooled with those published between 1 January 1992 and 20 March 2012 identified in the earlier systematic review by Janssens et al. [4]. A second search was conducted without date limits on 18 February 2021 to identify preference-based value sets for generic multidimensional childhood PROMs. This was constructed around measures identified by the first search that reported the development of a value set or the objective to develop a value set or to inform economic evaluations. Developers of PROMs were also contacted to confirm whether, to the best of their knowledge, there was no missing published value set.

Both systematic searches covered seven academic databases (Medline, Embase, PsycInfo via Ovid, EconLit via Proquest, CINAHL via EBSCOHost, Scopus and Web of Science) and one grey literature database (PROQOLID). The Medline search strategy for the first and second searches are presented in Tables A and B in the Supplementary Information (see the electronic supplementary material). Searches were limited to English language texts. In both searches, references were imported into Endnote X9 and duplicates were removed. References were then exported into Covidence [29], where two researchers independently reviewed the titles and abstracts of identified articles. If an article received two approvals, it proceeded to the next stage, with disagreements referred to a third reviewer for the final assessment. Full text articles were independently reviewed by two reviewers with disagreements again referred to the third reviewer for final assessments.

The inclusion criteria for the systematic search of PROMs were as follows: studies reporting the development of a generic multidimensional childhood (aged ≤ 18 years) PROM and studies published in the English language. Conference abstracts with sufficient information (i.e. containing at least the acronym/name of the instrument(s), the aim/motivation of development, whether a general childhood population was targeted, and the multidimensionality of the instrument) were included alongside full text articles. Studies in adults only and animals were excluded, as were commentaries, single case reports and secondary application of PROMs. References of included studies were searched, including those published before 1 January 2012. The inclusion criteria for the systematic search to identify value sets were as follows: studies reporting the development of a value set for one or more generic multidimensional childhood PROM; the value set was anchored on a 0–1 scale required for the use of the values in QALY estimation; and studies (including full text articles and a user’s manual [no conference abstracts were identified]) published in the English language. Studies that only reported values for a sub-set of the health states described by a childhood PROM were excluded.

3.2 Data Extraction

For the first review that identified generic multidimensional childhood PROMs, data for each measure were extracted independently by two reviewers per article from a pool of eight reviewers, with disagreements resolved through discussion. For the second review that identified preference-based value sets, data for each value set were extracted by one reviewer, with a 20% check performed by a second reviewer. For the identified generic multidimensional childhood PROMs, the following characteristics were extracted according to proformas in Excel: acronym/name of the instrument, name of author(s)/developer(s), development year and country(ies), aim/motivation of development, target age range, background concept (e.g. definition of health/QoL adopted before commencing PROM development), development methods (e.g. focus groups with children for item selection), role of children in development, level of child’s perception captured, respondent type (e.g. child, proxy, observer), administration mode (e.g. by self, by interviewer), recall period, domains covered, number of items, response options and method for score generation (other than value set application).

For the identified value sets of the generic multidimensional childhood PROMs, the following characteristics were extracted: acronym/name of the instrument, name of author(s)/developer(s), valuation year and country(ies), methods for selecting valued health states, number of valued health states, sampling methods for valuation including target population, recruitment strategies, response rates, sample size, representativeness of sample and reasons for exclusion, and stated preference data collection methods. The latter included elicitation technique(s) (e.g. SG, TTO, DCE), the approach taken to anchoring at 0–1, characteristics of respondents, perspective and administration mode, and modelling methods. Finally, the properties of the value sets were examined (e.g. value range, proportion of health states with negative values). As some of these properties were not reported in the papers, we coded the value set algorithms based on the published results and calculated the index scores for all potential health states according to the classification system of the instruments.

3.3 Data Synthesis

Using a method similar to that of Fayed et al. [3], the generic multidimensional childhood PROMs identified were categorised by conceptual basis. The wording and response options for measure items were reviewed independently by two reviewers per article from a pool of eight reviewers. Measures with 75% or more of items that captured a child’s perception (i.e. relation to enjoyment, satisfaction, goals, expectations, standards or concerns) on health or broader position in life were labelled HRQoL or QoL measures, respectively. Measures with 25% or fewer items capturing a child’s perception were labelled FDH measures. Those with between 25 and 75% of items capturing a child’s perception were labelled hybrid HRQoL-FDH or QoL-FDH measures. Fayed and colleagues similarly described the hybrid nature of many measures: e.g. the General Health Questionnaire was classified as ‘FDH and QoL/HRQoL’, the HUI as ‘FDH (with one HRQoL/QoL attribute)’ and the KIDSCREEN as ‘HRQoL/QoL (with some functioning features)’. Instead of these mixed descriptions, this study specifies cut-off levels to allow for comparison across a wide range of measures.

Distinguishing between QoL and HRQoL was particularly challenging given the overlap between health and position in life. Indeed, Fayed et al. do not make this distinction in their results, grouping QoL and HRQoL under the same category [3]. A two-step pragmatic approach was taken: (1) refer to the stated aim/motivation of the study prior to measure development or use (e.g. some studies gave the definition of QoL or HRQoL and underlying constructs [30, 31]), and if still unclear, then (2) categorise as HRQoL measures if more than 50% of the domains cover body functioning, disabilities and daily activities (e.g. mobility, pain, vision, anxiety, depressive symptoms, chronic illness, dressing and grooming) that would constitute the concept of FDH as defined by Fayed et al. [3].

When step (1) in this pragmatic approach could not be applied (i.e. the study used conceptual labels without clearly defining them), we did not automatically apply the conceptual labels given to PROMs by the developer(s). Fayed et al. documented the discrepancies between the conceptual bases as labelled by developers and secondary applications of PROMs and those as classified by their review. For example, Fayed et al. noted that the HUI and the PedsQL were labelled as HRQoL/QoL measures in the primary and secondary literature but are more accurately labelled FDH measures [3]. In the current review, conceptual bases were similarly assigned according to extracted data rather than existing labels (unless clearly defined), but we did not document the discrepancies as was done by Fayed et al.

For measures that use informant responses, the informant type was classified according to the ISPOR guideline [2], namely proxy for QoL/HRQoL measures (since informants infer the child’s subjective perception) and observer for FDH measures (since no such inference is made). For hybrid QoL/HRQoL-FDH measures, the informant type was labelled hybrid proxy-observer.

Following the ISPOR good research practices guideline [2], we explored differences in measurement characteristics for PROMs targeted at different age groups. The ISPOR guideline’s age cut-offs were used to categorise the measures by their target age group: less than 5 years, covering infants, toddlers or pre-schoolers; 5–7 years for younger pre-adolescents; 8–11 years for older pre-adolescents; and 12–18 years for adolescents. We then explored how the characteristics of the measures (including range of domains covered, respondent and informant type, and other design features) varied by conceptual basis and target age. In cases where target ages were not clearly specified, inferences were made from the study aim and development methods to assign the target age category.

The level of content validity of each PROM was assessed according to whether children of the target age were involved in (1) the development of the measure, including qualitative research for domain and item elicitation, and (2) assessment of comprehension using cognitive interviews and/or pilot/feasibility studies. Note that appraisals of other psychometric properties such as construct validity and internal consistency are not reported here but in forthcoming work. Cultural issues present in the initial instrument development (as opposed to subsequent translations and secondary cross-cultural adaptations of the developed instrument) were similarly described.

For the review of value sets for generic multidimensional childhood PROMs, data were analysed using narrative synthesis. Summary tables were used to describe the bibliography/setting, sampling methods, preference data collection methods and statistical features of the identified value sets. Separate summaries of the modelling methods are presented for two categories of preference elicitation techniques: TTO, SG and rating scales (RS); DCEs and BWS.

4 Results

4.1 Search Results

Figure 1 shows the PRISMA flow diagram for the first search that identified primary studies developing generic multidimensional childhood PROMs published between 1 January 2012 and 16 October 2020.

Fig. 1
figure 1

The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram for database searches for studies developing generic multidimensional childhood patient-reported outcome measures published between 1 January 2012 and 16 October 2020

Thirty-four eligible studies were identified that described 26 PROMs. Sixty-three PROMs included in the review by Janssens et al. [4] that met the inclusion criteria were included to give 89 PROMs in total. Fourteen PROMs were accompanied or designed to be accompanied by preference-based value sets. Figure 2 shows the PRISMA flow diagram for the identification of value sets for these 14 PROMs. Nineteen studies were identified in total that described 21 value sets.

Fig. 2
figure 2

The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram for database searches for valuation studies for generic multidimensional childhood PROMs identified by search strategies. PROM patient-reported outcome measure, QALY quality-adjusted life year

4.2 Overview of Generic Multidimensional Childhood Patient-Reported Outcome Measures (PROMs)

Table 1 provides an overview of the characteristics of the identified generic multidimensional childhood PROMs that are not accompanied by or designed to be accompanied by preference-based value sets, while Table 2 does so for PROMs designed to be accompanied by preference-based value sets. Some measures have several versions, and Table 1 includes all distinct versions, providing a total of 110 unique measures (for example, the PedsQL 4.0 Generic Core Scales are composed of age-specific modules for toddlers, young children, children and teens [32]). Of the 110 measures, 52 are FDH measures, 29 are QoL measures, 12 are HRQoL measures, nine are hybrid QoL-FDH measures, and eight are hybrid HRQoL-FDH measures. Seventeen measures (reflecting two versions each for HUI2 and HUI3 and two variants of the EQ-5D-Y) are designed to be accompanied by preference-based value sets. The measures were primarily developed in high-income countries (using the 2021 World Bank classifications [33]), with only seven (6%) developed in lower- or middle-income country (LMIC) settings as part of an international development process [34,35,36,37,38,39,40].

Table 1 Summary of characteristics of included generic multidimensional childhood patient-reported outcome measures not designed to be accompanied by preference-based value setsa
Table 2 Summary of characteristics of included generic multidimensional childhood patient-reported outcome measures designed to be accompanied by preference-based value setsa

4.3 Characteristics of Measure by Age Category

Age groups 5–7 and 8–11 were combined for the subsequent descriptions of measure characteristics because relatively few measures targeted these age groups. This generated three rather than four age categories for the subsequent descriptions. Overall, there were 20 measures that targeted children less than 5 years, 29 that targeted children aged 5–11 years and 24 that targeted adolescents aged 12–18 years. Thirty-seven measures covered multiple age groups. Table C in the Supplementary Information (see the electronic supplementary material) organises the included measures by the above four target age group categories.

4.3.1 Conceptual Basis

Figure 3 shows the number of identified PROMs by age category according to their conceptual basis. The proportion of measures designed to elicit the child’s perception of their status (i.e. QoL, HRQoL and QoL/HRQoL-FDH measures) varied between four out of 20 (20%) for the infants, toddlers or pre-schoolers category and 18 out of 24 (75%) for adolescents. Twenty of 37 (54%) measures with multi-age group coverage were designed to elicit the child’s perception of their status.

Fig. 3
figure 3

Number of identified generic multidimensional childhood PROMs by conceptual basis and age category. FDH functioning, disability and health, HRQoL health-related quality of life, PROM patient-reported outcome measure, QoL quality of life

4.3.2 Domain Coverage

Table 3 lists in alphabetical order the domains covered by generic multidimensional childhood PROMs stratified by age category and conceptual basis. Moreover, domains that are unique to each age category are presented in bold and underlined.

Table 3 Domains covered by included patient-reported outcome measures by age category and conceptual basis. Domains unique to age category are highlighted in bold and underlined

Thirty-one domains were unique to infants, toddlers or pre-schoolers, of which 25 were unique to FDH measures and six to HRQoL measures. There were ten unique domains for pre-adolescents, 25 for adolescents and 16 for the multi-age coverage category. The unique domains were concentrated in FDH measures (19 of 25; 76%) for infants, toddlers or pre-schoolers, and in QoL measures (16 of 25; 64%) for adolescents.

4.3.3 Respondent/Informant Type and Other Measure Characteristics

Table 4 summarises the respondent and informant types and design features of measures by age category.

Table 4 Frequencies of alternative characteristics of measures by target age category

Measures targeting younger age groups were more likely to rely on informant report. For example, of the 25 measures compatible with observer report only, 16 (64%) targeted children aged less than 5, and none targeted adolescents. Of the 44 measures designed primarily for child report, 18 (41%) targeted adolescents, and one (2%) targeted children aged less than 5.

Of 76 measures compatible with child self-report, 49 (65%) were designed for self-administration using non-electronic or electronic modes of data collection. Overall, there were 14 measures (13%) that used electronic data collection from children or informants: seven using computers at study locations [41,42,43,44,45,46,47], four using tablets or mobiles [48,49,50,51] and three via the internet from non-study locations [52, 53].

Shorter recall periods were used for younger age groups. Of 13 measures compatible with child self-report that had a recall period of current/today, seven (54%) targeted children aged 5–11, while two (15%) targeted adolescents.

The numbers of domains and items tended to be lower for measures targeting younger age groups. Of 42 measures compatible with child self-report that had more than five domains, 13 (31%) targeted adolescents and eight (19%) targeted children aged 5–11. Of 28 measures compatible with child self-report that had more than 30 items, ten (36%) targeted adolescents and six (21%) targeted children aged 5–11. Two measures used computerized adaptive tests [46, 47] that tailored the number of items according to respondents’ previous answers.

Most measures (102 of 110; 93%) used some form of ordinal scale with a pre-defined number of response points (e.g. Likert scale). Six measures used a visual analogue scale (VAS) alongside a Likert scale. Eleven measures used pictorial or narrative aids for child self-reports.

The vast majority of measures without value sets (78 of 93; 84%) used unweighted sums, averages or proportion of maximum scores as scoring methods. Four measures without value sets elicited importance scores from children, with items weighted to produce the overall QoL score [30, 49, 54, 55]. The KIDSCREEN and PROMIS groups of measures used item response theory scoring [44,45,46,47, 56,57,58,59].

4.4 Content Validity of Measures

Table D in the Supplementary Information (see the electronic supplementary material) describes the background concepts and development methods for measures, including whether and how children (or parents/primary caregivers in case of infants, toddlers or pre-schoolers) were involved in the development and pilot-testing processes. The development of 21 of the 110 measures (19%) involved qualitative research with or surveys of children for domain and item elicitation. Nine measures were based on adapting existing adult measures: one conducted focus groups with adolescents [37]; two cognitive interviews [6, 38]; one feasibility testing [60]; and five made little or no mention of children’s involvement [54, 61,62,63,64]. Twenty-three measures reported no involvement of children, relying on clinical and/or research expertise for domain/item elicitation, using statistical techniques for item selection (e.g. for short-form development), or provided little detail on development. Among measures with multi-age group coverage, nine of the 37 measures (24%) conducted qualitative research with children.

Seven measures addressed cultural issues in initial measure development. The ACHWM aimed to develop a culturally appropriate model of health and wellbeing for Aboriginal communities in Canada [48]. The QOLQA and QOLQA-Taiwan aimed to develop measures appropriate for Asian adolescents [35, 65]. The development of the IQI involved interviews with parents of infants aged 0–3 years from New Zealand, Singapore and the UK to assess cross-cultural interpretability [51]. Izutsu et al. adapted the existing English version of the WHOQOL-BREF for Bangladeshi adolescents and identified the need for interviewer administration due to low literacy rates in this context [36]. Finally, the KIDSCREEN measures were developed across 13 European countries specifically for cross-cultural relevance [57]; while the development of the EQ-5D-Y similarly included seven countries [38].

4.5 Characteristics of Preference-Based Value Sets for Childhood PROMs

The characteristics of 21 preference-based value sets developed by 19 studies of ten generic multidimensional childhood PROMs are summarised in Table 5. The CH-6D, CHSCS-PS, EQ-5D-Y-5L and TANDI, which have been designed as generic multidimensional childhood PROMs with the accompaniment of a value set as an objective, did not have value sets publicly available at the time of the review. The 21 value sets were predominately developed in high-income countries with the exception of those developed for Fiji and Tonga (AQoL-6D Adolescents) [37] and China (CHU9D) [66]. BWS and DCEs increased in frequency of use in recent valuation studies, with TTO and SG techniques largely utilised at least a decade ago, unless used for the specific purpose of anchoring the DCE/BWS to the QALY scale. The respondent types from which the value sets were derived differ across measures, ranging from school children and students (16D, AQoL-6D, CHU9D [Australia and China]) through to parents/caregivers (17D, HUI2, IQI), with general population adult samples forming the most prevalent source of values overall (AHUHM, CHU9D [UK and The Netherlands], EQ-5D-Y, HUI2, HUI3, QWB, IQI).

Table 5 An overview of country-specific value sets for generic multidimensional childhood PROMsa

The size of the samples from which the value sets were derived differed markedly between valuation studies, ranging from 115 (17D, parents of children [67]) to 4155 (EQ-5D-Y US value set, adults [68]). Some studies conducted complementary TTO experiments to assist with anchoring [66, 69,70,71] or used a particular type of DCE that includes a life year attribute (which is commonly referred to as DCETTO) [72] to assist with anchoring. Studies using BWS and DCE generally used larger sample sizes than more traditional approaches to health state valuation such as RS, TTO and SG (supplementary Figure S1, see the electronic supplementary material) [73]. Studies employing DCE and BWS approaches in relatively large samples also tended to be administered in self-complete online surveys. In contrast, TTO and SG were more likely to be administered in smaller samples and in an interview format due to their greater complexity and iterative nature [73].

Seven value sets were elicited from adolescents using RS for the 16D [18], TTO for AQoL-6D Adolescents (four value sets) [37] and BWS for CHU9D [66, 74]. One study elicited values from young adults to anchor the Ratcliffe 2015 value set for the CHU9D [74]. Amongst the remaining studies that sampled adult populations, five studies eliciting country-specific value sets for the EQ-5D-Y or HUI2 asked adult respondents to express their preferences from the perspective of a child aged 7 and 10 years [68] or 10 years [70, 71, 75, 76]. The HUI2 value sets [75, 76] added additional context by specifically asking the adult to imagine the child living in the health state for the remainder of their life expectancy, with death at age 70. Two value sets were elicited from adults on behalf of the child [67, 77]. The remaining seven studies [72, 78,79,80,81,82,83] elicited health state preferences from adults who were asked to value health states from the perspective of their own health.

There were noticeable variations in the features of the value sets depending on the setting and valuation protocols adopted. For example, there are several value sets for the CHU9D, including the UK general adult population value set based on SG [79], the Dutch general adult population value set based on the DCETTO [72] and Australian [74] and Chinese [66] adolescent value sets based on BWS. Variation in the rank order of the most important domains is evident and illustrated in Table 5, with adolescents in China ranking the ability to join in activities and tired domains as most important, adolescents in Australia placing greater importance on mental health domains (e.g. sad, annoyed) and adults in both the UK and the Netherlands placing greater importance on physical health domains (e.g. pain and sleep). However, these differences may be due to the differences in valuation protocol, the sample characteristics and differences in preferences between countries

The range of health state values and the percentage of negative health state values (health states considered worse than being dead) differs across value sets within measures (Fig. 4 shows the value range by perspective, and supplementary Figure S2 shows the range by elicitation method). For example, the worst health state described by the CHU9D had a value of 0.337 in the UK general population value set, in contrast to the Australian [69], Chinese [66] and The Netherlands [72] value sets, where the worst health state had values of −0.1059, 0.0563 and −0.568, respectively. The UK general population and Chinese adolescent CHU9D value sets have no health states valued at worse than being dead, whereas the Australian adolescent and The Netherlands general population value sets have 0.19 and 5.17% of health states, respectively, valued at worse than being dead.

Fig. 4
figure 4

Preference-based index value range by perspective

Similarly, when drawing comparisons between different instruments, marked differences are evident in the range of values for value sets and in the proportions of health states that are valued at worse than being dead, with most studies reporting no negative health state values [18, 37, 66, 67, 71, 77, 79, 83]. However, some studies report relatively high proportions of health states valued at worse than being dead: e.g. 21% for the Slovenian EQ-5D-Y value set [70], whilst the publicly available Canadian algorithm for the HUI3 suggests that as many as 78% of health states are valued as worse than being dead [80]. The latter’s value range and proportion of negative values is noticeably greater than all other value sets. A potential reason for the high proportion could be the multiplicative function used for HUI3 valuation tasks as compared to the more common additive functional form used by other measures (supplementary Figs. S3a and S3b). More broadly, differences in modelling approaches can be as important as differences in the stated-preference methods used in determining the overall characteristics of value sets. Supplementary Tables E and F contain further detail on the valuation tasks and modelling methods for studies using two classes of methods for value set development (RS, TTO and SG [Table E, see the electronic supplementary material]; and DCE and BWS [Table F, see the electronic supplementary material]).

5 Discussion

This systematic review has generated an up-to-date and comprehensive catalogue of generic multidimensional childhood PROMs and the value sets that accompany them. The outputs can be used to inform the development, selection and interpretation of measures and value sets across a range of research contexts, including research designed to inform patient-centred care and research designed to inform cost-effectiveness-based decision-making. Specifically, the description provided of PROM characteristics such as target age range and domain coverage—alongside evidence on their psychometric performance explored in a forthcoming systematic review—should aid decisions around measure selection and, where appropriate measures are currently not available, measure development. Likewise, the granulated descriptions and categorisations of measures and value sets should aid interpretation of the outputs produced by measures across a range of methodological characteristics.

The review is the first to comprehensively catalogue generic multidimensional childhood PROMs according to the conceptual basis using definitions identified and outlined by the author team. It hence builds on the previous work by Fayed et al., who highlighted ways in which the terms health status, QoL and HRQoL are often used interchangeably in the literature despite important conceptual differences [3]. This is similarly noted by Karimi and Brazier [9], although their suggestion that the term HRQoL be reserved only to describe health utilities does not distinguish between measurement (individual perception incorporated into the descriptive systems of HRQoL measures) and valuation (societal preferences used to derive value sets) considerations. The latter may be appropriate for informing societal resource allocation decisions, while the former is more suited to informing clinical decisions for individuals [84], although the dichotomy is not absolute. Therefore, appropriate selection of childhood PROMs for research and decision making necessitates careful consideration of the context within which they are to be used.

The ISPOR guidelines for application of childhood PROMs recommend end users consider the applicability of characteristics of measures across children of different ages [2]. For example, a key consideration is the health/QoL domain coverage of PROMs targeted at particular age groups. This review describes the methods adopted to elicit relevant domains and items from children of different ages or their parents/caregivers and updates the content validity assessment conducted by Janssens et al. [85]. Moreover, our review catalogues the resulting domains by conceptual basis and target age group, highlighting those that are unique to each category and aiding instrument selection by the end user’s target childhood population and domains of interest. Previous reviews have listed domain coverage, but only for individual instruments and not by conceptual and age categories [4, 10, 22, 25, 26]. The review identifies unique domains across developmental stages from infancy to adolescence, with a focus on observable FDH aspects (e.g. responsiveness, appetite) during the first 5 years of life replaced by increased focus on self-perception of activities and immediate relations during pre-adolescence (e.g. activity impairment, relation with siblings), followed by heightened awareness of personal independence and wider relations during adolescence (e.g. personal competence, opposite-sex relationships).

The emphasis on age-specific content should be balanced by the ability to generate outcomes that are comparable across childhood age groups and between children and adults; the latter is particularly important for research informing life-course healthcare decision-making [86]. The development of early childhood (age 1–5 years) versions of PROMIS sought to balance the two priorities by eliciting relevant domains and items from parents while minimising changes to the recall period and question wording of adult PROMIS measures [87]. Likewise, we identified nine measures that adapted existing adult measures, three after conducting qualitative research or cognitive interviews with children [6, 37, 38]. Significantly, two of these measures that were adapted with childhood input—QoL-C for age 4–9 years [6] and EQ-5D-Y for age 8–15 years [38]—are based on the EQ-5D, the reference case outcome measure for cost-utility analyses in the UK [13].

Another key age-dependent characteristic is the feasibility of child self-report, which is clearly dependent on not only chronological age, but also on children’s cognitive capacity and their educational abilities (reading, comprehension and writing skills). Unsurprisingly, only two of 20 measures targeting children aged less than 5 provide for self-report [88, 89], compared to 21 of 29 measures targeting pre-adolescents and all measures targeting adolescents. Where informant reports are used, this review distinguishes between proxy and observer report depending on the extent to which the informant is asked to infer a subjective perception over health/QoL items for the child. Hence, there is close overlap between the conceptual basis of the measure and informant type. This distinction is recommended by the ISPOR guideline [2], but is yet to be widely implemented. Janssens et al., for example, did not include the observer/proxy distinction in their discussion (p. 325) [4]. Moreover, based on empirical findings that child–informant agreement is poorer for subjective health/QoL constructs (e.g. emotion, pain) than observable ones (e.g. mobility, self-care) [90, 91], the ISPOR guideline recommends instruments involving observer report over those involving proxy report where informant report is found necessary (p. 470) [2]. Of the 61 identified measures compatible with informant report, 19 (31%) involved proxies; of the 33 measures compatible only with informant report (i.e. not designed for child self-report), nine (27%) involved proxies. This suggests that many existing measures do not adhere to the ISPOR recommendation, although more recently developed measures for infants and early childhood specifically cite this recommendation as the reason for including only observable domains [39, 87]. A key rationale for incorporating informant (specifically parent) report is that informants often have vital insights into the child’s health needs and medical history and hence this facilitates healthcare decision-making [2, 92]. Future research should examine to what extent this decision-making is impaired by limiting the informant focus to observable domains.

Incorporating age-appropriate instrument design is another ISPOR good research practice principle [2]. Electronic data collection is recommended as the preferred administration mode, particularly in pre-adolescent age groups where self-report is possible [2, 21, 93]. Accordingly, three measures specifically targeting pre-adolescents incorporated electronic data collection in self-reports [41, 42, 49] as did another seven with multi-age group coverage including pre-adolescents [43,44,45,46,47, 50, 53]. The ISPOR guideline notes the difficulties children face in comprehending extended periods and recommends a recall period of 24 h or less. A recent systematic review by Coombes et al. on the design of childhood PROMs in primary studies similarly recommends a recall period of 48 h or less for children aged 5–7 years [94]. These recommendations were applied (i.e. recall period of current/today or general) by half of the measures in our review that elicit pre-adolescent self-reports. The other half contains recall periods ranging from the past week to the past 2 months. Concerning instrument length, relatively few instrument developers were concerned about the administrative burden of long instruments and used this as a rationale for development of short forms [53, 55, 58, 95]. In all, the number of instruments adhering to design recommendations is generally low.

For childhood PROMs accompanied by preference-based value sets, a key challenge lies in how to elicit societal preference weights. This review reveals an increasing trend towards using ordinal approaches (e.g. DCE and BWS) in health state valuation. These approaches (in particular BWS) have different cognitive demands to iterative approaches and may be more suitable for children to self-complete [96]. Given death is normally not included in the hypothetical choice tasks, this also aligns with ethical considerations for children. It enables preferences of children to be directly elicited and has been adopted by the studies that derived value sets for the CHU9D in Australia and China [66, 74]. However, given the raw values from these ordinal tasks are estimated on a latent utility scale, there is a need to have a standalone valuation survey using traditional direct approaches to health state valuation (e.g. TTO or SG) with (normally) a small convenience sample to facilitate the re-scaling exercise. This enables the value sets to produce cardinal health state values compatible with the 0–1 (dead to full health) QALY scale (for a comprehensive overview, see [73, 97]). For the most recently published EQ-5D-Y value sets, a similar approach was used, but instead of using BWS, a DCE was adopted in the main valuation task [70, 71].

This review identified that the majority of value sets (14/21 value sets) were derived using adult samples; in half of these, the adult was asked to take the perspective of a child, and half were from the adult’s own perspective. Decisions about the appropriate perspective to adopt when developing value sets for economic evaluation need to consider the preference of health technology assessment (HTA) agencies or decision makers in each country. There are currently notable differences in the approaches taken for each country. For example, CHU9D value sets developed in the UK and The Netherlands are both based on the perspective and preference of the general public (aged ≥ 18 years) [72, 79]. The SG was used for the UK value set [79], whilst a DCETTO was used for The Netherlands value set [72]. The DCETTO does not require a separate task or data manipulation for states considered worse than dead [12, 98]; however, this approach was developed for adults, and may not be suitable for children and adolescents given ethical concerns about the inclusion of death. Notably, this issue also applies to the TTO. It will be important for future value set generation that the methods align to decision maker preferences as results will differ depending on approach taken. This also means that caution is required when comparing values generated across country value sets. Comparability of utilities for children with utilities generated for adults will also need to be considered by decision makers. The differences in perspectives used and normative judgments around appropriate perspectives will need to be balanced with considerations of comparability.

The review has also identified a trend over the last decade towards DCE and BWS exercises involving relatively larger samples and a reliance on online surveys. Whilst there are considerable practical advantages with online sampling, care is needed to ensure the quality of responses obtained, with protocols such as those developed by the EuroQol Group playing a critical role in quality assurance [99]. Qualitative studies such as ‘think aloud’ also provide information to understand participant decisions in the absence of a face-to-face interaction [100].

Across all categories of PROMs, regardless of the construct (or concept) being measured or target age(s), a noticeable feature is the scarcity of measures developed in LMIC settings (though this could be partly related to the reviews’ sole inclusion of English language studies). Only five LMICs were covered by PROM development and only three by value set development. A previous systematic review of studies applying generic childhood PROMs accompanied by value sets found a similar scarcity, with only 8% of samples coming from LMICs [16] despite the substantially higher childhood disease burdens in these settings [101]. Further development of PROMs and value sets relevant to LMICs is a pressing research need, particularly when many characteristics of PROMs are contingent upon local contexts (e.g. low literacy rates hindering self-administration [36]) and when national HTA agencies demand value sets derived from samples relevant to local decision-making [13].

5.1 Strengths and Limitations

The contributions and strengths of this systematic review relative to previous reviews of childhood PROMs have been discussed under specific themes above (e.g. categorising PROMs by conceptual basis and target age group). There are nonetheless limitations warranting further research as well as caveats that should be borne in mind by readers. First, in contrast to previous reviews [21, 22, 24,25,26, 102], this review did not assess the psychometric properties of identified PROMs. Assessment of psychometric properties is recommended for PROM selection in research and decision-making [2]. In common with the work conducted by Janssens et al. [85], a separate systematic search and synthesis of the psychometric properties of the PROMs identified by this review is planned. This review provides partial evidence of the content validity of PROMs (Table D, see the electronic supplementary material), which will be combined with evidence of other aspects of content validity such as relevance and comprehensibility, as well as evidence for other psychometric properties [103]. Second, the research is constrained by a lack of consensus surrounding terms that describe the conceptual bases underpinning PROMs [9]. Hence, the methods used in this work to establish the categories of FDH, QoL and HRQoL measures [3], and the cut-off levels for hybrid categories, may be disputed. That said, a sensitivity analysis that changed the cut-off levels required for the proportion of items within measures that captured a child’s perception for them to be labelled QoL/HRQoL-FDH hybrid measures (from 25–75% to 33–67%) altered the label of only one measure (Kiddy-KINDL parent questionnaire [89]), and this had minimal impact on subsequent analyses.

Third, this review covered generic multidimensional childhood PROMs only and did not cover multidimensional childhood PROMs developed for specific conditions. Future research could usefully generate a parallel catalogue of condition-specific multidimensional childhood PROMs to inform research in specific clinical areas [104]. Fourth, the delineation of age and age categories throughout the paper is framed in chronological terms. Consideration of developmental age or markers of reading or educational ability could inform the selection of multidimensional childhood PROMs in some contexts, but this would require analyses of individual-level data from studies that also report developmental or educational outcomes. A final caveat is that this review does not report overall scores on the methodological or reporting quality of included studies through, for example, the Checklist for Reporting Valuation Studies (CREATE) score for valuation studies [105]. This was because CREATE was not developed to encompass values for childhood PROMs and misses key additional elements of the methodological choices required in this context. For PROM development studies, the planned systematic review of psychometric properties of measures will apply the Consensus-based Standards for the selection of health status Measurement Instruments (COSMIN) checklist to assess their methodological quality [103].

6 Conclusion

The catalogue of generic multidimensional childhood PROMs generated by this systematic review creates a valuable resource for researchers, practitioners and policy makers in selecting the most appropriate measure(s) for application within childhood populations according to their conceptual basis, target age, design and needs of the end user. This information should be viewed in conjunction with evidence surrounding the psychometric performance of measures to be presented in the follow-up systematic review. The description of PROM characteristics should inform decisions about the need to develop measures with alternative features or targeted at particular age groups. Moreover, the identified value sets covering all childhood age groups can be used to inform cost-utility analyses of childhood interventions. However, many methodological questions remain regarding the methods to use in valuing childhood PROMs, and this is currently a very active area of research. The availability of childhood PROMs and the value sets that accompany them have important roles in informing, respectively, individualised care and cost-effectiveness-based decision-making.