FormalPara Key Points for Decision Makers

The agreement of the TTO survey between face-to-face and web surveys was not good, although according to the test-retest survey, the two methods had similar reliability. Discrete choice experiment and DCE with duration had similarly good agreement (more than 70%), regardless of the survey type.

However, if the positions of the cards (health states) were the same in the DCE, the agreement between the two choices was approximately 90%. When the card positions were changed, the agreement percentage was approximately 60%. It seems that 30% of respondents choose the “position” of the card.

1 Introduction

Utility is typically measured using a preference-based measure (PBM), such as the EQ-5D [1], HUI [2] and SF-6D [3, 4]. Many other PBMs (generic PBM [5, 6], disease-specific PBM [7, 8], PBM for pediatric and adolescents [9, 10], and PBMs for social care [11]) have also been developed. Patients are asked to respond to a PBM, and the responses are converted to utility using the value set. Therefore, we can calculate the general preferences for health states. Value sets are typically constructed using direct methods such as standard gamble (SG), time trade-off (TTO), discrete choice experiment (DCE), and best-worst scaling (BWS).

Valuation surveys using TTO are typically conducted in face-to-face settings. This survey method is expensive and time consuming. However, valuation surveys that use DCE sometimes employ online surveys. Respondents are sampled from a large online panel. Online surveys are cheaper than face-to-face surveys and accumulate responses in a short time. Responses to online surveys are sometimes unreliable (e.g., people take shortcuts to complete tasks quickly) [12]. The advantages and disadvantages of face-to-face and online surveys have been discussed. However, the characteristics of face-to-face and online surveys remain unclear.

Previous studies have examined the test-retest reliability of direct methods [13,14,15,16]. The characteristics of face-to-face and online surveys [17] have been compared. However, there are no studies in which both experiments have been combined. Therefore, we evaluated the EQ-5D-5L (a self-assessed, health related, quality-of-life questionnaire) health state through face-to-face and online surveys using (a) composite TTO (cTTO), (b) DCE, and (c) DCE with duration. This survey was repeated two weeks after the first survey was completed, based on the test-retest method. By comparing these data, we clarified the characteristics of face-to-face and online surveys, as well as practical guidance.

2 Methods

2.1 EQ-5D-5L

The EQ-5D-5L is one of the most widely used PBMs worldwide and comprises five items: “mobility,” “self-care,” “usual activities,” “pain/discomfort,” and “anxiety/depression.” Each item was assessed using five levels of description (no, mild, moderate, severe, and extreme or unable to do). Responses were converted to EQ-5D-5L index values using a predetermined algorithm based on the societal preferences of the general population. It describes 55=3125 health states. The index value of the worst health state [55555] is −0.025 in Japan [18], which is the highest value compared with scores in other countries. Health status with negative scores also existed.

2.2 Composite TTO, DCE, and DCE with Duration

We evaluated the EQ-5D-5L health state through face-to-face and online surveys using (a) composite TTO (cTTO) [19], (b) DCE, and (c) DCE with duration [20]. In the TTO survey, participants always started with a conventional TTO task: living for ten years in a healthy state as described by the EQ-5D-5L or living × years in full health. If the participants considered the presented EQ-5D-5L state to be better than immediate death (i.e., x > 0), the value of x was varied until indifference was reached, and the value of the EQ-5D-5L state was given by x/10. If the participants considered immediate death to be better than living for 10 years in the EQ-5D-5L state (i.e., x < 0), a lead-time TTO [21, 22] was started, which allowed the estimation of negative values. In lead-time TTO, a series of choices were offered between “y years of life in full health” and a life of “10 years in full health followed by 10 years in the presented EQ-5D-5L.” The value of y varied until indifference was reached, and the value of the EQ-5D-5L state was given by (y-10)/10.

The DCE survey presented two health states (A and B) as described in the EQ-5D-5L. For DCE with duration, expected life years (1, 4, 7, and 10 years) were combined with the EQ-5D-5L description. The participants chose their preferred option between the two options.

2.3 Experiment Design

For the TTO, all the respondents were asked to rate the same set of ten health states described in the EQ-5D-5L. Ten health states (shown in the tables in the Results section) were selected from those used in the EQ-5D-5L valuation survey, considering the distribution of misery scores (the sum of the levels in each domain). The order of the ten health states was random. For the DCE or DCE with duration surveys, all respondents were asked about the same set of 15 different health state pairs (shown in the tables in the Results section). These health states were selected from those used in the EQ-5D-5L valuation survey. For each respondent, the order in which the health state pairs were presented was randomized. To avoid positioning effects, the positions of the two health states (left or right) were randomized (1:1) in the first and second phases. These cTTO and two DCE tasks were repeated twice, two weeks after the first task, to compare the first and second responses in the same face-to-face or online mode.

2.4 Face-to-Face Survey

Respondents (aged 20–69 years) were recruited through a panel owned by a research company based on non-random quota sampling by sex and age. The inclusion criteria were as follows: (1) age 20–69 years, (2) current Japanese residency, (3) ability to visit the survey room in Tokyo, (4) ability to provide informed consent, and (5) ability to complete tasks in Japanese. As it was challenging to recruit elderly people for this survey owing to the COVID-19 outbreak, respondents aged > 69 years were excluded. The survey was conducted between October and December 2021.

The target sample size was 334 each for TTO, DCE, and DCE with duration (total of 1000). Thus, 334 respondents were expected to participate in both the first and second surveys. This was not based on statistical considerations. Respondents were asked to visit a survey center in Tokyo. Computer-assisted personal interviewing (CAPI) was performed with the interviewers’ support in a one-on-one setting for 30–60 minutes at the survey center. Three training TTO tasks were completed before the actual TTO tasks; “in a wheelchair,” “much better than being in a wheelchair,” and “much worse than being in a wheelchair, so bad that one would prefer to die immediately” were the responses collected automatically as electronic data.

We asked the same individuals who responded to the first survey to participate in the second survey. The same health state was evaluated using the same valuation method. For the second-stage survey, a similar process was repeated two weeks after the first survey.

2.5 Online Survey

A self-administered, online survey was also conducted. Respondents (aged 20–69 years) were recruited through a Japanese web panel based on quota sampling by sex and age. Age was determined to be consistent with the face-to-face survey. The target sample size was 334 for TTO, DCE, and DCE with duration (number of respondents who participated in the first and second surveys), similar to the face-to-face survey. We expected the dropout rate of the web survey to be much higher than that of the face-to-face survey, and more samples were collected in the first task to obtain 334 respondents in both stages. In the TTO survey, three training tasks were required to be completed. The sample size was not based on statistical analysis. For the second-stage survey, an invitation mail was also sent to ask the same people who responded to the first survey to cooperate in the second-stage survey. They evaluated the same health state using the same valuation method.

2.6 Statistical Analysis

The numbers and percentages of background factors were also calculated. The number of respondents who participated in the first survey and the percentage of dropouts were calculated. The median total response time (and interquartile range [IQR]) was calculated to complete all TTO, DCE, and DCE with duration tasks. In the web survey, the distribution of the total response time was skewed because some respondents interrupted the tasks. Therefore, the median time was preferable.

Mean TTO scores and mean root square error (MRSE) for ten health states were computed and compared between the face-to-face and web surveys, and the first and second surveys. The mean number of responses before the TTO tasks were completed is shown.

As described above, the retest was performed two weeks after the first-stage survey. Reliability was evaluated by calculating the percentage of agreement and intraclass correlation coefficient (ICC). To consider the relationship between the number of agreements, background factors, and response time, a negative binomial regression was performed by including certain background factors such as sex, age, employment, marital status, educational history, household income, and response time. Ethical approval was obtained from the National Institute of Public Health.

3 Results

The collected samples included 1047 respondents (349 for TTO, 348 for DCE, and 350 for DCE with duration) for the face-to-face survey and 1912 respondents (627 for TTO, 650 for DCE, and 635 for DCE with duration) for the web survey. The number of respondents who participated in the second survey was as follows: 1043 (349 for TTO, 347 for DCE, and 347 for DCE with duration) for the face-to-face survey and 1306 (433 for TTO, 432 for DCE, and 441 for DCE with duration) for the web survey. The percentage of respondents who completed both the first and second face-to-face surveys was 100% for TTO, 99.7% for DCE, and 99.1% for DCE with duration, and that for both the first and second web surveys was 69.1% for TTO, 66.5% for DCE, and 69.4% for DCE with duration. The response rate of the web survey was lower than that of the face-to-face survey.

The response time of the web survey was considerably shorter than that of the face-to-face survey and did not depend on the survey methods. The median total response time of respondents to the 10 TTO questions was 20.9 minutes (IQR 17.8–23.6 minutes) for the face-to-face survey and 7.3 minutes (IQR 5.2–10.3 minutes) for the web survey, respectively, in the first phase. Similarly, the following time was needed for 15 DCE and DCE with duration; 15.9 minutes (IQR 14.2–17.4 minutes) for the face-to-face survey of DCE, 4.4 minutes (IQR 2.9–6.6 minutes) for the web of DCE, 18.7 minutes (IQR 17.1–20.5 minutes) for the face-to-face survey of DCE with duration, and 4.6 minutes (IQR 3.0–6.8 minutes) for the web of DCE with duration. Respondents used a similar time for DCE and DCE with duration, independent of the expected lifetime attribute.

3.1 Demographic Factors

Respondents’ background characteristics are presented in Table 1. Median household income ranged from Japanese Yen (JPY) 5 million to JPY 7 million. Compared with the household income of all Japanese families of JPY 4.4 million in 2019, the household income was slightly higher. Many factors were comparable to observations in the general population. According to the 2019 Labor Force Survey, full-time workers accounted for 31.6% of all workers, and part-time workers accounted for 13.7%. In total, 24.3% of Japanese individuals graduated from university or graduate school in 2017, 61.3% of the Japanese people were married, and 31.6% were unmarried in 2015.

Table 1 Demographic factors

We also calculated the backgrounds of the respondents who dropped out. The distribution of background factors did not differ from those of the entire population. Respondents dropped out randomly, suggesting that the dropout rate did not influence the results.

3.2 Comparison of TTO Between Face-to-Face and Web Survey

Figure 1 shows a comparison of the TTO scores between the face-to-face and web surveys. Both scores were systematically different regardless of health status. The TTO scores for the face-to-face survey were higher than those for the web survey. The average number of responses before completion of the TTO tasks was 5.7 for the face-to-face survey and 5.4 for the web survey; the difference was not statistically significant. The distribution of the number of responses is shown in Fig. 2. The frequency of small numbers (once and twice) and a large number (nine times) in the web survey was higher than that in the face-to-face survey. In the web-based survey, a TTO score of −1 was frequently observed, because if the respondent continued to select the same option, they reached a TTO score of −1.

Fig. 1
figure 1

Time trade-off (TTO) scores between face-to-face and web surveys at the first and the second surveys

Fig. 2
figure 2

Number of responses before the completion of time trade-off (TTO) tasks

3.3 Comparison of TTO Between the First and Second Survey

Table 2 shows the percentage of agreement on the TTO scores between the first and second surveys. The percentages of both face-to-face and web surveys were low and considered unreliable. The percentage of web-based surveys was slightly higher than that of face-to-face surveys. The MRSE between the first and second face-to-face surveys ranged from 0.11 (11211) to 0.36 (51152). The MRSE of the web-based survey went from 0.34 (11211) to 0.38 (35332). The MRSE of the face-to-face survey was smaller than the percentage of agreement, which may imply that the responses to the face-to-face surveys were more homogeneous. The ICC of the face-to-face survey was 0.37, indicating a poor agreement, whereas that of the web survey was 0.59. Although the agreement was not good, as shown in Fig. 1, the graph of the mean TTO scores overlapped between the first and second surveys, regardless of the survey mode. The agreement of each TTO score was poor; however, reliable mean scores were obtained.

Table 2 Percentage of agreement: TTO

3.4 Comparison of DCE and DCE by the Duration Between Face-to-Face and Web Surveys and Between the First and Second Surveys

Table 3 shows the percentage of agreement on DCE and DCE with duration of the first and second surveys. The percentage agreement was similar between DCE and DCE with duration, although the attribute of expected lifetime was added to the health state in the DCE with duration survey. Both the face-to-face and web surveys showed more than 70% agreement, which means that both are reliable. This percentage did not depend on the pairs of health states.

Table 3 Percentage of agreement: DCE and DCE with duration

Next, we consider the positioning effects. Table 4 shows the 2 × 2 cross table that considers the first factor: whether the position of cards (health states) was changed between the first and second surveys, and the second factor: whether the choice was agreed upon between the two surveys. Considering the first results, it seems that 30% (90% minus 60%) of respondents tend to choose the “position” of the card, not the characteristics of “health state” on the card. If the positions of the cards were the same, the agreement between the two choices was approximately 90%, which was considered highly reliable. In contrast, when the positions of the cards were changed, the percentage of agreement was approximately 60%. The percentage of respondents who chose left-side cards was 51.4% in the face-to-face surveys and 50.3% in the web-based surveys. The left or right card position did not seem to influence the respondents’ choices.

Table 4 Positioning effect of DCE and DCE with duration

3.5 Effect of Background Factor and Response Time for the Web Survey

Table 5 presents the relationships between the number of agreements, background factors, and response times. We could not identify meaningful factors that influenced the number of agreements. The response time did not significantly increase the number of agreements, except for in the DCE survey. This suggests that the cut-off time could not be determined.

Table 5 Relation between number of agreement and background factor and response time

4 Discussion

We compared responses between face-to-face and web surveys and between the first and second surveys for TTO, DCE, and DCE with duration. This relatively large sample survey considered the test-retest reliability of these valuation methods. A published paper comparing face-to-face and web surveys [17] showed that TTO scores from online surveys were higher than those from face-to-face surveys. This pattern differs from that observed in this study. However, similar to the results of the previous survey, the results of the two survey methods differed significantly. Regarding test-retest reliability of TTO scores, according to the survey by Robinson [13], the ICCs ranged from 0.40 to 0.88 for TTO. In a report from Korea [14], the ICC of TTO scores was 0.827. An Indonesian survey [15] showed that the MRSE between test and retest ranged from 0.00 (state 24553) to 0.23 (state 55233). When the SF-6Dv2 valuation method was used [16], the ICC was 0.958 for TTO, with identical values accounting for 59.3% of observations. Compared to these studies, the ICC of our survey may be slightly smaller; however, this is influenced by the presented health state and experimental design. In addition, considering the DCE survey, a survey [15] showed that of the 1582 DCE responses in the first test, 1147 (72.5%) were identically chosen in the second test. Another survey indicated that 76.4% of choices were identical for DCETTO. These results are comparable to our data.

When TTO is used for valuation surveys, attention must be paid to the discrepancy between face-to-face and web surveys. We cannot detect the reason for this difference, but one possibility is that respondents in the web survey attempted to finish the TTO tasks in a short time or to choose the same options to save more time than in face-to-face surveys. We cannot determine which measurement value is true; however, if face-to-face surveys are standard, web surveys are not recommended because of systematic errors.

Discrete choice experiment and DCE with duration are occasionally used in valuation surveys. In the valuation surveys, the responses of DCE and DCE with duration are collected via the internet. Our results supported this. However, in the DCE or DCE with duration survey, about 30% of respondents did not consider health states on cards but chose only the “position” of the cards. Randomization of card positioning may be needed to reduce bias without depending on survey mode. In both the TTO and DCE surveys, the cut-off time for controlling the quality of responses could not be determined.

One limitation of this survey is that approximately 30% of respondents to the web survey dropped out before the 2nd survey. We were unable to recruit respondents who dropped out before the 2nd survey. The demographic factors were similar between respondents who dropped out and those who did not; however, it is possible that respondents who participated in both the first and second surveys were much more eager to participate. In addition, the evaluated health states were 10 in TTO and 15 in DCE and DCE with duration. In other studies, many health states have been evaluated, and a value set has been created. Compared with these studies, the number of health states was small, although our health states included a variety of severities, considering the misery scores. The generalizability of our results may be limited. Finally, the survey was conducted in Japan. We believe that this survey is reliable, but more deliberation is needed to extrapolate our results to other countries. Response patterns may differ among countries.

5 Conclusion

In this survey, we clarified the characteristics of TTO, DCE, and DCE with duration, considering both face-to-face and web surveys. First, we found a discrepancy in TTO scores between the face-to-face and web surveys. The ICC was poor for face-to-face interactions but appeared to be better for web interactions; however, the mean TTO scores were reliable. Both the DCE and DCE with duration showed good agreement between the two surveys. Web-based surveys on DCE and DCE with duration may be acceptable. However, positioning effects were not negligible. Because the response time did not influence reliability, it was challenging to set a cut-off time.