FormalPara Key Points

Efficiently allocating scarce resources for chronic and progressive diseases such as type 2 diabetes mellitus (T2DM) is challenged by limited time and resources and an unusual degree of decision-making uncertainty (e.g., clinical and economic implications that extend far beyond trial durations, patient heterogeneity, evolving practice patterns, and practice patterns that differ between trials and ordinary use).

To extrapolate trial data to longer decision-making time horizons, economic modeling is routinely used. While economic models of T2DM would ideally be user friendly, transparent, fast, and accurate (i.e., good external validity), the complexity of T2DM generally requires comprehensive (including parallel sets of complications and sophisticated treatment-switching algorithms) to ensure good predictive accuracy. Established T2DM models are generally slow and relatively opaque, which imposes an additional demand on economic stakeholders for case-specific expertise to evaluate the suitability of manufacturer-submitted models and in some cases to run the models with tight deadlines.

To address a need that some economic stakeholders have for greater user friendliness and faster run times, the IHE Diabetes Cohort Model was constructed using the cohort rather than the micro-simulation approach. A well-known limitation of cohort modeling, however, is an inability to adequately model patient heterogeneity (at least not without a health state explosion) and a potential for biased cost-effectiveness estimates.

In exercises designed to evaluate the potential magnitude of bias of the IHE Diabetes Cohort Model, we compared results generated for a set of simulation scenarios with those of a micro-simulation model (Economic and Health Outcomes Model of T2DM), chosen because the structures are otherwise generally similar and because it was possible to harmonize the models even more to minimize between-model simulation differences. We found systematic differences in simulated costs and quality-adjusted life-years, but little evidence of systematic differences in the incremental costs and quality-adjusted life-years that underlie cost-effectiveness metrics or in incremental cost-effectiveness ratios and net monetary benefits themselves.

1 Introduction

Type 2 diabetes mellitus (T2DM) is a chronic and progressive disease hallmarked by hyperglycemia. Chronic hyperglycemia, together with common co-morbidities such as obesity, hypertension, and dyslipidemia, is associated with high risks for serious micro- and macrovascular complications and premature mortality [1, 2]. Currently, T2DM cannot be cured and treatment consists primarily of managing blood glucose and cardiovascular risk factors (e.g., blood pressure and serum lipids) to postpone or prevent the development of disease complications [3].

The economic burden of T2DM is substantial [4,5,6], cost-effectively allocating scarce resources among competing resources is challenged not only by the limited time and resources available to economic stakeholders in general but also by an unusual degree of decision-making uncertainty (e.g., clinical and economic implications that extend far beyond trial durations, large number of interdependent micro- and macrovascular complications with multiple treatment targets, patient heterogeneity, evolving practice patterns, and practice patterns that can differ widely between trials and ordinary use) [7].

The evidence used by economic stakeholders to make decisions is routinely generated using economic models that support extrapolation of trial data to time horizons sufficient to capture the full costs and benefits of intervention (often lifetime). A large number of economic models of T2DM are available [8]. Ideally, these models would be user friendly, transparent, fast, and accurate (i.e., good external validity). To obtain good predictive accuracy given the complexity of T2DM, however, these models must include a large set of interdependent micro- and macrovascular complications and sophisticated long-term treatment managers that challenge these goals.

Economic simulation models can generally be divided according to whether they represent the hypothetical patients as unique individuals (micro-simulation) or as average members of a representative cohort (cohort modeling) [9, 10]. Both approaches have well-known advantages and disadvantages. Micro-simulation models can accommodate patient heterogeneity and interdependent health states while maintaining a compact form because individual hypothetical patients can be assigned and carry with them a large amount of personal information, which enables simulation of personalized treatment pathways and event risks and realistic patient histories [9, 10]. The primary disadvantages of the micro-simulation approach in T2DM are a lengthy model code (often in high-level programming language rather than the more accessible Microsoft Excel® [Microsoft, Seattle, WA, USA]), computational intensiveness [9, 11], and an additional demand on the economic stakeholder for case-specific disease and programming expertise to evaluate the suitability of manufacturer-submitted models. Indeed, the code underlying most current models of T2DM is generally impenetrable to most non-programmers and run times (numbering in hours and sometimes days) can be limiting. The Canadian Agency for Drugs and Technologies in Health (CADTH), for example, has announced pending updates to its Category 1 Requirements that include model run times for the base-case analysis and key scenario analyses of less than 1 business day and programming in Microsoft Excel® [12].

Cohort models can approximate a micro-simulation model if the disease is discretized into enough health states, but “state explosion” and the paradoxical possibility the model is less manageable and transparent than a corresponding micro-simulation model [10], thus the micro-simulation approach has been widely used for T2DM [8, 13, 14]. Pragmatic cohort models can be constructed without a “state explosion”, however, even for complicated diseases without a complete sacrifice of predictive accuracy. The IHE Diabetes Cohort Model (IHE-DCM) [15] was designed and constructed in Microsoft Excel® (Microsoft) with this in mind to address reasonable concern about a lack of transparency in micro-simulation models and has demonstrated external validity on par with other micro-simulation models of T2DM [15]. The benefits include fewer parameters, faster run times, and convenient use of Microsoft Excel®, all of which can be appealing to stakeholders tasked with understanding (and potentially running) the models under time pressure [16]. The primary disadvantage of the cohort modeling approach is the potential for biased estimates of the incremental cost-effectiveness ratio (ICER), which arises when there is “uncaptured” patient heterogeneity that forces the cohort approach to simulate non-linear relationships with average patient characteristics [16]. To manage the large number of parallel health states, Visual Basic for Applications was used to program key model functions as macros, thus sacrificing some of the potential gains in transparency.

The IHE-DCM has been used to estimate long-term cost consequences of T2DM in Sweden [17], to estimate the cost-effectiveness of anti-hyperglycemic treatments [18,19,20,21,22], and to support HTA submissions in Sweden, Norway, and Canada [23,24,25]. Given the possibility that the cohort modeling approach produces biased estimates for complex diseases like T2DM, stakeholders can benefit from an empirical investigation of the likely magnitude and direction (i.e., the potential penalty to be traded against the other benefits). Indeed, CADTH conjectured that “there may be a significant degree of bias …” involved in a recent application using IHE-DCM, owing in part to the model design (including absence of patient variability and the non-linear relationship between biomarkers and outcomes) [25]. The Norwegian Medicines Agency had similar reservations about the cohort approach, though they concluded that IHE-DCM was appropriate given shorter run times and greater transparency [23].

2 Objective

The objective of this study was to inform decision makers by investigating the direction and magnitude of bias of IHE-DCM cost-effectiveness estimates attributable to using the cohort modeling approach.

3 Methods

We borrowed well-established cross-validation tools [11, 13, 26] to examine whether cost-effectiveness estimates generated by IHE-DCM are tangibly biased by comparing IHE-DCM results from a set of scenarios inspired by the 9th Mount Hood Diabetes Challenge with corresponding results produced by an otherwise similar micro-simulation model—the Economic and Health Outcomes Model of T2DM (ECHO-T2DM). Similar analyses have been performed previously for other diseases, including chronic obstructive pulmonary disease [27], human immunodeficiency virus [28], and hepatitis C [29]. While such an exercise cannot provide a definitive (and universal) answer to concerns about possible bias, and it does not address the academic discussion of how much accuracy is reasonable to swap for increased transparency [30], it can provide a careful examination of how two otherwise similar models respond to the same stimuli (both absolutely and incrementally) and thus inform stakeholders charged with interpreting evidence generated by IHE-DCM.

3.1 The Models

IHE-DCM uses the cohort approach to model the cost-effectiveness of competing treatment alternatives for representative hypothetical patients with T2DM [18,19,20,21,22]. It is constructed with Markov health states representing important microvascular complications (retinopathy, neuropathy, and nephropathy) and macrovascular complications (myocardial infarction, ischemic heart disease, heart failure, and stroke) and dead, updated in annual cycles. Microvascular event risks are sourced primarily from the National Institutes of Health model [31] and Bagust et al. [32]. Multiple sets of macrovascular and mortality event risks are supported in the model [33,34,35,36], of which the UK Prospective Diabetes Study Outcomes Model 2 equations [36] were used in this exercise. Treatment effects are applied as changes in biomarkers (applied during the first year of treatment) and biomarker evolution is simulated until the predefined time horizon is reached. Treatment algorithms allow for treatment intensification when glycemic goals are not met. Unit costs and quality-adjusted life-year (QALY) disutility weights are applied based on health outcomes. The simulation time horizon is user defined and the probabilistic sensitivity analysis (PSA) is supported for treatment effects, risk coefficients, biomarker drifts, adverse event rates, unit costs, and QALYs. A more complete description can be found in the Electronic Supplementary Material (ESM). IHE-DCM performed in line with other micro-simulation models in internal and external validation exercises covering 12 long-term clinical studies, though there was a tendency to overestimate the macrovascular outcomes [15]. Model validity has been described formally using the Assessment of the Validation Status of Health-Economic decision modeling tool [37] (see the ESM).

ECHO-T2DM was chosen as the micro-simulation model because it has a similar (albeit not identical) structure (e.g., health states, biomarkers, risk predictions, as well as outcomes) and model features (e.g., treatment intensification following poor glycemic control), an ability to simulate common risk equations (both models support multiple sets), and flexibility. Furthermore, as both models were available to the study authors, the models could be modified to further improve standardization and reduce noise attributable to factors other than the modeling approach (something not possible when cross-validating against previously published results in the literature). ECHO-T2DM is validated [38, 39] and has participated in the 5th through 9th Mount Hood Diabetes Network Challenges [8, 13, 26]. A full description can be found in the ESM and tests of its validity are described using the Assessment of the Validation Status of Health-Economic decision modeling tool [37] (see the ESM).

The main differences in the models and the steps taken to harmonize them are presented in Table 1.Footnote 1 Briefly, we harmonized the model structures used in this exercise by: (1) selecting the same sets of macrovascular and mortality risk prediction equations (UKPDS 82) [36], (2) simplifying the ECHO-T2DM insulin treatment algorithm to duplicate the simpler regimen supported by IHE-DCM, and (3) aligning diverse inputs such as microvascular risk elasticities with glycosylated hemoglobin (HbA1c) and systolic blood pressure and drifts of clinical biomarkers. However, the models simulate end-stage renal disease risk and estimated glomerular filtration rate (eGFR) progression differently, which could not be resolved directly, thus eGFR progression in IHE-DCM was loaded as closely as possible to ECHO-T2DM. Health states for kidney disease and foot ulcer also differed, which was handled by disabling the cost and QALY consequences for micro-and macroalbuminuria in IHE-DCM and for chronic kidney disease (CKD) stages as well as foot ulcer in ECHO-T2DM. Because these standardizations entail that the simulated versions of the models are somewhat artificial, a sensitivity analysis was performed using the models “as intended” (i.e., not harmonized).

Table 1 Key model differences (IHE Diabetes Cohort Model [IHE-DCM] vs Economic and Health Outcomes Model of T2DM [ECHO-T2DM] and method of standardization)

3.2 Reference Case

A set of simulation scenarios was designed with inspiration from the “Reference Case” simulation developed for the 9th Mount Hood Diabetes Challenge Network (convened in Dusseldorf, Germany in 2018) [40] and based loosely on the Action in Diabetes and Vascular Disease: Preterax and Diamicron Modified Release Controlled Evaluation (ADVANCE) trial [41]. The Mount Hood Diabetes Challenge Network Reference Case was chosen as it is well known in diabetes modeling circles and permits comparison with publicly available results for 11 other models of diabetes [8]. In a first step, the Reference Case was simulated exactly as per the Challenge instructions [42], which importantly extends the reach of this analysis by supporting comparison with 11 different diabetes models that have uploaded results to the online Mount Hood Diabetes Network Registry [8] (because of the harmonization, the results reported here for ECHO-T2DM differ slightly from those online).

Baseline patient characteristics were sourced from the Challenge instructions and, as necessary, from ADVANCE trial publications (see Table 2). Quality-adjusted life-year disutility weights were sourced entirely from the Challenge instructions (see Table 1 of the ESM). The Mount Hood Challenge simulation consisted of a control arm compared with five hypothetical treatment profiles, the first four of which considered changes in individual biomarkers one at a time and the last of which included the combined set of biomarker changes. For this application, we simulated the combined set of biomarker changes (see Table 3). As per the Challenge instructions, male and female individuals are simulated separately (though baseline characteristics were otherwise identical), biomarkers were kept constant over time, and the simulation time horizon was 40 years. We supplemented the Reference Case by including a vector of unit costs reflecting the Canadian treatment setting (see Table 1 in the ESM), which enabled consideration of cost-effectiveness metrics. Fictional, but not unreasonable, annual costs were applied for the control and intervention arms (CAN$1000 vs CAN$2500). A porobabilistic sensitivity analysis was used in the base case for both models, which is consistent both with micro-simulation modeling and with ordinary use of IHE-DCM (though it may differ from common practice with cohort modeling in general). Preliminary simulations found that cost-effectiveness metrics stabilized at or well before 500 cohorts (with 1000 individuals per cohort for ECHO-T2DM), ICER for IHE-DCM, and net monetary benefits (NMB) for ECHO-T2DM based on model functionalities. Conservatively, 1000 cohorts (and 2000 individuals per cohort for ECHO-T2DM) were chosen (see Fig. 7 in the ESM).

Table 2 Baseline patient characteristics
Table 3 Treatment profiles

3.3 Expanded Reference Case

The restriction of homogeneous patients at baseline (and the absence of biomarker evolution and rescue medication) in the Reference Case artificially limits a key difference between cohort and micro-simulation modeling and limits generalizability of the exercise. Inspired by the Mount Hood Reference Case, we created a more realistic simulation scenario that captures patient heterogeneity, natural evolution of biomarkers, and treatment intensification. We also added biomarker treatment effects for HbA1c and eGFR to the control arm (see Table 3). Because cost-effectiveness is rarely estimated separately for male and female individuals in T2DM, the sexes were pooled. Treatment intensification starting with basal insulin and followed by a basal and bolus insulin regimen was applied when HbA1c was ≥ 8% (see Table 2 in the ESM). Note: these results are not comparable to those stored in the Mount Hood Diabetes Network Registry [8].

In addition to the base case, 18 additional scenarios were created and simulated to evaluate whether systematic differences between the models (and modeling approaches) could be identified and, if so, which model features drive them. The scenarios are presented in Table 4 and can broadly be sub-divided into tests of the treatment algorithm, the importance of PSA, economic parameters (i.e., costs of treatment and QALY disutility weights), different patient sub-groups (male vs. female individuals, early disease, and late disease), and differences in the CKD sub-model. Baseline patient characteristics for early and late disease are presented in Table 2. As these scenarios are each based on model harmonization to minimize between-model differences unrelated to the cohort vs micro-simulation approaches, we also simulated a less artificial scenario in which the models were simulated as intended.

Table 4 Simulation scenarios: expanded reference case

3.4 Analysis

We compared estimated model outcomes (including costs, QALYs, and ICERs and NMBs defined based on QALYs gained) under the maintained assumption that systematic differences can largely be attributed to the modeling approach (cohort vs micro-simulation) given our attempts to otherwise harmonize the models and input parameters. Numerical differences between models were calculated and assessed, for costs and QALYs at both the absolute and incremental levels. Mean differences were calculated across the base case and all scenarios in the Expanded Reference Case. Because harmonization was incomplete, however, some noise will inevitably enter, thus we assessed concordance statistically using three different methods (for the Reference Case, only visual assessment was performed):

  1. 1.

    We plotted the mean and 95% confidence intervals for incremental costs, incremental QALYs, and NMB estimated for both models for the base case and for the 18 scenario analyses. The proportion of point estimates for each model falling within the 95% confidence interval produced by the other model was generated for each outcome (ICERs were excluded because more than 2.5% of replications produced negative values). Ninety percent was considered a threshold for concordance.

  2. 2.

    At an anonymous reviewer’s suggestion, we conducted a formal hypothesis test for costs, QALYs, and NMB using the paired t test with a null hypothesis of concordance (significance level of 5%). We performed the test for ICERs as well because they are of considerable interest to decision makers, but one of the scenarios had to be omitted because it produced cost savings and QALY gains (i.e., a negative ICER). To ensure that violation of normality does not invalidate the results, the non-parametric Wilcoxon Signed Rank Test was also performed. Because the results of modeling different simulation scenarios are not akin to independent draws from a population (i.e., there is considerable dependence), this test is over-powered and thus too likely to reject the null hypothesis of concordance.

  3. 3.

    At the same reviewer’s suggestion, we also performed a test loosely based on methods proposed by Corro Ramos and colleagues [43] in which we calculated the number of PSA iterations for each model for which the estimated NMB falls within the 95% confidence interval produced by the other model for the base case scenario (ICERs were excluded because 95% confidence intervals could not be generated). Note, the Corro Ramos et al. approach is designed to assess the validity of model estimates by comparing with clinical data rather than predictions from a different model.

Because important differences can be masked when looking only at the aggregate level, we also compared cumulative event incidences in the Expanded Reference Case for IHE-DCM and ECHO-T2DM (95% confidence intervals are not generated by IHE-DCM). Specifically, the proportion of the 14 IHE-DCM-predicted cumulative event incidence rates in the base case falling within the 95% confidence intervals for the corresponding ECHO-T2DM micro-simulation estimates was calculated. Ninety percent was considered a threshold for concordance. Biomarker evolution curves were examined to ensure that the simulations were properly implemented.

4 Results

4.1 Comparison of Model Implementation

Run times differed substantially by model. On a personal computer with a 16-GB random access memory and an I7-processor, run times for the base case analysis were approximately 45 min for IHE-DCM and 30 h for ECHO-T2DM. For the scenario analysis without PSA (i.e., running only one cohort), run times were less than 1 min for IHE-DCM and between 2 and 3 min for ECHO-T2DM. In part because there are more parameters in micro-simulation though also because ECHO-T2DM has more model features, the analysts (authors AN and AL) noted that loading and double checking ECHO-T2DM took longer than IHE-DCM.

4.2 Reference Case

Key results for the Reference Case are presented in Tables 3 and 4 of the ESM for male and female individuals, respectively. Estimated life-years predicted by IHE-DCM were approximately 1 year longer for male individuals and 0.6 years longer for female individuals for both treatment arms than for ECHO-T2DM, which is consistent with the larger predicted QALYs and total costs. The between-model differences were smaller at the incremental level. Incremental predicted life-years were 0.61 and 0.47 years for IHE-DCM vs 0.71 and 0.55 years for ECHO-T2DM, for male and female individuals, respectively. The between-model differences in incremental predicted QALYs were smaller by about half. Incremental predicted total costs differed by CAN$294 for male and CAN$462 for female individuals, which yielded ICERs (per QALY gained) of CAN$29,309 for IHE-DCM vs CAN$27,654 for ECHO-T2DM for male individuals and CAN$38,680 for IHE-DCM vs CAN$37,109 for ECHO-T2DM for female individuals. At a willingness-to-pay threshold of CAN$50,000, NMBs (based on QALYs gained) were $13,293 for IHE-DCM vs CAN$15,452 for ECHO-T2DM for male individuals and CAN$6,199 for IHE-DCM vs CAN$7518 for ECHO-T2DM for female individuals. The cumulative incidences for micro- and macrovascular complications are presented in Figs. 3 and 4 of the ESM. With the exception of kidney complications, IHE-DCM predictions fell well within the 95% confidence intervals.

4.3 Expanded Reference Case

Key results for the Expanded Reference Case are presented in Table 5. Predicted absolute life-years, QALYs, and total costs were (as with the Reference Case) larger for IHE-DCM for both treatment arms. Incremental (between-arm) differences were again smaller, though the between-model gap differences were larger than in the Reference Case (0.46 vs 0.60 life-years gained, 0.67 vs 0.72 QALYs gained, and net cost increases of CAN$3719 vs CAN$5098 for IHE-DCM and ECHO-T2DM, respectively). Uncertainty as indicated by 95% confidence intervals was similar for the two models for costs, but about twice as high for IHE-DCM for QALYs (with the difference largely attributable to hypoglycemia event rates). Estimated ICERs were CAN$5542 for IHE-DCM and CAN$7059 for ECHO-T2DM and NMBs were CAN$28,834 and CAN$31,009, respectively. While 95% confidence intervals could not be calculated for ICERs, the 95% confidence intervals for NMBs were also about twice as wide for IHE-DCM and the lower bound was below 0 (CAN$-5833).

Table 5 Detailed cost-effectiveness estimates for the Expanded Reference Case, by model

Estimated survival curves were visually similar, though slightly higher for IHE-DCM (see Fig. 1). Estimated 40-year cumulative incidence rates for IHE-DCM fell within the 95% confidence intervals for ECHO-T2DM predictions for each outcome, though IHE-DCM generated generally lower estimates than ECHO-T2DM (see Fig. 2). These cumulative incidences are also presented in a scatterplot in Fig. 6 of the ESM, with the values for IHE-DCM on the horizonal axis and for ECHO-T2DM on the vertical axis. Points along the 45-degree line indicate equality and the dotted lines plot the best-fitting linear regression lines.

Fig. 1
figure 1

Forty-year intervention and comparator survival for Expanded Reference Case, by model. ECHO-T2DM Economic and Health Outcomes Model of T2DM, IHE-DCM IHE Diabetes Cohort Model

Fig. 2
figure 2

Forty-year predicted cumulative incidence rates for Expanded Reference Case, by model. The whiskers represent the estimated 95% confidence interval of the cumulative incidence in the Economic and Health Outcomes Model of T2DM (ECHO-T2DM). BDR background diabetic retinopathy, CHF congestive heart failure, ESRD end-stage renal disease, GPR gross proteinuria (macroalbuminuria), IHD ischemic heart disease, IHE-DCM IHE Diabetes Cohort Model, LEA lower extremity amputation, MA microalbuminuria, ME macular edema, MI myocardial infarction, PDR proliferative diabetic retinopathy, PVD peripheral vascular disease

A cost-effectiveness scatterplot plane is presented in Fig. 3, with each point representing incremental QALYs and costs for one of the 1000 cohort replicates for the two models (IHE-DCM in black and ECHO-T2DM in red). Though uncertainty is larger for IHE-DCM, the scatterplots largely coincide. Cost-effectiveness acceptability curves are largely similar as well (see Fig. 5 in the ESM). Both models predict a low probability of cost savings, but the predicted probabilities that the intervention is cost-effective are about 70% at a willingness-to-pay of CAN$10,000 per QALY gained rising to 96% for IHE-DCM and 100% for ECHO-T2DM at a willingness-to-pay of CAN$50,000 per QALY gained. The modified Corro Ramos et al. test found that estimated NMB for IHE-DCM fell within the 95% confidence interval generated by ECHO-T2DM for 72% of the PSA iterations. For ECHO-T2DM, estimated NMB fell within the 95% confidence interval generated by IHE-DCM for 98% of the replications.

Fig. 3
figure 3

Cost-effectiveness plane for Expanded Reference Case, by model (one sample point per cohort replication). ECHO-T2DM Economic and Health Outcomes Model of T2DM, IHE-DCM IHE Diabetes Cohort Model, QALY quality-adjusted life-year

Similarities at the aggregate level may mask some differences at the granular level. For example, IHE-DCM simulated greater cost offsets for avoided stroke and ischemic heart disease events than ECHO-T2DM, but ECHO-T2DM predicted cost offsets for CKD while IHE-DCM predicted a modest cost increase. Simulated biomarker evolution curves diverged over time, especially for HbA1c and body mass index where the start of rescue insulin medication occurred at the same time and induced stair step patterns in IHE-DCM, largely because of differential survival in the heterogeneous ECHO-T2DM simulated population (see Fig. 8 of the ESM).

Results of the scenario analysis demonstrated that the two models changed in predictable (and mostly similar) ways to the parameter changes. IHE-DCM produced consistently greater life-years, QALYs, and absolute costs for both treatment arms than ECHO-T2DM (summary results are presented in Table 5 in the ESM) and IHE-DCM also generated consistently lower mean incremental costs, QALYs, and NMBs. Mean ICER for the base case and the 18 scenarios (excluding one scenario for which intervention was dominant for both models) were CAN$10,299 for IHE-DCM and CAN$10,417 for ECHO-T2DM. IHE-DCM generated a lower ICER in ten of the 18 cases (with well-behaved ICERs).

Individually, the results of the scenarios were generally predictable and robust. Sub-group analysis was notable, for example, only because the early disease cohort was associated with a noticeable change in incremental costs (especially for IHE-DCM). This affected predicted ICERs in relative terms, though the effect was less for the NMB (CAN$41,300 for IHE-DCM vs CAN$43,411 for ECHO-T2DM). The results were most affected by assumptions about CKD, where structural differences could be least standardized. Keeping eGFR constant over time increased the ICERs for both models compared with the base case, with between-model differences driven largely by changes in incremental costs. Using the model “as intended” (rather than standardized) had limited impact on the results.

Mean and 95% confidence intervals (note, only for scenarios with PSA activated) for incremental costs, incremental QALYs, and the NMB are plotted in Fig. 4. Neither model had a mean value that fell outside of the 95% confidence interval for the other model in the base case or any of the 18 scenarios. Paired t tests uniformly rejected the null hypothesis of between-model equality of the absolute costs (p < 0.001) and QALYs (p < 0.001), incremental costs (p < 0.001) and QALYs (p < 0.001), and the NMB (p < 0.009). For the scenarios with well-behaved ICER estimates, however, the t test failed to reject between-model equality (p < 0.68).

Fig. 4
figure 4

Mean estimates using the IHE Diabetes Cohort Model (IHE-DCM) and Economic and Health Outcomes Model of T2DM (ECHO-T2DM) for Expanded Reference Case with 95% confidence interval for ECHO-T2DM, by scenario. a Incremental costs. b Incremental quality-adjusted life-years (QALYs). c Net monetary benefits associated with intervention (willingness to pay of CAN$50,000 per QALY). CKD chronic kidney disease, HbA1c glycosylated hemoglobin, PSA probabilistic sensitivity analysis

5 Discussion

Using well-established cross-validation tools [11] modified to allow structural standardization of the models, we examined whether IHE-DCM produces systematically biased estimates of cost-effectiveness related to the cohort approach. In a simple Reference Case performed to enable comparison with the results of 11 other models that participated in the 9th Mount Hood Diabetes Challenge, IHE-DCM produced consistently greater absolute survival, QALYs, and costs than ECHO-T2DM, which is consistent with the difference between modeling homogenous patients and heterogeneous patients when event risks are non-linear (specifically convex) in key parameters [16]. Between-model differences were generally small at the incremental level (i.e., different between the two comparator arms) used to construct cost-effectiveness metrics, however, and the ICER and NMB, which were also similar between models. As expected, IHE-DCM was considerably faster compared with ECHO-T2DM, with a run time of approximately 45 min compared with 30 h using ECHO-T2DM, an important aspect for many stakeholders under time constraints.

This same pattern was observed for the more realistic Expanded Reference Case and 18 scenario analyses, and both models responded to changes in model parameters similarly and predictably. This was supported statistically; incremental costs, incremental QALYs, and NMBs for each model fell uniformly within the 95% confidence interval generated by the other model. There was more uncertainty in the results of IHE-DCM, which was driven in large part by uncertainty in the parameter estimate for the hypoglycemia event rate (eliminating it roughly halved the confidence interval). The estimates of ECHO-T2DM falls within even half of the 95% confidence intervals generated by IHE-DCM. Estimates in the base case by IHE-DCM of the 40-year cumulative incidence of study outcomes, moreover, fell within the 95% confidence intervals generated by ECHO-T2DM. While the paired t tests did find statistically significant between-model differences in incremental costs, incremental QALYs, and the NMB for these 19 scenarios, the paired t test is grossly overpowered to reject the null hypothesis in this setting as the simulation scenarios (i.e., the sample draws) are not independent of each other. Interestingly, however, the paired t test failed to reject between-model differences for the ICER (p < 0.68) for the 18 scenarios for which both incremental costs and incremental QALYs were positive (producing a meaningful ICER). Further underscoring this absence of clear bias in cost-effectiveness estimates, there was no discernible pattern as to which model produced more favorable cost-effectiveness estimates, with each more favorable in roughly half of the scenarios.

The trade-off between cohort modeling and micro-simulation is sometimes (perhaps mistakenly) cast as a choice between time and transparency vs accuracy. Both models satisfy International Society for Pharmacoeconomics and Outcomes Research recommendations for model transparency, which accept complexity and call instead for a technical report that describes the structure, components, equations, and computer code that would enable experts to reproduce the model (full technical transparency) and non-technical documentation that, at a minimum, describes the type of model and intended applications, funding sources, model structure, inputs and outputs, data sources, model validation, and model limitations [11]. While transparency in a general sense is hard to quantify, and no fit-for-purpose diabetes models are likely to achieve “transparency” in a general sense, analysts (authors AN and AL) generally considered that IHE-DCM was easier to grasp and work with (and is constructed with approximately 50% fewer lines of code).

This analysis has several strengths, including the use of two models that were relatively similar and required limited standardization. Many of the remaining differences could be standardized to minimize the extent that differences would be driven by model differences other than units of observations. The scenarios were inspired by the Mount Hood Reference Case, which permits comparison (at least of the Reference Case results) with 11 health economic models of diabetes that participated in the 9th Mount Hood Diabetes Challenge Network. Finally, a wide range of scenarios was considered that explored different aspects of the model to enhance generalizability.

The models could not be entirely standardized, however, and remaining differences must be considered when interpreting the results of this analysis (i.e., between-model differences may reflect more than just the potential bias related to cohort vs micro-simulation modeling). In particular, the main structural difference is the modeling of CKD, for which there are different methods of simulating disease progression (transition probability vs biomarker driven) and which clearly impact the results. Indeed, for the cumulative incidence, the CKD outcomes (micro-and macroalbuminuria and end-stage renal disease) were clear outliers and the mean estimates for the IHE-DCM model were just within the 95% confidence interval of ECHO-T2DM. Foot ulcer is included only in ECHO-T2DM. To mitigate the impact on the analysis, costs and QALY weights were set to 0. The indirect impact on overall results was limited because foot ulcer affected only the risk of congestive heart failure (though patients simulated to develop congestive heart failure had in turn increased risks for ischemic heart disease, myocardial infarction, and mortality) and the simulated incidence of foot ulcer was low. Second, while the scenarios were constructed to mimic a cost-effectiveness analysis, the simulated scenarios are purely hypothetical.

While this study cannot provide a definitive (and universal) answer to concerns about possible bias, and it does not address the academic discussion of how much accuracy is reasonable to swap for increased transparency [30], this exercise provides a careful examination of how two otherwise similar models respond to the same set of stimuli (both absolutely and incrementally), which can be valuable for stakeholders charged with interpreting evidence produced by IHE-DCM.

6 Conclusions

The IHE-DCM was faster to load and to run than the micro-simulation model used in this study (ECHO-T2DM) and the modeling details are likely to be more easily understood by external reviewers, which can be an advantage for economic stakeholders with limited time and resources. Despite systematic differences in absolute predicted survival, QALYs, and costs, estimated cost-effectiveness metrics were similar suggesting that any bias related to the cohort approach is small in the outcomes that matter most. We believe that both models are suitable for use in cost-effectiveness evaluations for interventions in T2DM; the selection of one over the other should be made on the basis of stakeholder needs, resources, and preferences.