Introduction

Endocrine therapies are the mainstay of treatment in hormone receptor-positive (HR+), human epidermal growth factor receptor 2 (HER2)-negative metastatic breast cancer (mBC) except in life-threatening situations qualifying the patient to receive chemotherapy [1].

Clinical trials investigating therapies for mBC often use progression-free survival (PFS) as primary endpoint [2], since patients with mBC have a relatively long survival time of around 3 years in median. With the desire to rapidly translate promising new agents into clinical practice, there is the need for endpoints which can be measured in a timely manner. Therefore, it is currently discussed whether endpoints based on disease progression, including PFS, time-to-progression (TTP), or time-to-treatment failure (TTF), are appropriate to demonstrate clinical benefit. These endpoints ensure an early availability of study outcomes and can serve as sensitive parameters for the benefit of a study medication as they are not influenced by subsequent lines of therapy or cross-over [2, 3]. Further advantages are the widespread use and comparability of PFS and TTP since they are most frequently used as primary endpoints in phase III trials and are worldwide accepted for the approval of new drugs [4,5,6].

However, the prolongation of overall survival (OS) is one of the most important therapeutic goals [7]. OS is regarded as unambiguous criterion, but there are certain disadvantages of OS as primary endpoint in the metastatic setting of breast cancer: the need for large numbers of patients, the long duration of follow-up phases until results become available, and the need for multiple subsequent therapies, which can confound OS. These limitations particularly cause difficulties in first-line studies [8, 9].

Health technology assessment (HTA) agencies worldwide generally accept PFS as endpoint in clinical trials [10], whereas the German Institute for Quality and Efficiency in Health Care (IQWiG) and the Federal Joint Committee (Gemeinsamer Bundesausschuss, G-BA) do not accept endpoints based on disease progression as a patient-relevant outcome within the benefit assessment of pharmaceuticals because they are measured by imaging techniques. Patient relevance of such endpoints might be accepted when measured via symptoms experienced by the patient. This would, however, lead to an omission of the re-evaluation of metastases in the course of clinical trials, which is considered unethical by physicians and does not comply with guideline recommendations [11]. Possible solutions for these different requirements have to be developed.

IQWiG suggested methods for the validation of surrogate endpoints in HTA context [12]. The aim of this study was the application of these methods in the indication of HR+, HER2-negative mBC to validate PFS as surrogate endpoint for OS.

Materials and methods

Literature search

A systematic search was conducted on the basis of the databases MEDLINE and EMBASE as well as in five EBM Reviews sources in September 2016 and was performed in accordance with PRISMA guidelines (Appendix A.1). The following keywords and associated subject headings were used: “breast cancer” and “metastatic” or “locally advanced” in combination with “fulvestrant” or “letrozole” or “tamoxifen” or “exemestane” or “anastrozole” (Online Appendices A.2–A.4). Inclusion criteria for trials are listed in Table 1.

Table 1 Inclusion criteria for trials in the systematic literature search

Randomized controlled trials (RCT) were included if at least 80% of the study population met the inclusion criteria. In case of missing information regarding HER2 status or HR status, the proportion of patients meeting the inclusion criteria was extrapolated based on epidemiological data. In case HER2 status was unknown, a proportion of 81.9% of HR+ patients was assumed to be HER2-negative [13]; for patients with both unknown HER2 status and hormone receptor status, a proportion of 64.5% was assumed to be HR+ and HER2-negative [13]. Trials with TTP or comparable endpoints were considered if the definition was identical to PFS (time from randomization to objective disease progression or death from any cause). Only studies reporting PFS according to Response Evaluation Criteria In Solid Tumors (RECIST) [14] were included to ensure standardized and comparable endpoint evaluation. Overall survival had to be reported in the studies and should be defined as the time from the date of randomization to the date of death from any cause.

Two reviewers independently assessed citations to determine relevance to the research question. Included studies were cross-checked for relevance by physicians. If several publications for one study were available, data from the latest publication or publications reporting final data cuts were used. Data from included studies were extracted by one reviewer; another reviewer checked for consistency against the original source. Risk of bias on study level was assessed and summarized for all included individual studies (Online Appendix A.5).

Statistical methods

As part of a rapid report, the German IQWiG presented methods for surrogate endpoints validation and recommendations for correlation-based procedures [12]. Health technology assessments are based on these methods in Germany. The methods include the evaluation of the certainty of conclusion of study results and the correlation between effect estimates of surrogate endpoint (e.g., PFS) and true outcome (e.g., OS) on trial level, whereas correlation is estimated by sample Pearson correlation coefficient r. Requirements for a successful surrogate validation are a high correlation (lower confidence limit (LCL) of r > 0.85) and a high certainty of conclusion of results of included studies. If the correlation is low (upper confidence limit < 0.7), no statement of surrogate validation is possible. In all other cases—where correlation is in the medium range and validity of surrogate endpoint is therefore unclear according to IQWiG methodology—they propose to apply the concept of STE [15], allowing conclusions on true endpoints by means of surrogate endpoints. STE is defined as minimal treatment effect on the surrogate endpoint explaining a non-zero (i.e., significant) effect on the true endpoint. In this context, STE represents the maximum value of the hazard ratio for PFS (HRPFS) that needs to be observed in a trial to ensure the possibility to draw conclusion of a significant effect on OS.

First, we tested the correlation between both outcomes (H0: r = 0 vs. H1: r ≠ 0). Second, if correlation was medium, we fitted a random effects mixed-model to the data with moderator HRPFS and outcome variable hazard ratio of OS (HROS) weighted by standard error (SE) of OS using the restricted maximum likelihood (REML) estimator for the amount of heterogeneity. Since SE is usually not reported, we recalculated it by means of 95% confidence interval (CI) of hazard ratio with (log(HR) − log(HRLCL))/z(0.975), whereas z(0.975) is the 97.5 percentile of the standard normal distribution. Based on the regression fit, we calculated a prediction band to a significance level of α = 0.05 for HROS. Meta-regression model and prediction values were implemented with R [16] using functions rma.uni and predict.rma from metafor package [17]. The STE resulted from the intersection of the upper prediction limit curve and the horizontal where HROS = 1 (zero effect).

In sensitivity analyses, we investigated if factors HER2 status (reported vs. not reported), line of treatment (only first-line vs. others), and therapy option (studies comparing combination therapy with monotherapy vs. studies comparing two monotherapies) accounted for heterogeneity.

Results

Systematic literature search

The search identified 9071 citations from MEDLINE®, EMBASE, and EBM Review databases. We included 16 studies (26 full texts) for analysis (Fig. 1).

Fig. 1
figure 1

Flow diagram of study selection process. N Number of patients

Characteristics for included trials are summarized in Table 2. The 16 trials included 5324 patients in total. In ten trials, HER2 status was reported for the entire study population. Six trials were included in the analysis since 80% of the study population met the inclusion criteria due to calculations according to epidemiological data (see methods). Six trials (2875 patients) evaluated treatments exclusively in the first-line setting for locally advanced or metastatic disease, and ten trials (2449 patients) included pretreated patients or patients in various lines of treatment. Almost all trials included postmenopausal women except for two trials which included a small (2.9%) [18] or unknown [19] number of premenopausal women treated with GnRH agonists.

Table 2 Overview of trial characteristics

Twelve trials compared combination therapy with monotherapy, while four trials compared monotherapy versus monotherapy. Combination treatments were add-on to hormone therapy and comprised different compound classes in comparison to endocrine therapy.

Endpoints were reported for intention-to-treat population (seven trials), full analysis set (three trials), modified intention-to-treat (two trials), or for all randomized patients (three trials). For one trial, no information was given on the analysis population.

Statistical analysis

In the main analysis (pool of 16 identified trials), the correlation between hazard ratios of PFS and OS was statistically significant (r = 0.72, 95% CI 0.35–0.90, p = 0.0016) representing a positive linear relationship of surrogate endpoint and by this patient-relevant endpoint. According to the definition in IQWiG’s rapid report, correlation was merely medium-sized and therefore the validity of the surrogate endpoint is unclear and a STE analysis is applied. The meta-regression showed low residual heterogeneity (τ2 = 0.009, I2 = 25%) and provided a significant result of the moderator coefficient βPFS (p = 0.0206). STE for HRPFS was 0.60 (Fig. 2), and thus for trials meeting the above-mentioned inclusion criteria in this specific indication and upper confidence limit of HRPFS below STE, it is possible to draw the conclusion of a significant effect on OS by means of surrogate endpoint PFS.

Fig. 2
figure 2

Meta-regression showing the relationship between hazard ratios of PFS and OS. Expansions of circles were scaled by the inverse of the standard error of HROS. Numbers in parentheses reflect studies in Table 2. STE is defined as maximum value of HRPFS so that HROS still is significant, i.e., upper confidence limit of HROS < 1. CI Confidence interval, HR hazard ration, OS overall survival, PFS progression-free survival, r Pearson correlation coefficient, STE surrogate threshold effect

Sensitivity analyses to check the robustness in the main analysis were performed to account for available information about HER2 status (sensitivity analysis 1), line of treatment (sensitivity analysis 2), or therapy option (sensitivity analysis 3) (Table 3). Due to the smaller sample sizes in the subpools, STE values deviate from the value in the main analysis, but correlation in all subpools is positive and at least of a medium magnitude, confirming the positive relationship between OS and PFS. In all subpools STE is below 1 except for sensitivity analysis 2b (Table 3). In this case, STE cannot be calculated (upper confidence limit of HROS > 1 for any value of HRPFS). Hence, meta-regression analyses in all specified subpools did not show heterogeneity regarding the observed factors and confirm the results of the main analysis.

Table 3 Overview of sensitivity analyses

Discussion

PFS is an accepted endpoint with a definition based on standardized criteria according to RECIST [14]. The outcome of PFS is not influenced by subsequent therapies, and results are timely available and a lower number of patients are needed than for OS. In addition, results are widely accepted for the approval [4, 5] as well as the HTA evaluation of new drugs [10] except from German HTA bodies due to an assumption of missing proof of patient relevance due to evaluation of PFS by imaging and not by symptoms.

From a physician’s point of view, PFS has a high relevance for patients. In case of a progression, the patient’s therapy needs to be changed, which entails different adverse effects and requires new procedures and adjustments of schedules. A proven progression also has a significant impact on the psychological well-being and quality of life [35].

Additionally, a prolongation of OS and maintaining quality of life continues to be the focus of treatment in the metastatic situation of breast cancer [7]. To quickly transfer results on PFS from trials with innovative therapies to clinical practice, it would be advantageous if a validation of progression-based endpoints as surrogate endpoint for OS is available, which was the aim of this study.

Methods used in this work have some limitations. It is possible that the pool of included studies does not include all publicly available data because the search was limited to three literature databases and included no further sources. In addition, several aspects often lead to exclusion of studies. One reason was poor reporting, for example if data for only one of the required endpoint were published. Lack of information regarding HER2 status leading to non-conformity with the defined patient population and no PFS/TTP assessment according to RECIST criteria were other reasons. Especially older studies were often not in accordance with the inclusion criteria.

The sensitivity analyses show that the STE values vary strongly when only very small study subsets are considered. Nevertheless, the values are not so far apart that they would point completely in the other direction, i.e., STE > 1. Furthermore, the STE is sensitive to outlier observations when number of studies in the model is low. The generation of randomization and whether allocation concealment was adequately carried out was rarely reported in the individual studies. To what extent this has an impact on the endpoints OS and PFS and finally on the STE remains unclear.

According to IQWiG’s method description, the entire 95% CI of PFS has to be below the STE in order to take into account the uncertainty with which both estimators are affected. Gillhaus et al. [36] described that this approach reduces the α error, but also considerably reduces the power of the STE concept. Statistical power could be increased using a lower α significance level (e.g., 0.1 or 0.2) for the prediction band of HROS in the meta-regression model. However, this assumption can only be made if the hypothetical trial is conducted in patients with HR+, HER2-negative mBC treated with endocrine therapies alone or in combination with other targeted treatments. The model does not intend to predict the outcome of OS concerning HR or differences in median of OS from the model.

In general, OS results always need a critical appraisal. Especially in mBC, an improvement of OS for a new therapy option is difficult to measure. Factors like the heterogeneity of the disease, therapy complexity with integration of local therapies (surgery, radiotherapy), and a wide range of systemic therapies as well as a long survival in the metastatic situation with numerous different sequential courses of therapy may have an impact on the results of OS. A model calculation has shown that the probability of demonstrating a significant OS benefit decreases to less than 30% for a post-progression survival (PPS) of more than 12 months [37]. However, survival of several years has been reached especially in mBC. In addition, depending on the required statistical power, thousands of patients need to be recruited to identify a survival benefit. In the age of individualized therapy with numerous specific subgroups, these studies are hardly feasible. The authors also conclude that the interpretation of OS is only useful, if the PPS is really short [37].

Additional points to take into account are the clinical relevance of OS results. The STE calculated in this publication only allows to draw conclusions on OS in the above-mentioned settings and about the statistical significance of OS. However, it is not possible to predict the differences in median survival times and its clinical relevance. Therefore, it is possible that the final result for OS is statistically significant in a trial but might not be considered clinically relevant. For example, a difference of 3 months in median OS is clinically relevant in an indication with very short survival times like metastatic pancreatic carcinoma [38]. MBC has comparably long survival times of 2–3 years [39] and a difference of 3 months in median OS would normally not be considered clinically relevant. Even if a meaningfully relevant difference in median OS was achieved, a proven prolongation of life with a simultaneous significant deterioration in the quality of life is not always a desirable therapeutic goal [40].

In conclusion, we were able to calculate the STE (0.60) allowing to draw conclusions on OS through the surrogate endpoint PFS besides minor methodological limitations in trials with HR+, HER2-negative mBC treated with endocrine therapies alone or on combination.

This means that for a hypothetical or future trial demonstrating upper confidence limit of HRPFS < 0.60 in PFS it is possible to conclude on a significant effect in OS. However, only final OS results can confirm if a clinical relevant difference in survival time is reached. For future prospects, reflecting the current results in regard to ongoing clinical studies examining the addition of CDK 4/6 inhibitors to endocrine therapy will be desirable since they mostly lack of statistical significant, mature OS data for the time being. As long as OS results are not available, conclusions using STE may be drawn from PFS. To gain quick results on a new drug, PFS remains a relevant endpoint with high clinical relevance.