Introduction

Fractures are crucial symptoms of osteoporosis, which is usually a process without apparent clinical symptoms, being often called a “silent bone thief.” A typical osteoporotic fracture is a low-trauma fracture after a fall from a standing height. The most common osteoporotic fractures affect the hip, the spine, the forearm, and the arm, and are thus classified as major osteoporotic fractures. Every fracture causes serious health problems for affected patient, including pain, functional impairment, and the necessity of long-term medical care, while hip and spine fractures are especially serious in their effects. Fractures in the two locations may cause permanent functional deterioration and also terminate patient’s life; a considerable number of subjects die during the first months after hip/spine fracture [1]. One should also remember that a prior fracture increases the risk of subsequent fracture. The history of fragility fractures is an important risk factor for subsequent fractures [2]. Given that osteoporosis is a medical condition which is usually clinically undiagnosed for longer time periods, the ability to determine fracture risk is of paramount importance. Some prognostic tools have been developed, including FRAX [3] and Garvan calculators [4, 5]. These tools are available on the Internet and enable to establish fracture risk (Garvan) or fracture probability limited by life expectancy (FRAX). FRAX measures the probability of hip or major fractures for women and men over a decade, Garvan establishes the risk for hip and any fractures in women and men over 5 and 10 years, and estimates any fracture risk over 5- and 10-year periods only in women. The practical role of such estimations in everyday practice is to identify subjects at increased fracture risk and thus with a need of immediate therapy. Methods, designed to establish fracture risk, were presented and discussed in some reports [6,7,8]. In general, an early assessment of fracture risk is very important for therapy success and may also be helpful for early therapy initiation. An important point is how precisely the baseline fracture risk assessment will correspond to the actual fracture prevalence values during 5- or 10-year observation periods. In some studies, the problem of precise fracture risk assessment was raised and discussed [9,10,11]. Several studies compared the FRAX and Garvan tools [12,13,14,15,16,17,18,19,20,21,22] in various populations. In general, the studies, aimed to show the predictive potentials of various fracture risk estimation tools, may be useful for clinical practitioners.

A long-term observation seems to be optimal for the practical verification of the prognostic apparatus of the tools designed to estimate fracture risks. We present results of a longitudinal observation of the epidemiological female sample of the RAC-OST-POL Study, gathered in 2012 [23].

The aim of our study was to establish how precisely the FRAX and Garvan tools predicted hip and major fracture risk (FRAX) and any fracture risk (Garvan) over one decade.

Material and methods

The study group included a population-based postmenopausal sample from the RAC-OST-POL Study. The detailed clinical characteristics were presented earlier [23]. At baseline, 978 subjects were recruited. The “core” cohort of 625 women, included in the RAC-OST-POL study, was randomly selected from the local population of the Racibórz district. The total number of women at the district was 57 357 and the total population of the qualifiable women (> 55 years), inhabiting the region at the time of enrollment into the study was 17,500. Next, 1750 subjects were randomly selected and invited via regular e-mail exchange to participate in the study. A blind list of women, selected for the study, was provided by the local government and each woman was assigned a number without showing her name. A group of 625 women responded positively to the invitation and declared their intention to take part in the study. Together with the women invited by email, additional 353 female volunteers were also included. We considered that addition as beneficial and increasing the data volume, obtained from a larger population. Prior to that decision, we checked whether the method of recruitment (random or nonrandom) had depended on the prior fracture prevalence. The chi-squared test showed that fracture history did not differ between the groups (33% in the randomized population and 29% in the non-randomly included population, chi-squared value = 1.24, p = 0.26). Finally, all the women were combined into a larger group of 978 women as the process of inclusion was not associated with the baseline fracture prevalence. We also compared fracture probabilities obtained by the FRAX tool and the fracture risk from the Garvan calculator nomograms for any fracture risk and hip fracture risk, and the values did not differ in any significant way in regard to the randomized or nonrandomized subcohort.

Each year between 2011 and 2020, all the patients were enquired via phone about any fractures during the previous year. All the data were gathered by one investigator (WP). After the 10-year follow-up, 640 women remained under observation. The total drop-out was then 34.5%, while some subjects were lost to follow-up, due to the following reasons: the loss of contact, 25.6%; death, 7.5%; refusal to cooperate, 1%; and the lack of baseline DXA scans, 0.4%). Table 1 presents the baseline clinical characteristics for the subjects who completed the entire follow-up period. In order to assess whether the drop-out effect could significantly bias the results at the end of follow-up period, the most important clinical features of women, who did not complete the study, were additionally analyzed and compared with the final study group. The women, who were lost to follow-up, were older in comparison to the subjects included in the final analysis (67.4 ± 8.6 vs. 65.0 ± 6.95; p < 0.001, respectively). However, they did not differ significantly in terms of the crucial risk factors for fracture, e.g., FN T-score (− 1.34 ± 0.9 vs. − 1.24 ± 0.9; p = 0.08), the incidence of fractures before the enrolment into the study (27.2% vs. 30.4%; p = 0.30), the incidence of falls during the year preceding enrolment (35.7% vs. 32.4%; p = 0.30), and the frequency of steroid use (7.1% vs. 5.0%; p = 0.18).

Table 1 Clinical characteristics of the whole study group and subgroups with and without fractures during follow-up

At baseline, bone densitometry (DXA) was performed at the proximal femur, using a Lunar DPX (GE Healthcare, Madison, WI, USA) device. The coefficients of variation (CV%) for the femoral neck and for total hip were 1.6% and 0.82%, respectively. Based on the obtained DXA results and the gathered information about clinical risk factors, the fracture risk was established, using the FRAX calculator specific for the Polish female population and the Garvan tool. The DXA results and the prevalence of clinical risk factors, recorded at the baseline examination, that influenced the results of fracture risk calculation, is presented in Table 2.

Table 2 The DXA results [mean ± SD] and the prevalence [n (%)] of clinical risk factors recorded at baseline examination that influence the results of fracture risk calculation in the whole study group and subgroups with and without fracture during follow-up

During the period of observation, 190 osteoporotic fractures occurred in 129 subjects. Major osteoporotic fractures (at hip, arm, forearm, or spine location) were noted in 97 women. The fractures occurred in the following skeletal sites: forearm, 81; spine, 30; ankle, 25; hip, 15; arm, 13; rib, 9; foot (except of digits), 7; clavicula, 7; and pelvis, 3. Single fractures were recorded in 91 women, whereas 24 women reported 2 fractures, 7 women, 3; 5 women, 4; and 2 women reported 5 fractures.

Statistics

All the statistical calculations were performed with the use of the Statistica 13.3 software (StatSoft, Tulsa, OK, USA) and of the PQStat v.1.8.2.238 (PQStat Software, Plewiska, Poland; https://pqstat.pl). The mean values and standard deviations were used for the descriptive statistics of continuous variables. The prevalence of qualitative features was presented as the number of subjects with percentage values. The values of sensitivity, specificity, and balanced accuracy (BAcc) were used to compare the predictive powers of the compared fracture risk assessment tools. BAcc was chosen instead of accuracy (Acc), as the analyzed dataset was characterized by a high degree of imbalance [24]. Patients with any fractures accounted for 20.16% of all analyzed subjects, with major fractures 15.16%, and with hip fractures only 2.19%. BAcc, which is the average of sensitivity and specificity, is recommended as more valuable for evaluating imbalanced data than “classic” Acc. To verify the prediction accuracy of the analyzed diagnostic tools, the receiver operating characteristic (ROC) was studied, as well as the area under the curve (AUC) was calculated using of the DeLong method. The alternative cutoffs, determining high or low fracture risks, were established on the basis of ROC curves.

When identifying the cutoff points, different methods were used, including the analysis of the distance from the top left corner and the Youden index J calculated as [25, 26]:

$$J={max}_{t}\left({Sensitivity}_{t}+{Specificity}_{t}-1\right),$$

where t denotes the threshold value, for which the index is maximal.

Conceptually, the distance from the top left corner is the minimum distance between the upper left corner of a square of side 1, i.e., the place where the sensitivity and specificity are the highest, as well as the point of the ROC curve. This index is calculated by the following formula:

$$\sqrt{{\left(1-Sensitivity\right)}^{2}+{\left(1-Specificity\right)}^{2}.}$$

A p-value at a level of 0.05 was regarded as statistically significant.

Results

The mean values of 10-year fracture risk/probability, established at baseline for major (FRAX) or any (Garvan) fracture risks and for hip fracture risks (both tools), are presented in Table 3. In addition, the frequency distributions of risk probabilities for the women with and without a fracture, determined for the FRAX and Garvan tools, are presented in Fig. 1.

Table 3 The mean values of 10-year fracture risk/probability established at baseline for the compared calculators
Fig. 1
figure 1

Frequency distributions of FRAX and Garvan risk probabilities for women with and without a fracture

In order to compare the predictive values of both calculators in relation to the fractures observed during the 10-year follow-up period, all the subjects were first categorized into low or high fracture risk, according to the thresholds of 3% for hip fractures and of 10% for major/any fracture [22, 27]. Table 4 presents information about the number of subjects classified in the high-risk category and the number of observed fractures during the follow-up period.

Table 4 The number of subjects classified at high-risk category and the number of observed fractures during follow-up period

It is easy to notice that, in the case of the Garvan tool, the vast majority of observed fractures were correctly predicted, while for the FRAX calculator the predictability was much lower, reaching merely 19.59% for major osteoporotic fractures. On the other hand, the Garvan tool showed a much higher fracture prediction power than it was actually observed. The detailed values of the diagnostic indexes for the analyzed predictive tools are provided in Table 5.

Table 5 Diagnostics indexes (with 95% CI) for FRAX (major and hip) and Garvan (any and hip) at conventional (≥ 3 and ≥ 10 for hip and major/any fractures, respectively) and newly proposed (≥ 6; ≥ 1.4; ≥ 14.4; and ≥ 8.8, respectively) cutoff points

Considering such a big discrepancy between the diagnostic tools with the same “fixed” cutoff points, resulting in “overestimation” in fracture prediction by the Garvan tool and “underestimation” by FRAX major fracture risks, ROC analyses were additionally performed. They focused on optimal cutoff points, separate for each diagnostic tool, to improve the accuracy of fracture prediction. The achieved ROC curves for each diagnostic tool are presented in Fig. 2.

Fig. 2
figure 2

Analysis of receiver operator characteristic (ROC) curves for FRAX and Garvan. *Null hypothesis: true area = 0.5. **The numbers in parentheses indicate specificity and sensitivity values for the proposed cutoff

According to [18, 20], an acceptable model should have both specificity and sensitivity of 50–79%. The following fracture risk threshold values were obtained, corresponding to the optimal diagnostic accuracy: FRAX major fracture risk, 6%; FRAX hip fracture risk, 1.4%; Garvan any fracture risk, 14.4%; and Garvan hip fracture risk, 8.8%. Table 5 shows also the sensitivity, specificity, and balanced accuracy (BAcc) values after the given cutoffs were applied. It may be seen that the BAcc value increased, compared to the values obtained for the previously tested thresholds in all the cases tested.

Hence, the presented analyses confirm the validity of using different cutoff thresholds for both calculators. The relation between the observed fractures and the estimated number of subjects expected to have fractures, predicted on the basis of “new” cutoffs, is presented in Table 6. Compared to the data in Table 4, there is a clear increase in correctly predicted fractures, using the FRAX tool. At the same time, the number of high-risk individuals, predicted by the Garvan tool, decreased while maintaining an acceptable level of correctly predicted fractures (close to that obtained by the FRAX calculator).

Table 6 The number of subjects classified at high-risk category with respect to the proposed cutoff values and the number of observed fractures during the follow-up period

Discussion

The predictive potentials of FRAX and Garvan tools are presented in our study via a prospective observation. In general, both tools underestimated the fracture incidence, while the hip fracture rates were more accurately predicted. The Garvan tool better identified therapy demanding subjects than the FRAX calculator. The current study established therapeutic thresholds, helping identify high fracture risk candidates to therapy more accurately than the commonly used thresholds. The thresholds, established in the study and adjusted to the national population, are the most important finding of the current study. We plan to perform an additional longitudinal observation in another, independent population in order to establish the ability of fracture prediction, demonstrated by the FRAX and Garvan tools in comparison to the original local tool, called POL-RISK [28]. We consider that such an external validation may be helpful for practitioners. Such validation should be helpful in accurate identification of subjects who need treatment. In result, a significant number of patients might avoid a fracture.

We consider that the most important issue in management of osteoporotic patients is accurate fracture prediction. Fracture risk depends on several well-defined factors, including the bone status, assessed by bone densitometry, and the so-called clinical risk factors. The crucial aim of osteoporotic therapy is protection against fracture; the primary objective is to prevent the first fracture, while the secondary aim is to avoid the next fracture. These obvious aims are not easy to achieve in daily practice. The number of osteoporotic individuals is rather high and it is not possible to recommend and start treatment in all of them. So, the issue of how to establish precise indications for therapy onset is still open. One should remember that the primary aim is to avoid fracture, but independently of the purely medical approach, cost-effectiveness has to be considered as well. In general, the thresholds of 3% for hip and of 20% for any (major) fracture risk are used. In national recommendations, the threshold for the initiation of pharmacological therapy in case of all fracture risk is 10% [27].

In our opinion, the most important point in regard to such clear recommendations is the verification of how the tools, used for fracture risk assessment, predict the real fracture incidence. We should also remember that there are two problems; first, how to reveal those individuals who really need treatment, and second, how to avoid treatment in patients who in fact are not at high fracture risk. An optimal way to establish the actual values of fracture risk assessment is a prospective observation of a big population plus an analysis of whether fractures really occurred in those who were at high risk at baseline. One should also remember that fractures always occur in a minority of patients. Therefore, the primary aim is to start the therapy in high-risk individuals. The results of the current study enable to establish optimal thresholds for therapy onset, based on specificity and sensitivity. This pure medical point of view, which could be a recommendation used in daily practice, should also be approached from the cost-effectiveness perspective. How to find a balance between medical efficacy and economical calculation? Interestingly enough, an application has been proposed by the Garvan tool (www.fractureriskcalculator.com). The thresholds, used to establish therapy initiation points, include the reimbursement of medications. Regarding the hip fracture risk, therapy is recommended when the risk exceeds 9%, and for any fracture risk, it is above 26%. Below 3% and 14%, respectively, the reimbursement of medications is not possible. Between 3 and 9% (hip fracture risk) and 14 and 26% (any fracture risk), the decision should be individualized. The thresholds, proposed in the current study, should be considered as pure medical ones. Therefore, an additional analysis is needed in future, taking also into account the therapy costs, in order to establish optimal thresholds for daily clinical practice. The current study shows that the traditional thresholds are used, i.e., 3% for hip fracture risk and 10–20% for any or major fracture risk, it is possible to identify only a small proportion of subjects who need therapy (e.g., those who sustained a fracture during the follow-up period in our study cohort). For the hip, according to the FRAX calculator figures, only 50% women (7 with predicted fracture in comparison to 14 fractures in follow-up) would be treated, and the respective figure for the Garvan tool was 71% (10 versus 14). Regarding the assessment of major fracture risk, according to the FRAX calculator, respective value was 17.6% (17 fractures predicted versus 97 observed) and for any fracture risk, assessed by the Garvan tool, it was 82.2% (106 versus 129). These data indicates that even when the commonly approved thresholds are used, the Garvan tool better identifies the candidates for necessary treatment.

The issue of fracture prediction has been presented in many published reports [14,15,16,17,18,19,20,21,22]. However, only some of them provide prospective data [14, 15, 18, 19]. In a study of 1422 women, observed for 8.8 years, fracture prediction accuracy was clearly higher for the Garvan tool than for the FRAX calculator [14]. The Garvan tool accurately predicted any fracture risk and overestimated the hip fracture risk. The FRAX calculator underestimated both major and hip fracture risks. The AUC for hip fractures was 0.70 and 0.64 for major fractures (FRAX), and the respective results for the Garvan tool were 0.67 and 0.64. Also, in an Australian observation of 506 women, continued for 5 years, the Garvan tool better predicted fractures than the FRAX tool [15]. The AUCs were 0.693 and 0.689, respectively. In another study, performed with participation of 809 women, observed for 10 years, both the Garvan and FRAX tools underestimated major (any) fracture risk and accurately predicted hip fracture risk [18]. Comparably to our results, the Garvan tool better predicted the later, actual fractures than the FRAX calculator. The ratios for the observed/predicted fractures were 139/184 and 52/115 for the Garvan and FRAX tool, respectively. The AUCs were 0.753 and 0.956 for the FRAX and Garvan tool, respectively. Comparable results were presented in a subsequent study [19], where the FRAX and Garvan tools demonstrated good discriminatory values for hip fracture risk but only a moderate discriminatory ability for major fracture risk (FRAX) and for any fracture risk (Garvan). The respective values of AUC for any (major) fracture risks were 0.708 and 0.721, respectively, and did not significantly differ. The adequate values for hip fracture risk were 0.841 and 0.761, and differed significantly. Regardless of the presented data, the authors showed an interesting analysis of the proportion of the subjects treated and untreated, according to the fracture risk established at baseline: 73.3% (FRAX) and 72.8% (Garvan) of the patients who eventually suffered from a fracture would not have been treated. We believe that the above considerations demonstrate the need to establish other optimal thresholds for therapy onset instead of the to-date’s common thresholds.

The study has certain limitations. We observed only women, some data, obtained via phone interviews, might be incorrect, spine radiograms were not available in all the patients, either at baseline or during follow-up; thus, some vertebral fractures may have been omitted. However, a large epidemiological sample was followed for 10 years and a significant number of fractures were identified, providing valuable results. The new thresholds for therapy onset should enable more patients with high fracture risk to be qualified for treatment.

In conclusion, the described prospective study established new optimal thresholds for the initiation of therapy. The improved identification of patients in need of therapy should decrease new fracture incidence rates.