Introduction

Osteoporosis has become an important public health problem because of low bone mass, decreased bone strength, and increased fracture risk [1, 2]. A study conducted in Turkey by the Turkish Osteoporosis Society has reported that the incidence of osteoporosis in people over the age of 50 is 7.5% in men and 12.9% in women [3].

The results from evaluating the disease by physical examination, laboratory tests, and imaging methods are not always consistent with the patient’s well-being and functional capacity. In this respect, the measurement of quality of life has an important role in the approach to diseases today [4]. Although generic scales such as Short Form 36 are used to evaluate the quality of life in patients with osteoporosis, they may be insufficient to measure the specific characteristics of osteoporosis. There are multiple disease-specific scales used to measure the quality of life in osteoporotic patients [5,6,7,8]. The European Society of Osteoporosis Quality of Life Questionnaire-41 (QUALEFFO-41) has been the most commonly used disease-specific scale in the literature for many years and is an established and self-reported questionnaire [9]. This questionnaire was primarily developed for patients with osteoporosis and associated vertebral fractures. However, in the years that followed, this questionnaire was used in patients with osteoporosis without vertebral fracture, and its reliability was demonstrated [10,11,12]. QUALEFFO-41 includes many questions and takes a long time to complete, which has revealed the requirement for a shorter and more practical version. Therefore, the questionnaire was updated in 2006, resulting in the QUALEFFO-31 version, and the validity of the current questionnaire was demonstrated [13].

To our knowledge, validity and reliability studies of the new QUALEFFO-31 version have been performed in Spain [14], Taiwan [10], Hong Kong [15], and China [16] to date. This study aims to conduct the reliability and validity study of QUALEFFO-31 in Turkish and to evaluate the capacity of the questionnaire to distinguish patients in terms of osteoporosis.

Method

Patients

The plan for the study was to enroll 150 patients who presented to Bezmialem Vakıf University Hospital outpatient clinics between February 2020 and May 2020. Inclusion criteria were being 50 years of age or older, having the result of dual-energy x-ray absorptiometry performed in the last 6 months, being independent and mobile, and having the cognitive capacity to understand and complete the questionnaire. Exclusion criteria were having comorbidities that can significantly compromise the quality of life and bone health, such as malignancy, chronic inflammatory diseases, neuromuscular diseases, or non-osteoporosis metabolic bone diseases. Demographic data such as age, sex, smoking, and body mass index were collected in the baseline evaluations of the patients included in the study. Fractures were evaluated morphometrically. Lateral spine radiographs were taken for all patients. Height losses exceeding 20% in the anterior, middle, or posterior vertebra were accepted as fractures. Patients were classified into bone mineral density (BMD) categories according to meeting the World Health Organization criteria (low bone density (osteopenia or osteoporosis); T score ˂ − 1, osteoporosis; T score ≤  − 2.5) in at least one of their lumbar or femoral measurements.

Translation

First, to translate and validate the questionnaire in Turkish, permission was obtained from the lead author, Paul Lips, who developed the questionnaire. Subsequently, the questionnaire was adapted to Turkish using the methodology of Beaton et al. [17]. The original English version was translated into Turkish by two translators whose native language was Turkish, independent of each other. The first translator was a physiatrist who was familiar with the content and concept of the questionnaire. The other translator was someone who provided professional translation services and did not have a medical background. Translation synthesis was achieved by comparing the two translations and a single translation was constituted. Subsequently, this Turkish version was translated back into English by two different bilingual translators whose native language was English and who did not previously see the original version of the questionnaire. The resulting versions were evaluated by a team which included all the translators and the four physiatrists, and a consensus formed the preliminary questionnaire. This preliminary questionnaire was tested with 30 patients during a pilot study. Patients gave feedback regarding the clarity and compatibility of every item of the questionnaire. After the vocabulary, terminology, information errors, and parts difficult to understand were revised, the questionnaire was finalized.

Questionnaires

The Turkish version of QUALEFFO-31 comprises 31 items including 4 items on the pain domain, 18 items on the physical function domain, and 9 items on the mental function domain. High scores indicate poor quality of life. A Likert scale with 4 options is used for items 16 and 18 in the physical function domain, while a Likert scale with 5 options is used for all the other items. Scores of the 2nd, 3rd, 4th, 6th, 8th, and 9th items in the mental function domain are calculated in a reverse manner (1 is the most unhealthy while 5 is the healthiest). Each domain score is calculated by converting the sum of the scores of the items included in the domain into a scale of 0–100.

Short Form-36 (SF-36) is a self-reported instrument comprising eight domains. These domains are bodily pain, physical functioning, social functioning, general health, mental health, vitality, and role restrictions because of physical and emotional problems. SF-36 is scored from 0 to 100: 0 indicates a poor health status, whereas 100 indicates a good health status. The reliability and validity study of the SF-36 Turkish version was performed by Kocyigit et al. [18].

All patients included in the study answered both questionnaires in the same order (QUALEFFO-31 followed by SF-36).

Sample size

The suggested number of participants for each item varies between 5 and 25 in questionnaire validation studies where there are no established rules for determining the required sample size [19, 20]. Recruitment of at least 50 participants is considered necessary for most analyses [19, 21]. In previous validation studies of QUALEFFO-31, the number of participants ranged from 118 to 200 [10, 14,15,16]. In the light of this data and considering the patient population of the study center, the participant item ratio and sample size were chosen as 5 and 150 respectively.

Statistical analysis

IBM SPSS 26.0 version (IBM Corp., Armonk, NY, USA) for Windows was used to evaluate the data. The descriptive statistics of the study were shown as mean ± standard deviation for continuous data and as frequency and percentage for categorical variables.

Internal consistency and test–retest analyses were used for the reliability study. Internal consistency was assessed using Cronbach’s α values between 0.70 and 0.95, which were considered acceptable [22]. For retest reliability, intraclass correlation coefficient (ICC), absolute agreement, and a mixed-effects model were used. Thirty patients (apart from the patients participating in the translation process) completed the questionnaire for the second time 2 weeks after the first time they completed it. This period was designed to be long enough for the patients to forget their answers, but short enough that their current health status would not change. ICC values below 0.5 were considered weak reliability, those between 0.5 and 0.75 moderate reliability, those between 0.75 and 0.90 good reliability, and those above 0.90 excellent reliability [23].

For the validity study, convergent-discriminant validity, concurrent validity, factor analysis, and known-group validity analyses were performed. Convergent validity was accepted if the correlation of each item with its own domain was above 0.4. Discriminant validity was accepted if the correlation of each item with its own domain was greater than its correlation with other domains. The number of items providing convergent and discriminant validity in the domains was divided by the number of all items to calculate convergent and discriminant validity ratios. The correlation between the domains of QUALEFFO-31 and SF-36 for concurrent validity was calculated using the Pearson or Spearman coefficients based on the distribution of the data. A correlation coefficient between 0.3 and 0.5 was considered a low correlation, 0.5–0.7 a moderate one, and 0.7–0.9 a strong one [23,24,25]. Exploratory factor analysis (EFA) was performed using the SPSS program. Principal axis factoring and Promax with Kaiser Normalization were used as extraction method and rotation method respectively. Also, confirmatory factor analysis (CFA) was conducted with the IBM® SPSS® Amos™ (Version 24) program to investigate the factorial structure of the questionnaire via structural equation modeling. The fitness and validity of the questionnaire were investigated via CFA using comparative fit index (CFI), relative chi-square index (CMIN/DF), normed fit index (NFI), goodness of fit index (GFI), relative fit index (RFI), and root mean square error of approximation (RMSEA). CMID/DF is a method that makes chi-square less dependent on sample size and is obtained by dividing the chi-square by the degrees of freedom (DF). A value of 5 or less can be considered sufficient for a model to be accepted [26]. GFI is a measure of fit between the hypothesized model and the observed covariance matrix. GFI value ranges from 0 to 1 and values exceeding 0.90 are considered a good model indicator [27]. CFI compares the fit of the model with the correlation between latent variables and the fit of the null hypothesis model ignoring covariance. CFI value ranges from 0 to 1 and values approaching 1 indicate better goodness of fit. NFI investigates the compatibility of the assumed model with the basic model and ranges from 0 to 1. Higher values indicate better goodness of fit. RMSEA is an absolute fit index that evaluates how far an assumed model is from a perfect model and values closer to 0 indicate better goodness of fit [28]. For the known-group validation, the patients were grouped based on the presence of fracture or osteoporosis. Whether there was a difference in QUALEFFO scores between the groups was evaluated using the Mann–Whitney U or independent T-tests. Kruskal–Wallis and one-way ANOVA tests were used for subgroup analyses.

Receiver operating characteristic (ROC) analysis was performed to evaluate the capacity of QUALEFFO-31 and SF-36 to differentiate patients with fractures or osteoporosis. Discriminative capacity was accepted for values significantly higher than 0.5. Values between 0.7 and 0.8 were considered a moderate test performance, values between 0.8 and 0.9 a good test performance, and values between 0.9 and 1.0 a very good test performance [29].

A p value of 0.05 or less was set as the threshold of statistical significance for all analyses.

Results

A total of 111 patients were evaluated. The patients were predominantly women (93.7%). Twenty-three patients (20.7%) had one or multiple osteoporotic vertebral fractures and 38% of the patients were osteoporotic. Table 1 lists the socio-demographic and clinical data of the patients.

Table 1 Socio-demographic and morphological characteristics of the patients

There was a floor effect in the pain domain of QUALEFFO-31. There were 19 (17.1%) patients with the lowest possible value of zero. There were no floor or ceiling effects in the other domains and total score.

For the QUALEFFO-31 domains, internal consistency levels were optimal except for the mental function. ICC coefficients showed good retest reliability for all domains and total tests. Table 2 lists the Cronbach alpha values and ICC coefficients.

Table 2 Cronbach alpha values and ICC coefficients for QUALEFFO-31

It was determined that all items in the pain and physical function domains had convergent (rho coefficient = 0.43–0.89) and discriminant validity. In the mental function domain, all items except for the 2nd and 8th items had convergent validity (rho coefficient = 0.23–0.68), and all items except for the 1st and 5th items had discriminant validity. Table 3 lists the convergent and discriminant validity ratios of the domains.

Table 3 Convergent and discriminant validity ratios of QUALEFFO-31 domains

In the concurrent validation analysis, there was a moderate and good negative correlation between QUALEFFO-31 and SF-36 domains which had similar names (Table 4). The best correlation was between the QUALEFFO-31 total and SF-36 physical function domains (rho = 0.74).

Table 4 Correlation coefficients between QUALEFFO-31 and SF-36 domains (Spearman’s rho)

The EFA revealed a 3-factor structure. These 3 factors explained 62.8% of the total variance. All items that constitute the pain domain and items 1, 4, 6, 8, 10, and 16 of the physical function domain were located in the first factor. Items 14, 17, and 18 of the physical function domain took place in the third factor together with the mental function domain items. All the remaining items of the physical function domain took place in the second factor. The EFA of QUALEFFO-31 is presented in Table 5 in detail. We performed the CFA according to both the original factor structure and distribution and the distribution pattern we determined in our EFA. The obtained CMIN/DF, GFI, CFI, and RMSEA values are summarized in Table 6.

Table 5 Exploratory factor analysis of QUALEFFO-31
Table 6 Confirmatory factor analysis of QUALEFFO-31

Based on the comparison between the groups, those with fractures had worse QUALEFFO-31 scores in all domains except for the mental domain compared to those without fractures. However, there was no statistically significant difference between the groups (p ˃ 0.05). Those with osteoporosis had significantly lower QUALEFFO pain values compared to non-osteoporotic (osteopenia + normal BMD) patients (p ˂ 0.05). Thereupon, patients were divided into BMD subcategories (osteoporosis osteopenia, normal BMD, and subgroups formed according to the presence of fracture) and subgroup analysis was carried out. However, there was no statistically significant difference between the groups in terms of QUALEFFO scores (p ˃ 0.05). There was no significant difference in other domains between osteoporotic and non-osteoporotic patients (Table 7). In addition, the QUALEFFO scores of the patients who received anti-osteoporotic treatment did not significantly differ from those who did not (p ˃ 0.05).

Table 7 Comparison of QUALEFFO-31 scores between groups

Table 8 lists the area under curve (AUC) values for the QUALEFFO-31 domains in terms of differentiation between fracture and osteoporosis in the ROC analysis. Although there were values above the threshold (0.5), there was no significant differential capacity in terms of osteoporosis or fracture for either QUALEFFO-31 or SF-36.

Table 8 QUALEFFO-31 domains and discriminative capacities for fracture and osteoporosis

Discussion

QUALEFFO is a disease-specific scale used to evaluate the quality of life of osteoporotic patients. This study has demonstrated that the most recent version of QUALEFFO-31, which has been shortened and updated, has preserved its psychometric characteristics following its adaptation to Turkish.

It is desired that the floor effect of a scale is below 15% [30]. Based on the results of this study, there was a floor effect in the QUALEFFO-pain domain. Although it seems that this is attributed to the inclusion of patients without fractures, the fact that the developers of the scale have faced a similar case confirms a scale attenuation effect [13].

In our study, the mental function domain had a low Cronbach alpha value (0.67). When the mental domain item # 2 “Do you tend to feel tired?” was excluded the alpha coefficient increased to 0.717. Furthermore, the convergent validity of this item was low (rho = 0.239). This may be attributed to patients marking the times when they felt most tired during the day rather than evaluating the vitality and energy level due to the cultural conversation habits of the patients. Nevertheless, it seems appropriate to consider this item as a part of the mental function domain because of its discriminant validity. Moreover, Kocyigit et al., Van Schoor et al., and Lai et al. reported the lowest values for the mental domain with an alpha value of 0.7, 0.72, and 0.72, respectively [4, 13, 15]. In this respect, our internal consistency results are in line with other studies.

In our study, all QUALEFFO-31 domains had good retest reliability [31]. The ICC values obtained in our study were slightly lower compared to those in the Spanish validation study (0.96–0.98) [14] and similar to those in the Chinese validation study (0.76–0.91) [15] and the Taiwan validation study (0.77–0.91) [10].

The domain with the lowest values in terms of convergent and discriminant validity was the mental domain. However, there were no items that had both no convergent validity and no discriminant validity. As in the QUALEFFO-41 Turkish validation study (39th item), the 8th mental domain item “Do you find it easy to make contact with people?” may have been incorrectly understood and answered because of cultural factors [4]. Kocyigit et al. reported rates varying between 89 and 100% for the convergent and discriminant validity ratio, with the lowest values in the mental domain [4]. Other studies reported values ranging from 72 to 100% [10, 13, 15]. Our results confirm that QUALEFFO-31 has a sufficient level of convergent and discriminant validity, which is consistent with the literature.

The EFA revealed a 3-factor structure in line with the original model. However, there were some differences in the distribution of the items compared to the original study. Physical function domain items located in the first factor which forms the pain domain inquire physical functions that may be restricted due to pain, such as dressing, cleaning, washing dishes, carrying goods, bending forward, and gardening. Similarly, physical function domain items located in the third factor which forms the mental function domain inquire the dimensions of physical function that may be restricted due to compromised socialization, such as going out and visiting a cinema or friends. These relationships can explain the spread of physical function items to other factors. Although the CMIN/DF value of 3.349 was acceptable in the CFA performed according to the original model, the other markers were not at a very good fitting level. When we performed the CFA according to our model, the detected fitting level increased in all analyses. These findings may indicate that applying some changes to the QUALEFFO-31 model provides better construct validity. Further studies with more patients are needed to confirm these findings.

QUALEFFO is essentially a questionnaire created for patients who have osteoporosis with fractures. Like QUALEFFO-41, QUALEFFO-31 is a disease-specific questionnaire, and it has been shown in many studies that it yields worse results in patients with fractures in all its domains [13, 15, 16]. Based on the comparison between groups in this study, those who have fractures had worse QUALEFFO-31 values, except for the mental domain. Based on ROC analyzes, in terms of fracture, all domains had AUC values above the threshold value except for the mental domain. However, there were no statistically significant values in either analysis. This may have been caused by the low number of patients with fractures. Moreover, it has been shown in previous studies that patients with osteoporosis have worse QUALEFFO values [14, 32, 33]. However, there was no significant difference between BMD categories in our study. No predictive value was also found based on the ROC analysis. Osteoporosis is known to be a silent disease unless there is a fracture [34]. In their study using QUALEFFO-41, Romagnoli et al. reported that the questionnaire did not have any differential capacity in terms of BMD categories [35]. Our results indicate that QUALEFFO-31 has no differential capacity in terms of osteoporosis in Turkish patients.

Limitations of the study

It was planned to enroll 150 patients into the study, considering the 3-month patient enrollment period. However, because of the COVID-19 pandemic, the number of admissions to the hospital, except for mandatory cases, has significantly decreased. Therefore, the duration of the study was extended up to 1 year. However, the target number was still not reached, and the study was terminated prematurely (111 patients). Although the values of the patients with fractures were higher compared with the controls based on the known-group and ROC analyses, this may have been the reason why statistical significance was not reached.

Conclusions

The Turkish version of QUALEFFO-31 has a sufficient level of reliability, validity, and psychometric properties. Nevertheless, improvements in pain and mental function domains and some changes applying to the model may increase the psychometric capacity of the questionnaire. The questionnaire, which is potentially capable of differentiating patients with fractures, does not appear to have differential capacity in terms of osteoporosis in Turkish patients.