Introduction

Degenerative lumbar spinal stenosis (DLSS) is defined by diminished space for the neural and vascular elements in the central canal of the lumbar spine secondary to degenerative changes of the facet joints, ligaments, vertebrae, and intervertebral discs1,2. DLSS is a common disease in elderly patients and typically presents with neurogenic claudication symptoms including pain in the buttocks and lower extremities provoked by walking or extended standing and relieved by rest and bending forward3. The treatment options range from nonsurgical approaches such as analgesics, physiotherapy, and epidural corticosteroid injections to surgical methods.

In the past, a multitude of studies assessed the effects of these treatment options for DLSS. In order to be able to establish firm and stringent evidence-based clinical guidelines on the cost-effective use of treatment interventions, results based on clinical trials need to be compared. This is particularly important in systematic reviews and meta-analyses where conclusions are based on the available studies4. However, many trials use different outcome measures which complicate the comparison of trial results. Further, studies may use measures that were not validated in the DLSS population and therefore, may not identify clinically relevant changes or differences in this patient population. Indeed, one study showed that depending on the outcome measure that was used and the cut-off values for clinically important improvement, the conclusion of a study may be strongly influenced5. To date, no study has systematically assessed the outcome measures used in clinical studies for DLSS and their validation specifically for DLSS.

We performed a cross-sectional analysis of treatment studies for DLSS included in systematic reviews and meta-analyses published between 2006 and April 2021. After extracting the outcome measures for the domains of pain and disability, we assessed whether these instruments were validated specifically for DLSS and critically appraised the quality of the validation studies.

Methods

Study design and eligibility criteria

Cross-sectional analysis of outcome measures for pain and disability in treatment studies for DLSS. We included randomized controlled studies (RCT) and observational studies (OS) which were previously included in systematic reviews (SR) or meta-analyses (MA) and were published in the Cochrane library. This approach allowed us to include a complete set of studies for each treatment intervention that was previously assessed for their methodological validity. Spinal stenosis caused by other conditions than degenerative origin (e.g. traumatic, congenital, spondylolisthesis) and other study designs were excluded. This study is not a systematic review, however, reporting will be based, if applicable, on the recommendations of the Preferred Reporting Items for Systematic reviews and Meta-Analysis Protocols (PRISMA statement)6 and the Statement for Strengthening the Reporting of Observational Studies in Epidemiology (STROBE statement)7.

Search strategy

We searched for SR and MA assessing surgical and non-surgical treatments for DLSS published in the Cochrane library from its inception (1996) to April 2021. An update search which did not identify additional SR or MA was conducted on June 21, 2022.

Search terms included “lumbar spinal stenosis” in the title, abstract, or keywords and MeSH term “spinal stenosis”.

Selection process

Two reviewers (DR and MW) independently screened the titles and abstracts for eligible SR and MA according to the pre-defined inclusion criteria. Subsequently, two reviewers (DR and FB) extracted all RCT and OS from the included SR respectively MA into an Endnote database for the analysis. The full text of all RCT and OS were then reviewed for inclusion by DR and confirmed by FB. In case of inconclusive or uncertain eligibility or discrepancies, studies were discussed between the two reviewers and resolved by consensus or by a third party (MW).

If necessary, authors of protocols for systematic reviews and meta-analysis were contacted for further information.

Data extraction process

The following information was systematically extracted by one reviewer (DR): Author, publication year, study design, treatment intervention, outcome measures for pain and disability, references for validation studies. A second reviewer confirmed the extracted information (FB). Subsequently, all cited validation studies were retrieved and read in full text.

Quality of validation study

Two reviewers (DR and MW) analyzed the methodological quality of the validation process using the COnsensus-based Standards for the selection of health status Measurement Instruments (COSMIN, https://www.cosmin.nl/tools/checklists-assessing-methodological-study-qualities, assessed on December 2, 2022) checklist8. The COSMIN checklist was developed to assess the methodological quality of studies on measurement properties of health-related patient reported outcomes. We extracted information on eight domains: the content validity, internal consistency, construct validity, criterion validity, reliability, responsiveness, flooring/ceiling effect, and interpretability.

Content validity Was there a clear description of the measurement aim, the target population, widely accepted or appropriate methods and concepts were used, the item selection, and the investigators / experts involved in item selection are reported. Number of patients adequate (very good ≥ 50, adequate 30–49).

Internal consistency Scale or subscale is unidimensional. Were factor analyses performed in an adequate sample (≥ 100 patients very good, adequate 50–99) and Cronbach’s alpha(s) calculated per dimension (Cronbach’s alpha(s) 0.70–0.95)?

Criterion validity Was a correlation with the gold standard assessed (at least ≥ 0.70)? Number of patients adequate (≥ 50 very good, 30–49 adequate).

Construct validity Were pre-specified hypotheses defined and the results in ≥ 75% in correspondence with these hypotheses (target sample size for this (sub)group analysis ≥ 50 patients)?

Reliability Two independent measurements in similar conditions. Was a test–retest intraclass correlation coefficients (ICC)) or weighted Kappa calculated (at least ≥ 0.70, sample size ≥ 50 patients)?

Responsiveness Proposed criterion can be considered as a reasonable gold standard. Was the ability to detect a clinical important change over time assessed (AUC ≥ 0.70 or Gyatt’s responsiveness ratio > 1.96)? Number of patients adequate (very good ≥ 50, adequate 30–49)?

Floor or ceiling effects: Was a floor or ceiling effect assessed and not detected (sample size ≥ 50 patients)?

Interpretability Was the degree to which one can assign qualitative meaning to quantitative scores assessed (anchor-based method recommended, to determine the minimal clinical difference; sample size ≥ 50 patients)?

Two reviewers (DR and MW) independently assessed each domain and rated the domain as fulfilled (+ , defined as very good or adequately addressed), not fulfilled (-, doubtful or inadequate), not applicable (NA), and nor reported (NR). Disagreement between the reviewers were discussed and resolved by consensus. In case no consensus could be reached, the study was discussed with a third reviewer (FB). All disagreements were resolved by consensus. Finally, a quality score was calculated ranging from 0 (no domain was fulfilled) to 8 points (all domains were fulfilled).

Outcome of interest

Primary outcome were outcome measures in the domains of pain and disability.

Data synthesis

We summarized categorical variables with number and percentage and continuous data with mean and standard deviation. All analyses were conducted with the statistical software R (version 3.6.1).

Results

Study selection

The literature search in the Cochrane library retrieved 31 eligible references. Twenty references met our inclusion criteria and were included in the study (systematic reviews n = 15, meta-analysis n = 3, combined systematic review and meta-analysis n = 2). Subsequently, a total of 256 primary studies were extracted for full-text assessment. For details see Table 1.

Table 1 Characteristics of included SR and/or MA (n = 20).

After full text screening, 95 primary studies were included in the final analysis. One hundred and forty-two studies did not fulfill our inclusion criteria and were excluded. The main reason for exclusion were duplicates (n = 94). The study selection process is depicted in Fig. 1.

Figure 1
figure 1

Flow chart.

Characteristics of the included primary studies

The characteristics of the included primary studies are summarized in Table 2. Most of the studies were randomized controlled trials (n = 50, 48.5%) and prospective cohort studies (n = 34, 35.8%). Almost three quarters (73%) of the primary studies involved at least one surgical intervention. Studies were published between 1983 and 2016.

Table 2 Characteristics of the included studies.

The primary studies included a total of 7′878 participants with a median age of 63.5 ± 7.1 years (range 44–76.2 years). The median follow-up duration was 78.1 ± 81.3 weeks (range 1–480 weeks).

Table 3 summarizes the outcome measures used in the primary studies. In total, 242 outcome measures were identified. In the domain of pain four outcome measures were detected. The Visual Analogue Scale (VAS, n = 69, 90%) respectively Numeric Rating Scale (NSR, n = 9, 9%) were most commonly used. In the domain of disability, a total of 12 outcome parameters were identified. The Oswestry Disability Index (ODI, n = 53, 47%) and various tests assessing walking tolerance (n = 34, 29%) were mostly used (walking ability9,10,11, pain free walking12, walking distance13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37, walking test38, walking time39, walking < 15 minutes40, walking tolerance 41).

Table 3 Outcome measures in the domain of pain and disability.

In the domain of pain and disability combined, the Zurich Claudication Questionnaire (ZCQ, n = 22, 47%) and the SF-36 (n = 15, 32%) were frequently applied.

Outcome measures and reference studies

In total, 45 primary studies (47.3%) provided a reference for at least one outcome measure. In the domain of pain references were provided for the VAS (n = 5) and the NRS (n = 2), respectively. In the domain of disability, the ODI (n = 22) and the Roland Morris Disability Questionnaire (RMQ, n = 8) were most frequently referenced. In the domain of pain and disability combined the ZCQ (n = 14) was commonly referenced.

For nine outcome measures (disability n = 7, pain and disability combined n = 2) a total of 14 validation studies specifically for a DLSS population were found. For the ZCQ (n = 4)42,43,44,45 and the ODI (n = 3)43,46,47 more than one validation study was identified. For details see Table 4.

Table 4 Summary and quality of validation studies.

Quality assessment of the validation studies

None of the validation studies assessed all predefined domains of the COSMIN checklist8 (Table 4). Twelve of the included 14 studies reached a quality score of 3/8 or less, indicating low methodological quality. None of the validation studies reached the score maximum (range 2/8–7/8). The two studies by Stucki et al.44,45 assessing the validation of the ZCQ in DLSS population, achieved the highest scores (6/8 respectively 7/8).

The Beaujon scoring system (BSS) and various tests assessing walking tolerance were tested in a DLSS population. However, the methodology of the validation study was not in agreement with the methodological items proposed for measurements of health-related patient reported outcomes8.

Discussion

Main findings

The results of this cross-sectional analysis indicate the reporting of outcome measures in randomized clinical trials and observational studies in DLSS is insufficient. Less than half of the included primary studies provided a reference for at least one outcome measure in the domain of pain, disability, or combined pain and disability. A total of 14 validation studies for nine outcome measures were found. The quality assessment of the validation studies revealed low quality for the majority of the studies. Within the DLSS population three validation studies were found for the ODI and four validation studies for the ZCQ, respectively. However, all three validation studies for ODI scored unsatisfactory in the quality assessment. Based on this study, the ZCQ represents the only disease specific tool with adequate validation for assessing treatment response in DLSS.

Results in light with the literature

The findings of our study are in agreement with a systematic review and meta-analysis on outcome measures for neurogenic claudication48. The authors evaluated 15 separate walking outcome measures and concluded that walking outcome measures for patients with neurogenic claudication are lacking. The development of a measurement instrument involves testing validity and reliability with a defined target population49. Choosing a measurement instrument wisely can be challenging given the growing number of choices available. Meaningful use of a measurement instrument depends not only on the validity of the instrument itself, but also on the context in which it is used50. Web-based systems such as PROMIS have been developed from efforts to optimize and simplify the process of selecting an appropriate measurement instrument51. The stated goal is to provide well-constructed, generalizable, and clinically relevant endpoints for studies52. These systems facilitate the completion of questionnaires for subjects, as otherwise there would be a considerable administrative burden. In 2006, the North America Spine Society (NASS) Compendium for the Assessment and Research of Spinal Disorders recommended the Quebec Back Pain Disability Scale, the Roland Morris Disability Questionnaire, and the Waddell Nonorganic Signs for lumbar pain as measurement tools53. In contrast to lumbar back pain, there are currently no specific recommendations for the use of measurement tools in DLSS54. However, measurement tools that are valid for patients with nonspecific back pain do not necessarily measure the relevant endpoints for patients with DLSS. The latter have a different clinical presentation with typical claudication symptoms. Consequently, depending on the conception and design of a questionnaire, clinical outcomes may vary significantly5. The variance of measured symptoms can vary widely, as shown in a recently published study55. The comparison of measurement instruments in patients with DLSS showed that there was a variability of 40–70% depending on cut-off and measurement instrument. In a recently published study56, the ZCQ was the most responsive tool to assess symptoms and function in DLSS supporting the findings of the current systematic analysis. The use of non-validated, nonspecific measurement instruments in studies has an impact on future clinical decisions. The extent of this variation was relevant enough to lead to completely different interpretations of a study. Kimberlin et al.57 argue that although any outcome of a measurement instrument is only an approximation of the actual truth, the use of non-validated measurement instruments has the same effect on study quality as a poor study design or an insufficient number of patients. Our study shows that many of the measurement tools used have not been validated in DLSS patients and it is therefore unclear whether they represent what is relevant to patients.

The issue of inclusion of a magnitude of different outcomes in trials of the same intervention is not novel. For example, in their systematic review from 2017 Mayo-Wilson et al.58 identified variation in outcomes across reports of RCTs the effect of gabapentin for treating neuropathic pain and quetiapine for bipolar depression, respectively. The authors found that the RCTs included hundreds of outcomes and concluded that researchers may cherry-pick what they report from multiple source of RCT information. This results in challenges for interpreting clinical trials and obstacles in comparing clinical trials in meta-analyses.

The development of a measurement instrument involves testing validity and reliability with a defined target population49. Choosing a measurement instrument wisely can be challenging given the growing number of choices available. In recent years, various efforts have been made to systematically assess the validity of measurement instruments59. Meaningful use of a measurement instrument depends not only on the validity of the instrument itself, but also on the context in which it is used50 Web-based systems such as PROMIS have been developed from efforts to optimize and simplify the process of selecting an appropriate measurement instrument60 The stated goal is to provide well-constructed, generalizable, and clinically relevant endpoints for studies.

Strength and limitations

To the best of our knowledge this is the first cross-sectional analysis of outcome measures used in randomized clinical trials and observational studies in DLSS. In addition, we conducted a validity check of the outcomes applying existing guidelines for conducting systematic literature reviews51.

As we focused on systematic reviews and meta-analyses, it is potentially possible that individual studies may not be identified in our analysis. However, we are confident that our methodology included the most relevant papers. The main limitation of this study is that this approach did not capture all validation studies conducted to date. To include an overview of all validation studies ever conducted in patients with DLSS would require a systematic review. By using complete sets of studies included in SR and MA, we assessed the quality of reporting of validation studies and the quality of the validation studies themselves. Therefore, we did not aim to provide a complete overview for all validation studies conducted in DLSS. Thus, when included in this systematic literature review, a study underwent two selection processes.

Implications for clinical research

In order to assess the effectiveness of treatment studies in patients with DLSS, valid and comparable measurement instruments are central. Our study shows that many different and partly unvalidated instruments are used. In addition, there is a lack of information on the minimal clinically important change of the respective measurement instruments. Researchers should systematically conduct high quality validation studies for the measurement instruments in DLSS patients. In addition, the patients’ perspective should be included in the selection of measurement instruments. Further validation studies of measurement instruments specific for DLSS patients with at least 50 patients and considering the quality criteria of Terwee et al.61 will help to quantify the symptoms relevant for DLSS patients and thus have a direct impact on the validity of future RCTs and OS.

Implications for clinical practice

Increasingly, patient-centered measurement instruments are recommended or required for measuring treatment outcome. Our study shows that the selection of adequately validated measurement instruments for DLSS patients is important and that many measurement instruments are not validated in this patient population. In particular, reliable and valid questionnaires specific to DLSS are helpful for everyday clinical practice, as clinical progress can be monitored and responses are less influenced by the treating individuals. For monitoring treatment response in DLSS, we believe that ZCQ provides the most differentiated results. In particular, this questionnaire has the advantage of combining the assessment of pain, satisfaction and disability at the same time.

Conclusion

Reporting of the validity of outcome measures was poor and only in validation in one outcome measure was adequate. In order to be able to compare results from clinical studies, outcome measures need to be validated in a disease specific population and external validation studies should be indicated adequately. For monitoring treatment response in DLSS, the use of the ZCQ is recommended.