Introduction

In February of 1896, at the physics laboratory of Dartmouth College, Edwin Brant Frost used what were then known as roentgen rays to capture an image of the healing ulna of his patient, Edward McCarthy [1]. The supreme clinical applications of this novel technology were not lost on early observers, Silvanus P. Thompson (President of the Roentgen Society) said a year later:

‘Excepting only the introduction into surgery by Lord Lister of antiseptics, and the discovery of anaesthetics, no discovery in the present century has done so much for operative surgery as this of the roentgen rays’ [1].Over the following 130 years of clinical practice, plain radiographs have remained foundational to the investigation of musculoskeletal injuries. The WHO estimates that 3.6 billion investigations using ionising radiation are performed globally each year, the majority of which being simple X-rays [2]. In the UK, more than 60% of emergency department attendances have a primary diagnosis relating to the musculoskeletal (MSK) system [3]. In total, 38.7% of all patients will receive at least one plain radiograph and in MSK injuries this rises to over 50% [4].

Despite being ubiquitous, the interpretation of skeletal radiographs is challenging, and errors can be of significant detriment to both patients and care providers. The interpretation of radiographs in a trauma setting is especially fraught, with high patient turnover and often junior staff. Consequently, emergency departments are recognised as ‘high risk’ for diagnostic error [5]. Research reviewing UK medicolegal claims in skeletal radiology between 1995 and 2006, showed the ‘great majority followed missed diagnoses of fractures following trauma’ [6].

Existing research has shown variable levels of performance in the initial interpretation of skeletal radiographs for trauma. Across all radiographs in the emergency department setting, an error rate of approximately 3% has been shown [7]. In the upper limb, estimates suggest incorrect assessment is made in around 8.5% of cases [7, 8].

There have so far been limited attempts to produce summary rates of reporting error in plain skeletal radiographs of lower limb trauma, despite a body of individual studies assessing this both in generality and by more specific anatomical site.

The aims of this study were to conduct a systematic review and meta-analysis of the existing literature to establish sensitivity, specificity, and diagnostic odds ratio for the initial interpretation of lower limb radiographs (including those of anatomical sub-divisions; foot, ankle, knee and femur).

Methods

Review protocol and search strategy

This systematic review was prospectively registered with the PROSPERO database, a copy of the review protocol can be found under registration number CRD42020197973.

In April of 2020, the PubMed MEDLINE, Embase, Cochrane Database of Systematic Reviews (CDSR) and Cochrane Central Register of Controlled Trials (CENTRAL) databases were scrutinised from 1990 to the present, using a search strategy developed with the aid of Imperial College Library Services. The full electronic search strategy is detailed in Fig. 1.

Fig. 1
figure 1figure 1figure 1

Literature review process

Eligibility criteria

In accordance with the objectives of this study, eligibility criteria were developed by the authors to identify papers containing pertinent data for inclusion. These were as follows:

  • Written in the English language

  • Conducted during or after 1990

  • Original research, published in peer reviewed, academic journals (editorial letters, opinion pieces and expert reviews were excluded)

  • Reporting an initial assessment of plain radiographs of the lower limb, performed by identified members of staff or grade of staff and compared to a definitive assessment of findings

  • Investigation of subjects with a confirmed or suspected trauma and orthopaedic injury, as characterised by the WHO ICD 11

  • Radiographs included for review being of skeletally mature subjects

  • Conducted in active healthcare settings where diagnostic services are provided to a patient population

  • Outcomes reported with respect to accuracy, specificity or sensitivity of radiograph reporting

  • Outcomes reported with respect to specific anatomical site or regional anatomy

Study selection

An initial sample of 200 search results was reviewed for inclusion by the six reviewing authors (TY, CF, KR, GM, HJ, WH). Using the eligibility criteria against title and abstract, each author sorted these 200 results into ‘reject’ or ‘further review’ categories. Inter-reader reliability assessment was then performed to establish the degree of agreement amongst the authors on those articles meriting further review. Fleiss’ Multirater Kappa was calculated to be 0.640 (p < .005), conventionally taken to represent substantial agreement [9].

Each author then individually assessed an equal share of the remaining results by title and abstract, again categorising as ‘reject’ or ‘further review’. These, along with the reviewed results of the initial sample were combined, and further categorised on the basis of the anatomical region to which they related: lower limb, upper limb, pelvis, spine and thorax, skull and facial. Where an article included data pertinent to more than one anatomical region, it was duplicated, and a copy assigned to both.

TY and CF then reviewed the full text of all potentially eligible results categorised as lower limb against the aforementioned eligibility criteria. Where disparity arose, it was resolved by means of further review and joint assessment.

Data collection and assessment

A bespoke data extraction tool was developed by the authors; this was applied to all included studies. Variables recorded were radiograph reporting population, male/female % of radiograph subjects, recruitment methods to study, anatomical site identified, reporting accuracy/error rate %, specificity %, sensitivity % and qualitative outcome statement.

An assessment was made of methodological quality using the MINORS tool [10] and of risk of bias using a modified Cochrane RoB2 tool [11]. Where the authors initially made a divergent assessment of any study, a consensus evaluation was formed.

Summary and synthesis

The radiograph reporting populations, reporting accuracy and specificity/sensitivity were identified as the principle summary measures. Meta-analysis was then performed in order to produce summary estimates of specificity and sensitivity, including covariates by anatomical site, using HSROC and bivariate model analysis.

Results

Study selection and characteristics

After the removal of duplicates, a total of 3887 papers were identified for screening. Following abstract review, 89 articles were progressed to full-text review. A total of 23 articles were included for qualitative synthesis, of which 10 articles yielded data suitable for meta-analysis [12,13,14,15,16,17,18,19,20,21]. These 10 articles examined an aggregate of 3902 sets of radiographs, producing a total of 4709 radiograph interpretation episodes for meta-analysis (see Fig. 2).

Fig. 2
figure 2

Literature review process

The specific anatomical areas examined by articles in the meta-analysis were foot (n = 3), ankle (n = 4), knee (n = 1) and femur (n = 2). Two studies examined multiple anatomical locations (see Table 1).

Table 1 Table of characteristics for all articles included in meta-analysis

The studies primarily involved the comparison of plain film radiology with an alternative form of imaging (n = 6). Alternatively, inter reader plain film X-ray diagnostic performance was examined (n = 1), or the value of additional X-ray views on diagnostic performance (n = 2), or both (n = 1). The seniority of the studied initial reporters ranged from post-graduate surgical and radiology trainees to senior orthopaedic surgeons, radiologists and emergency physicians.

There was some variation across the ten articles included in the meta-analysis, specifically regarding the definition of a ‘positive’ and ‘negative’ radiographic finding. One article [14] defined positive and negative findings as the presence or absence of any bony or soft tissue pathology. This included soft tissue injury, fractures, dislocations, osteomyelitis and osteoporosis. The other nine articles defined positive and negative finding as the presence or absence of a bony fracture [12, 13, 15,16,17,18,19,20,21]. However, two of these nine articles went further and required radiograph interpreters to correctly classify any fracture identified for their findings to be regarded as a ‘true’ positive. Utukuri [12] required interpreters to specify if a calcaneal fracture was intra- or extra-articular. For proximal femur fractures, Riaz O et al. [18] required radiograph interpreters to correctly specify the location and degree of fracture displacement.

Individual study results

Across all lower limb studies sensitivity ranged from 0.59–0.97, and specificity from 0.66–1.00. Utukuri [12] found the highest sensitivity in initial interpretation, with 0.97 achieved for radiographs of the foot. Ricci [21] found the lowest specificity with only 0.65 achieved for lower limb radiographs (see Table 2).

Table 2 Individual study results forest plot

Synthesis of results

A bivariate model was used to conduct meta-analysis along with a hierarchical summary receiver operating characteristic (HSROC) curve for diagnostic performance across all lower limb plain radiographs (see Fig. 3).

Fig. 3
figure 3

HSROC for all studies

The summary estimate of sensitivity across the included studies was 93.5%, with specificity of 89.7% and a false positive rate of 10.3%. Covariate analysis was also performed to assess specificity and sensitivity by lower limb anatomical subdivision; this was possible for all subdivisions apart from the knee where only a single included study was found (see Table 3).

Table 3 Summary estimates

Summary sensitivity and specificity were both found to be highest for ankle radiographs, 98.1% and 94.6% respectively. Similarly, the initial interpretation of ankle radiographs had the highest diagnostic odds ratio (929.97).

Risk of bias assessment

All studies included in meta-analysis were analysed using a modified Cochrane risk of bias tool, this qualitative tool assesses study risk of bias on seven separate criteria. One study was considered to be at high risk of bias due to scoring in greater than four categories. Four studies were considered at moderate risk of bias due to scoring in three or more categories or scoring particularly strongly in one of two categories. Five studies scored in two or fewer categories and so were considered to have a low risk of bias (see Table 4).

Table 4 Modified Cochrane ‘Risk of Bias’ assessment tool

Methodological quality

The methodological quality of the ten articles identified for meta-analysis was assessed using the ‘Minors’ (methodological index for non-randomised studies) tool developed by Slim et al. The range of scores was 13–22 out of a possible 24 points. Articles generally scored highly (average score 16.9).

Nine (90%) of the studies lacked prospective calculations of size, and seven (70%) did not possess an unbiased assessment of their endpoint (see Table 5). Conversely, the studies tended to have minimal losses to follow up (80%) and involved the prospective collection of data (70%).

Table 5 Table demonstrating study methodological quality as per MINORS assessment tool

Discussion

Key findings

This study finds that initial interpreters of lower limb plain radiographs for trauma achieve a relatively high degree of sensitivity (93.5%). It is difficult to quantify the rate at which healthcare systems are justified in accepting the failure to detect findings. Certainly, false negatives are likely to represent the most deleterious of these errors; borne-out by the evidence on litigation for missed fractures both in the UK [6, 22] and abroad [23, 24].

False negatives in the initial interpretation of greater than one in twenty lower limb radiographs, mean that busy accident and emergency or trauma settings are likely to miss substantial numbers of injuries. This appears to support the necessity of safety-netting measures to mitigate the risk of reporting errors. In particular, virtual fracture clinic review [25] and out-of-hours teleradiology services [26] have been widely adopted across the UK and Europe. Alongside these existing methods, the development of novel technologies (such as artificial intelligence algorithms [27]) to supplement interpretation is evidence of a broadly accepted clinical need to improve this reporting.

The summary specificity of reporting was found to be 3.8% lower (89.7%) than sensitivity, suggesting that initial interpreters were less able to identify true negative skeletal radiographs. This finding was commented upon by Utukuri et al. [12] and is also supported by a wider evidence base that shows increasing the seniority of interpreters has a greater benefit to specificity than sensitivity [28, 29]. This implies that some interpretation errors, particularly false negatives, represent a limitation of plain radiographs as a modality and so are not easily preventable. These findings also explain the conclusions of the qualitative synthesis which highlighted the importance of corroborating radiograph interpretation with examination and clinical judgement to prevent fractures being ‘missed’ [7, 30,31,32].

Of the compared anatomical subdivisions, the diagnostic odds ratio for ankle radiographs was found to be superior, followed by the foot and then the femur. The cause for this is not explored in this study; however, the frequency with which ankle injuries present to emergency and trauma care settings may mean initial interpreters are more practiced in the review of these radiographs. The ankle is both the most commonly injured joint, and also the most frequently operated upon [33]; with the estimated incidence for fractures of the ankle being as high as 187 per 100,000 people per annum [34].

Limitations

Of the included studies, a generally favourable assessment of risk of bias and methodological quality was made. However, weaknesses were noted due to lack of prospective size calculation and establishing an unbiased endpoint. The extent to which these factors influence results is uncertain; however, sample sizes in a number of studies appear underpowered [12, 19, 20].

During study selection, a number of large sample-size papers were identified but lacked sufficient characterisation of data for inclusion in meta-analysis. Whilst these are a targeted for use in future analysis, they emphasise the importance of reporting diagnostic accuracy along STARD 2015 [35] or similar, relevant guidelines.

Conclusions

This study suggests that the initial interpretation of plain skeletal radiographs is performed with a relatively high degree of specificity and sensitivity. However, this still represents greater than one in twenty true positives being missed on primary review. The necessity of systems designed to provide safety netting against this are paramount, as are the development of novel means to improve the accuracy of initial interpretation.

Evidence is also found to support statistically significant variation in the accuracy of interpretation across anatomical subdivisions; radiographs of the ankle were shown to have the highest diagnostic odds ratio. The cause of this is uncertain and may reflect inherent difficulties present in certain radiographic views or anatomy, or simply greater interpreter familiarity with some radiographs. Further research is warranted to explore these factors.