Introduction

The novel coronavirus, SARS-Cov-2 and its associated disease, COVID-19, have presented a significant and urgent threat to public health while simultaneously disrupting healthcare systems. Despite being more than 2 years since the beginning of the pandemic, outbreaks continue to threaten to overwhelm healthcare systems, and viral variants continue to introduce uncertainty [1]. Fast and accurate diagnostic and prognostic capability help quickly determine which patients need to be isolated and informs triage of patients. Reverse-transcription polymerase chain reaction (RT-PCR) is the current clinical standard for diagnosis of COVID-19, however, its low sensitivity often necessitates repeat testing [2] taking additional time. This has led to the suggestion that there is a role for radiology in diagnosing COVID-19.

Radiological professional bodies have generally recommended against the use of imaging for screening in COVID-19 but recognise the role of incidental findings and for disease staging. Early in the pandemic, the use of computed tomography (CT) for diagnosis and screening was discussed in the context of shortages of RT-PCR test kits and poor sensitivity [3]. In March of 2020, a consensus report was released [4], endorsed by the Society of Thoracic Radiology, the American College of Radiology and the Radiological Society of North America (RSNA), recommending against the use of chest CT for screening due to a low negative predictive value, but also partly due to a lack of evidence early in the pandemic. The Royal Australian and New Zealand College of Radiologists released their advice in April of 2020, which remains current, recommending against the use of chest radiograph for screening but recommending for the use of CT for staging [5]. The report, however, stops short of recommending a severity scale. By June of 2020, the World Health Organisation recommended the use of radiological imaging: (1) for diagnostic purposes in symptomatic patients when RT-PCR is not available, is available but results are delayed and when RT-PCR is negative but there is high clinical suspicion of COVID-19; (2) for triage purposes when deciding to admit to hospital and/or intensive care unit (ICU); and (3) for staging purposes when deciding appropriate therapeutic management [6]. The most recent version of the Cochrane review on the topic suggest that CT and chest X-ray (CXR) are moderately sensitive and specific to the diagnosis of COVID-19, whereas ultrasound is sensitive but not specific to the diagnosis of COVID-19 [7]. This novel application of radiology has spurred an interest in the application of machine learning techniques to automate the image interpretation tasks.

Many investigators have proposed techniques in a wide range of applications to automate image interpretation in imaging of COVID-19, including segmentation of COVID-19 related lesions, typically ground-glass opacities (GGOs), diagnosis, staging of the current disease progression and prognosis of likely future disease progression. However, the field has inspired controversy. DeGrave et al. [8] demonstrated that combining data from multiple sources, in particular where data from different classes have different acquisition and pre-processing parameters, led to a significant bias that artificially improved the measured performance in many studies. Garcia Santa Cruz et al. [9] presented a review of public CXR datasets, concluding that the most popular datasets used in the literature were at a high risk of introducing bias into reported results.

Many other reviews have been introduced on the topic, we now introduce the seminal ones. Shi et al. [10] presented a narrative review very early in the pandemic (published April of 2020) of machine learning techniques for segmentation of COVID-19-related lesions and for diagnosis, staging and prognosis of COVID-19 using CXR and CT. However, this early review did not consider potential study bias in its papers. Others have presented systematic reviews [11, 12] that, while following a more rigorous approach to inclusion also failed to asses bias when assessing results. Wynants et al. [13] present a broadly-scoped systematic review for prediction models in COVID-19, leveraging the prediction model risk of bias assessment tool (PROBAST) [14]. They reported high risk of bias across the field. Roberts et al. [15] presented a systematic review of machine learning techniques applied to CXR and CT imaging, published up to the 3rd of October, 2020, assessing bias using the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [16], Radiomics Quality Score (RQS) [17] and PROBAST [14] and reporting methodological and dataset trends. They use this to develop a set of recommendations for authors in the field.

In this review, we use similar techniques to those presented by Roberts et al. [15]. Rather than assessing papers on separate criteria, RQS and CLAIM, we assess all papers with CLAIM. We also aim to present a richer analysis of techniques and their performance, and to provide an update, including publications until 31st October, 2021. We also introduce an analysis of authors and institutions in the field, in the hope that it encourages and facilitates further collaboration.

Research questions:

  • Which techniques are most successful in differentiating COVID-19?

  • What are the clinical requirements driving the development of these tools? How would such techniques be implemented clinically?

  • Who is publishing this in this field?

Methodology

Study selection

The inclusion criteria for the review are:

  1. (1)

    Studies that aim to automatically (allowing for manual contouring as a preprocessing step under the assumption this could be automated) diagnose, stage or prognose COVID-19 or segment lesions associated with COVID-19; and

  2. (2)

    Studies that use medical imaging or signals, including CXR, CT, ultrasound, magnetic resonance imaging (MRI), or electrocardiograph (ECG) as input to their model.

rscopus version 0.6.6 [18] was used to retrieve articles according to the search criteria outlined in Panel 1. The search was performed on the 19th November, 2021. Papers meeting the inclusion criteria that were identified during the investigation but not identified in the search were also included in the study.

Panel 1: Scopus search criteria

TITLE-ABS-KEY ( ( covid OR coronavirus ) AND ( ( chest W/5 xray ) OR “computed tomography” OR ultrasound OR “magnetic resonance” OR mr OR mri OR ecg OR electrocardiograph* ) AND ( diagnos* OR staging OR identif* OR response OR prognos* OR segment* ) AND ( learn* OR convolutional OR network OR radiomic*) )

Exclusion criteria were also imposed to eliminate studies that exhibited or were likely to exhibit a high risk of bias:

  1. (1)

    Studies from journals with a source normalized impact per paper (SNIP), as measured in 2021, less than 1 were excluded. SNIP is a metric introduced by Scopus that measures contextual impact, normalising between fields with different citation rates. This process was manually checked by two of the authors, and journals that were likely to publish relevant studies and reputable within their fields, that would be eliminated, were included.

  2. (2)

    Studies that were more than 90 days old and had not attracted any citations were excluded. This criteria is included to automatically filter articles which the scientific community has deemed uninteresting, under the assumption that in such a fast moving field, 90 days should be adequate to have attracted at least one citation.

  3. (3)

    Studies with metadata indicating that they were Editorials, Reviews, Notes or Letters were excluded.

  4. (4)

    Studies where application to COVID-19 is secondary and not the primary focus of the paper were excluded.

  5. (5)

    Studies not meeting the minimum risk of bias assessment (see “Bias assessments” section) were excluded.

Remaining studies were assigned amongst reviewing authors, and each study was reviewed by one author, who assessed for minimum risk-of-bias, and extracted data. Studies were not de-identified before analysis.

Bias assessments

Due to reports of a high risk-of-bias in the field [9, 13, 15], we include a bias assessment. Improper study design, data collection, data partitioning and statistical methods can lead to misleading reported results [14]. This commonly manifests as a positive bias because authors (rightly) attempt to improve the performance of their proposed techniques.

The CLAIM checklist was completed for all included papers [16]. All 42 checklist items were given either a pass or fail score, or a “not applicable” score which did not count towards the failure count in cases where the checklist item was not applicable to the paper. The number of failure scores was used as a measure for bias. Similar to Roberts et al. [15], we impose a subset of CLAIM, items 7, 9, 20, 21, 22, 25, 26 and 28, as a minimum risk of bias. Any papers that did not meet all subset checklist items were excluded. CLAIM checklist reports from Roberts et al. [15] were merged and used where available to avoid duplication.

Extracted data

Methodological and performance results were collected per technique, where each study presents one or more technique. When multiple techniques were introduced in each study, only the highest performing technique was surveyed, unless the techniques filled different purposes (e.g., one study presenting a segmentation and diagnostic technique) or different contexts (e.g., different available clinical data to augment image input) (Table 1).

Table 1 Data collected during survey

Analysis of studies

Accuracy and area under the curve (AUC) of the receiver operating characteristic (ROC), where reported, were used for performance comparison. Statistical significance was measured throughout this review using two-sided independent t-tests, with a significance threshold of p < 0.05. No adjustments were made for multiple comparisons.

Analysis of authors and publishers

Author, institution and publication metadata were extracted using rscopus 0.6.6 [18] and used to compute author h-indices. A co-author network was generated with tidygraph 1.2.0 [19] by linking authors that had published together, and the most central authors identified using the betweenness centrality.

Results

Of 1002 studies identified, 282 were assessed against the required subset of the CLAIM checklist for exclusion, after which 81 studies were included in the study (Fig. 1). A list of identified and included studies are available in Supplementary 1, Table S1, and the full set of studies identified and collected data are available in Supplementary 2. CLAIM 26 eliminated the most studies (Fig. 2, left), which pertains to the evaluation of the best-performing model. Most papers failing this subset failed to evaluate against a separate test set after presenting multiple models. CLAIM 25 eliminated the next most studies, which required an adequate description of hyperparameter selection. Only one in four of papers met the inclusion criteria, and approximately one in four of papers failed a half or more of the required CLAIM subset (Fig. 2, right). From the 81 studies included, a total of 103 separate techniques were included.

Fig. 1
figure 1

PRISMA flow diagram of search

Fig. 2
figure 2

Studies excluded for bias. The percentage of total studies that failed each of the required subset of the CLAIM checklist for inclusion (left), and a histogram of the number of failures (right), where only studies with 0 failures met the inclusion criteria

Bias

Remaining CLAIM failures in the included articles are depicted in Fig. 3 (left). The count of failures for each article became the risk-of-bias surrogate, a histogram over all papers is shown in Fig. 3 (right). The mean number of failures was 8.3 \(\pm\) 3.9 standard deviation.

Fig. 3
figure 3

CLAIM results of studies included: the number of included studies that failed each of the CLAIM items (left), and a histogram of the number of failures (right)

Methodologies

The majority, 58%, of techniques sought to solve a diagnosis task, attempting to classify COVID-19 disease from healthy patients and/or non-COVID-19 pneumonia (Fig. 4, left), versus 31% performing prognosis (where techniques performing both are counted in both). Of the 31% of techniques attempting to solve a prognosis task, the majority used an objective prognostic outcome measure (46% progression and 16% survival) rather than matching a clinical assessment.

Fig. 4
figure 4

(Left) Machine learning tasks attempted to be solved by techniques. (Top Right) A breakdown of Diagnosis and Diagnosis & Prognosis approaches by diagnostic outcome variable classes. (Bottom Right) A breakdown of Prognosis and Diagnosis & Prognosis approaches by prognostic outcome variable. The inner ring represents the number of classes, or continuous for regression tasks, and the outer ring represents the derivation of the outcome variable. See Table 1 for definitions of derivations

Most papers used CT images, either in 3D or as 2D slices, as model input, followed by CXR and US (Fig. 5, left). Only a small minority of papers included clinical features as input. Although MRI and ECG were explicitly included within the scope of the review, no techniques using these modalities were included. No MRI papers were identified, and none of the 3 identified ECG papers that progressed beyond screening met the inclusion criteria.

Fig. 5
figure 5

(Left) The distribution of modalities used for input to techniques. (Middle) The reported AUC and (Right) accuracy of techniques by modality. Only techniques reporting AUC or accuracy are included, respectively. Results of a two-sided independent t-test are give as ‘*’ for significance or ‘ns’ for no significance

The majority of papers used a deep learning approach, the most common deep learning models used are listed in Fig. 6.

Fig. 6
figure 6

(Left) The distribution of techniques using traditional machine learning and radiomics approaches versus deep learning and (Right) the distribution of the most popular deep learning networks

Performance

Performance is only reported here for studies where AUC or accuracy were described. The top-performing diagnostic and prognostic techniques are listed in Tables 2 and 3, respectively. Neither AUC (Fig. 7, left) nor accuracy (Fig. 7, right) significantly correlated with the number of CLAIM failures for diagnosis nor prognosis. There were no statistically significant differences between input modalities on performance (Fig. 5, middle and right), although CXR appeared to provide a higher AUC than CT, and US appeared to provide a lower accuracy than CT and CXR. Deep learning approaches had increased reported AUC (p = 0.04) and accuracy (p = 0.01), but no significant difference in bias was identified (Fig. 8).

Table 2 Union of top 5 performing diagnostic techniques by AUC and accuracy. Techniques performing binary classification between healthy and COVID-19 were excluded
Table 3 Union of top 5 performing prognostic techniques by AUC and accuracy
Fig. 7
figure 7

Performance of techniques, as measured by AUC (left) and accuracy (right), plotted against CLAIM failures. Hue represents tasks, as indicated in the legend. Dashed lines indicate the mean regression for each of the tasks, and shading indicates the 95% confidence interval. All regression lines were compared with a two-sided independent t-test against a null hypothesis that gradient = 0, none of which reached significance

Fig. 8
figure 8

Comparison of (Left) AUC, (middle) accuracy and (Right) number of CLAIM fails between techniques leveraging deep learning and those leveraging classical machine learning and radiomics approaches. Results of a two-sided independent t-test are represented as ‘*’ for significance or ‘ns’ for no significance

Authors

The country of residence of authors tended to correlate with countries that were affected the most by the pandemic in early 2020 (Fig. 9).

Fig. 9
figure 9

Number of articles published by author country. Articles with authors from multiple countries, indicated by hue, are counted in duplicate for each country

A network analysis of connectivity between authors yielded 48 separate graphs of the 81 publications, depicted in Supplementary 1 Figure S1, and a subset in Fig. 10. The most productive research groups are summarised in Table 4.

Fig. 10
figure 10

Authorship graph, where nodes represent authors and edges represent co-authorship. Depicted are the 5 largest clusters

Table 4 20 most productive groups

Discussion

In this work, we present a systematic review of automated techniques for diagnosis, prognosis and segmentation of COVID-19 disease. Because the field has proven both popular and controversial, we used liberal exclusion criteria to reduce the number of lower-quality papers for manual review. In formulating the criteria, we assumed that impactful papers are likely to be published in highly cited publications and are likely to attract citations themselves. Studies published in journals with a SNIP below 1 were eliminated, which risks eliminating journals that aren’t ranked by Scopus. In order to reduce this risk, the list of eliminated journals was reviewed by all authors, and a consensus on non-indexed journals to include was reached. Further, studies that have been published for greater than 90 days yet hadn’t attracted any citations were eliminated, which risks eliminating unnoticed studies. Even after screening, 71% of papers were excluded during bias assessment (Figs. 1, 2), indicating that the majority of work in the field is at high risk of bias, including those published in reputable peer-reviewed publications.

Sources of bias

Datasets

Many studies use data from sources with minimal provenance and metadata, and often use data that was not intended for training diagnostic or prognostic tools. A number of datasets aggregate data from different sources, some of which may be aggregates themselves [9]; and many studies aggregate a number of datasets, either to increase their training size or to provide an independent test set. However, this causes a complex set or participants and leads to a high risk that the same images are present in the training and evaluation set. Other datasets present a series of CT slices without metadata indicating which images belong to which participants, leading to a high risk that adjacent axial slices from a participant may lie in the training and evaluation set. Any studies exhibiting these risks failed CLAIM 21.

Although it did not lead to exclusion in this review, some datasets also aggregate different classes from different sources. It has been established that this presents a high risk of bias, as networks are able to distinguish between classes using non-disease-related domain effects.

Data handling

Studies that did not split training and evaluation sets at the patient level also failed CLAIM 21. This mostly occurred in papers dealing with CT as 2D axial slices, some of which randomly allocated all 2D images between classes. CLAIM 26 was responsible for the most failures (45%, Figure < CLAIM subset >), which often indicated a failure to allocate an evaluation set for use after model selection.

Description of methods

The remaining CLAIM checklist items, 7, 9, 20, 22, 25 and 28, each related to adequately documenting methodology. This is important not only for reproducibility, which is important in technical publications to ensure the advancement of the field, but also could represent hidden bias. The field of machine learning requires attention to detail in implementation to prevent overfitting, data dredging or otherwise accidentally positively biasing results.

Study demographics

The majority (58%) of techniques sought to solve a diagnosis task. Although there has been limited need for diagnosis of COVID-19 using imaging, the potential for faster analysis compared with RT-PCR, especially when considering that consecutive negative RT-PCR testing is required for exclusion when the pre-test probability is high [39]. However, within this set, 38% only demonstrated an ability to differentiate COVID-19 from healthy individuals. Any clinically realistic scenario for deployment of such an algorithm would need to demonstrate an ability to aid in a differential diagnosis between similar diseases. Regardless, most professional bodies recommend the use of radiographic imaging in COVID-19 only for triage purposes [5, 6, 40] and therefore it is most likely more impactful for investigators to explore prognostic techniques.

CT scanning was the most popular modality, likely due to the image quality of tomographic imaging and the availability of public datasets. The additional context a 3D image can give may also have motivated the use of the modality, although many techniques only considered 2D axial sections. Given the clinical context and the fact that techniques are likely to be most useful during an outbreak, the use of CXR may be more convenient and practical. For example, clinical practice dictates that imaging rooms require an hour between patients for cleaning, a requirement that can be obviated with portable CXR that can move to the patient’s room [41]. Therefore, we suggest that future investigations may be more impactful in delivering a technique using CXR data, especially as no significant performance differences were seen between CXR and CT (Fig. 5).

It has been proposed that ultrasound analysis for COVID-19 could be valuable in rural and remote regions, and as a tool to facilitate social distancing in urban regions [42]. The relatively niche requirement means that systems for automated analysis of ultrasound are likely to be less impactful. This may be offset by the low cost of ultrasound, and the potential to deploy systems to developing countries. Other modalities, including MRI and even ECG, were explicitly included in the scope of this review, however no papers met the inclusion criteria for either. MRI generally yields poor contrast within the lung and provides few benefits over CT in this application. Some studies investigating ECG remain after the screening process, but either were excluded as they were not automated or did not meet the bias assessment requirements.

Study performance

Studies tended to report excellent diagnostic and prognostic performance based on imaging features. The top diagnostic techniques all reported AUC ≥ 0.98 and accuracy ≥ 96.8% (Table 2), while the prognostic techniques reported AUC ≥ 0.97 and accuracy ≥ 85.7% (Table 3). Further, these results were relatively stable across the number of CLAIM failures (Fig. 7), providing some confidence that the top results are not dominated by biased studies. Notably, though, the top performing prognostic techniques in Table 3 are binary classification tasks, which naturally yield higher metrics than those with more classes.

Observations

Data handling

Many studies used image storage formats that don’t meet medical imaging standards. Images may be stored at lower bit depth resolution, be stored using lossy compression, or be stored without requisite metadata. If these traits are consistent between classes, these issues are less likely to lead to a positive bias in reported results but may lead to lower performance. Similarly, many CT studies reported using per-image intensity normalisation for pre-processing. For quantitative modalities such as CT, this leads to a loss of information that the network is likely having to account for internally.

Input data

Studies that presented techniques under identical conditions with and without clinical data reported superior performance with the clinical data [28, 43]. This may be reporting bias, but it is likely that some combination of demographic, symptomatic and imaging data is likely to provide additional discrimination into the disease progression. Much of this information is relatively easily acquired, so there is little cost to include it.

Ethics

The majority of studies presenting novel datasets reported detail on the ethical approval. However, far fewer provided information on the consent given by participants, as required by CLAIM item 7. To be consistent with the analysis of Roberts et al., we have ignored this requirement, however we note this is an area to be improved in the medical imaging literature. Further, no studies that sourced from public datasets reported any ethical approval. The National Statement on Ethical Conduct in Human Research [44] outlines the definition of human data to include that sourced from public datasets.

Clinical translation

Few of the reviewed papers realistically considered clinical deployment. As Roberts et al. [15] highlight, no developed systems are ready to be deployed clinically, with one reason being the need to work with clinicians to ensure the developed algorithms are clinically relevant and implementable. This is highlighted by a review by Born et al. [45] who found that although 84% of clinical studies report the use of CT (with CXR only comprising 10% of studies), a much larger proportion of the AI papers were focused on X-ray. The same paper also emphasizes the need for additional stakeholder engagement, including patients, ethics committees, regulatory bodies, hospital administrators and clinicians. For clinical deployment medical imaging software generally requires validation through randomised control trials, regulatory certification (generally the software would be developed within an ISO1485 and IEC 62304 environment), and integration with existing clinical workflow (aligning with agreed standards for interoperability and upgradability, particularly the DICOM standard and required vendor tags).

Author demographics

We provide data on the authors (Fig. 9) and institutions (Table 4) publishing in the field as a landscape map for new authors. Most of the authors are located between China, The United States of America and Italy, and most of the most productive groups in China. Collaboration between groups predominantly occurred within the same country, except for a cluster of collaboration between Italy and the United States (Fig. 10).

Review limitations

Automatic filtering of studies using the SNIP of the published journal and number of citations for studies older than 90 days at the time of search was conducted. This was required in order to regulate the scope of the manually reviewed articles. This risked omitting rigorous papers that have not attracted scientific interest or are published in less circulated or newer journals. We believe this risk is low enough that the results presented are generalisable to the field.

In this work, we collect studies primarily from the Scopus database. While Scopus, alongside Web of Science, are historically the most widely used databases in bibliometric analysis, their coverage is not complete. Notwithstanding, Scopus shares 99.11% of its indexed journals with Web of Science and 96.61% with Dimensions. For this reason, we believe the methods in this review were valid and fit-for-purpose.

In this work, we use CLAIM as a surrogate measure for bias. CLAIM provides a prescriptive and objective criterion, well-suited to having a range of reviewers quickly and consistently assess a large number of papers. However, CLAIM is designed as a checklist of best practices, as opposed to an assessment of bias. The number of CLAIM failures should be interpreted by the reader as only an approximate measure of bias.

Conclusion

In this systematic review, we collected 1002 studies and have included 82 in the analysis after screening, relevance and bias assessment. A 71% exclusion ratio for bias despite extensive screening was indicative of a high level of risk-of-bias in the field. Commonly, publications sought to solve tasks with lower potential clinical impact, focusing on diagnosis rather than prognosis and differentiation of COVID-19 from controls rather than from other likely candidate diseases in a differential diagnosis. Similarly, clinical considerations and deployment were seldom discussed. Medical imaging standards were also regularly not met, with data sourced online without provenance and in compressed formats. Nevertheless, studies reported superb prognostic and diagnostic performance, and these results were robust amongst studies regardless of risk-of-bias or modality. Deep learning studies tended to report improved performance but did not report higher risk-of-bias compared with traditional machine learning approaches. We therefore conclude that the field has proven itself as a concept and that future work should focus on developing clinically useful and robust tools.