Introduction

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a newly discovered coronavirus [1, 2]. In March 2020, the World Health Organization (WHO) declared the COVID-19 outbreak a pandemic. Up to now, more than 9.23 million cases have been reported across 188 countries and territories, resulting in more than 476,000 deaths [3]. Early and accurate screening of infected population and isolation from public is an effective way to prevent and halt spreading of virus. Currently, the gold standard method used for diagnosing COVID-19 is real-time reverse transcription polymerase chain reaction (RT-PCR) [4]. The disadvantages of RT-PCR include its complexity and problems associated with its sensitivity, reproducibility, and specificity [5]. Moreover, the limited availability of test kits makes it challenging to provide the sufficient diagnosis for every suspected patients in the hyper-endemic regions or countries. Therefore, a faster, reliable, and automatic screening technique is urgently required.

In clinical practice, easily accessible imaging, such as chest X-ray (CXR), provides important assistance to clinicians in decision making. Compared to computed tomography (CT), the main advantages of CXR are enabling fast screening of patients, being portable, and easy to set up (can be set up in isolation rooms). However, the sensitivity and specificity (radiographic assessment accuracy) of CXR for diagnosing COVID-19 are low compared to CT. This is especially problematic for identifying early-stage COVID-19 patients with mild symptoms. This causes larger intra- and inter-observer variability in reading the collected data by radiologists since qualitative indicators can be subtle. Therefore, there is increased demand for computer-aided diagnostic method to aid the radiologist during decision making for improved management of COVID-19 disease.

In view of these advantages and motivated by the need for accurate and automatic interpretation of CXR images, a number of studies based on deep convolutional neural networks (CNNs) have shown quite promising results. Ozturk et al. [6] proposed a CNN architecture, termed DarkCovidNet, and achieved 87.02% three class classification accuracy. The method was evaluated on 127 COVID-19, 500 healthy, and 500 pneumonia CXR scans. COVID-19 data were obtained from 125 patients. Wang et al. [7] built a public dataset named COVIDx, which is comprised of a total of 13975 CXR images from 13870 patient case and developed COVID-Net, a deep learning model. Their dataset had 358 COVID-19 images obtained from 266 patients. Their model achieved 93.3% overall accuracy in classifying normal, pneumonia, and COVID-19 scans. In [8], a ResNet-50 architecture was utilized to achieve a 96.23% overall accuracy in classifying four classes, where pneumonia was split into bacterial pneumonia and viral pneumonia. However, there were only eight COVID-19 CXR images used for testing. In [9], 76.37% overall accuracy was reported on a dataset including 1583 normal, 4290 pneumonia, and 76 COVID-19 scans. COVID-19 data were collected from 45 patients. In order to improve the performance of the proposed method, data augmentation was performed on the COVID-19 dataset bringing the total COVID-19 data size to 1,536. With data augmentation, they have improved the overall accuracy 97.2%. In [10], contrast limited adaptive histogram equalization (CLAHE) was used to enhance the CXR data. The authors proposed a depth-wise separable convolutional neural network (DSCNN) architecture. Evaluation was performed on 668 normal, 619 pneumonia, and 536 COVID-19 CXR scans. Average reported multi-class accuracy was 96.43%. The number of patients for the COVID-19 dataset was not available. In [11], a stacked CNN architecture achieved an average accuracy of 92.74%. The evaluation dataset had 270 COVID-19 scans from 170 patients, 1139 normal scans from 1015 patients, and 1355 pneumonia scans from 583 patients. In [12], the reported multi-class average classification accuracy was 94.2%. The evaluation dataset included 5000 normal, 4600 pneumonia, and 738 COVID-19 CXR scans. The data were collected from various sources and patient information was not specified. In [13], transfer learning was investigated for training the CNN architecture. The evaluation dataset included 224 COVID-19, 504 normal, and 700 pneumonia images. 93.48% average accuracy was reported for three-class classification. The average accuracy increased to 94.72% if viral pneumonia was included in the evaluation. In [14], performance of three different, previously proposed, CNN architectures was evaluated for multi-class classification. With 2265 COVID-19 images, the study used the largest COVID-19 dataset reported so far. Average area under the curve (AUC), for classification of COVID-19 from regular pneumonia, was 0.73 [14].

Although numerous studies have shown the capability of CNNs in effective identification of COVID-19 from CXR images, none of these studies investigated local phase CXR image features as multi-feature input to a CNN architecture for improved diagnosis of COVID-19 disease. Furthermore, except [7, 14], most of the previous work was evaluated on a limited number of COVID-19 CXR scans. In this work, we show how local phase CXR feature-based image enhancement improves the accuracy of CNN architectures for COVID-19 diagnosis. Specifically, we extract three different CXR local phase image features which are combined as a multi-feature image. We design a new CNN architecture for processing multi-feature CXR data. We evaluate our proposed methods on large-scale CXR images obtained from healthy subjects as well as subjects who are diagnosed with community acquired pneumonia and COVID-19. Quantitative results show the usefulness of local phase image features for improved diagnosis of COVID-19 disease from CXR scans.

Material and methods

Our proposed method is designed for processing CXR images and consists of two main stages as illustrated in Fig. 1: 1—We enhance the CXR images (CXR(xy)) using local phase-based image processing method in order to obtain a multi-feature CXR image (\(\mathrm{MF}(x,y)\)) and 2—we classify \(\mathrm{CXR}(x,y)\) by designing a deep learning approach where multi-feature CXR images (\(\mathrm{MF}(x,y)\)), together with original CXR data (\(\mathrm{CXR}(x,y)\)), are used for improving the classification performance. Next, we describe how these two major processes are achieved.

Fig. 1
figure 1

Block diagram of the proposed framework for improved COVID-19 diagnosis from CXR

Fig. 2
figure 2

Local phase enhancement of \(\mathrm{CXR}(x,y)\) images

Fig. 3
figure 3

Our proposed multi-feature mid-level (left) and late-level (right) fusion architectures

Image enhancement

In order to enhance the collected CXR images, denoted as \(\mathrm{CXR}(x,y)\), we use local phase-based image analysis [15]. Three different \(\mathrm{CXR}(x,y)\) image phase features are extracted: 1—local weighted mean phase angle (LwPA(xy)), 2—LwPA(xy) weighted local phase energy (\(\mathrm{LPE}(x,y)\)), and 3—enhanced local energy attenuation image (\(\mathrm{ELEA}(x,y)\)). \(\mathrm{LPE}(x,y)\) and \(\mathrm{LwPA}(x,y)\) image features are extracted using monogenic signal theory where the monogenic signal image (\(\mathrm{CXR}_{M}\)(x,y)) is obtained by combining the band-pass-filtered \(\mathrm{CXR}(x,y)\) image, denoted as \(\mathrm{CXR}_{B}(x,y)\), with the Riesz filtered components as:

$$\begin{aligned}&\mathrm{CXR}_{M}(x,y)= [\mathrm{CXR}_{M1},\mathrm{CXR}_{M2},\mathrm{CXR}_{M3}]\\&\quad =[\mathrm{CXR}_{B}(x,y), \mathrm{CXR}_{B} \times h_{1}(x,y),\\&\quad \mathrm{CXR}_{B}(x,y) \times h_{2}(x,y)]. \end{aligned}$$

Here, \(h_{1}\) and \(h_{2}\) represent the vector valued odd filter (Riesz filter) [16]. \(\alpha \)-scale space derivative quadrature filters (ASSD) are used for band-pass filtering due to their superior edge detection [17]. The LwPA(xy) image is calculated using:

$$\begin{aligned}&\mathrm{LwPA}(x,y)=\mathrm{arctan}\\&\quad \left( \frac{\sum _{sc}\mathrm{CXR}_{M1}(x,y)}{\sqrt{\sum _{sc}\mathrm{CXR}_{M1}^{2}(x,y)+\sum _{sc}\mathrm{CXR}_{M2}^{2}(x,y)}} \right) . \end{aligned}$$

We do not employ noise compensation during the calculation of the \(\mathrm{LwPA}(x,y)\) image in order to preserve the important structural details of \(\mathrm{CXR}(x,y)\). The \(\mathrm{LPE}(x,y)\) image is obtained by averaging the phase sum of the response vectors over many scales using:

$$\begin{aligned}&\mathrm{LPE}(x,y)=\{\sum _{sc}|\mathrm{CXR}_{M1}(x,y)|\\&\quad -\sqrt{\mathrm{CXR}_{M2}^{2}(x,y)+\mathrm{CXR}_{M3}^{2}(x,y)}\}\times \mathrm{LwPA}(x,y). \end{aligned}$$

In the above equation, sc represents the number of scales. \(\mathrm{LPE}(x,y)\) image extracts the underlying tissue characteristics by accumulating the local energy of the image along several filter responses. The \(\mathrm{LPE}(x,y)\) image is used in order to extract the third local phase image \(\mathrm{ELEA}(x,y)\). This is achieved by using \(\mathrm{LPE}(x,y)\) image feature as an input to an L1 norm-based contextual regularization method. The image model, denoted as CXR image transmission map (\(\mathrm{CXR}_{A}(x,y)\)), enhances the visibility of lung tissue features inside a local region and assures that the mean intensity of the local region is less than the echogenicity of the lung tissue. The scattering and attenuation effects in the tissue are combined as: \(\mathrm{LPE}(x,y)=\mathrm{CXR}_{A}(x,y)\times \mathrm{ELEA}(x,y)+(1-\mathrm{CXR}_{A}(x,y))\rho \). Here, \(\rho \) is a constant value representative of echogenicity in the tissue. In order to calculate \(\mathrm{ELEA}(x,y)\), \(\mathrm{CXR}_{A}(x,y)\) is estimated first by minimizing the following objective function [15]:

$$\begin{aligned}&\frac{\lambda }{2}\parallel \mathrm{CXR}_{A}(x,y)-\mathrm{LPE}(x,y)\parallel ^2_{2}\\&\quad +\sum _{j\in \chi }\parallel W_{j}\circ (D_{j} * \mathrm{CXR}_{A}(x,y)) \parallel _{1}.\ \end{aligned}$$

In the above equation, \(\circ \) represents element-wise multiplication, \(\chi \) is an index set, and \(*\) is convolution operator. \(D_{j}\) is calculated using a bank of high-order differential filters [18]. The filter bank enhances the CXR tissue features inside a local region while attenuating the image noise. \(W_{j}\) is a weighting matrix calculated using: \(W_{j}(x,y)=\mathrm{exp}(-\mid D_{j}(x,y) * \mathrm{LPE}(x,y)\mid ^2)\). In the above equation, the first part measures the dependence of \(\mathrm{CXR}_{A}(x,y)\) on \(\mathrm{LPE}(x,y)\) and the second part models the contextual constraints of \(\mathrm{CXR}_{A}(x,y)\) [15]. These two terms are balanced using a regularization parameter \(\lambda \) [15]. After estimating \(\mathrm{CXR}_{A}(x,y)\), \(\mathrm{ELEA}(x,y)\) image is obtained using: \(\mathrm{ELEA}(x,y)=[(\mathrm{LPE}(x,y)-\rho )/[\mathrm{max}(\mathrm{CXR}_{A}(x,y),\epsilon )]^\delta ]+\rho \). \(\delta \) is related to tissue attenuation coefficient \((\eta \)) and \(\epsilon \) is a small constant used to avoid division by zero [15]. Combination of these three types of local phase images as three-channel input creates a new multi-feature image, denoted as \(\mathrm{MF}(x,y)\). Qualitative results corresponding to the enhanced local phase images are displayed in Fig. 2. Investigating Fig. 2, we can observe that the enhanced local phase images extract new lung features that are not visible in the original \(\mathrm{CXR}(x,y)\) images. Since local phase image processing is intensity-invariant, the enhancement results will not be affected from the intensity variations due to patient characteristics or X-ray machine acquisition settings. The multi-feature image \(\mathrm{MF}(x,y)\) and the original \(\mathrm{CXR}(x,y)\) image are used as an input to our proposed deep learning architecture which is explained in the next section.

Table 1 Data distribution of the evaluation dataset
Table 2 Distribution of fivefold cross-validation dataset split for training, validation, and testing for COVID-19 data only. Same split was also performed for Normal and Pneumonia datasets
Table 3 Data distribution of Test Dataset-2

Network architecture

Our proposed multi-feature CNN architecture consists of two same convolutional network streams for processing \(\mathrm{CXR}(x,y)\) images and the corresponding \(\mathrm{MF}(x,y)\), respectively. Strategies for the optimal fusion of features from multimodal images is an active area of research. Generally, data are fused earlier when the image features are correlated and later when they are less correlated [19]. Depending on the dataset, different types of fusion strategies outperform the other [20]. In [21], our group has also investigated early, mid-, and late-fusion operations in the context of bone segmentation from ultrasound data. Late-fusion operation has outperformed the other fusion operations. In [22], the authors have also used late-fusion network, for segmenting brain tumors from MRI data, and have outperformed other fusion operations. During this work, we design mid-fusion and late-fusion architectures (Fig. 3). As part of this work, we have also investigate several fusion operations: sum fusion, max fusion, averaging fusion, concatenation fusion, convolution fusion. Based on the performance of the fusion operations and fusion architectures, on a preliminary experiment, we use concatenation fusion operation for both of our architectures. We use the following network architectures as the encoder network: pretrained AlexNet [23], ResNet50 [24], SonoNet64 [25], XNet (Xception) [26], InceptionV4 (Inception-Resnet-V2) [27] and EfficientNetB4 [28]. Pretrained AlexNet [23] and ResNet50 [24] have been incorporated into various medical image analysis tasks [29]. SonoNet64 achieved excellent performance in implementation of both classification and localization tasks [25]. XNet (Xception) [26], InceptionV4 (Inception-Resnet-V2) [27] and EfficientNetB4 [28] were chosen due to their outstanding performance on recent medical data classification tasks as well as classification of COVID-19 from chest CT data [30, 31].

Fig. 4
figure 4

Top row: From left to right CXR(xy) image of normal, pneumonia, and COVID-19 subjects. Bottom row: Grad-CAM images [33] obtained by late-fusion ResNet50 architecture

Dataset

We use the following datasets to evaluate the performance of proposed fusion network models: BIMCV [32], COVIDx [7], and COVID-CXNet [12]. COVID-19 CXR scans from BIMCV [32] and COVIDx [7] datasets were combined to generate the ‘Evaluation Dataset’ (Table 1). For Normal and Pneumonia datasets, we have randomly selected a subset of 2567 images (from 2567 subjects) from the evaluation dataset (Table 1). In total, 2567 images from each class (normal, pneumonia, COVID-19) were used during fivefold cross-validation. Table 2 shows the data split for COVID-19 data only. Similar split was also performed for Normal and Pneumonia datasets. In order to provide additional testing for our proposed networks, we have designed a new test dataset which we call ‘Test Dataset-2’ (Table 3). The images from Normal and Pneumonia cases which were not included in the ‘Evaluation Dataset’ were part of the ‘Test Dataset-2.’ Furthermore, we have included all the COVID-19 scans from COVID-CXNet [12].

In order to show the improvements achieved using our proposed multi-feature CNN architecture, we also trained the same CNN architectures using only \(\mathrm{MF}(x,y)\) or \(\mathrm{CXR}(x,y)\) images. We refer to these architectures as mono-feature CNNs. Quantitative performance was evaluated by calculating average accuracy, precision, recall, and F1-scores for each class [7, 9].

Results

The experiments were implemented in Python using Pytorch framework. All models were trained using stochastic gradient descent (SGD) optimizer, cross-entropy loss function, learning rate 0.001 for the first epoch, and a learning rate decay of 0.1 every 15 epochs with a mini-batches of size 16. For local phase image enhancement, we have used \(sc=2\) and the rest of the ASSD filter parameters were kept same as reported in [15]. For calculating \(\mathrm{ELEA}(x,y)\) images, we used \(\lambda =2\), \(\epsilon =0.0001\), \(\eta =0.85\), and \(\rho \), the constant related to tissue echogenicity, was chosen as the mean intensity value of \(\mathrm{LPE}(x,y)\). These values were determined empirically and kept constant during qualitative and quantitative analysis.

Qualitative analysis: Gradient-weighted class activation mapping (Grad-CAM) [33] visualization of normal, pneumonia, and COVID-19 are presented as qualitative results in Fig. 4. Investigating Fig. 4, we can see the discriminative regions of interest localized in the normal, pneumonia, and COVID-19 data.

Quantitative analysis of Evaluation Dataset: Table 4 shows average accuracy of the fivefold cross-validation on the ‘Evaluation Dataset’ for mono-feature CNN architectures as well as the proposed multi-feature CNN architectures. In most of the investigated network designs, \(\mathrm{MF}(x,y)\)-based mono-feature CNN architectures outperform \(\mathrm{CXR}(x,y)\)-based mono-feature CNN architectures. The best average accuracy is obtained when using our proposed multi-feature ResNet50 [24] architecture. All multi-feature CNNs with mid- and late-fusion operation compared with mono-feature CNNs, with original \(\mathrm{CXR}(x,y)\) images as input, achieved statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at \(\%5\) significance level). Except SonoNet64 [25], XNet(Xception) [26], and InceptionV4(Inception-Resnet-V2) [27], all multi-feature CNNs with mid-fusion operation compared with mono-feature CNNs with \(\mathrm{MF}(x,y)\) images as input show statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at \(\%5\) significance level). We did not find any statistical significant difference in the average accuracy results between the middle-level and late-fusion networks (p>0.05 using a paired t-test at \(\%5\) significance level). Figure 5 presents confusion matrix results together with average precision, recall, and F1-scores for all multi-feature late-fusion CNN architectures. One important aspect observed from the presented results we can see that almost all the investigated multi-feature networks achieved very high precision, recall, and F1-scores for COVID-19 data indicating very few cases were misclassified as COVID-19 from other infected types.

Table 4 Mean overall accuracy after fivefold cross-validation on ‘Evaluation Data’ using mono-feature CNNs and multi-feature CNNs. Bold denotes the best results obtained
Fig. 5
figure 5

Confusion matrix, and average precision, recall, and F1-scores obtained from fivefold cross-validation on ‘Evaluation Data’ using all multi-feature network models

Quantitative analysis of Test Dataset-2: Multi-feature ResNet50 provides the highest overall accuracy shown in Table 5, which is consistent with the quantitative result achieved with the ‘Evaluation Dataset.’ All multi-feature CNNs with mid- and late-fusion operation compared with mono-feature CNNs, with original \(\mathrm{CXR}(x,y)\) images as input, achieved statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at \(\%5\) significance level). Except XNet(Xception) [26], all the multi-feature CNNs with mid-fusion operation compared with mono-feature CNNs with original CXR(xy) images as input achieved statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at \(\%5\) significance level). Except XNet(Xception) [26], all multi-feature CNNs with mid-fusion operation compared with mono-feature CNNs with \(\mathrm{MF}(x,y)\) images as input show statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at \(\%5\) significance level). Similar to ‘Evaluation Dataset’ results, there was no statistically significant difference in the average accuracy results between the middle-level and late-fusion networks (p>0.05 using a paired t-test at \(\%5\) significance level) except ResNet50 [24], and XNet(Xception) [26] architectures. Confusion matrix results, together with average precision recall and F1-score values, for all multi-feature late-fusion CNN architectures evaluated is presented in Fig. 6. Similar to the results presented for ‘Evaluation Dataset,’ high precision, recall, and F1-score values are obtained for the COVID-19 data.

Table 5 Mean overall accuracy after fivefold cross-validation on ‘Test Dataset-2’ using mono-feature CNNs and multi-feature CNNs. Bold denotes the best results obtained
Fig. 6
figure 6

Confusion matrix, and average precision, recall, and F1-scores obtained from fivefold cross-validation on ‘Test Dataset-2’ using all multi-feature network models

Table 6 Comparison of proposed method with recent state-of-the-art methods for COVID-19 detection using CXR images

Discussion and conclusion

Development of a new computer-aided diagnostic methods for robust and accurate diagnosis of COVID-19 disease from CXR scans is important for improved management of this pandemic. In order to provide a solution to this need, in this work, we present a multi-feature deep learning model for classification of CXR images into three classes including COVID-19, pneumonia, and normal healthy subjects. Our work was motivated by the need for enhanced representation of CXR images for achieving improved diagnostic accuracy. To this end, we proposed a local phase-based CXR image enhancement method. We have shown that by using the enhanced CXR data, denoted as \(\mathrm{MF}(x,y)\), in conjunction with the original CXR data, diagnostic accuracy of CNN architectures can be improved. Our proposed multi-feature CNN architectures were trained on a large dataset in terms of the number of COVID-19 CXR scans and have achieved improved classification accuracy across all classes. One of the very encouraging result is the proposed models show high precision, recall, and F1-scores on the COVID-19 class for both testing datasets. Finally, compared to previously reported results, our work achieves the highest three class classification accuracy on a significantly larger COVID-19 dataset (Table 6). This will ensure few false positive cases for the COVID-19 detected from CXR images and will help alleviate burden on the healthcare system by reducing the amount of CT scans performed. While the obtained results are very promising, more evaluation studies are required specifically for diagnosing early-stage COVID-19 from CXR images. Our future work will involve the collection of CXR scans from early-stage or asymptotic COVID-19 patients. We will also investigate the design of a CXR-based patient triaging system.