Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an infectious disease that has infected 385 million individuals and has killed 5.7 million people globally as of 3rd February 2022 [1]. On March 11th, 2020, the World Health Organization (WHO) declared COVID-19 a global pandemic (the novel coronavirus) [2]. COVID-19 [3, 4] has proven to be worse in individuals with comorbidities such as coronary artery disease [3, 5], diabetes [6], atherosclerosis [7], fetal [8], etc. [9,10,11]. It has also caused architectural distortion with the interactions between alveolar and vascular changes [12] and affected relationships with daily usage such as nutrition [13]. Pathology has shown that even after vaccine immunization (ChAdOx1 nCoV-19), vaccine-induced immune thrombotic thrombocytopenia (VITT) was triggered [14]. It was also observed that adults who are born small, so-called intrauterine growth restriction (IUGR), are also likely to get affected by COVID-19 [8].

One of the gold standards for COVID-19 detection is the "reverse transcription-polymerase chain reaction" commonly known as the RT-PCR test. Nonetheless, the RT-PCR test takes time and has low sensitivity [15,16,17]. This is where we use the image-based analysis for COVID-19 patients by using Chest radiographs and Computed Tomography (CT) [18,19,20] to diagnose the disease and work as a reliable complement to RT-PCR [21]. In the general diagnosis of COVID-19 and body imaging, CT has shown high sensitivity and reproducibility [20–, 22,23,24]. The primary benefit of CT [25, 26] is the imaging capacity to identify anomalies/opacities such as ground-glass opacity (GGO), consolidation, and other opacities [27,28,29] seen in COVID-19 patients [30–, 31,32,33,34,35].

DL is a branch of AI that employs deep layers to provide fully automatic feature extraction, classification, and segmentation of the input data [36, 37]. Our team has developed the COVLIAS system, which has used deep learning models for lung segmentation [38,39,40]. In these previous studies, only one cohort was used when applying cross-validation, leading to bias in the performance since both the training and testing data were taken from the same CT machine, same hospital settings, and same geographical region [41,42,43]. To overcome this weakness, we introduce a multicentre study where training is conducted on one set of data coming from Croatia and testing was conducted using another data set taken from another source. This source was from Italy, the so-called “Unseen AI” (or vice-versa), which is one of the innovations of the proposed study. Just recently, there has been more visibility on “Unseen AI” [38, 44].

Due to variations in COVID-19 lesions such as GGO, Consolidations, and Crazy Paving, the ability of AI models to predict the automated COVID-19 lung segmentation in CT Unseen data has led to poor clinical manifestations (see Fig. 1). This happens when the Hounsfield Units (HU) [45] of CT images are not consistent between the training and testing paradigms, which leads to over-and under-estimation of the prediction region. This can be prevented via normalization right before AI deployment [46, 47]. We embed such normalization in our AI framework automatically, which is another innovation besides the Unseen AI model design.

Fig. 1
figure 1

Overlay of segmentation results (red) from the ResNet-SegNet HDL models trained without adjusting the HU level. The white arrow represents the region where the ResNet-SegNet HDL model under-estimates the lung region

Recent advances in deep learning, such as hybrid deep learning (HDL) have shown promising results [38,39,40–, 48,49,50,51,52]. Using this premise, we hypothesize that HDL models are superior to solo DL (SDL) models for segmentation. In this study, we have designed nine SDL and HDL models that are trained and tested for COVID-19-based lung segmentation on multicentre databases. We further offer insight into how 9 models of AI reciprocate to COVID-19 data sets, which is another unique contribution of the proposed study. The analysis includes attributes such as (i) the size of the model, (ii) the number of layers in AI architecture, (iii) the segmentation model utilizes, and (iv) the encoder part of the AI model. These can be used for a comparison between the nine AI models. Lastly, to prove the effectiveness of the AI models, we present performance evaluation using (i) Dice Similarity (DS), (ii) Jaccard Index (JI) [53], (iii) Bland–Altman plots (BA) [54, 55], (iv) Correlation coefficient (CC) plot [56, 57], and (v) Figure of Merit. Finally, as part of scientific validation, we compare the performance of COVLIAS 1.0-Unseen against MedSeg [58], a web-based lung segmentation tool.

Literature Survey

Artificial intelligence (AI) has been in existence for a while especially in the field of medical imaging [59, 60]. AI can play a vital role in the investigation of CT and X-ray images, assisting in the detection of COVID-19 type and overcoming the shortage of expert workers. It started with the role of machine learning moving into different application of point-based models such as diabetes [61, 62], neonatal and infant mortality [63], gene analysis [64] and image-based machine learning models such as carotid plaque classification [65,66,67,68,69], thyroid [70], liver [71], stroke [24], coronary [72], ovarian [73], prostate [74], skin cancer [75, 76], Wilson disease [77], ophthalmology [78], etc. The major challenge with these models is the feature extraction process which is ad-hoc in nature and, therefore, very time taking [79]. It has been recently shown that this weakness is being overcome by the deep learning (DL) models [59, 60].

Paluru et al. [80] proposed AnamNet, a hybrid of UNet and ENet to segment COVID-19-based lesions using 4,300 images (using 69 patients with 5122 resolution size) [81]. The authors compared the models against ENet [82], UNet + + [83], SegNet, and LEDNet [84]. The DSC for the lesion detection turned out to be 0.77. Saood et al. [85] used a set of 100 images downscaled to 2562 to compare the results between the two models, namely UNet and SegNet, and showed the DS score of 0.73 and 0.74, respectively. Cai et al. [86] established a tenfold CV protocol on 250 images using 99 patients and adopted the UNet model with a DS of 0.77. They also suggested a method for predicting the duration of an intensive care unit (ICU) stay. Suri et al. [40] benchmarked NIH [87] (a conventional model) against the three AI models, namely, SegNet, VGG-SegNet, and ResNet-SegNet using nearly 5000 CT scans using 72 patients in an image resolution of 7682. Concluding that ResNet-SegNet was the best performing model. In an inter-variability study by Suri et al. [39], three models, namely, PSPNet, VGG-SegNet, ResNet-SegNet were used. The authors showed HDL models outperformed SDL models, by ~ 5% for all the performance evaluation metrics using 5000 CT slices (taken from 72 patients), in an image resolution of 7682. A recent study by the same authors [38] presented VGG-SegNet, and ResNet-SegNet compared to their COVLIAS 1.0 system against MedSeg. This study used HDL models and demonstrated standard Mann–Whitney, Paired t-Test, and Wilcoxon tests to prove the system's stability.

Method and Methodology

Demographics and Data Acquisition

The proposed study utilizes two different cohorts from different countries. The first dataset contains 72 adult Italian patients (approximately 5000 images, Fig. 2), 46 males, and the remainder were female. A total of 60 people tested positive for RT-PCR in which broncho-alveolar lavage [88] was used with 12 individuals. This Italian cohort had an average GGO of 2.1 which was considered low. The second cohort consisted of 80 Croatian patients (approximately 5000 images, Fig. 3), of which 57 were male and the rest female patients. This cohort had a mean age of 66 and an average GGO of 4.1, which was considered high.

Fig. 2
figure 2

Sample CT scans taken from raw CRO data sets

Fig. 3
figure 3

Sample CT scans taken from raw ITA data sets

For the patients in the Italian cohort, CT data were acquired using Philips' automatic tube current modulation – Z-DOM), while Croatia's CT volumes were acquired using the FCT Speedia HD 64-detector MDCT scanner (Fujifilm Corporation, Tokyo, Japan, 2017). The exclusion criteria consisted of patients having metallic items or poor image quality without artifacts or blurriness induced by patient movement during scan execution [38].

AI Architectures Adapted

The proposed study uses a total of nine AI models, of which (i) PSPNet (see Supplemental A.1), (ii) SegNet, and (iii) UNet are SDL models and (iv) VGG-PSPNet (Fig. 4), (v) ResNet-PSPNet (Fig. 5), (vi) VGG-SegNet (see Supplemental A.2), (vii) ResNet-SegNet (see Supplemental A.3), (viii)VGG-UNet (Fig. 6), and (ix) ResNet-UNet (Fig. 7) are the HDL models. The difference between the SDL and HDL is that the traditional backbone or encoder part of the SDL model is replaced with a new model like VGG and ResNet. Suri et al. [39, 40, 48, 49, 89] Recent findings show that employing HDL models over SDL models in the medical sector helps learn complicated imaging features rapidly and reliably. Using this knowledge of the performance of HDL > SDL, we here introduce four new HDL models, namely, VGG-PSPNet, ResNet-PSPNet, VGG-UNet, and ResNet-UNet for lung segmentation of COVID-19-based CT images.

Fig. 4
figure 4

VGG-PSPNet architecture

Fig. 5
figure 5

ResNet-PSPNet architecture

Fig. 6
figure 6

VGG-UNet architecture

Fig. 7
figure 7

ResNet-UNet architecture

UNet [90] was the first medical segmentation model that consisted of mainly two sections (i) encoder, where the model tries to learn the features in the images, and (ii) decoder, the part of the model that up-samples the image to produce the desired output like a segmented binary lung mask in this study. Another model used in this paper is SegNet [91], which transfers only the pooling indices from the compression (encoder) path to the expansion (decoder) path, thereby using low memory. The Pyramid Scene Parsing Network (PSPNet) [92] is a semantic segmentation network that considers the full context of an image using its pyramid pooling module. PSPNet extracts the feature map from an input image using a pretrained CNN and the dilated network technique. The size of the resulting feature map is 1/8 that of the input image. Finally, the collection of these features is used to generate the output binary mask.

Residual networks (ResNet) [93] use a sequential technique of "skip connections" and "batch normalization" to train deep layers without sacrificing efficiency, permitting gradients to bypass a set number of levels. This solves the vanishing gradient problem which is not present in VGGNet [94]. The primary attributes of the AI models such as the backbone used in the architecture, the number of layers in the training models, the total number of parameters in the architecture, and the final size of the trained models are further discussed and compared in the discussion section.

Experimental Protocol

This study involves two datasets from different centers, each of ~ 5000 lung CT images for COVID-19 patients. We have utilized a fivefold cross-validation [95, 96] technique for the training of AI models without overlap. The training and testing performance was determined by the accuracy score of the binary output of the trained AI model and gold standard [39, 40], respectively.

The accuracy of the system was computed using standardized protocol given the true positive, true negative, false negative, and false positive. Finally, to assess the model's training during the backpropagation, the cross-entropy loss function was employed. The plots of the accuracy and loss function can be seen in Figs. 8 and 9.

Fig. 8
figure 8

Accuracy and loss plot for the nine AI models for the training on the CRO dataset

Fig. 9
figure 9

Accuracy and loss plot for the nine AI models for the training on the ITA dataset

Results and Performance Evaluations

Results

To prove our hypothesis that the performance of the HDL > SDL models in the proposed study, we present a comparison between (i) SDL and HDL models and (ii) the difference in training the models using high-GGO and low-GGO lung CT images. The accuracy and loss plots for the nine AI model for CRO and ITA dataset is presented in Figs. 8 and 9. Using overlays (Figs. 10, 11, 12 and 13), we present a visual representation of the results from the AI models by comparing against four different scenarios, namely, seen analysis using (i) train on Croatia data (CRO) and test on CRO, (ii) train on Italy data (ITA) and test on ITA. Similarly for Unseen analysis, (iii) train on CRO and test on ITA, and finally (iv) train on ITA and test on CRO. This study makes use of two different datasets (i) CRO with ~ 5000 CT images of COVID-19 patients who are considered as patients with high-GGO and (ii) ITA with ~ 5000 COVID-19 CT images regarded as low-GGO patients.

Fig. 10
figure 10

Visual overlays (set 1) showing the AI (green) output against the GT (red) for Seen analysis

Fig. 11
figure 11

Visual overlays (set 2) showing the AI (green) output against the GT (red) for Seen analysis

Fig. 12
figure 12

Visual overlays (set 1) showing the AI (green) output against the GT (red) for Unseen analysis

Fig. 13
figure 13

Visual overlays (set 2) showing the AI (green) output against the GT (red) for Unseen analysis

Performance Evaluation

This study presents (i) DS, (ii) JI, (iii) BA, (iv) CC plots, and (v) Figure of Merit (FoM) as part of performance evaluation for nine AI models under Seen and Unseen settings. The cumulative frequency distribution (CFD) plot for DS and JI is presented in Figs. 14, 15, 16 and 17 at a threshold cutoff of 80%. Figures 1617, 18 and 19 show the BA plot with mean and standard deviation (SD) line for the estimated lung area against the AI models and ground truth tracings. Similarly, CC plots with a cutoff of 80% are displayed in Figs. 1819, 20 and 21. We present a summary, mean, SD, and percentage improvement for all six AI models for DS, JI, and CC values in Tables 1, 2 and 3. When comparing four scenarios for Seen and Unseen settings against SDL and HDL, the DS score is better by 1%, 3%, 1%, and 1%, the JI score is better by 3%, 5%, 3%, and 2%, and finally, for CC, the performance is better by 2%, 1%, 1%, and 6%, thus proving the hypothesis for COVID-19 lungs that performance of HDL > SDL. The standard deviation for all the AI models lies in the range of 0.01 to 0.06, which is considered stable because of the values being in the second decimal place.

Fig. 14
figure 14

Cumulative frequency plot for Dice using Seen analysis

Fig. 15
figure 15

Cumulative frequency plot for Dice using Unseen analysis

Fig. 16
figure 16

Cumulative frequency plot for Jaccard using Seen analysis

Fig. 17
figure 17

Cumulative frequency plot for Jaccard using Unseen analysis

Fig. 18
figure 18

BA plot for Seen analysis

Fig. 19
figure 19

BA plot for Unseen analysis

Fig. 20
figure 20

CC plot for Seen analysis

Fig. 21
figure 21

CC plot for Unseen analysis

Table 1 Dice Similarity table for the nine AI models
Table 2 Jaccard Index table for the nine AI models
Table 3 Correlation Coefficient (P < 0.0001) for the nine AI models

Scientific Validation

The results from the MedSeg tool were compared against gold standard tracings of the two datasets used in the study. Figure 22 shows a cumulative frequency plot of DS for the segmented lungs using the MedSeg tool for Italian and Croatian datasets using COVLIAS. Similarly, Figs. 23 and 24 show the JI and CC plot of the results from the MedSeg compared to the ground truth tracings of the two datasets, with ITA on the left and CRO on the right. The percentage difference between the DS, JI, and CC score of the COVLIAS AI models in comparison to MedSeg is < 5%, thus proving the applicability of the proposed AI models in the clinical domain. Finally, the mean and standard deviation of the lung area error is presented in Fig. 25 using the BA plot and is used in the same notion with ITA on the left and CRO on the right. For the determination of the system’s error, Table 4 presents Figure of Merit for the nine AI models of Seen and Unseen analysis. Finally, to prove the reliability of the AI-based segmentation system COVLIAS, statistical test such as Mann–Whitney, Paired t-Test, and Wilcoxon test is presented for Seen (Table 5) and Unseen (Table 6) analysis. MedCalc software (Osteen, Belgium) was used to carry out all the tests.

Fig. 22
figure 22

Cumulative frequency plot of DS for MedSeg for ITA (left) and CRO (right) data sets

Fig. 23
figure 23

Cumulative frequency plot of JI for MedSeg for ITA data (left) and CRO data (right)

Fig. 24
figure 24

CC plot for MedSeg vs. GT for ITA (left) and CRO (right)

Fig. 25
figure 25

BA plot for MedSeg vs. GT for ITA (left) and CRO (right)

Table 4 The Figure of Merit for the nine AI models for Seen-AI vs. Unseen-AI
Table 5 Statistical tests for Seen-AI analysis on nine AI models
Table 6 Statistical tests for Unseen-AI analysis on nine AI models
Table 7 Nine AI architectures and their comparison

Discussion

This proposed study presented nine automated CT lung segmentation techniques in AI framework using three SDL, namely, (i) PSPNet, (ii) SegNet, (iii) UNet and six HDL models, namely, (iv) VGG-PSPNet, (v) VGG-SegNet, (vi) VGG-UNet, (vii) ResNet-PSPNet, (viii) ResNet-SegNet, (ix) ResNet-UNet. To prove our hypothesis, we use automated HU adjustment to optimize values of (1600, -400) and train our AI models to predict on test data (Fig. 26). After HU adjustment for DS, JI, and CC, the percentage improvement for Seen AI is 1%, 3%, and 6%, and for the Unseen AI is ~ 4%, ~ 5%, and 6%, respectively. We concluded that Unseen AI is possible using automated HU adjustment. Further, HDL was found to be superior to SDL (Table 1, 2 and 3).

Fig. 26
figure 26

Overlay of segmentation results from the ResNet-SegNet model trained without adjusting the HU level (red) and after adjusting the HU level (green). The white arrow represents the under-estimated region and the red arrows represent the same region estimated accurately by the ResNet-SegNet model

Fig. 27
figure 27

Left: Number of NN layers. Right: Size of the final AI models used in COVLIAS 1.0

Comparison and Contrast of the Nine AI Models

The proposed study uses a total of nine AI architectures with three SDL (PSPNet, SegNet and UNet) and six HDL models (VGG-PSPNet, VGG-SegNet, VGG-UNet, ResNet-PSPNet, ResNet-SegNet, and ResNet-UNet). ResNet-PSPNet was the AI model with the highest # of NN layers and model size, equally. The training for all the AI models was implemented on NVIDIA DGX V100 using python [97] and adapting multiple GPUs to speed up the training time (Table 7 and Fig. 27).

Benchmarking

Table 8 shows the benchmarking table using CT imaging. Our proposed study (row #7) took 10,000 CT scans of 152 patients and implemented 9 different models that consisted of three SDL, namely, PSPNet, SegNet, UNet, and six HDL models, namely, VGG-PSPNet, VGG-SegNet, VGG-UNet, ResNet-PSPNet, ResNet-SegNet, ResNet-UNet. The four scenarios (CRO-CRO, ITA-ITA, CRO-ITA, and ITA-CRO) correspond to SDL and HDL.

Table 8 Benchmarking table

A Special note on Tissue Characterization

Lung segmentation can be considered as a tissue characterization (TC) process and was tried before using ML such as in plaque TC [66, 98], lung TC [99], coronary artery disease characterization [100], liver TC [101], or in cancer application such as skin cancer [102], ovarian cancer [103]. Other types of advanced TC can be using hybrid models such as [24, 36, 51].

Strength, Weakness, and Extensions

This proposed study, COVLIAS 1.0-Unseen proves our two hypotheses, (i) contrast adjustment is vital for AI, and (ii) HDL is superior to SDL using nine models considering 5,000 CT scans. The system was validated against MedSeg and tested for reliability and stability.

It can also be noted that while training the AI model for COVID-19 infected lungs, it is necessary to adjust the HU levels to get the results of the segmentation accurately. Even though we used HU adjustments (i) it can be extended by adjusting the contrast, removing noise, and adjusting the window level [104]. (ii) Multimodality cross-validation such as ultrasound [105]. (iii) More advanced image processing tools such as level sets [106], stochastic segmentation [107], and computer-aided diagnostic tools [108, 109] can be integrated with AI models for lung segmentation. (iv) Recently, there have been studies to compute the bias in AI and it would be interesting to evaluate the bias models using AP(ai)Bias (AtheroPoint, Roseville, CA, USA) and other competitive models [42]. (v) CVD assessment of patients during the CT imaging [110].

Conclusions

The proposed research compares three SDL models, namely, PSPNet, SegNet, UNet, and six HDL models, namely, VGG-PSPNet, VGG-SegNet, VGG-UNet, ResNet-PSPNet, ResNet-SegNet, and ResNet-UNet against MedSeg for CT lung segmentation. It also performed the benchmarking of three SDL and 6 HDL models against MedSeg. The multicentre CT data was collected from Italy (ITA) with low-GGO, and Croatia (CRO with high-GGO hospitals, each with ~ 5000 COVID-19 images. These CT images were annotated by two trained, blinded senior radiologists, thus creating an inter-variable multicentre dataset. To prove our hypothesis, we use an automated Hounsfield Units (HU) adjustment methodology to train the AI models, leading to four combinations of two Unseen sets: train-CRO:test-ITA, train-ITA:test-CRO, and two Seen sets: train-CRO:test-CRO, train-ITA:test-ITA. To keep the test set unique for each fold, we adapted a five-fold cross-validation technique. Five types of performance metrics, namely, (i) DS, (ii) JI, (iii) BA plots, (iv) CC plots, and (v) Figure-of-Merit. For DS and JI, HDL (Unseen AI) > SDL (Unseen AI) by 4% and 5%, respectively. For CC, HDL (Unseen AI) > SDL (Unseen AI) by 6%. The COVLIAS-MedSeg difference was < 5%, thus proving the hypothesis and making it fit in clinical settings. Statistical tests such as Paired t-Test, Mann–Whitney, and Wilcoxon were used to demonstrate the stability and reliability of the AI system.