1 Introduction

Coronavirus disease often referred to as Covid-19, is a submicroscopic pathogen resulting from worldwide pandemics. Coronavirus causes rapid bronchial distress disorder, a severe type of asthma. It is a kind of atypical, human-to-human transmissible pneumonia. Due to its enormous adverse influence on the healthcare of the general community, the Covid-19 pandemic is currently the most serious issue affecting our entire world. To fight Covid-19, governments and rulers imposed a variety of different strategies, policies, and lifestyles [38]. Science and technology had significantly impacted the implementation of these new ideas and techniques.

Reverse transcriptase-polymerase chain reaction (RT-PCR) [21] and enzyme-linked immunosorbent assay (ELISA) [63] are two of the most commonly used methods for detecting Covid-19 viruses. The most effective screening tool for finding Covid-19 patients is RT-PCR, which can locate the virus’s RNA from lower respiratory tract samples. The whole testing method for identifying pathogens using RT-PCR is manual and time-consuming, with a high risk of false negatives of 39–61%. In any event, significant clinical development leading to pneumonia, chest imaging studies are regularly conducted in suspected or confirmed Covid-19 patients, by the recommendation of WHO [43]. Antibodies, antigens, proteins, and glycoproteins are routinely measured in biological samples using an immunological test known as ELISA.

According to preliminary research, people with Covid-19 or pneumonia infection show anomalies in their chest radiographs. It was suggested that radiography examination might be used as a critical means of pneumonia-based disease screening in epidemic areas [4]. Radiography analysis is an excellent complement to the RT-PCR test and, in some instances, even provides a positive index. Accommodations for chest imaging are easily accessible in modern medical systems, even though radiographic images cannot simply and rapidly solve our purposes. There was a high demand for expert radiologists in this epidemic. The healthcare industry needs a solution to this problem. The field of computer vision and image processing could manage this problem with the aid of advanced tools and techniques [17]. Machine learning and deep learning were extremely promising solutions for managing these issues.

Artificial intelligence is widely being applied in medical imaging identification, and analysis [39]. The advantage of deep learning-based techniques such as CNN, RNN, LSTM-RNN, etc., in the field of computer-vision, outperformed the work of professional radiologists [29]. Those CNNs-based framework extract feature and classification prediction capabilities are quite a height, but due to a tremendous amount of data required for the training purpose, authors [6] used pre-trained models to save CPU power and calculation time.

The chest X-ray image is readily available due to its low-cost compared to other tests; the correct diagnosis from only these images is an immediate need for mankind. Since it is difficult to diagnose COVID-19 and pneumonia from X-ray images manually by medical practitioners, an automated system is required to correctly diagnose these diseases from the pool of normal, pneumonia, and COVID-19 x-ray images.

In this research, we experimented with one of the largest datasets for COVID-19 identification, comprised of 20738 images in three classes. Here, a hybrid deep learning architecture is designed by the stack-ensemble method and tuned lightweight version of the three standard deep learning frameworks and a custom CNN framework. The proposed architecture is developed to consider the issue of deploying the system in low-configuration devices to make the diagnosis faster with high performance.

The key contributions of the present work are as follows:

  1. 1.

    A novel stack ensemble deep learning framework namely LWSNet is proposed to segregate Covid-19, Pneumonia and Normal from x-ray imagery.

  2. 2.

    During stack ensemble process, truncated versions of MobileNetV2, VGG19 and InceptionResNetV2 were considered, making the final model lightweight for running at resource-constrained environments.

  3. 3.

    Three distinct public datasets were used to evaluate the performance of the proposed system.

2 Review of state-of-the-arts

It is essential to quarantine patients as soon as possible in order to control this infectious disease. Available resources are fast running out due to a constantly rising number of patients and lengthy treatment time [21]. Researchers from many disciplines and policymakers urgently need to develop a strategy to control the unwanted situation. We are attempting to concentrate on numerous computer vision areas to speed up the entire procedure. We explored some previously mentioned research areas on computer vision intelligent systems for automatic diagnosis. Researchers are using radiographic images to apply machine learning, and deep learning algorithms as the main categories to classify Covid-19 automatically and pneumonia disease [19, 20, 28, 55]. Other subcategories, such as multi-layer perceptron [62], ensemble [57], LSTM [22], fusion [53], and fuzzy [62], were applied for categorization. Those categories can implement different imaging modalities like X-ray, CT, etc.

2.1 Stack ensemble classification

The method of increasing the performance of the classifier by aggregating the already learned sub-models to tackle the same classification task is known as ensemble learning [36]. Each base learner takes a vote, and the meta-learner, a model that learns to improve the base learners’ predictions, receives the final prediction. Tang et al. [57] suggested ensemble learning can solve deep learning’s drawbacks by making predictions with several models rather than as a single model. Their experiment showed the results with good accuracy of 95%. Saha et al. [50] proposed a model that extracted deep features from X-ray images, then used an ensemble classifier. They obtained individual scores before implementing an ensemble classifier, which provides better accuracy from 1320 images. The highest accuracy, precision, recall, and F1-score were 98.91%, 100%, 97.28%, and 98.89%. Li et al. [31] combined ensemble with VGG16 as a base model, and they used cascade classifier from multiple training sets.

Annavarapu et al. [5] introduced a new ensemble technique for reducing the computationally learning cost. They named this approach the snapshot ensemble technique. Snapshot Ensembling’s adaptability with a wide range of network models and learning tasks was verified by its cyclic learning rate scheduling. This snapshot ensemble approach saved the local minima parameter and changed the model during runtime. They used ResNet50 as a base architecture. Chowdhury et al. [9] used the snapshot ensemble model, which is based on EfficientNet model. To identify and predict a critical section, they used the Grad-CAM model to visualize a CAP. They published their overall accuracy of 96.07% for Covid-19 detection.

Upadhyay et al. [60] implemented a fusion-based model to categorize three distinct classes. They used the handcrafted color-space method to collect specific features from X-ray images and then applied a stacked ensemble. Abdar et al. [1] proposed novel model named Uncertainty FuseNet. It is based on the fusion method using ensemble dropout currently. This model produced good results in noisy data.

Gifani et al. [52] implemented a pre-trained model using a total of 15 pre-trained architectures and fine-tuned the target classes. Among the 15 pre-trained architectures, the majority of voting was applied only on 5 architectures.

2.2 Deep transfer learning

Deep Transfer Learning is a method of deep learning where knowledge is transmitted from one model to the other. Zhu et al. [64] used traditional CNN and VGG16 net models. They optimized both the models and evaluated the predictions. But a weakness of their work was the selection of fewer chest images. Gupta et al. [18] worked with a five pre-trained integrated stack model called InstaCovNet-19. For boosting classification, various training and pre-processing techniques were applied. They benchmarked their model on other state-of-the-art approaches. Fan et al. [12] experimented with five pre-trained models and applied three optimizers using different learning rates. In addition, they used a 10-fold cross-validation approach. They achieved an average accuracy of 97%. Mohammadi et al. [35] discussed four popular pre-trained models with binary classification and obtained 99.1% accuracy using MobileNet for identifying Covid-19 disease from a chest X-ray. Niu et al. [42] proposed a new technique named distant domain transfer learning (DDTL). They used two models, namely reduced-size U-net Segmentation and Distant Feature Fusion. Their models worked on unstructured data and efficiently handled the variation in distribution during the training and testing data. Rezaee et al. [48] introduced a pre-trained deep learning architecture-based hybrid deep transfer learning technique. They utilized feature extraction, feature selection, and a support vector machine to classify. Other researchers presented their approach in aiming for pneumonia-based disease as an exception to these limited categories of deep learning and machine learning model. Saha et al. [49] used a pre-processing method to transform image data into an undirected graph such that simply the edges of the image are considered rather than the entire image. These networks showed impressive accuracy of 99% for the limited dataset. Another effective approach utilized by researchers for classification is the segmentation method. Munusamy et al. [41] proposed an architecture using segmentation of CT images and segregation on X-ray images. They used the U-net model for segmentation purposes and combined the location information. They compared their classification model with different pre-trained standard architecture, and overall accuracy of 99% was reported.

2.3 Lightweight CNN

In lightweight CNN, layers of the network are made reduced considering the system’s complexity, and accuracy trade-off [40]. Karakanis et al. [26] implemented a lightweight model in addition to a conditional generative adversarial network to develop synthetic images to replace the minimal data set. They began with a two-class dataset but later switched to a three-class dataset, which performed better. Paluru et al. [44] proposed 7.8 times fewer lightweight parameters using U-net architecture. They showed that their model performed well compared to the other six standard models. They claimed that their model could also run on a low configuration system. Ter-Sarkisov et al. [58] discussed lightweight segmentation models based on R-CNN masks with different standard architecture. The lightweight model isn’t limited to working with fixed images or structural data. Trivedi et al. [59] proposed a lightweight architecture based on MobileNet that can identify pneumonia from chest x-ray images. A benchmark dataset containing 5856 images in two classes was used. Additionally, they discussed the total training time for preparing the lightweight model and represented the accuracy at 97.09%.

The state-of-the-art research dealt with available datasets ignoring the performance over the conglomeration effects of these datasets. The system deployment issue on the resource-constraint devices is also a significant research gap to the best of our knowledge.

3 Proposed method

In this work, we considered deep learning-based models. Medical fields were significantly benefited from deep learning, including the detection of lung infection. In recent years, due to their impressive classification capabilities, deep learning methods were gained popularity on COVID-19. It was also observed that deep learning models outperform handcrafted feature-based models [14]. To train a deep learning model, a huge amount of data and a good amount of training time are required. So to abstain from these issues, here, transfer learning-based models were considered.

Transfer learning [15] is a machine training strategy where a system is built utilizing many training samples for a specific assignment and then used as the starting solution in another study. Rather than constructing the whole architecture initially, a pre-trained framework was used since this learning technique assures how an architecture trained for one problem may perform on another issue. As a result, the learning process is more resilient and adaptable. In addition, building a deep neural framework from scratch necessitates a high volume of samples and a significant quantity of cost time. Transfer learning allows one to focus on the beginning efforts rather than developing an entire deep architecture. Through the use of the transfer learning technique, three well-established deep learning architectures are transformed into lightweight models that are fine-tuned, as discussed in Section 3.1.

Amongst different deep learning frameworks [32], CNN is a very useful technique for processing spatial data. It is generally made up of three tiers: convolution, pooling, and dense. The design of such networks allows us to learn a wide range of complicated patterns that a basic neural network often fails to perform. CNN-based frameworks are a wide variety of uses, including self-driving cars, robots, surgical operations, etc. The convolutional layer transfers the presence of features observed in individual portions of the input images into a feature map. The process of creating a feature map from this layer is as follows:

$$ \nu_{m,n} = \sum\limits_{m=1}^{m}\sum\limits_{n=1}^{n}\rho_{({m - j,n - k})} * {x_{j,k}^{r}} $$
(1)

Where, \({x_{j,k}^{r}}\) represents the kernal for the jth,kth pixel in cth layer over the instance ρm,n and * represent the convolution operator.

$$ \nu_{m,n} = \lambda \left( \sum\limits_{m=1}^{m}\sum\limits_{n=1}^{n}\rho_{({m - j,n - k})} * {x_{j,k}^{r}}\right) $$
(2)

Here, λ denotes ReLU (rectified linear unit) activation function, which can be expressed as

$$ \lambda(z) = \max(0; z) $$
(3)

if(z < 0);Re(z) = 0otherwise,Re(z) = z,where,zdenotestheneuroninput.

The softmax activation function is utilized in the network’s last dense layer, can be represented as

$$ \chi(n)= \frac{e^{n}}{{{\sum}^{m}_{1}}{e^{n}}} $$
(4)

Here, n signifies the source vector, which has a length of m.

3.1 Proposed lightweight models

In this work, we proposed lightweight versions of existing deep CNN models: MobileNetV2, VGG19, and InceptionResNetV2, and designed a custom lightweight CNN (CLCNN) architecture. Developing lightweight stack ensemble models aims to deploy the system into resource-constraint devices. The number of parameters generated in the lightweight models is significantly less than the original counterpart.

3.1.1 Lightweight MobileNetV2

MobileNet-V2 is built on the principles of MobileNet-V1, which uses depth-wise separable convolution as a robust building component. They introduced an inverted residual block and a linear bottleneck framework in this version [51]. The original MovibleNetV2 architecture has fifty-three levels and 3.4 million parameters in its final version. In contrast, the proposed lightweight version consists of only the top twenty-five layers and utilizes 0.139331 M parameters. The structure and parameter details of this architecture are shown in Table 1.

Table 1 Layer-wise parameter details for the lightweight models

3.1.2 Lightweight VGG19

Simonyan and Zisserman [54] at the University of Oxford, UK, in early 2014, designed a CNN model named VGG network. VGG (Visual Geometry Group) was trained using the ImageNet ILSVRC dataset, consisting of more than 1 million pictures. According to these picture patterns, pictures are divided into 1000 categories and utilize more than 100 thousand images for training and 50 thousand images for validation. VGG-19 is a VGG variation with 19 densely linked layers that routinely outperform other state-of-the-art models. The model is made up of convolutional and fully connected layers that allow for enhanced feature extraction and the use of maxpooling instead of average pooling for downsampling, and modifying the linear unit (ReLU) as the activation function, selecting the largest value in the image area as the pooled value.

3.1.3 Lightweight InceptionResNetV2

The Inception-ResNet architecture [56] combines Inception block with residual module. In each Inception component, there is a filter expansion layer, which is used to scale up the depth of the filter bank to match the depth of the input. We made a lightweight version of this architecture by considering only the top twenty layers out of one hundred sixty-four. There are 0.2261 M parameters generated in this pre-trained lightweight architecture, whereas 55 M parameters are in the original architecture. In Table 1 the structure and parameter details of the lightweight InceptionResNetV2 architecture model are presented.

3.1.4 Custom lightweight CNN (CLCNN)

In this CLCNN framework (shown in Fig. 1), six convolution layers with 64, 32, 16, 16, 8, and 8 filters accompanied by two max-pool layers having a pooling size of 2 × 2 were considered. One max-pool layer is placed after the first two convolutions layers, and the second one is employed at the end of the fourth convolution layer. A dense/output layer of size 3 was used. In ablation study, we obtained the accuracies of 96.25%, 96.45%, 96.42%, and 96.35%, using 0.3, 0.4, 0.5, and 0.6 dropout values, respectively. Since a dropout of 0.4 yielded the best accuracy, we considered this value throughout the rest of the experiments. The layer-wise number of trainable parameters generated is shown in Table 1. The number of trainable parameters in original and lightweight versions is tabulated in Table 2.

Fig. 1
figure 1

CLCNN architecture for Covid-19 and pneumonia classification

Table 2 The comparison of number of trainable parameters (Million) between original and lightweight versions

3.2 Lightweight stack ensemble learning

The prediction outcome suffers from high variance, and generalization problems [24] occur due to noise in the training data and the unpredictability of the deep learning models. We considered a stack ensemble approach to boost efficiency. It uses deep learning architecture for non-linear integration of predictors to increase prediction accuracy while reducing training errors. Specifically, several ensemble techniques are available, but among all the technique stacking generalization, [61] method is most effective in terms of accuracy. Before obtaining the final prediction, passing information through one group of classifiers to other sets of classifiers is known as stacking generalization. The information in the classifier network originates from several subgroups of the training set. The original training set is split into numerous subgroups of training sets, a distinguishing property of stacking generalization. Each sub-training group is utilized to gather biased information on the dataset’s generalization behavior, which is then used to populate the classifier network. This study developed an ensemble-based stacked framework for improving classification by combining predictions of different models and obtaining the highest positive true predictive values.

Figure 2 depicts the architecture of the proposed lightweight stack ensemble learning framework. In this architecture, we considered lightweight CNN models as single models. Then the stacked ensemble was designed considering double stack, triple stack, and quadruple stack. The performance of the single models was evaluated after classification from the final predictor. The double stack consists of, MobileNetV2 & VGG19, MobileNetV2 & InceptionResNetV2, MobileNetV2 & CLCNN, VGG19 & InceptionResNetV2, VGG19 & CLCNN, and InceptionResNetV2 & CLCNN. Similarly, a triple stack consists of three base architecture such as InceptionResNetV2 & VGG19 & MobileNetV2, InceptionResNetV2 & VGG19 & CLCNN, and MobileNetV2 & VGG19 & CLCNN. This quadruple stack comprises MobileNetV2, VGG19, InceptionResNetV2, and CLCNN. This multi-stack prediction is fed to the meta-learner. In the meta-learner, three dense layers having sizes 512, 256, and 128 were deployed. The meta-learner predictions are fed to the final prediction block, which comprises a dense layer of dimension 3.

Fig. 2
figure 2

Four-step experiments of the proposed lightweight stack ensemble learning architecture

4 Experiment

The LWSNet is a classification problem to detect Covid-19 from chest X-ray images. Since the infection is unusual and has pneumonia-like characteristics, we included pneumonia images along with Covid-19 and normal images in our experiment.

4.1 Setup

We performed training and evaluation of our proposed system on a GPU machine consists of two core Intel(R) Core(TM) i5-10400 H CPU @ 2.60 GHz 2.59 GHz, 32 GB of RAM, and two NVIDIA Quadro RTX 5000. It runs on Windows 10 Pro OS version 20H2 with TensorFlow 2.0.0 installed for deep learning model training and inference, where cuDNN is enabled to speed up the training computation on a GPU device.

Table 3 Details of the databses used in the current experiment

4.2 Datasets

The datasets used in this study were obtained from three different public sources and contained three classes. As a result, since Covid-19 is a novel disease, only a limited number of benchmark datasets are available for studies. The first datasets were collected from Kaggle repositories, which contained 6432 samples divided into three classes. The second dataset was constructed using 50% of chest X-ray images from Kaggle repositories with three classes. This dataset contains 21165 chest X-ray images from four classes. The third dataset is mixed with two public repositories; one is the GitHub repository with Covid-19 images, which is constantly updated by a researcher named Cohen et al. [11] at the Montreal University, and the other datasets are taken from the Kaggle repository [37] with two classes: Normal and Pneumonia. These mixed datasets are also available in the Kaggle repository. In order to conduct further experiments, we prepared a combined dataset. Merge three datasets with their respective classes to create a combined dataset. Table 3 represents the number of images containing three categories: Covid-19, Pneumonia, and Normal, from four different datasets. In Fig. 4 the sample sizes using an 8:2 train-set of three datasets are shown. The sample images of three classes corresponding to the three datasets are presented in Fig. 3.

Fig. 3
figure 3

Organize the three datasets Dataset 1, Dataset 2, and Dataset 3 with their corresponding classes of images in three columns: (a) Covid-19, (b) normal, and (c) pneumonia

4.3 Evaluation protocol

To evaluate the performance of our LWSNet, we used different evaluation metrics since accuracy is not sufficient [13] to justify the performance of the proposed framework, especially in disease detection cases. The classification of diseases is based directly on the number of True Positives, False Negatives, True Negatives, and False Positives. Using these values, the following matrices are used to evaluate model performance from different perspectives:

Sensitivity

Sensitivity is a measure that evaluates the number of patients who was detected as positive in a scenario when the patient is genuinely effected.

Specificity

Specificity is a measure that evaluates the number of patients who was not been detected as negative in a scenario when the patient is genuinely not effected.

F1-Score

Statistics use F1 scores to determine how accurate a test is when analyzing classification data. During the computation of the F1 score, the precision and recall of the test are both considered. The F1 score is the harmonic average of precision and recall, where 1 is the best and 0 is the worst score.

Precision

Precision is the positive prediction value for the corresponding diseases. A predicted value of this disease is calculated based on true and false positives.

Negative predictive value

The negative predicted value is calculated based on the true, and false negative values.

Miss rate

Miss rate refers to the result of a test which suggests that a person does not have a specific disease when, in reality, that person does in fact, have that disease.

Fall-out

One of the most significant risks of receiving a false positive result is when the person is cohorted with other patients who are suffering with Covid-19 and is thus exposed to the virus.

False discovery rate

When making multiple comparisons, the false discovery rate is a means of conceptualizing the rate of type I error in the null hypothesis testing that is often used. Statistical approaches for regulating false discovery rates are intended to keep the predicted percentage of rejected null hypotheses) from exceeding a certain threshold.

False omission rate

It is The statistical technique, which is the complement of negative predictive value, is employed in multiple hypothesis testing to account for numerous comparisons. It calculates the percentage of false negatives that are wrongfully rejected.

Training regime

The datasets are of different sizes, so the images were resized into dimensions of 224 × 224 as a pre-processing step. The datasets were split into 8:2 train-test sets. The reason for the 8:2 division is that it was observed that the 8:2 train-test set gave better results compared to other ratios [64]. By this split, there are 5144 trains and 1288 tests for dataset 1. For dataset 2, 6061 images were considered in training, while 1515 images were for testing. Similarly, 5384 images were used in training, and the rest, 1346, were kept for testing dataset 3. To show the robustness of LWSNet, we combined all three datasets with their respective classes. In this combination, there are 16592 trained images and 4146 test images. At the beginning of the experiment, the learning rate was set to 0.001. The performance was evaluated at the interval of 0.001 learning rate. A dropout value was set at 0.5 throughout our experiment. Accordingly, the initial and secondary momentum exponential decay rates were fixed to 0.9 and 0.999. The epsilon level was set to 0.0000001. The value of the AMSGrad Boolean optimization parameter variable was set to false.

4.4 Results & analysis

Several levels of experiments were conducted to build the LWSNet model. The experiment was performed on single as well as double, triple, and quadruple stacking experiments. A total of 60 experiments were conducted during the experimentation period. To demonstrate the performance of LWSNet, we used different statistical approaches.

Table 4 Accuracy and loss values for the three datasets and the combination of three datasets in the quadruple stack model for diverse learning rates

To test the underfit and overfit of single architectures, the accuracies and corresponding losses were presented in Fig. 4. It is seen that using mobileNetV2 training and validation loss were almost null, whereas, in InceptionResNetV2, there is a validation loss of 0.17%. Similarly, VGG19 and Custom CNN loss were generated at 0.08 and 0.27. In Table 4 the accuracies and losses corresponding to the learning rates are presented for the quadruple stack model on four datasets. Changing the learning rate is very significant in building a better DL model. We observed that accuracy gradually increased and loss decreased when learning rates were decreased.

Fig. 4
figure 4

Training accuracy and training loss, validation accuracy and validation loss for individual lightweight models on dataset 1

The results of individual lightweight architecture: MobileNetV2 (Mbl), VGG-19 (Vgg), InceptionResNetV2 (Incp) and Custom lightweight CNN (CLCNN) are depicted in Fig. 5, also represent other three stacking experiment results. In the four different individual lightweight models, the average accuracy of the datasets 1, 2, 3, and combined is 96.22%, 93.95%, 94.52%, and 95.84%, respectively. The average values of MobileNetV2, VGG-19, InceptionResNetV2, and CLCNN for those four datasets are 94.00%, 94.48%, 95.74%, and 95.39%, respectively.

Fig. 5
figure 5

Consolidated accuracy graph to single architecture to quad stack architecture. The individual model’s abbreviation within brakets is MobileNetV2(Mbl), VGG19(Vgg), InceptionResNetV2(Incp), and custom lightweight CNN (CLCNN) and ‘+’ sign indicate that models are concatenated for ensemble stacking

In this experiment, the batch size of 32, initial learning rate 0.001, optimizer RMSprop, activation function ReLu, and epoch size 100 were considered. We utilized these parameters for training four individual architectures. The results of four datasets showed that our CLCNN architecture performed well as compared to the pre-trained lightweight architecture. In the CLCNN architecture, only 42,899 trainable parameters were used, which were 2.24, 4.27, and 8.64 times less than mobileNetV2, InceptionResNetV2, and VGG19, respectively. The parameters are tabulated in Table 2. Stacking approaches were divided into three phases according to the proposed lightweight stack ensemble architecture. The stacking process in the first phase was done by selecting two lightweight models among the 4 variants. Using these selection criteria, a total of six combinations are there in a dataset. As a result, using the double-stack ensemble technique, a total of 24 experiments were carried out on four datasets. Compared to individual models, in the double-stack ensemble model, the accuracies improved by 0.32%, 1.54%, 1.14%, and 1.22% for dataset 1, dataset 2, dataset 3, combined, respectively. For dataset 2, the double-stacked architecture performed better than the single-lightweight architecture in terms of average accuracy. In the second step, we took three distinct models and combined them into a single stack. Three lightweight models were selected from four types of the triple stack model. The average accuracy gained by the triple stack model compared with the double stack was 0.88%, 0.68%, 0.36%, and 0.51% for datasets 1, 2, 3, and combined, respectively.

Testing was conducted further by changing the number of epochs from 50 to 300 with 50 epoch intervals considering the batch size of 32. But, the system’s performance didn’t increase for the increasing number of an epoch. Considering 100 epochs, we experimented again by changing the batch size by 64, 128, and 256. For 128 batch size, the quadruple stack block returns the highest accuracies of 98.54%, 96.50%, 97.41%, and 98.10%, on dataset 1, 2, 3 and combined, respectively.

The accuracy value of a system cannot be the only measure of its performance. We further calculated other statistical measures to check the architecture performance. We used nine statistical metrics to evaluate the LWSNet architecture’s efficiency in terms of correctly and incorrectly identified X-ray images with their respective diseases. From the Table 5 we observed that false positive rate is 0.0009, 0.0181, 0.0026 and 0.0010 for Covid-19 classes in Dataset 1, 2, 3, and Combined, respectively.

Table 5 Quad stack performance evaluation for different datasets

4.5 Error analysis

It was observed from Table 4 that changing the learning rates 0.001, 1.00e-06, and 1.00e-05, the lowest error rate of 1.46%, 2.59%, 3.49%, and 1.90% were generated for a quadruple stack using dataset 1, 2, 3, and combined, respectively. The correctly and incorrectly classified samples for each respective dataset can be understood easily from the confusion matrices as shown in Fig. 6. It is observed that for dataset 1 in Fig. 6(a) the lowest error was generated, i.e., 26 samples out of 1285 sample size were misclassified. As a result of this sample size, one instance of Covid-19 was misclassified as pneumonia, and one instance of normal was misinterpreted as Covid-19, while three instances of pneumonia were wrongly classified as Covid-19. Similarly, in the combined dataset, 22 samples were misclassified to detect Covid-19, while 30 samples were misclassified to detect pneumonia (Fig. 6(d)).

Fig. 6
figure 6

Confusion matrix of quad stack architecture of (a) Dataset 1, (b) Dataset 2, (c) Dataset 3, and (d) Combined dataset

In Fig. 7, the misclassified instances of images are shown. The possible reasons of misclassification are noisy, blur, and opaque images.

Fig. 7
figure 7

The first column represents the target class, and misclassified samples of the target class are represented by the second and third columns. Here, the first row and the first column represent the target class of Covid-19. In the same row, the second and third columns represent misclassified samples that are normal and pneumonia, similarly in the second and third rows

4.6 Comparative study

We compared LWSNet with the standard lightweight deep learning models for dataset 1, 2, 3, and combined dataset. The results indicate that the proposed LWSNet model is most effective in four datasets, particularly in the combined dataset. The accuracy of LSWNet was improved by 3.77%, 2.07%, 2.88%, and 2.09% when compared with lightweight MobileNet, InceptionResNetV2, and CLCNN in the combined dataset, respectively. In Table 6, along with accuracy, precision, sensitivity, and f1-score, the inference time for a batch size of 128 is also presented for single models and LSWNet.

Table 6 Evaluation metrics of lightweight deep learning models with LWSNet on given X-ray images datasets

The proposed architecture was compared with other state-of-the-art architecture. In Table 7, it is seen that the accuracies were improved by 4.54%, 0.93%, and 0.28% and the F1-score on dataset 1 were improved by 4.48%, 1.12%, and 1.48% in comparison with the article of Gour et al. [16], Abdar et al. [1], and Jain et al. [23], respectively. Also, comparing the techniques [3, 27, 33, 46], with this proposed method for dataset 2, the accuracies got improved by 0.73%, 5.86%, 1.39%, and 0.50%. But, compared with the work of Aggarwal et al. [2] a loss of 0.50% of accuracy was also observed. Using the same dataset, the f1-scores of our method comparing with Rahman [46] and Lafraxocite [33] were improved by 1.11% and 1.97%, respectively. For dataset 3, the f1-scores in our technique were improved by 0.48%, 5.49%, 0.66%, and 4.46%, and the accuracies were gained by 0.40%, 6.91%, 8.41%, and 0.98% comparing the results of the articles of [7, 8, 30, 34], respectively.

Table 7 Comparison with same public X-ray images dataset by using different evaluation metrics

5 Conclusion

In this work, we presented LWSNet, a lightweight stack ensemble architecture to segregate Covid and non-Covid pneumonia. We explored truncated versions of three state-of-the-art networks, namely MobileNetV2, VGG19, InceptionResNetV2, and CLCNN architecture for the said problem. Further, the stack ensemble technique was employed on these four tailors-made models. Three different stacking techniques, namely double, triple and quadruple stacking, were experimented. Among all, the quadruple stack ensemble produced the highest accuracy of 97.28%, 96.50%, 97.41%, and 98.54%, using datasets 1, 2, 3, and combined, respectively, which outperform the state-of-the-art.

Our plan for the future is threefold: (i) the experiments will be carried out on other radiological images such as CT and MRI, (ii) different advanced deep learning architectures such as generative adversarial network, attention-based encoder-decoder, zero-shot learning, etc., will also be explored, and (iii) fusion of deep and handcrafted features, and clinical information (upon availability) will also be evaluated.