Efficient pneumonia detection using Vision Transformers on chest X-rays

Singh, Sukhendra; Kumar, Manoj; Kumar, Abhay; Verma, Birendra Kumar; Abhishek, Kumar; Selvarajan, Shitharth

doi:10.1038/s41598-024-52703-2

Download PDF

Article
Open access
Published: 30 January 2024

Efficient pneumonia detection using Vision Transformers on chest X-rays

Sukhendra Singh¹,
Manoj Kumar¹,
Abhay Kumar²,
Birendra Kumar Verma¹,
Kumar Abhishek² &
…
Shitharth Selvarajan³

Scientific Reports volume 14, Article number: 2487 (2024) Cite this article

2436 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Pneumonia is a widespread and acute respiratory infection that impacts people of all ages. Early detection and treatment of pneumonia are essential for avoiding complications and enhancing clinical results. We can reduce mortality, improve healthcare efficiency, and contribute to the global battle against a disease that has plagued humanity for centuries by devising and deploying effective detection methods. Detecting pneumonia is not only a medical necessity but also a humanitarian imperative and a technological frontier. Chest X-rays are a frequently used imaging modality for diagnosing pneumonia. This paper examines in detail a cutting-edge method for detecting pneumonia implemented on the Vision Transformer (ViT) architecture on a public dataset of chest X-rays available on Kaggle. To acquire global context and spatial relationships from chest X-ray images, the proposed framework deploys the ViT model, which integrates self-attention mechanisms and transformer architecture. According to our experimentation with the proposed Vision Transformer-based framework, it achieves a higher accuracy of 97.61%, sensitivity of 95%, and specificity of 98% in detecting pneumonia from chest X-rays. The ViT model is preferable for capturing global context, comprehending spatial relationships, and processing images that have different resolutions. The framework establishes its efficacy as a robust pneumonia detection solution by surpassing convolutional neural network (CNN) based architectures.

Segment anything in medical images

Article Open access 22 January 2024

Vision–language foundation model for echocardiogram interpretation

Article Open access 30 April 2024

Virtual reality-empowered deep-learning analysis of brain cells

Article Open access 22 April 2024

Introduction

Pneumonia is a common respiratory infection caused by multiple types of bacteria, viruses, and fungi. It is the leading cause of morbidity and mortality worldwide, particularly among infants under the age of five and the elderly. According to WHO¹, 1.4 million pneumonia-related fatalities among children under five in 2018. Chest X-ray imaging is commonly used to diagnose pneumonia, as it can reveal important symptoms, such as increased lung opacity and consolidation. However, it can be difficult to interpret a chest X-ray (CXR) because pneumonia symptoms can be subtle and overlap with other lung diseases. Rapid and accurate diagnosis of pneumonia is essential for expediting treatment and improving patient outcomes. Radiological images, such as chest X-rays or CT scans, require specialized training and can be time-consuming to diagnose pneumonia.In recent years, there has been significant interest to develop model using machine learning techniques that assist physicians in diagnosing pneumonia using chest X-ray images. These techniques have shown promising results and may improve the efficacy and accuracy of pneumonia diagnosis.

By training a CNN on a dataset of chest X-ray images, Deep Learning (DL)^2,3,4,5 has been utilized to detect pneumonia^6,7,8,9,10. As shown in Fig. 1, the CNN can learn to recognize patterns and associated features with pneumonia, such as clouded lung areas to detect pneumonia.The model can then be used to classify new X-ray images as normal or pneumonia. Multiple studies^11,12,13,14 have demonstrated the efficacy of this method in detecting pneumonia with a high degree of accuracy. Attention mechanism isn DL refers^{15,16,17,18,19,20,21} to a technique used in neural networks to selectively focus on certain portions of an input as opposed to processing the entire input equally. In image detection and classification, attention mechanisms can be utilized to concentrate the network's attention on specific regions of an image that are most important for making a classification decision. This can help the network to improve its accuracy and decrease its computation needs. ViT models are a variant of the Transformer architecture^{22,23,24,25,26}, which was originally designed for NLP applications. These models have been adapted for image classification tasks by handling an image as a sequence of image segments that are then processed by the transformer's attention mechanism. In addition, the ViT model outperformed state-of-the-art (SOTA) techniques on a broad variety of image classification tasks, making it an excellent candidate for the pneumonia diagnosis task.

Motivation

Vision Transformer architecture for pneumonia detection from CXR is motivated by the need for time to detect this severe respiratory disease. Globally, pneumonia is one of the big causes of mortality. Early diagnosis and treatment are crucial for improved patient outcomes. Traditional methods of evaluating CXR to diagnose pneumonia are time-consuming and require specialized medical knowledge, which can lead to diagnostic errors and treatment delays. In response to these challenges, DL techniques such as CNNs and RNNs have been developed to automate the detection of pneumonia from CXR. However, these methods are inadequate to analyze complex medical images. ViT architecture has demonstrated exceptional efficacy in a variety of vision tasks, including image classification and object detection. It is a viable candidate for pneumonia detection from CXR because it can extract global and local image features. Utilizing the power of self-attention mechanisms, ViT is able to effectively capture complex patterns and relationships in X-ray images, resulting in improved pneumonia detection accuracy and reliability. Therefore, the goal of utilizing ViT architecture for pneumonia detection from CXR is to surmount the limitations of conventional methods and improve the precision and efficacy of DL models for medical imaging analysis. Vision Transformer architectures are totally different from CNN architectures. Transformer-based architectures were initially designed for sequence-to-sequence tasks in natural language processing. CNN is primarily used for tasks like machine translation, text summarization, language modeling, and sentiment analysis. These architectures have been customized into Vision Transformer architecture so that they can be suitable for Image classification and analysis.

The contribution of work is summarized as follows.

In this investigation, we propose a ViT-based architecture for pneumonia detection in CXR. This architecture will be designed to effectively manage the large and complex medical images that are typical in CXR and will be capable of detecting pneumonia with precision.
We will evaluate the accuracy of the proposed ViT architecture to that of existing DL techniques. This will provide a thorough analysis of the benefits and drawbacks of our proposed approach compared to existing methodologies.
We will evaluate the efficacy of the proposed ViT architecture using a CXR dataset that is publicly available. This will entail training and testing the model using a set of performance metrics, including accuracy, recall, precision, and F1 score, to measure its performance.

We will present the proposed ViT architecture's performance evaluation findings and analysis. This will include a discussion of any limitations of the proposed model and recommendations for improving its efficacy through future work.

Organization of the paper

The rest paper is structured as Sect. 2 discusses the background and working principle of the proposed architecture and other variants of Vision Transformer architecture. Section 3 presents recent applications and a review of related studies. Section 4 describes the dataset characteristics and proposed architecture. Section 5 discusses experiment specifications, results, and prospects of Vision Transformer architecture, followed by Sect. 6, which represents the conclusion.

Background and methodology

In this section, the paper builds the foundation for the proposed architecture.

Transformer architecture

The transformer architecture is a neural network²⁷ designed for natural languages, such as language translation, language modeling, and text summarization. The main concept of the transformer architecture is the self-attention mechanism, which assess the relative relevance of various words or sub-phrases in a given input. This is achieved by computing a "query," "key," and "value" for each word or sub-phrase, followed by adding a weighted sum on the similarity between the query and the keys. Additionally, the transformer architecture utilizes a multi-head attention mechanism^28,29,30 to attend to various input positions. In addition to the self-attention mechanism³¹, the feed-forward neural network process the output of the self-attention layer to produce the better result in Transformer model. The architecture also uses positional encoding to convey the position of the input image.

Vision transformer derived from generic transformer architecture

The Vision Transformer replaces the original transformer's self-attention mechanism with a spatial attention mechanism³² which is designed to govern images' two-dimensional grid structure. This enables the model to analyze and comprehend the spatial relationships between different image regions. Itis an effective architecture for image classification and computer vision tasks. Images are processed through the Transformer model, consisting of spatial attention and a feed-forward neural network. The spatial attention mechanism applies the attention to the image pixels, followed by the feed-forward neural network to the output of the attention mechanism. In addition, this modeluses a patch-based strategy where an image is divided into smaller segments and learns to focus separately on each patch. This allows the model to extract granular features and improve its accuracy.

Working principle of Vision Transformer

The fundamental concept of a ViT is the self-attention mechanism, which exploits both global and local features by focusing on distinct portions of the image. The self-attention mechanism is implemented by adding self-attention layers with multiple heads that are known as transformer blocks. Each patch is converted into corresponding 1-D vector and transmitted to the transformer. The transformer then uses self-attention to learn the relationships between the various regions, and the resulting representation is input into a feed-forward neural network to make a prediction.As the spatial resolution of the input does not constrain the self-attention mechanism, one of the main advantages of ViT is their ability to handle images of arbitrary sizes. This model can be trained on large images, such as high-resolution medical images, without downsampling or cropping. Additionally, this model has been improved in recent variants such as DeiT^33,34, Swin-T^35,36, and ReViT³⁷ to enhance their performance, reduce the number of parameters and computational costs, and make them more efficient and scalable for practical applications.

Self-attention mechanism in Vision Transformer for image detection and classification

A Vision Transformer^38,39 is a neural network that processes visual information using self-attention mechanisms. Similar to how the Transformer architecture is used in natural language processing (NLP), ViT employs attention mechanisms to evaluate the specific parts of an image in order to make accurate predictions. These networks excel at image classification and object detection.

Self attention techniques

Self-attention¹⁵ is a technique that enables a model to selectively concentrate its processing on particular regions of an image. Self-attention is typically applied to extracted feature maps generated by a CNN in the context of images. Self-attention allows the model to determine the relative importance of various image regions by computing a set of attention weights for each region. These attention weights can then be applied to the feature maps before their transmission to the remainder of the network.There are numerous methods to incorporate self-attention into images. A common technique is using a multi-head self-attention mechanism, in which the model computes multiple sets of attention weights for various regions of the image and then combines them. This allows the model to consider the entire image when making a prediction rather than just a specific region's features. A further method for image processing is to use a transformer-based model in which the self-attention mechanism focuses on various image regions when selecting a prediction. The transformer-based model is trained to understand the relationships between multiple image regions and makes predictions based on this information.

Self-attention in DL for image processing can be categorized into two main modules: channel attention and spatial attention.

Spatial attention networks

In contrast to conventional CNNs, which process entire images and extract features from them, spatial attention networks^32,40 process only particular regions of an image. This is accomplished by incorporating an attention mechanism that learns to weigh various image regions based on their significance to the current task. By selectively attending to the relevant areas of an image, spatial attention networks can achieve greater accuracy and efficiency when performing tasks such as image captioning, object detection, and visual question answering. In addition, the attention mechanism improves the interpretability of these networks by highlighting the regions of the image that the network is concentrating on for a given task.

Channel attention

Channel attention^41,42 pertains to a mechanism's ability to focus on particular channels of the feature maps selectively. Typically, this is carried out by computing a set of attention weights for each channel of the feature maps. These attention weights can then be applied to channels before their transmission through the remainder of the network. This allows the model to concentrate its prediction on the channels that are most informative.The combination of channel and spatial attention empowers the model to predict using both spatial information (the location of the specified portion within the image) and channel information (the features extracted by the CNN). This results in more robust and generalizable modelsfor images that have not been seen.

Variants of Vision Transformer

Several customizations in ViThave been experimented with to improve its performance or fit certain applications. The main customization methods include.

Patch size

The ViT architecture linearly embeds fixed-size input image patches. Patch size affects model performance. Larger patches capture global context but lose fine-grained details, while smaller patches may fail to capture global context. To find better performance, optimal patch size has been used.

Positional encoding

ViT incorporates spatial information into the model via learnable positional encodings. These encodings assist the model understand image patch placements. ViT performance can be improved with sine/cosine, spatial, or learned positional encodings.

Architectural variations

To improve ViT, researchers have tried several architectural variations. A Pyramid Vision Transformer (PVT) is a hierarchical modification that captures multi-scale information. The Convoluted Vision Transformer (ConvViT) combines self-attention and convolutional layers to use local and global information.

Training methods

ViT performance and convergence have been improved using various training methods. Data augmentation, regularization (dropout, weight decay), and advanced optimization algorithms (Adam, RMSprop) are examples. Pretraining on ImageNet and transfer learning^43,44 have also been used to initialize ViT models.

Hybrid models

Hybrid designs integrate Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for tasks such as pneumonia detection in chest X-ray images, we first use a CNN as the feature extractor, removing its fully connected layers while retaining its convolutional and pooling layers. The CNN-generated feature maps are then separated into non-overlapping patches, and each patch is converted into a high-dimensional embedding vector. These embeddings, which depict local characteristics, are then fed into the ViT model in order to capture global dependencies and contextual information across the entire image. For final predictions, a classification head is appended to the ViT output. The entire hybrid model, comprised of the CNN feature extractor and the ViT model, is trained from beginning to end using labeled data, with fine-tuning strategies tailored to the specific dataset and computational resources available. This approach maximizes the extraction of both local and global information, optimizing performance for complex image analysis tasks.Transformers process CNN-extracted features. This hybrid strategy uses CNNs (local feature extraction) and transformers (global context modeling) to improve performance. Pyramid Vision Transformer (PVT captures multi-scale information hierarchically. Multiple steps process features at varying resolutions. The model effectively captures local and global information. A convoluted Vision Transformer (ConvViT) is a Self-attention mechanism with convolutional layers. Self-attention models global context, while convolutional layers catch local patterns. This combination improves the model's local and global information handling.

Attention mechanism

ViT's architecture relies on attention techniques. Attention mechanism customization may include Long-Range Arena (LRA) attention, Axial attention, and Shifted attention. LRA attention efficiently handles input image long-range dependencies. It helps the model capture global context even when patches are far apart.

Axial attention captures dependencies along image axes (rows and columns). Self-attention is modified to catch shifted or offset patch dependencies. This helps the model manage data spatial transformations.

To have state-of-the-art performance and improved convergence,researchers have experimented with the following pre-trained Vision Transformer architectures.

DeiT (data-efficient image transformers)

DeiT³⁴ uses self-attention mechanisms and patch-based processing to outperform CNNs in image tasks with less labeled training data. Self-attention computes attention weights on smaller image patches to efficiently capture long-range relationships and grasp the global context. The models are pre-trained on large, unlabeled datasets to learn general visual representations, then fine-tuned on smaller, task-specific datasets. Visual characteristics and hierarchical representations help the model transfer pre-trained knowledge to the target task. Dropout and data augmentation increase generalization. Data-efficient image transformers use self-attention, patch-based processing, pre-training, fine-tuning, transfer learning, and regularization to perform well in picture tasks without labeled data.

Swin-T

Swin transformer^36,45, a new image understanding architecture, blends Transformers with CNNs. It converts the input image into non-overlapping patches using transformer layers. Swin Transformer's hierarchical architecture organizes transformer layers into stages, making it unique. Lower stages process patch-level information, whereas later stages capture broader contextual information. The hierarchical model efficiently captures image local and global dependencies. Shift procedures help Swin Transformer model repair spatial links. Swin Transformer uses Transformers' self-attention mechanism and CNNs' efficient processing to achieve state-of-the-art results on image classification, object detection, and semantic segmentation with fewer computational resources than other transformer-based models.

ReViT

The Vision Transformer (ViT) architecture can accommodate inputs of different resolutions with Resizable-ViT³⁷. Traditional ViT models require fixed-size inputs, which can limit their adaptability in real-world applications with varied image sizes. Resizable-ViT solves this problem with "token shifting" and "layer dropping." Token shifting requires scaling the input image and adapting position and token embeddings to the new resolution. For lower inputs, layer-dropping skips model architectural layers based on input resolution, reducing computing complexity. Resizable-ViT efficiently processes images of varied resolutions while doing well on image recognition tasks by dynamically adapting to input sizes.

All of these variants have been shown to enhance the performance and efficiency of Vision Transformers and have been applied to a variety of tasks, including image recognition, object detection, and medical imaging, with SOTA results.

Recent applications of Vision Transformer architecture

Vision Transformer (ViT) has attracted great interest in computer vision duties due to its capacity to process images with high precision and efficiency. Recent developments and applications have been made to the ViT architecture. The DeiT model, which enhances the training of ViT models using data augmentation and distillation techniques, is one of the most significant innovations. The Swin Transformer model, which employs hierarchical representations to enhance the performance of ViT models on large-scale image datasets, is another innovation.Recent Vision Transformer architectures research has centered on a variety of applications, including.

Object detection and instance segmentation

ViT architecture is promising for object detection and instance segmentation because it possesses several essential characteristics that make it suitable for these tasks. First, the self-attention mechanisms in ViT enable the model to learn global relationships between various image components, which can be used to identify and localize objects. ViT can be trained on large datasets with many labeled examples, which is essential for these tasks because they require a large amount of data to learn the involved complex patterns. Finally, ViT can be fine-tuned for specific object detection or instance segmentation tasks⁴⁶, allowing it to achieve high accuracy by adapting to the requirements of these tasks.

Dense predictions

Dense prediction is the task of predicting a pixel-wise output for an input image, such as semantic segmentation, where each pixel is designated as a specific object or background. The input image is divided into a series of non-overlapping segments for dense prediction, which is then flattened and fed into the ViT architecture. Self-attention allows ViT to record spatial information across these regions, and the output is shaped into a grid corresponding to the original image. One of the benefits of employing ViT for dense prediction is that it can learn to distinguish between objects of varying sizes and shapes without explicit object proposals or region-based attributes. ViT attends to all regions in the input image and learns to weigh their contributions based on the significance of their contributions to the output. In addition, ViT can be trained end-to-end with large-scale datasets like ImageNet to acquire general features that can be applied to subsequent tasks like a dense prediction. In situations with limited labeled data, this makes ViT an attractive design for dense prediction.

Self-supervised learning

Even without human annotations, ViT can be used for self-supervised learning^47,48. Self-supervised learning teaches input data meaningful representations for classification, detection, and segmentation. Training the model on a pretext task is one method to use ViT for self-supervised learning⁴⁹. Pretext tasks allow the model to learn key characteristics from input data. Data augmentation to generate multiple perspectives of the same image and training the ViT model to predict which views match is a common pretext challenge. Contrastive learning teaches the ViT model to distinguish between similar and distinct images. Two arbitrary images are supplied to the ViT model. The model is then trained to predict whether or not two images are identical.In both cases, the ViT model discovers features that are independent of viewpoint, illumination, and other factors that affect the appearance of input data. These learned characteristics can be used to establish supervised model weights or to refine subsequent tasks.

Multi-modal learning

Recent research⁵⁰ has examined the use of transformer-based architectures for multimodal unsupervised learning from raw video, audio, and text. Using self-supervised learning techniques, the plan is to implement a transformer-based architecture capable of handling multiple modalities and capable of predicting the next frame, audio, or text given the current one.

Efficient ViT architectures

Recent efforts have been made to make Vision Transformer models more effective in terms of computation time and memory consumption. Multiple architectures, such as Separable Vision Transformer (SepViT)⁵¹ and Reversible Vision Transformer (RViT)^37,52, have been proposed by researchers that are capable of achieving comparable or superior performance than conventional ViT models while being more energy-efficient. SepViT blocks employ separable convolutions rather than conventional convolutions. This update minimizes the self-attention mechanism, the most computationally expensive component of ViT. Separable convolutions separate conventional convolutions into depthwise and pointwise convolutions, requiring fewer parameters and computations. RViT augments ViT design with reversible residual blocks. These blocks recreate input features from output features, which increases the efficiency of gradient calculation during backpropagation. Reversible blocks enable models with limited memory to be larger.

Explainable AI

ViT can be utilized in Explainable AI³³ to provide insight into how an image classification decision is made. By using attention maps generated by ViT, it is possible to visualize which aspects of an image are most crucial to the classification decision. This information can be used to clarify the model's decision when communicating with humans.

In Table 1, the article summarizes recent contributions made for a range of tasks using Vision Transformer architecture.

Table 1 Insight into related recent research.

Full size table

Material and methods

Dataset characteristics

In the investigation, we used a publicly available chest X-ray (CXR) dataset from Kaggle^57,58. The same dataset has also been utilized in numerous other investigations. The dataset consists of three sections: train, test, and validation. Each section contains subfolders for Pneumonia and Normal CXRs. There are 5863 X-ray images in total as shown in Table 2. The X-ray images used in the dataset were acquired at the Women and Children's Medical Center in Guangzhou from children aged one to five.These images were taken as part of the children's routine medical examinations.To assure the quality of the X-ray images used in the analysis, they were screened by specialists for low-resolution or unreadable images. The remaining images were then evaluated by two physician specialists, with any discrepancies resolved by a third specialist. This procedure was performed to teach an AI system to make precise diagnoses.80% of the dataset has been allocated to the training set, 10% to the test set, and 10% to the validation set, as shown in Table 3.

Table 2 Class distribution of the dataset.

Full size table

Table 3 Partitioning of training, testing, and validation datasets.

Full size table

Proposed architecture

The proposed Architecture uses patch embeddings, positional encodings, several Transformer encoder layers, self-attention, feed-forward neural networks, and a classification head to classify and analyze imageswhich are shown in Fig. 2.

Input embedding

It requires reshaping the input image into patches as shown in Fig. 3 and applying a linear transformation in order to obtain the embeddings. Let's denote the input image as $X\in R^{(H\times {\text{W}}\times {\text{C}})}$ re H, W, and C, respectively, represent the height, breadth, and number of channels. Each patch has a dimension of P × P, and there are N patches in total. The input embedding can be represented the as $E\in R^{(N\times {\text{D}})}$ where D is the number of dimensions of the embeddings.

Positional encoding

The input embeddings include positional information to capture the relative and absolute positions of the patches. The positional encoding matrix $P\in R^{(N\times {\text{D}})}$ added to the input embeddings E element by element.

Transformer encoder

Each layer of the Transformer Encoder is constituted of a multi-head self-attention mechanism and a position-wise feed-forward network as shown in Fig. 4.

(a)
Multi-head self-attention: The attention weights between the input embeddings are computed by the multi-head self-attention mechanism. It entails three linear transformations: Query (Q), Key (K), and Value (V), with Q, K, and $V\in R^{(N\times {\text{D}})}$ Using the attention weights, the output of the self-attention mechanism is the weighted sum of the values. The attention weights are calculated by Eq. (1).
$$Attention\left(Q,K,V\right)=softMax\left(\left(Q{K}^{T}\right)/\surd \left(Dh\right)\right)V$$
(1)
where Dh represents the dimension of each attention head.
(b)
Position-wise feed-forward network: The position-wise feed-forward network employs two linear transformations separated by a nonlinear activation function (such as ReLU). Let's designate the attention mechanism's output as $A\in R^{(N\times {\text{D}})}$ The representation of the position-wise feed-forward network is as according to Eq. (2).
$$FFN\left(A\right)={\text{max}}\left(0,A\times W1+b1\right)\times {\text{W}}2+{\text{b}}2$$
(2)
where $W1\in R^{(D\times {\text{dFFN}})},b1\in R^{(1\times {\text{dFFN}})},W2\in R^{(dFFN\times {\text{D}})},b2\in R^{(1\times {\text{D}})}$.

These two sub-layers are applied parallel to the input sequence and then combined to generate the encoder layer's output. The process is repeated multiple times to form a stack of encoder layers, where each encoder layer builds upon the representation learned by the preceding encoder layer, enabling the model to learn increasingly complex and generalized representations of the input sequence.

Classification layer

This layer utilizes the encoder layers' output to predict pneumonia's presence or absence. This prediction may be made using a fully connected or convolutional layer.

Loss function

This component evaluates the model's efficacy based on the predicted and actual labels. In this endeavor, binary cross-entropy loss is a common loss function.

Ethical standards

No human participants were involved in the study. Dataset is available on Internet.

Result and discussions

Performance indicators

Various evaluation metrics are used to measure the effectiveness of machine learning models, and each has its benefits and drawbacks. The most prevalent metrics include.

Accuracy

This is the most important metric for evaluating a model and is defined as the proportion of correct predictions to the total number of predictions made by the model. It is evaluated using Eq. (3).

$${\text{Accuracy}}=\frac{(\mathrm{True\, Positives }+\mathrm{ True\, Negatives}) }{(\mathrm{True\, Positives }+\mathrm{ True\, Negatives }+\mathrm{ False\, Positives }+\mathrm{ False\, Negatives})}$$

(3)

Precision

Higher-precision classifiers produce fewer false positives. High accuracy reduces the likelihood of misclassifying negative instances as positive in numerous applications where false positives have severe consequences. Precision is calculated by Eq. (4).

$${\text{Precision}}=\frac{\mathrm{True\, Positives}}{(\mathrm{True\, Positives }+\mathrm{ False\, Positive})}$$

(4)

Recall (sensitivity or true positive rate)

Classifiers with higher recall have fewer false negatives. The classifier captures positive cases and reduces false negatives. A classifier with lower recall has more false negatives. The recall is determined by Eq. (5)

$${\text{Recall}}=\frac{\mathrm{True\, Positives}}{(\mathrm{True\, Positives }+\mathrm{ False\, Negatives})}$$

(5)

F1 Score

The F1 Score is the harmonic mean of precision and recall, indicating patterns between them and calculated using Eq. (6).

$${\text{FScore}}=\frac{2 * ({\text{Precision}}*{\text{Recall}})}{({\text{Precision}}\_{\text{Recall}})}$$

(6)

ROC curve

ROC curves evaluate binary classification models. The model separates positive and negative events across classification thresholds. ROC curve form and position indicate model discrimination. The ROC curve shows the trade-off between positive and negative identification when the classification threshold changes. AUC increases discrimination and model performance.

Confusion matrix

The confusion matrix tabulates classification model performance. It compares predicted labels to real labels and shows different classification outcomes. The confusion matrix reveals model performance. True positives (TP) and true negatives (TN) are situations that were accurately predicted. False positives (FP) and false negatives (FN) are cases of misclassification. These values allow us to generate model performance metrics including accuracy, precision, recall, and F1 score.

Model’s training

To demonstrate our proposed architecture, we experimented with a benchmark dataset of CXR images, one of the most frequently downloaded datasets for testing on Kaggle. Using these studies and datasets for binary classification. Python 3.7, Anaconda/3, and CUDA/10 are installed on a Windows server with an i5 CPU, 2 GB GPU, and 8 GB RAM, as well as an Anaconda/3 distribution. In addition to the parameters listed above, the Python libraries Pytorch, OpenCV, matplotlib, os, math, and NumPy are used. During training, the data is partitioned into batches, and the model's parameters are modified based on each cohort's average loss. The group size dictates the number of samples utilized during each update phase. A larger sample size can speed up the training rate but may require additional memory. CrossEntropyLoss was chosen as the experiment's loss function. During training, the model minimizes this loss function. It computes the negative log-likelihood of expected class probability and actual labels. The training algorithm modifies the parameters of the model. In an experiment, the Adam optimizer was used to alter the learning rate for each parameter based on gradient estimates of the first and second moments. Pytorch was used for the implementation, and training was conducted in a GPU environmentThe learning rate establishes how much model parameters are updated with each optimizer iteration. The multiplicative factor of the learning rate is used to modify the learning rate at each epoch or phase, enabling more granular control of the learning rate during training. The learning rate's multiplicative factor can help the model converge on a superior solution. Table 4 demonstrates the experiment's hyperparameter settings. The novelty of our work lies in the application of the Vision Transformer (ViT), specifically utilizing the DEIT_Base_Patch16_224 pre-trained weights, to the domain of medical imaging for pneumonia detection. While ViT has shown promise in various fields, its adaptation to medical imaging, especially chest X-ray analysis, is relatively unexplored. Our approach capitalizes on ViT's ability to capture intricate spatial relationships in images, offering advantages over traditional methods. We demonstrate improved performance and potential for enhanced pneumonia detection accuracy, marking a significant contribution to the field of medical image analysis.

Table 4 Hyper-parameter setting used in the experiment.

Full size table

A model's performance depends on these hyperparameters and others. To enhance model performance, selecting hyperparameter values requires careful analysis and experimentation. For optimal performance, hyperparameters must be explored and fine-tuned based on task, dataset, and model architecture.

Performance evaluation

The model's train-validation accuracy against the epoch curve shows its learning and generalization. If training accuracy increases but validation accuracy plateaus or falls, it indicates overfitting. Convergence and excellent accuracy for both curves show learning and generalization efficacy. The train-validation loss versus epochs curve shows model optimization. The model initially matches data better when training and validation loss decreases. Overfitting occurs when training loss decreases with increasing validation loss. Convergence and low loss suggest error minimization and good generalization for both curves.

Table 5 presents the performance delivered by the proposed approach and Figs. 5 and 6 show the relationship between accuracy and epoch and loss and epoch, respectively. Figures 5 and 6 show that during training, validation accuracy gradually improves along with test accuracy and reaches 97.61 and other performance indicators are also indicating outperforming results.

Table 5 Performance delivered by the proposed model.

Full size table

Confidence intervals test

This is statistical tool used to estimate the range within which a performance metric, such as accuracy, sensitivity, or specificity, is likely to lie. They provide a range of values that likely contains the true value of the parameter, along with a level of confidence.

Confidence interval (CI) is calculated using the formula described using Eq. (7)

$$\mathrm{Accuracy CI}=\mathrm{Accuracy }\pm {\text{Z}}\times \sqrt{\frac{{\text{Accuracy}}\times (1-{\text{Accuracy}})}{\mathrm{sample\, size}}}$$

(7)

Z is the z-score corresponding to the desired confidence level. For example, for a 95% confidence level, the Z-score is approximately 1.96.

Interpretation

The accuracy reported as 97.61% with a 95% confidence level, the confidence interval is between 96.2 and 98.9%. This means we can be 95% confident that the true accuracy of our proposed model lies within this range.

Matthews correlation coefficient (MCC)

The Matthews correlation coefficient (MCC) is a measure used in machine learning to evaluate the quality of binary classification. The formula for MCC is described in Eq. (8).

$$MCC=\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)(TN+FN)}}$$

(8)

From the confusion matrix on the test data.

$\begin{aligned} & {\text{TP}} = {152},{\text{ TN}} = {42}0,{\text{ FP}} = {6},{\text{ FN}} = {8} \\ & {\text{MCC}} \approx 0.{9396} \\ \end{aligned}$

The Matthews correlation coefficient (MCC) typically ranges from − 1 to + 1:

+ 1 indicates a perfect prediction,
0 suggests a random prediction,
− 1 indicates a total disagreement between prediction and observation.

In this case, an MCC of approximately 0.9396 indicates a very strong positive correlation between the predicted and actual classifications. This suggests an excellent classification performance for the model used.

The confusion matrix in Fig. 7 shows that out of 586 samples in the test data, our proposed model showed 152 cases of TP and 420 cases of TN and 6 cases of FP,and 8 cases of FN which indicates a test accuracy of 97.61%. Variation of precision and recall is represented by Figs. 8 and 9, which indicates that recall converse after 15 epocs while precision converse after 35 epocs. The ROC of the suggested architecture, depicted in Fig. 10, indicates an AUC value of 0.96. It denotes the capability of our proposed model to identify the presence or absence of pneumonia. A precision-recall value of 0.94, depicted in Fig. 11, suggests that the model demonstrates a notable capacity to accurately predict positive instances while capturing a substantial proportion of the true positive instances. The precise interpretation may differ depending on the domain of application and the particular objectives of the classification endeavor.

Discussion

Table 6 presents the performance of pre-train CNN architectures keeping all hyper-parameters values the same to make a comparison on the same datasets. It shows that Vision Transformer architecture offersa great improvement over all other architectures, The proposed architecture offers an accuracy of 97.61% and an AUC of 0.96 but this more extraordinary performance is obtained by compromising on training time because the training was a bit time taking when compared with different architectures.

Table 6 Performance evaluation relative to other architectures utilizing the same dataset.

Full size table

Research prospects in Vision Transformer

Vision Transformer (ViT) architecture research prospects for image classification hold tremendous potential for advancing the field. Future research can concentrate on enhancing the performance of ViT models by optimizing their architecture, refining training strategies, and investigating novel techniques to improve precision, robustness, and efficiency. In addition, efforts can be focused on developing interpretability methodologies for ViT models, allowing for a better comprehension of their decision-making process. It is possible to investigate efficient training and inference methods to reduce computational complexity and accelerate model deployment. Adapting ViT to scenarios with limited data using semi-supervised and few-shot learning techniques will increase its applicability. In addition, domain-specific extensions, hybrid architectures that combine ViT with other models, and real-world deployments will contribute to the advancement and practical application of ViT in image classification tasks.

Conclusion

The article conducts a thorough analysis of a Vision Transformer (ViT) framework for pneumonia detection in chest X-rays. ViTs' ability to analyze complex image relationships is showcased, demonstrating superior performance over traditional CNNs and other advanced techniques. ViTs excel in capturing global context, spatial relations, and handling variable image resolutions, leading to accurate pneumonia detection. The study aims to assess this method's effectiveness by comparing it to state-of-the-art models on a diverse CXR dataset. The results reveal ViT's superiority with an accuracy of 97.61%, sensitivity of 95%, and specificity of 98%. In conclusion, the ViT-based approach holds promise for early pneumonia detection in CXRs, offering substantial development potential in this field. However, limitations include data scarcity and the need for real-world validation. Future directions encompass enhancing interpretability, addressing model robustness, and conducting clinical trials for practical deployment.

Data availability

In this work, a public dataset of CXR (https://data.mendeley.com/datasets/rscbjbr9sj/2) has been used.

References

Pneumonia in children. WHO (2019). https://www.who.int/news-room/fact-sheets/detail/pneumonia
Khan, S. H. et al. COVID-19 detection and analysis from lung CT images using novel channel boosted CNNs. Expert Syst. Appl. 229, 120477 (2022).
Khan, S. H. et al. COVID-19 detection in chest X-ray images using deep boosted hybrid learning. Comput. Biol. Med. 137, 104816 (2021).
Article CAS PubMed PubMed Central Google Scholar
Khan, S. H., Sohail, A., Zafar, M. M. & Khan, A. Coronavirus disease analysis using chest X-ray images and a novel deep convolutional neural network. Photodiagnosis Photodyn. Ther. 35, 102473 (2021).
Article CAS PubMed PubMed Central Google Scholar
Singh, S., Tripathi, B. K. & Rawat, S. S. Deep quaternion convolutional neural networks for breast Cancer classification. Multimed. Tools Appl. 82, 31285–31308 (2023).
Article Google Scholar
Liang, G. & Zheng, L. A transfer learning method with deep residual network for pediatric pneumonia diagnosis. Comput. Methods Programs Biomed. 187, 104964 (2020).
Article PubMed Google Scholar
Nishio, M., Noguchi, S., Matsuo, H. & Murakami, T. Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: Combination of data augmentation methods. Sci. Rep. 10, 1–6 (2020).
Article Google Scholar
Asif, S., Zhao, M., Tang, F. & Zhu, Y. A deep learning-based framework for detecting COVID-19 patients using chest X-rays. Multimed. Syst. https://doi.org/10.1007/s00530-022-00917-7 (2022).
Article PubMed PubMed Central Google Scholar
Suryaa, V. S., Annie, A. X. & Aiswarya, M. S. Efficient DNN ensemble for pneumonia detection in chest X-ray images. Int. J. Adv. Comput. Sci. Appl. 12, 759–767 (2021).
Google Scholar
Singh, S., Kumar, M., Kumar, A., Verma, B. K. & Shitharth, S. Pneumonia detection with QCSA network on chest X-ray. Sci. Rep. 13, 9025 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Duong, L. T., Nguyen, P. T., Iovino, L. & Flammini, M. Automatic detection of COVID-19 from chest X-ray and lung computed tomography images using deep neural networks and transfer learning. Appl. Soft Comput. 132, 109851 (2023).
Article PubMed Google Scholar
Duong, L. T., Le, N. H., Tran, T. B., Ngo, V. M. & Nguyen, P. T. Detection of tuberculosis from chest X-ray images: Boosting the performance with Vision Transformer and transfer learning. Expert Syst. Appl. 184, 115519 (2021).
Article Google Scholar
Duong, L. T., Nguyen, P. T., Iovino, L. & Flammini, M. Deep learning for automated recognition of COVID-19 from chest X-ray images. medRxiv. https://doi.org/10.1101/2020.08.13.20173997 (2020).
Article Google Scholar
Kazemzadeh, S. et al. Deep learning detection of active pulmonary tuberculosis at chest radiography matched the clinical performance of radiologists. Radiology 306, 124–137 (2023).
Article PubMed Google Scholar
Ramachandran, P. et al. Stand-alone self-attention in vision models. Adv. Neural Inform. Process. Syst. 32 (2019).
Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M. & Hu, S.-M. Visual attention. Network. 14, 1–12 (2022).
CAS Google Scholar
Xu, K. et al. Show, attend and tell: Neural image caption generation with visual attention. in 32nd Int. Conf. Mach. Learn. ICML 2015 3, 2048–2057 (2015).
Wang, F. et al. Residual attention network for image classification. in Proc.—30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017 2017-Janua, 6450–6458 (2017).
Singh, S. et al. Deep attention network for pneumonia detection using chest X-ray images. Comput. Mater. Contin. 74, 1673–1690 (2023).
Google Scholar
Kumar, M. & Biswas, M. Human activity detection using attention-based deep network. Springer Proc. Math. Stat. 417, 305–315 (2023).
Google Scholar
Kumar, M., Patel, A. K., Biswas, M. & Shitharth, S. Attention-based bidirectional-long short-term memory for abnormal human activity detection. Sci. Rep. 13, 14442 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Carion, N. et al. End-to-end object detection with transformers. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 12346 LNCS 213–229 (2020).
Potamias, R. A., Siolas, G. & Stafylopatis, A. G. A transformer-based approach to irony and sarcasm detection. Neural Comput. Appl. 32, 17309–17320 (2020).
Article Google Scholar
Wolf, T. et al. Transformers: State-of-the-art natural language processing. 38–45 (2020). doi:https://doi.org/10.18653/v1/2020.emnlp-demos.6.
Singh, S. & Mahmood, A. The NLP cookbook: Modern recipes for transformer based deep learning architectures. IEEE Access 9, 68675–68702 (2021).
Article Google Scholar
Wolf, T. et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv Prepr. arXiv1910.03771 (2019).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Process. Syst. 2017-Decem, 5999–6009 (2017).
Google Scholar
Al-Deen, H. S. S., Zeng, Z., Al-Sabri, R. & Hekmat, A. An improved model for analyzing textual sentiment based on a deep neural network using multi-head attention mechanism. Appl. Syst. Innov. 4.4, 85 (2021).
Feng, Y. & Cheng, Y. Short text sentiment analysis based on multi-channel CNN with multi-head attention mechanism. IEEE Access 9, 19854–19863 (2021).
Park, S. et al. Multi-task Vision Transformer using low-level chest X-ray feature corpus for COVID-19 diagnosis and severity quantification. Med. Image Anal. 75, 102299 (2022).
Article PubMed Google Scholar
Zhu, J. et al. Efficient self-attention mechanism and structural distilling model for Alzheimer’s disease diagnosis. Comput. Biol. Med. 147, 105737 (2022).
Chen, C., Gong, D., Wang, H., Li, Z. & Wong, K. Y. K. Learning spatial attention for face super-resolution. IEEE Trans. Image Process. 30, 1219–1231 (2020).
Mondal, A. K., Bhattacharjee, A., Singla, P. & Prathosh, A. P. XViTCOS: Explainable Vision Transformer based COVID-19 screening using radiography. IEEE J. Transl. Eng. Heal. Med. 10, 1–10 (2021).
Touvron, H. et al. Training data-efficient image transformers & distillation through attention. in International Conference on Machine Learning 10347–10357 (2021).
Islam, M. N. et al. Vision Transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from CT-radiography. Sci. Rep. 12, 1–14 (2022).
Article Google Scholar
Liu, Z. et al. Swin transformer: Hierarchical Vision Transformer using shifted windows. in Proceedings of the IEEE/CVF International Conference on Computer Vision 10012–10022 (2021).
Zhu, Y. et al. Make a long image short: Adaptive token length for Vision Transformers. arXiv Prepr. arXiv2112.01686 (2021).
Han, K. et al. A survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. https://doi.org/10.1109/TPAMI.2022.3152247 (2022).
Article PubMed Google Scholar
Jiang, Z. et al. Computer-aided diagnosis of retinopathy based on Vision Transformer. J. Innov. Opt. Health Sci. 15.02, 2250009 (2022).
Chen, J. et al. Channel and spatial attention based deep object co-segmentation. Knowledge-Based Syst. 211, 106550 (2021).
Article Google Scholar
Zhang, Y., Fang, M. & Wang, N. Channel-spatial attention network for fewshot classification. PLoS One 14, 1–16 (2019).
Article Google Scholar
Bastidas, A. A. & Tang, H. Channel attention networks. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work. 2019-June, 881–888 (2019).
Google Scholar
Singh, S. et al. Hybrid models for breast cancer detection via transfer learning technique. Comput. Mater. Contin. 74, 3063–3083 (2022).
Google Scholar
Seemendra, A., Singh, R. & Singh, S. Breast cancer classification using transfer learning. Lect. Notes Electr. Eng. 694, 425–436 (2021).
Article Google Scholar
Jiang, J. COVID-19 detection in chest X-ray images using swin-transformer and transformer in transformer.
Chen, W. et al. A simple single-scale Vision Transformer for object detection and instance segmentation. in Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 13670 LNCS, 711–727 (2022).
Goldberg, X. Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. https://doi.org/10.2200/S00196ED1V01Y200906AIM006 (2009).
Article Google Scholar
Liu, X. et al. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 35.1, 857–876 (2021).
Caron, M. et al. Emerging properties in self-supervised Vision Transformers. in Proc. IEEE Int. Conf. Comput. Vis. 9630–9640 (2021). https://doi.org/10.1109/ICCV48922.2021.00951.
Akbari, H. et al. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Inf. Process. Syst. 29, 24206–24221 (2021).
Google Scholar
Li, W. et al. SepViT: Separable Vision Transformer. (2022).
Mangalam, K. et al. Reversible Vision Transformers. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2022-June, 10820–10830 (2022).
Google Scholar
Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. (2020).
Korkmaz, Y., Yurt, M., Dar, S. U. H., Özbey, M. & Cukur, T. Deep MRI reconstruction with generative Vision Transformers. in Machine Learning for Medical Image Reconstruction: 4th International Workshop, MLMIR 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, October 1, 2021, Proceedings 4 54–64 (2021).
Usman, M., Zia, T. & Tariq, A. Analyzing transfer learning of Vision Transformers for interpreting chest radiography. J. Digit. Imaging. https://doi.org/10.1007/s10278-022-00666-z (2022).
Article PubMed PubMed Central Google Scholar
Wu, H. et al. CvT: Introducing convolutions to Vision Transformers. Proc. IEEE Int. Conf. Comput. Vis. https://doi.org/10.1109/ICCV48922.2021.00009 (2021).
Article PubMed PubMed Central Google Scholar
Kermany, D., Zhang, K. & Goldbaum, M. Chest X-ray images (pneumonia). https://data.mendeley.com/datasets/rscbjbr9sj/2
Kermany, D. Large dataset of labeled optical coherence tomography (OCT) and chest X-ray images. Mendeley Data. 3.10.17632 (2018).
M. Hassan. VGG16—Convolutional network for classification and detection. Neurohive (2018). https://neurohive.io/en/popularnetworks/vgg16.
Dey, N., Zhang, Y. D., Rajinikanth, V., Pugalenthi, R. & Raja, N. S. M. Customized VGG19 architecture for pneumonia detection in chest X-rays. Pattern Recognit. Lett. 143, 67–74 (2021).
Elpeltagy, M. & Sallam, H. Automatic prediction of COVID-19 from chest images using modified ResNet50. Multimed. Tools Appl. 80.17 26451–26463 (2021).
Zhang, Q. A novel ResNet101 model based on dense dilated convolution for image classification. SN Appl. Sci. 4, 1–13 (2022).
Prabhakaran, A. K., Nair, J. J. & Sarath, S. Thermal facial expression recognition using modified ResNet152. in Lecture Notes in Electrical Engineering vol. 736 LNEE (2021).
Rahimzadeh, M. & Attar, A. A new modified deep convolutional neural network for detecting COVID-19 from X-ray images. arXiv 19, 100360 (2020).
Lee, H. C. & Aqil, A. F. Combination of transfer learning methods for kidney glomeruli image classification. Appl. Sci. 12.3, 1040 (2022).
Albahli, S., Rauf, H. T., Algosaibi, A. & Balas, V. E. AI-driven deep CNN approach for multilabel pathology classification using chest X-rays. PeerJ Comput. Sci. 7, 1–17 (2021).
Article Google Scholar
Jignesh Chowdary, G., Punn, N. S., Sonbhadra, S. K. & Agarwal, S. Face mask detection using transfer learning of inceptionV3. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 12581 LNCS (2020).
Mondal, M. R. H., Bharati, S. & Podder, P. CO-IRv2: Optimized InceptionResNetV2 for COVID-19 detection from chest CT images. PLoS One 16.10, e0259179 (2021).
Ezzat, D., Hassanien, A. ell & Ella, H. A. GSA-DenseNet121-COVID-19: A hybrid deep learning architecture for the diagnosis of COVID-19 disease based on gravitational search optimization algorithm. Arxiv.Org (2020).
U.N. Oktaviana & Y. Azhar. Garbage Classification Using Ensemble DenseNet169. J. RESTI (Rekayasa Sist. dan Teknol. Informasi). 5.6, 1207–1215 (2021).
Adhinata, F. D., Rakhmadani, D. P., Wibowo, M. & Jayadi, A. A deep learning using DenseNet201 to detect masked or non-masked face. JUITA J. Inform. 9.1, 115–121 (2021).
Yang, G., He, Y., Yang, Y. & Xu, B. Fine-grained image classification for crop disease based on attention mechanism. Front. Plant Sci. 11, 1–15 (2020).
Article CAS Google Scholar
Singh, S. & Tripathi, B. K. Pneumonia classification using quaternion deep learning. Multimed. Tools Appl. 81, 1743–1764 (2022).
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

JSS Academy of Technical Education, Noida, India
Sukhendra Singh, Manoj Kumar & Birendra Kumar Verma
National Institute of Technology Patna, Patna, India
Abhay Kumar & Kumar Abhishek
School of Built Environment, Engineering and Computing, Leeds Beckett University, LS1 3HE, Leeds, UK
Shitharth Selvarajan

Authors

Sukhendra Singh
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Abhay Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Birendra Kumar Verma
View author publications
You can also search for this author in PubMed Google Scholar
Kumar Abhishek
View author publications
You can also search for this author in PubMed Google Scholar
Shitharth Selvarajan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this work. The manuscript was reviewed by all authors.

Corresponding author

Correspondence to Shitharth Selvarajan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Singh, S., Kumar, M., Kumar, A. et al. Efficient pneumonia detection using Vision Transformers on chest X-rays. Sci Rep 14, 2487 (2024). https://doi.org/10.1038/s41598-024-52703-2

Download citation

Received: 20 June 2023
Accepted: 22 January 2024
Published: 30 January 2024
DOI: https://doi.org/10.1038/s41598-024-52703-2

This article is cited by

Enhancing pediatric pneumonia diagnosis through masked autoencoders
- Taeyoung Yoon
- Daesung Kang
Scientific Reports (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Segment anything in medical images

Vision–language foundation model for echocardiogram interpretation

Virtual reality-empowered deep-learning analysis of brain cells

Introduction

Motivation

Organization of the paper

Background and methodology

Transformer architecture

Vision transformer derived from generic transformer architecture

Working principle of Vision Transformer

Self-attention mechanism in Vision Transformer for image detection and classification

Self attention techniques

Spatial attention networks

Channel attention

Variants of Vision Transformer

Patch size

Positional encoding

Architectural variations

Training methods

Hybrid models

Attention mechanism

DeiT (data-efficient image transformers)

Swin-T

ReViT

Recent applications of Vision Transformer architecture

Object detection and instance segmentation

Dense predictions

Self-supervised learning

Multi-modal learning

Efficient ViT architectures

Explainable AI

Material and methods

Dataset characteristics

Proposed architecture

Input embedding

Positional encoding

Transformer encoder

Classification layer

Loss function

Ethical standards

Result and discussions

Performance indicators

Accuracy

Precision

Recall (sensitivity or true positive rate)

F1 Score

ROC curve

Confusion matrix

Model’s training

Performance evaluation

Confidence intervals test

Interpretation

Matthews correlation coefficient (MCC)

Discussion

Research prospects in Vision Transformer

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Enhancing pediatric pneumonia diagnosis through masked autoencoders

Comments

Search

Quick links