Introduction

Chest X-ray is still the most commonly used modality for the diagnosis of various thoracic diseases. It is economical and inexpensive, and the equipment is easy to install. Specifically, chest X-ray is an excellent choice to be widely applied in developing or resource-poor areas of the world, where radiology services are highly insufficient. Since the global coronavirus disease-19 (COVID-19) outbreak in 2020, chest X-ray has become a critical imaging application for disease screening worldwide [1]. However, with the surge in chest X-rays during the pandemic, there was a massive increase in imaging data, dramatically overloading frontier radiologists. Driven by this medical demand, many artificial intelligence (AI) diagnostic models, such as convolutional neural networks (CNNs), have been established, which played an essential role in combatting the pandemic [2,3,4]. They presented good performances in COVID-19 detection, which are even comparable with radiologists [2,3,4].

However, these AI-based diagnostic models generally have two shortcomings in clinical practice: 1. Lack of independent multi-label classification capabilities. Although most AI models perform well in the diagnosis of a single disease or lesion (e.g., pneumonia or not, with or without lung nodule), real-world imaging diagnosis is usually a multi-label classification task, or so-called “One Check, Many Findings” [5, 6]. Coexisting diseases are more common in real-world scenarios; for example, a typical chest X-ray can reveal more than one disease (e.g., pulmonary infiltration, cardiomegaly). However, multi-disease diagnosis of chest X-rays can be a challenging task for AI models due to the more complex patterns that may be present in the images. Therefore, some solutions can be considered: (1) Combination of multiple pre-trained models for single-disease diagnosis, which is mostly applied by many AI platforms to achieve an apocryphal multi-disease diagnosis, demanding obviously increased computing resources; (2) establishment of an independent multi-label AI diagnostic model by applying state-of-the-art deep learning methods which can effectively reduce computing consumption and accelerate the diagnosis speed. 2. Multi-label long-tail distribution issue. The real-world image samples usually present a long-tailed distribution. Typical negative samples (no findings) constitute the majority of the head category; in contrast, most disease samples fall into the tail categories and can only be collected in a small amount [7,8,9]. This imbalance makes the model training seriously overfit the negative samples and ignores the disease features in positive samples, leading to useless training. Moreover, because this data imbalance varies among different labels, it further increases the difficulty of accurate classification. That’s why the manual data inclusion or sharing of tailed data to ensure the intra- and inter-class balances is mainly applied in radiological model training [5, 9,10,11]. However, while a balanced distribution of disease classes may be beneficial for model training, it may not accurately reflect real-world data distribution and compromise the model's generalization to subpopulations [6, 12].

Hence, utilizing a dataset that accurately represents the real-world distribution of diseases, even if it results in class imbalance and multi-disease diagnosis, could offer greater benefits. Suppose effective solutions can be found for the challenges of multi-label classification and long-tailed distribution; in that case, an optimal multi-label diagnostic model for chest X-rays can be established, enabling radiologists to make more accurate diagnoses and improve examination efficiency globally. So far, most studies have focused on AI diagnosis with only 3 to 6 multi-label categories [7]. Although two previous studies explored multi-label classification of 8 and 13 diseases, both presented limited performance of CNNs, with the lowest area under the receiver operating curve (AUROC) of only 0.6 [7, 8]. In this study, we aim to achieve a fourteen-disease classification in a long-tail dataset of chest X-rays, which has rarely been attempted. In order to promote the AI diagnostic performance with increased labels, we adopted three strategies: first, improve algorithms (e.g., self-attention, channel attention) to strengthen learning ability [13, 14]; second, choose or design an appropriate loss (e.g., reweighting, focal loss) to make the learning focus more on the tailed and hard samples [8, 15]; third, using various tricks to promote model convergence and prevent overfitting (e.g., transfer learning, data augmentation) [7, 16, 17].

Methods

Dataset

The enhanced-version ChestX-ray14 public dataset (Link: https://www.kaggle.com/datasets/nih-chest-xrays/data, National Institutes of Health Clinical Center, Bethesda, USA.) as a real-world dataset was used in our study because the ultimate goal of this study is to train a model that generalizes well to new, unseen data in real-world scenarios. This public dataset has undergone privacy-preserving preprocessing and holds a license of CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, which waives copyright interest in scientific work and is dedicated to the worldwide public domain. This dataset contains 112,120 consecutive frontal-view chest X-rays spanning from 1992 to 2015. It includes 14 disease labels identified through using a variety of Natural Language Processing (NLP) techniques mining from related radiological reports. The spectrum of disease labels had the following: “infiltration”, “atelectasis”, “effusion”, “nodule”, “pneumothorax”, “mass”, “consolidation”, “pleural_thickening”, “cardiomegaly”, “emphysema”, “fibrosis”, “edema”, “pneumonia”, and “hernia”. The whole dataset included 60,361 negative chest X-rays (“No findings”); 20,796 images contain two or more disease labels (range: 2–9). The distribution of all disease labels presents apparent long-tail distribution, with disease proportions from 0.2% (label: hernia) to 17.7% (label: infiltration). We randomly divided the entire dataset into train, validation, and test sets with ratios of 0.7, 0.1, and 0.2 for model training, validation, and testing, respectively. The details of the dataset are shown in Table 1.

Table 1 Summary of the dataset

Networks and Hyperparameters

Baseline

We chose ResNet50 with weighted binary cross-entropy loss (\({L}_{WBCE}\)) as the baseline, which showed the best performance in the eight-label chest X-ray diagnosis in a previous study [8]. As one of the most used baselines in deep-learning studies, its main contribution is to address the degradation problem. By establishing a “shortcut connection,” or so-called “residual connection,” ResNet allows the original information of the superficial layers to be directly transmitted to the subsequent deeper layers, which eliminates the vanishing gradient issue owing to the excessive depth of the network. In addition, by introducing the Bottleneck structure, ResNet first performs dimensionality reduction through 1 × 1 convolution, followed by a larger kernel convolution, in order to reduce the computational consumption caused by the direct large-kernel convolution process [18]. The ResNet50 adopted in this study comprises 16 Bottleneck residual blocks (Fig. 1A). At the end of the Bottleneck blocks, the output underwent average pooling and flattening; then, the results passed through a fully connected layer (FC) to calculate the diagnosis probability. For the multi-label classifications in our datasets, a final layer of 14 sigmoid activation units was added to output the predicted probability of each disease. To improve the learning in long-tail datasets, \({L}_{WBCE}\) was applied following the recommendations from the previous study in the same dataset [8].

Fig. 1
figure 1

Schematic diagram of the diagnostic networks in this study. A ResNet50; B EfficientNet-b5; C CoAtNet-0-rw

Study Networks

To enhance the accuracy of multi-disease diagnosis, state-of-the-art (SOTA) deep learning networks can be utilized, incorporating various loss functions and activation functions that address the class imbalance and enhance the interpretability of the models. These networks, such as EffecientNet and CoAtNet, consistently integrate channel attention or Transformer modules to improve their ability to automatically learn the intricate features of image data [19,20,21,22]. The channel attention and Transformer modules allow the network to concentrate on the crucial parts of an input image (such as the lungs and mediastinum) and effectively capture complex relationships among different elements, making them particularly beneficial for multi-disease diagnosis in chest X-rays. These SOTA networks have shown exceptional performance in various visual tasks, demonstrating their potential in the field. In this study, we aimed to utilize representative SOTA CNN and CNN + Transformer hybrid architecture networks, including EfficientNet and CoAtNet, which have been widely used for natural image classification tasks in recent years, as our study models [19,20,21,22]. These networks have been chosen for their demonstrated efficacy and advanced representation learning abilities, which are crucial in achieving improved multi-disease diagnosis in chest X-rays. Additionally, these networks can be fine-tuned for specific tasks by adjusting the last layers of the network to accommodate the target data, reducing the requirement for extensive reimplementation. To ensure that the parameters between the models are similar, we choose EfficientNet-b5 and CoAtNet-0-rw as our backbones with lightweight designs. The same final layer of 14 sigmoid activation units was added for the multi-label classifications. The architectures of these two networks are elucidated below:

EfficientNet-b5

EfficientNet has been one of the most successful CNNs in recent years [17]. It showed a crushing performance on ImageNet with a spectacular reduction in computing consumption when compared with previous CNNs. Overall, EfficientNet greatly balances the network depth, width, and resolution, leading to an essential breakthrough. In addition to the same residual connection as ResNet, another major contribution of EfficientNet is the joint application of depth-wise separable convolution and the channel attention mechanism named squeeze and excitation (SE) [14, 23]. Without adding too much computation, the depth-wise separable convolution provides a larger number of input and output channels benefiting more feature information extraction [23]. It first increases the number of channels through 1 × 1 convolution; then, the large-kernel convolution is separably performed on each channel, forming the output with the same channel count; finally, the number of channels is reduced to the input size by another 1 × 1 convolution operation, creating an “Inverted Bottleneck” structure. Compared with the conventional convolution operation, the computing consumption is only \(\frac{1}{{N}_{kernel count}}+\frac{1}{{D}_{kernal size}^{2}}\) of the former. As channel attention, SE helps exploit contextual information among different channels [14]. First, the global spatial information is squeezed into a channel descriptor by using the global average pooling; then, after two FC layers with a following sigmoid activation, channel-wise dependencies as a scalar are fully captured; finally, channel-wise multiplication is performed between the scalar and the feature map. Equations (1)–(3) of SE are as follows:

$$\mathrm{squeeze\;operation}:\;z=\frac{1}{H\times W}{\sum}_{i=1}^{H}{\sum}_{j=1}^{W}F(i,j)$$
(1)
$$\mathrm{excitation\;operation}:\;s=\sigma ({W}_{2}\delta \left({W}_{1}z\right))$$
(2)
$$\mathrm{scaling\;operation}:\;\widetilde{z}=zs$$
(3)

where H, W indicates the height and width of the feature map; \(\sigma\) is the sigmoid function; \({W}_{2}\delta \left({W}_{1}z\right)\) suggests the output after two FC layers with the intermediate ReLU activation marked as \(\delta\). The depthwise separable convolution and SE make up a module named MBConV [14, 23]. The EfficientNet-b6 used in this study has a total of 39 MBConV modules. Like ResNet50, the final output passed through a sigmoid activation layer to obtain the multi-label diagnosis probability (Fig. 1B).

CoAtNet-0-rw

Although CNN is still the predominant network in computer vision, Transformer has shown a powerful performance potential since its birth [6]. Compared with CNN, Transformer’s most significant advantage is its larger parameter capacity and global receptive field. On large-scale datasets, Transformer can also achieve the SOTA performance, even better than CNN [21, 22]. However, in datasets with limited sample sizes, such as various medical imaging datasets, CNN still presented a better performance than Transformer owing to its powerful inductive bias capacity [19, 24]. CoAtNet was designed with a CNN + Transformer hybrid architecture that integrates the benefits of local and global receptive fields [19, 21] to combine the advantages of EfficientNet, Transformer, and ResNet. It involves MBConV modules, self-attention, and residual connections. In addition, to better merge CNN and Transformer, the network integrates static convolution kernel parameters in original self-attention equations, also known as relative-attention, achieving three advantages: translation invariance, adaptive input weighting, and global receptive field [19]. The equation of relative-attention (4) is as follows:

$${y}_{i}={\sum}_{j\epsilon \mathcal{G}}\frac{\mathrm{exp}({x}_{i}^{T}{x}_{j}+w)}{\sum_{k\epsilon \mathcal{G}}\mathrm{exp}({x}_{i}^{T}{x}_{k}+{w}_{k})}{x}_{j}$$
(4)

where \(\mathcal{G}\) indicates the global spatial space. \((i, j)\) suggests the position pair. \(w\) is a trainable scaler that retrieves all \({w}_{i-j}\) static convolutional kernels for all \((i, j)\) pairs by calculating the pairwise dot product attention. The CoAtNet-0-rw used in this study is a lightweight network with 5 MBConV modules and 9 Transformer modules. After passing through a sigmoid-activation layer, the multi-label diagnosis probabilities are exported (Fig. 1C).

Other hyperparameter settings kept the same (Table 2), including: a batch size of 150 [130 in EfficientNet-b5 owing to the graphics processing unit (GPU) memory limitation), 100 training epochs, an optimizer of Adam, a learning rate (lr) of 5.0e-05. All models were trained on the same cloud GPU platform (gpuhub.com/home). The hardware configuration includes: Nvidia 3090 24G GPU*4, a 60-core Intel(R) Xeon(R) Platinum 8358P central processing unit (CPU), and 360G random access memory (RAM). The training process was carried out using PyTorch distributed parallel computing. All codes have been released on the link: https://github.com/KiwisFraggle/CoAtNet_NIH.

Table 2 Comparisons of different models in our study

Loss

This study intended to use two loss strategies (Fig. 2) for the backpropagation:

Fig. 2
figure 2

The impact of long-tail data distribution on classification and the proposed solution for this study. A For the tail sample (yellow dots), it is difficult for the model to learn the valid classification when using classic binary cross-entropy loss. B \({L}_{WBCE}\) increases the weight of tail samples and reduces the weight of head samples (blue dots) through reweighting, which can effectively enhance the learning of tailed categories. C Our design loss (\({L}_{ours}\)) simultaneously increases the weight of tailed data and reduces the contribution of easy head samples, which may help to improve the classification ability of the model when training on a long-tail dataset

Weighted Binary Cross-Entropy Loss

In the multi-label classification task of our datasets, LWBCE was feasible to adjust the long-tail distribution to “rebalance” and to promote the learning of the tailed data [8]. Besides, it holds flexibility when facing different long-tail distributions among various labels. Therefore, the corresponding rebalancing weights should be adjusted for each label to obtain an optimal multi-label diagnosis effect [8]. The specific formula of \({L}_{WBCE}\) (7–9) is as follows:

$$\begin{aligned}{L}_{WBCE}=&-{\sum }_{c=1}^{m}({\sum }_{i=1}^{n}{w}_{pos,c}{y}_{pos,i}ln({p}_{pos,i})\\&-{\sum }_{i=1}^{n}\left(1-{y}_{pos,i}\right)\mathit{ln}(1-{p}_{pos,i}))\end{aligned}$$
(5)
$${w}_{pos,c}=\frac{Positive\;sample\;count}{Negative\;sample\;count}$$
(6)
$${p}_{pos,i}=\sigma ({z}_{i})$$
(7)

where \({y}_{pos}\) and \({p}_{pos}\) indicate the positive label and positive prediction probability calculated from the sigmoid of the output \({z}_{i}\), respectively; \({w}_{pos,c}\) was the weight calculated by the positive sample count over the negative sample count; \(c\) and \(i\) indicate the label class and sample sequence, respectively.

Our Designed Loss

Considering the lack of hard-sample classification capacity, we designed a novel reweighted loss in the base of \({L}_{WBCE}\), by additionally joining an exponential decay factor of easy negative samples and a nonlinear shifting probability for reducing the contribution of negative samples, which probably makes the training focusing on not only the positive but hard samples [8, 15, 25]. Our loss keeps the capacity to make accurate individual adjustments for each label distribution with different imbalance levels. The specific formula (812) is as follows:

$${L}_{ours}={L}_{+}+{L}_{-}$$
(8)
$${L}_{+}=-{\sum}_{c=1}^{m}{\sum }_{i=1}^{n}{\alpha w}_{pos,c}(1-{p}_{pos,i})ln({p}_{pos,i})$$
(9)
$${L}_{-}=-{\sum }_{c=1}^{m}{\sum }_{i=1}^{n}{\mathit{max}\left({p}_{pos,i}-{p}_{shift}, 0\right)}^{\gamma }ln[1-\mathit{max}\left({p}_{pos,i}-{p}_{shift}, 0\right)]$$
(10)
$${w}_{pos,c}=\frac{Negative\;sample\;count}{Positive\;sample\;count}$$
(11)
$${p}_{pos,i}=\sigma ({z}_{i})$$
(12)

Similar to LWBCE, \({p}_{pos,i}\) indicates the prediction probability calculated by sigmoid of output \({z}_{i}\), and \({w}_{pos,c}\) is the weight calculated by the negative sample count over the positive sample count. \(\alpha\) is the balanced coefficient to adjust the initial balance at the start of the training, which was set as 0.2 in our study; \(\gamma -\) is an exponential modulating factor on negative samples, which was set as 4.0; a \({p}_{shift}\) value of 0.05 was set as the nonlinear shifting probability.

Other Training Tricks

In previous attempts, we encountered the issue of non-convergence in model training, which was also met in an earlier study [8]. To address this issue and strengthen the model’s robustness, we applied several additional tricks. First, previous studies suggested that transfer learning with joint data augmentation could improve model accuracy and generalization for multi-label classification tasks [7, 8]. Therefore, we employed the pre-trained weights obtained from ImageNet training as the initialization. Second, we adopted an autoaugment policy, which involves an optimal augmentation strategy established from forced learning [17]. This policy compiles 25 augmentations, including random rotation, shear, sharpness, etc. It can significantly improve the model accuracy and decrease the error rate in various datasets [17]. Moreover, random horizontal flip and patch erasing are additionally applied. Third, we use a learning rate scheduler, OneCycleLR, consisting of 20 epochs spent increasing the lr like a warmup and following 80 epochs with the cosine decay. It can prevent a trap at a local minimum in the training process [26]. Fourth, the parameter weight decay and dropout before the final FC layer was additionally applied to inhibit overfitting in training with a λ value of 0.01 and a dropout probability of 0.5 [27, 28]. At last, the Exponential Moving Average (EMA) was performed with a decay value of 0.9997 to improve the accuracy and robustness of the model [29]. Automatic mixed precision training was applied to reduce memory consumption and accelerate the training process [30].

Performance Evaluations and Statistical Analysis

The statistical analysis was conducted using IBM SPSS Statistics Software (version 26, IBM, New York, USA). Quantitative data were presented as mean ± standard error with a 95% confidence interval. Because the dataset is highly imbalanced, the epoch with the increased accuracy can be achieved simply by identifying samples as the head class (“No findings”), even when the loss persistently elevates. Thus, determining the best model in this study does not rely on maximizing classification accuracy but on minimizing the loss in the validation set. This model selection strategy can also overcome overfitting [31]. To compare the performance of different models, we evaluated the overall and individual AUROC, accuracy, macro precision, macro recall, and macro F1-score, which were compared between different models or labels using repeated measures Analysis Of Variance (ANOVA) tests and AUROC comparison analysis with Bonferroni adjustments [7, 8]. Besides, to explore whether there is any relationship between the classification capacity of the model and positive sample size, the Pearson correlation tests were performed involving AUROC and positive sample ratio. A two-tailed p value less than 0.05 was considered statistically significant.

Lesion Localization and Visual Interpretations

When the model training is completed, we select the model with the highest overall AUROC value; then, we use the group-score-weighted class activation mapping (Group-CAM) to localize the lesions and help visual interpretations. Compared with the commonly used randomized input sampling for explanation (RISE) or gradient-weighted class activation mapping (Grad-CAM), Group-CAM is more convincing and less noisy [32].

Results

Comparisons of AUROCs among different models were summarized in Table 3. After adding multiple tricks as mentioned above, ResNet50 + LWBCE showed a significantly higher AUROC on multi-label classification than the result in a previous study (p = 0.006); in particular, the AUROC of the “Mass” label increased from the reported 0.561 to 0.819 [8]. Second, both SOTA networks, including CoAtNet-0-rw and EfficientNet-b5, presented higher overall AUROCs than ResNet50 (0.826/0.822 vs. 0.811, respectively) when using the same LWBCE, but without significant differences. After applying Lours, both CoAtNet-0-rw and EfficientNet-b5 achieved significantly higher AUROCs than ResNet50 + LWBCE and ResNet50 + Lours (p ≤ 0.037, each), while CoAtNet-0-rw + Lours presented the highest overall AUROC of 0.842. However, different losses rarely affected the performance of ResNet50, unlike the cases of CoAtNet-0-rw and EfficientNet-b5. In addition, the AUROC didn’t show any significant correlations with the positive sample ratio of the label, no matter which model was applied (p > 0.05).

Table 3 Comparisons of AUROCs among different models

In addition, CoAtNet-0-rw + Lours shows the highest overall accuracy (0.257), macro precision (0.57), macro recall (0.76), and macro F1-score (0.57) when compared with other models. Similar to the comparison results of AUROCs among different models, the macro F1-score of CoAtNet-0-rw + Lours was significantly higher than ResNet50 + LWBCE and ResNet50 + Lours (p = 0.010 and 0.002, respectively). Besides, the macro F1-scores of EfficientNet-b5 + Lours and CoAtNet-0-rw + LWBCE were also significantly higher than baseline ResNet50 + LWBCE (p = 0.041 and 0.002, respectively). The details are summarized in Table 4.

Table 4 Other performance evaluations among different models

Furthermore, although CoAtNet-0-rw + Lours showed the best overall performance, the AUROC differed significantly among different disease labels, from 0.705 to 0.890 (Fig. 3), with significant differences among part of labels such as emphysema vs. edema (0.939 vs. 0.912, p < 0.001) and cardiomegaly vs. effusion (0.914 vs. 0.889, p < 0.001) (Table 5). Further heatmap visualization of the model showed that, for most disease labels (e.g., atelectasis, edema, effusion), the network could pay close attention to the corresponding areas of the lesions and make an accurate diagnosis (Fig. 4). However, in some cases, such as pneumothorax, the model did not focus on the lesion areas but on the drainage catheter that was used to treat the disease (Fig. 5). Meanwhile, we also noticed that some disease labels in the ChestX-ray14 dataset were inaccurate (Fig. 5).

Fig. 3
figure 3

AUROC curves of different label identification by CoAtNet-0-rw + Lours

Table 5 Comparisons of AUROC among different labels identified by CoAtNet-0-rw + Lours
Fig. 4
figure 4

An exemplary illustration of accurately predicted cases (left, original images; right, Group-CAM heatmaps). A A case (Image Index: 00000761_010.pgn, label: Atelectasis) showed right attention of atelectasis at the left lower lung. B A case (Image Index: 00012834_049.png, label: Edema|Effusion|Infiltration) showed the right attention to diffused edema, effusion, and infiltration at bilateral lungs. C A case (Image Index: 00012834_049.png, label: Consolidation) showed the right attention of extensive consolidation at the right lower lung. D A case (Image Index: 00014849_011.png, label: Fibrosis) showed the right attention of fibrosis at bilateral lungs. E A case (Image Index: 00009658_002.png, label: Atelectasis|Mass|Pleural_Thickening) showed the right attention of a mass with atelectasis and peripheral pleural thickening at the right upper lung. F A case (Image Index: 00002935_000.png, label: Emphysema) showed the right attention of emphysema at bilateral lungs. Abbreviation: Group-CAM, group-score-weighted class activation mapping

Fig. 5
figure 5

Exemplary illustration with incorrect attention (left, original images; right, Group-CAM heatmaps). A A case (Image Index: 00028948_001.png, label: Cardiomegaly|Hernia|Mass) was predicted only “Cardiomegaly,” which was correctly focused in the heatmap; however, labels “hernia” and “mass” cannot be identified by our professional radiologist from the original image (left). B A case (Image Index: 00003285_001.png, label: Nodule) with correct prediction showed the ignorance of small lung nodules (white arrows) at the right lung. C A case (Image Index: 00000631_004.png, label: Pneumothorax) with correct prediction demonstrated the attention around the thoracic drainage catheter (white arrows) but not the right pneumothorax area (red dash-line circle). D A case (Image Index: 00014234_000.png, label: Pneumonia) presented incorrect attention at the diaphragm, but the original image (left) was identified with no “pneumonia” by our professional radiologist. Abbreviation: Group-CAM, group-score-weighted class activation mapping

Discussion

In this study, we challenged a 14-label AI diagnosis task in a real-world long-tail dataset. To enhance the AI model’s performance, we applied SOTA backbones, a customized loss function (Lours), and several techniques, such as transfer learning and joint data augmentation. As a result, our experiments revealed that CoAtNet-0-rw + Lours achieved the highest overall AUROC and macro F1-score, significantly outperforming the baseline ResNet50 + LWBCE (AUROC: 0.842 vs. 0.811, p = 0.037; macro F1-score: 0.57 vs. 0.51, p = 0.010). In addition, the AUROCs of CoAtNet-0-rw + Lours varied widely across different disease labels (0.705 to 0.890), but no significant correlations were found between the AUROC values and the corresponding positive sample ratios (p ≥ 0.058).

Chest X-rays are still one of the most widely used and cost-effective medical examinations, despite advancements in pulmonary computed tomography (CT) technology. However, AI diagnosis of chest X-rays presents a greater challenge than CT scans due to the fine-grained classification issue [8]. The difficulty in fine-grained classification stems from the need for the model to learn and distinguish very delicate details, such as slight variations in shape, texture, or patterns among different classes. In contrast to CT, these details can be challenging to detect in chest X-rays because they often show subtle changes in grayscale or size and lack apparent morphological and color differences between objects or lesions and lung tissue [33]. As a result, even the trained eyes may struggle to distinguish between different labels, such as nodules and masses or infiltrations and edema. Additionally, the class imbalance problem, which we discussed earlier, exacerbates the challenge of fine-grained classification. Without proper AI techniques, this can result in a biased model that performs well for common diseases but poorly for rare diseases.

In contrast to previous studies, we did not use the conventional hierarchical multi-label method, which relies strongly on human cognition [34]. Instead, we utilized several advanced AI techniques and training tricks, such as depth-wise separable convolution, self-attention, joint data augmentation, class weighting, and tail sample focusing, to address the issue of multi-label imbalance. As a result, we achieved a better performance in the multi-label diagnosis for all 14 diseases than results reported in any previous studies [7, 8]. These AI techniques we used helped yield higher overall and individual AUROCs and macro F1-score compared to the baseline. Regarding the network structures, previous studies have argued that the Transformer structure facilitates higher-level cognition of global receptive fields [21, 22, 35]. Without the aid of large-scale pretraining and datasets, Transformer-based networks were shown to be inferior to CNNs in end-to-end tasks because CNNs have locality learning strategies and thus have a more substantial inductive bias [19, 24]. However, in our experiments, we found that Transformer can catch up or even surpass the performance of the powerful EfficientNet after the fusion with CNNs. CoAtNet, with the addition of Transformer modules connected with prior MBConV modules, has the adaptive learning ability to process long-range image information or lesions with large regions and can obtain better performance [19]. Another advantage of CoAtNet over CNN (e.g., EfficientNet) lies in the transfer learning capability, allowing similar training in further study on 3-dimensional images (e.g., CT) as in the 2-dimensional images (chest X-ray) in this study. Transformer can apply the parameters of its transformer block directly to 3-dimensional data with the same structure due to its global attention. In addition, Transformer has a significantly larger parameter capacity than CNN and is more suitable for more extensive data sizes and complicated data distributions [33, 35]. Therefore, although EfficientNet has the advantage of floating-point operations (FLOPs) with only mild lower AUROC, CoAtNet holds an advantage in future and broader applications.

To further improve the model’s performance, we designed a novel loss (Lours) for training. Theoretically, this loss integrates the advantages of reweighting, hard-sample focus, and a nonlinear shifting probability to reduce the contribution of negative and easy samples [8, 15, 25]. Besides, it can more accurately adjust the long-tail differences between different labels, so AUROCs of various disease labels did not show any significant correlations with the corresponding positive sample ratio. Our results demonstrate that training with Lours improved the performance, and CoAtNet-0-rw + Lours achieved the highest overall AUROC and macro F1-score, both significantly higher than the ResNet50 + LWBCE baseline. While accuracy is not a perfect evaluation metric for imbalanced data, as it can be inflated by over-predicting the majority class (“No findings”), it is worth noting that CoAtNet-0-rw + Lours also achieved the highest accuracy (0.257) among all models. However, this study unexpectedly found little effect of Lours on ResNet50. We speculate that a deep network without an attention mechanism (e.g., channel attention and self-attention) may be insusceptible to our designed loss, which merits further exploration.

Regarding the limitations, this study still has some remaining challenges: (1) Despite efforts to improve classification, the results still show a low overall macro F1-score when diagnosing multi-disease with varying degrees of long-tail label distribution. Data availability remains a significant challenge in chest X-ray research. When facing a large spectrum of more than ten diseases, the sample size of the ChestX-ray14 dataset is still insufficient. (2) The current labeling process, which primarily relies on automated radiology report labelers, leads to potential mislabeling and nonuniformity in the spectrum of diseases in different published chest X-ray datasets (Fig. 5), which negatively affects the performance of the models and the ability to utilize different datasets effectively [36, 37]. (3) In this study we didn’t consider the issues of multiple images from same patient and very small sample sizes in validation and test sets originated from random dataset division. To ensure a comparable experimental setup, we used the same data preprocessing and dataset division as in previous studies [7, 8]. This was done to enable a focus on evaluating and comparing the performance of the proposed deep-learning models for multi-disease diagnosis with these previous networks and setups [7, 8].

In the future, we propose several potential strategies to further increase the accuracy of AI in chest X-ray diagnosis: (1) Federated learning with a standard labeling system using a robust NLP labeling tool for chest X-rays. Federated learning aids in collecting more disease samples from various medical centers and allows model parameters to be shared without original data transfer, addressing the challenges posed by ethical and legal regulations regarding medical privacy when creating a widely accessible public dataset [38]. (2) Implementing multi-modal and cross-modal AI models for comprehensive diagnosis. The routine diagnostic process involves the comprehensive analysis of a patient’s medical history, laboratory investigation results, and chest X-rays before reaching a final diagnosis. This highlights the importance of using multiple sources of information to improve classification accuracy [39]. (3) Utilizing contrastive learning to obtain more accurate representations of the data. This study utilized pre-trained weights from ImageNet, which may somewhat limit the model’s performance on medical datasets. Contrastive learning offers a better self-supervised learning method by using radiology reports as supervision without additional labeling, being able to train the model’s backbone more accurately [33].

Conclusions

This study demonstrated an improved performance in the multi-disease diagnosis of chest X-rays in a long-tailed dataset using a pretrained CNN + Transformer hybrid network named CoAtNet-0-rw. However, the limited sample size of diseases and potential inaccuracies in labeling may have impacted the diagnostic capability of the model. To enhance performance, establishing uniform evaluation criteria for chest X-rays, incorporating multi-modal diagnostic information in training, and adopting contrastive learning techniques have the potential to facilitate federated learning and improve the model’s performance in the future.