Introduction

X-ray computed tomography (XCT) is considered a powerful technique to study lithium-ion batteries (LIBs) since its nondestructive 3D imaging across multiple length scales provides quantitative and qualitative metrics for the characterization of their complex microstructure1. The effects of microstructural properties on the electrochemical performance of the battery can therefore be investigated, allowing the optimization of the electrode design. However, extracting reliable microstructural properties from 3D tomographic images is not straightforward and it usually follows a pre-processing step and a segmentation process2,3. The tomographic images contain many artifacts that have an impact on the 3D reconstruction and on the image quality as well, which affect the result of the analysis in evaluating the area and phases. Besides, it also causes ambiguity when different people process the same image. For instance, ring artifacts resulting from different sensitivity in detection, and emphasized after reconstruction, can be reduced using the flat-field correction method but the residual effect usually remained4. Center of rotation errors and cone beam errors affect the quality of the tomographic reconstruction and induce the appearance of a blurring effect in the image5. Several mechanisms can introduce different types of noise into the X-ray image acquisition process, such as statistical noise caused by the fluctuation of the raw x-ray signal, structural noise created by variations in the detector structure and in the differing response of elements, and scattering noise of scattered X-ray photons inducing a spurious signal and adding extra noise to the image6.

Hence, in X-ray computed tomography, image treatment is required to reduce the uncertainty in the 3D studied object. Image pre-processing is usually applied to improve the image quality, such as denoise, deblur and ring artifact removal, which involves the image feature enhancement (histogram equalization, normalization, brightness, and contrast adjustment) and distortion reduction. Segmentation is supposed to distribute pixels among certain groups based on pixel values, with the aim of assigning various regions of the image to different phases of the material.

In addition to the analysis, the pre-processing and the segmentation procedure are both non-trivial issues that have a significant impact on any subsequent image analyzes, such as the calculation of porosity, tortuosity, and the surface area of a specific phase. Due to the lack of a non-distorted reference image, the quality of the pre-processing is typically assessed by subjective visual inspection. However, image pre-processing has a profound impact on following segmentation performance. Schluter et al.7 presented that, with suitable image enhancement prior to segmentation, segmentation algorithms became more robust and were less prone to operator bias. He analyzed the segmentation accuracy of the images before and after pre-processing and pointed out that the distortion leads to poor segmentation. In addition, the absence of ground truth makes it difficult to assess the quality of the segmentation. Pietsch et al.8 proved that subjective judgment was not a reliable standard for the selection of binarization criteria in the segmentation procedure, leading to uncertainties in the results. Therefore, a numerical metric to assess the quality of images is needed to guide the pre-processing step (setting parameters, selecting filters) so that the next segmentation step can result in further highly reliable quantitative analysis.

Image quality assessment (IQA) aims to predict the perceptual quality of a distorted image. However, the human vision system (HVS)9 needs a reference to quantify the discrepancy by comparing the distorted image either directly with the original undistorted image or implicitly with a hallucinated scene in mind10. It is time-consuming and labor-intensive to assess image quality from a crowd of people. Moreover, due to different cultures and living environments, people sometimes give different views on the same picture. Especially for tomographic images, inexperienced and laymen would like to give totally different scores. Therefore, it is complicated to objectively assess the quality of tomographic images.

To avoid the distinction caused by cognitive bias and to provide robust professional estimation, some machine-assisted IQA methods have been proposed in recent decades. They are generally divided into three categories: (a) Full-reference image quality assessment (FR-IQA), which evaluates the distorted image by comparing it to the reference image and measuring the difference11,12. (b) Reduced-reference image quality assessment (RR-IQA) which measures image quality with part of the reference image13,14. (c) No-reference image quality assessment (NR-IQA), which requires little information about reference images and estimates image quality directly from distorted images15,16.

The conventional metrics used for FR-IQA and RR-IQA are peak signal-to-noise ratio (PSNR) and root mean square error (RMSE) which compare image intensity of distorted images to the reference images without considering HVS. By considering the luminance, contrast, and structural information, SSIM11 used average pooling to calculate a score from a similarity map. Based on SSIM, MS-SSIM17 compared the distorted image to the reference image at multiple scales. F-SIM18 leveraged phase congruency and gradient magnitude feature to derive a quality score, while GMSD19 only considered the image gradient as the criterial features. Besides the gradient, MDSI20 utilized chromaticity similarity and deviation pooling to imitate the HVS and achieved better results.

Although the above methods can serve as an indicator, reference images are not always available in real-world situations. Hence, NR-IQA methods have recently attracted extensive attention, which is also challenging due to the lack of reference information. Early NR-IQA methods mainly focused on specific types of distortions, such as noise21, contrast change22, blur23, and ring artifact24,25. Since the types of distortion of the images are unknown in real scenarios, these methods are impractical compared to the general methods26,27 which require no prior information about the distortion-types.

With the development of deep neural networks (DNNs), the deep learning methods have been exploited for NR-IQA28,29 without any prerequisites and have demonstrated superior prediction performance. Le et al.30 firstly proposed a shallow CNN to estimate the quality score for natural images. Ke et al.31 introduced a deep learning-based image quality index for blind image quality assessment, which was more efficient and robust. Instead of the multi-stage methods, Sebastian et al.32 presented an end-to-end neural network to regress the quality score by joint learning of local quality and local weights. Instead of considering the whole image in the network, Simone et al.33 cropped the image into patches, estimated the scores separately, and finally merged them, which was more suitable for cases of insufficient training data. However, the lack of training data was a crucial obstacle for the aforementioned methods. To overcome the limitation of data, Xialei et al.34 implemented data augmentation by generating artificial distorted images and then trained a Siamese network (RankIQA) to regress the quality scores. Kwan-Yee et al.10 combined the generative neural network to generate the reference images and the convolutional neural network to regress the quality score from the discrepancy. Hancheng et al.35 developed a meta-learning36 method to estimate the quality score of images with new distortion, which addressed the generalization problem of IQA.

Although many methods have been provided for IQA and achieved excellent results, most of them focus on natural images and require a huge number of annotated labels, which are not practical for X-ray tomography images. For example, the FR-IQA methods need a reference image for each estimation of a distorted image, which implies a high demand for annotations. The already developed NR-IQA methods require less data than the FR-IQA method but are still relatively large (hundreds of annotations) to avoid overfitting. Besides, the existing open-source datasets37,38,39 of battery electrodes tomographic images are not for IQA task, i.e., without various distortion-types and corresponding scores. Therefore, a light NR-IQA method, which requires less annotated data and is strong enough to transfer among different X-ray tomography images, is urgently demanded.

The main contributions of this work are summarized as follows:

  • A no-reference tomographic image quality assessment (TIQA) method is proposed for tomographic images, which requires only dozens of annotated images for training and achieves outperformed results.

  • A data generation method is developed by imitating the human observers to automatically label the distorted images for the purpose of addressing the insufficient data problem. Benefit from data generation, our TIQA method requires only one-fifth of the number of images compared to other NR-IQA methods.

  • The correlation between image quality score and segmentation results is studied to guide the pre-processing step.

The remainder of the paper is organized as follows: In section “Results”, we show the results of our data generation method and TIQA method. Moreover, the segmentation result and the link between quality score and segmentation performance are demonstrated in this section as well. In section “Discussion”, we summarize the results and emphasize the features of our method. We also propose several potential applications and future directions of our method. In section “Methods”, we introduce our dataset and the experiment details.

Results

Data generation results

As shown in Fig. 1, the first step of our approach is to generate the data required for the subsequent training process of the score prediction network, whose purpose is to address the problem of insufficient data. The detailed workflow of the data generation process is illustrated in Fig. 2.

Fig. 1: Pipeline of our TIQA method.
figure 1

It is composed of two modules: data generation and score prediction. In score prediction, (1) is the self-supervised learning for ranking the images and (2) is the fine tune procedure for regressing the ranks to a score in a fixed range.

Fig. 2: Detailed structure of data generation.
figure 2

The observers are some FR-IQA methods. y is the human annotation, yi is the predicted score of the ith observer, \(\bar y\) is the average score of yi.

Firstly, the original image is resized and cropped into a fixed size, 224*224 pixels. Notably, to verify whether the resize operation affects the image quality scores, we compare the annotations on the images before and after this operation. The comparison results (Fig. S1) confirm that the pre-processing operations do not affect the image quality. Next, three types of distortion (ring artifact, blur, noise) that are commonly presented in X-ray tomographic images, are added to generate distorted images. (More generated images can be found in Fig. S2). Finally, the label projection step systematically produces the annotations of distorted images by comparing the HVS features of the original image and distorted image using different FR-IQA metrics (more details are in the Methods section).

To validate our method for data generation, we consider two criteria to quantify the correlation between generated results from the label projection step and corresponding labels from the survey, including Pearson’s linear correlation coefficient (PLCC) and Spearman’s rank-ordered correlation coefficient (SROCC). As presented in Fig. 3 and Fig. S3, for all types of distortions, the generated scores have a positive correlation with the annotations. Especially for the images with noise or blur, the correlation is high. As for the ring artifact, the results demonstrate that the existing general FR-IQA metrics cannot well handle this type of distortion.

Fig. 3: Evaluation results of label projection module.
figure 3

It shows the data augmentation performance for three types of distortions. a, b Demonstrate the correlation between predicted scores and human annotated scores in three types of distorted images. ce Illustrate the quantitative value of the predicted scores and human labels of images with blur, noise, and ring artifact. The red boxes represent the confident human annotations.

Score prediction results

In the procedure of image quality score prediction, as shown in Fig. 1, the network was first trained to rank the images according to their distortion levels. Then, based on the prior “ranking” knowledge, it was fine-tuned to regress the order information to a comprehensive quality score that represents the image quality in the range of 1 (worst) to 5 (best). In this work, we take the EfficientNet40 as the feature extractor instead of VGG41 used in RankIQA26 because it has less parameters (about 9 million parameters compared to VGG, about 138 million trainable parameters), which means easier to converge and less possibility for overfitting.

For the validation of the model, we predict the quality score of 56 images and compared the results with human annotations, as presented in Fig. 4(c–f). The results of images with different types of distortions were evaluated separately, which allows to observe the performance of the model towards different distortions. Taken together, these results indicate that there is a correlation between our predicted results and the human-labeled scores, which demonstrates that our method is able to imitate the HVS for the IQA. Interestingly, for blurred images, it performs excellently on both the relative order and the absolute score.

Fig. 4: Image quality prediction results of two image volumes.
figure 4

a Shows the box figure of the prediction results while (b) demonstrates the tested image volumes. They both are from the same material, but the volume Image Stack 01 has better quality than the volume Image Stack 02. The figure on the right shows the predicted quality score of the two volumes. The line in the box is the mean value of all scores. Quantitative comparison among different methods. ce Show the results of assessing the images with ring artifact, noise, and blur distortion, respectively. The last figure (f) illustrates the results of different methods for all types of distortions.

We also apply our method to two X-ray tomographic image volumes to observe the consistency of the results. As demonstrated in Fig. 4a, b, we generate two image volumes with different quality and each of them contains 594 slices with a size of 720 × 720. The images with higher quality are generated by enhancing the boundary through a segmentation algorithm. In the box plot, we can see that the volume (purple) with high quality achieved a higher score while the one (cyan) with lower quality had a lower score. A more detailed comparison of different pre-processing methods can be found in Fig. S4. The results show that our method can be quantitatively compared with different pre-processing methods, allowing the selection of the most appropriate method. Moreover, from the variance of these two boxes, we can conclude that our method has stable performance because the difference among confident scores is small (less than 0.05) compared to human distinction. Besides selecting pre-processing methods, our method could also help to adjust filter parameters with the guidance of quality scores.

To demonstrate the advantage of our method, we compare it to other outstanding NR-IQA methods through two quantitative metrics (SROCC and PLCC) and the full table is shown in Table S5. Here we represent two of them, BRISQUE26 and RankIQA34 in Fig. 4f. Overall, it shows that our method excels in assessing the quality of tomography images since it yields the highest correlation score among these three methods. In terms of different types of distortion, our method outperforms BRISQUE for all three distortions. When compared with RankIQA, our method achieves better results for images with ring artifact and noise and on performance on the images with blur distortion. Besides, to avoid the shortcomings of correlation-based metrics, we adopt Krasula’s metric42 as well, and the results are shown in Fig. S6. The three analysis results show that the TIQA method achieves higher AUC than the others.

Segmentation evaluation method

The TIQA method provides us with an efficient tool to select the image with the best quality among pre-processing methods, and the clue of how the distortion affects the segmentation accuracy could suggest the pre-processing step. Therefore, extra experiments were conducted to inspect the relation between image quality and segmentation accuracy. We implemented a CNN based on D-linkNet43 to predict the semantic segmentation results and compared them with TIQA results to explore the influence of the distortions.

As presented in Fig. 5, a CNN network for segmentation is trained on X-ray tomography images and annotated segmentation ground truth before making predictions. The uncertainty map is generated by calculating the entropy44 of the possibility of each pixel belonging to different classes. It represents the uncertainty when the network assigns a phase (φi) to each pixel. High uncertainty is represented as a red pixel, while a low certainty is displayed as a white pixel. From the uncertainty map, we can see that a higher uncertainty exists at the interphases while a low one exists at the bulk, which proves that the network usually produces fuzzy boundaries. The segmentation results are obtained by binarizing the probability map. Here, only two classes are considered, but the segmentation process can be extended to multiple classes.

Fig. 5: Pipeline of the segmentation evaluation procedure.
figure 5

In the uncertainty map, the red area means high uncertainty while the white area means low uncertainty. In IoU map, the red, black, and green areas represent true positive, true negative, and false negative, respectively.

Relation between IQA results and segmentation accuracy

In addition to the uncertainty map, the F1 score, which is calculated from the confusion matrix, is also considered to quantify the segmentation accuracy. The correlation between the TIQA score and the segmentation accuracy is investigated. We select an original image and its corresponding images with different types of distortions as the data for both IQA and segmentation (see Fig. 6). From the results, we can clearly see that the distortion affects the image quality and the segmentation performance. With distortion, images have a lower quality score than the F1 score, which means lower segmentation accuracy. The uncertainty map clearly presents the influence of distortion on final segmentation results. Compared to these three types of distortion, the noise causes a large amount of incertitude points in the uncertainty map, shown as the red points in Fig. 6. Although it seems that the blur distortion causes little uncertainty, it leads to vague boundaries and misclassification as well as a huge reduction in HVS-based image quality score.

Fig. 6: Results of differently distorted images evaluated by TIQA and segmentation.
figure 6

For the F1 score, it is in the range of 0 (the worst) and 1 (the best).

Moreover, the quantitative evaluation results of the TIQA and segmentation accuracy are shown in Table 1. From the SROCC and PLCC, we can see that the quality scores predicted by our approach are well correlated to the segmentation accuracy. The TIQA scores share a similar trend with the F1 scores, especially for the images with ring artifact and blur distortions.

Table 1 Quantitative results of the correlation between predicted quality score and segmentation accuracy.

To inspect the impact of distortion on classification results from a pixel perspective, we calculate the correlation point pixel by pixel between predicted segmentation and ground truth. As shown in Fig. 7, we can observe that the colorful lines (with distortion) have a positive correlation with the black line (without distortion). Nevertheless, they may have different sensitivity to specific types of distortion. For example, the correlation point line of the segmentation with noise do not converge at the reference line and the fluctuation indicates the serious impact of noise on segmentation results. Additionally, the third figure in Fig. 7 illustrates that, with the increase in distortion level, the IQA score decreases quickly, but the segmentation accuracy keeps stable, which implies that the network can tell very little difference in pixel values in the image and classify the pixels to different categories based on the distinction. Due to the limitation of HVS, people cannot distinguish the little variation of the pixel intensities, as the results of the blurred images shown in Fig. 7.

Fig. 7: Point correlation between predicted segmentation and ground truth for black phase.
figure 7

Figures ac set out the correlation results from images with the ring, noise, and blur. The color bar shows the distortion at different levels, from little distortion to severe distortion. The solid line means the point correlation in X direction. The number labeled at the end of each line is the image quality score.

In summary, through the image quality score produced by our method, especially for images with blur and ring artifact, as described in Table 1, we can infer the corresponding segmentation performance without implementation. It greatly reduces the time of choosing an appropriate pre-processing algorithm to improve the image quality and achieve better segmentation accuracy.

Discussion

Tomography images are widely used for analyzing the battery microstructure. However, the essential pre-processing image procedure is, mostly, observer-dependent45. This observer-dependence can lead to dispersions and uncertainties in the segmentation process. The latter might produce unreliable results that deteriorate the subsequent quantitative analysis, especially when the segmentation involves the supervised training procedure (inaccurate ground truth). However, we believe that observer-dependence can be reduced or eliminated by the appropriate pre-processing step, that provides the image with good quality according to HVS. Hence, a trust-worthy metric, which can assess the image quality like human observers, to guide the pre-processing procedure helps with dependable post-processing workflow (segmentation and analysis).

In this paper, we propose a quantitative metric, denoted as TIQA, for X-ray tomographic image quality assessment. Moreover, we address the lack of data issues for X-ray tomographic images through the data generation process. Overall, our approach shows good performance and outperforms the other two IQA methods (BRISQUE & RankIQA) for X-ray tomographic images, given only a few annotations for training. It is worth noting that although we try to reduce the demand for training annotations, a small number of labels are still required so that it cannot be considered a totally blind IQA method.

The correlation between our metric and the segmentation performance has also been explored. The qualitative and quantitative evaluation results prove that the segmentation performance is associated with the predicted quality score, which is also related to subjective human annotations. This correlation gives us tips to reduce the uncertainties and variations of segmentation results by applying pre-processing algorithms to improve the image quality.

For the idea of using a neural network to evaluate the results of IQA, we use a similar method as Samuel et al.46, who investigated the effect of image quality on DNN results by applying differently distorted images on the same network, but we conducted more types of distortion. Instead of focusing on the image classification problem, which classifies an image into one category, we analyzed the impact of distortion on the image segmentation problem, that is concerned with pixels classification. Taking advantage of the uncertainty map and IoU map, the influence of distortion could be clearly visualized.

In conclusion, this work provides a quantitative IQA metric to guide the pre-processing step based on subjective human opinion so that the observer-dependence can be alleviated or removed from the pre-processing and the segmentation step. It greatly reduces the tedious work of selecting good images and facilitates the automation of analyzing X-ray tomography images. In addition, it provides a more reliable assessment of pre-processing image results, which avoids the conflicts of different human observers, and promises an outperformed segmentation analysis.

However, some limitations remain to be solved. The undistorted images are not well evaluated in our method due to the lack of images with excellent quality. Although our approach does not need hundreds of images for training, the estimation results of image quality can still be improved with the larger dataset. These limitations can be solved with the contribution of the community by sharing open-source X-ray tomographic data, such as Tomobank37,38,39.

Interestingly, thanks to the demand for automatically analyzing the tomography images, our TIQA method can be extended to improve the image quality by using different image processing methods. For example, by constructing a Teacher-Student model, our method (teacher) can teach a distortion removal network (student) to automatically eliminate the distortions. It will greatly release the burden of human observers and reduce the impact of distortion on segmentation. In addition, the image quality assessment can be extended to object-oriented assessment. For instance, through the learning of object information, the network can judge whether the materials inside of the battery are destroyed or not.

Methods

Dataset creation

We collected 40 8-bit images from 11 different types of batteries with different resolutions. All the images were rescaled to the same resolution 224 × 224. To avoid deformation, we resized the original image to the width or height equaling 255 while preserving the aspect ratio, then randomly cropped the region with a size of 224 × 224. We also maintained six original images for the analysis of the impact of the downsampling operation on the image quality score. To expand the dataset, we applied different algorithms with different parameters to generate images with different types of distortion. For example, we generated several rings with different radius and intensities based on original images and add them together to imitate the ring artifact distortion. For blur and noise, we used the methods implemented in scikit-image47. Similar to Hanne’s method48, we manually set the parameter values that control the distortion amount such that the visual quality of the distorted images varies, from an expected rating from 1 (terrible) to 5 (excellent). The distortion parameter values were chosen based on a small set of images and applied the same for the remaining images in our database.

We performed two surveys for subjective image quality scores and conveyed them to different people who included beginners and experts in this field for annotation among five levels: terrible, bad, average, good, and excellent. For each image, we collected annotations from 15 people and set the average number as its quality score.

Data generation

As illustrated in Fig. 2, the preprocessed images were regarded as reference images. Then several distortion filters, including noise, blur, and ring artifact, were applied to the reference image to generate the distorted images. The parameter values of the filters were set differently, as shown in Table S7, to create different distortion before adding them to the reference images to produce images at different levels of distortion. For label projection, we used five FR-IQA evaluators, mimicking the human observers, to calculate the difference between a reference image and a distorted image and pass a score for a distorted image. Due to the range of the score from each evaluator varies, we normalized and rescaled them to the same range. Finally, we averaged the produced scores and set it as the generated score.

Score prediction

As shown in Fig. 1, we took the EfficientNet network as the feature extractor and change the last three layers to output a score for each input image. Among the dense layers, we added dropout49 to avoid overfitting. Instead of training the network from scratch, we transferred the weights from the pre-trained model in ImageNet50 to reduce the time of convergence51. The input image size was fixed at 224 × 224 × 3 and the corresponding output was a score with a shape of 1 × 1.

We built the image by pair by picking an original image, generating the distorted images with distortions at different levels. The image with a lower level of distortion was regarded as a better image than the one with a higher level of distortion. Taking advantage of the generated ranking information, the network could order the images by quality. The corresponding rank loss52 function is

$${{{\mathrm{L}}}}\left( {{{{\hat{\mathrm y}}}}_{{{\mathrm{i}}}},{{{\hat{\mathrm y}}}}_{{{\mathrm{j}}}}} \right) = {{{\mathrm{max}}}}\left( {0,{{{\mathrm{m}}}} + {{{\hat{\mathrm y}}}}_{{{\mathrm{i}}}} - {{{\hat{\mathrm y}}}}_{{{\mathrm{j}}}}} \right)$$
(1)

where \(\hat y_i - \hat y_j\) are the prediction results of a pair of images; m, set at 6 in our experiment, is a margin to control the minimum distance of the positive image pair.

After the image ordering process, the human annotations and the generated machine labels were inputted into the network to regress the output score to a fixed range by leveraging the Minor Square Error (MSE) loss function.

Training and testing parameters

In the score prediction module, we used 32 original images, which were expanded to 512 images after data generation but without labels for training the rank. The initial learning rate was set at 3e-5 and decayed after several iterations. The network was trained for 30 epochs, and on each epoch, it iterated on the whole dataset. The rate of the dropout was set at 0.5 to avoid overfitting. The Adam53 optimizer was applied for optimizing the rank loss.

After training the rank, the model was fine-tuned in the score regression step. The training dataset contains 29 images with the size of 224 × 224 × 3 and their corresponding labels, which are in the range from 1 to 5. The data generation method was also implemented to expand the training dataset to 464 images with generated annotations. Then, they were inputted into the network for regression with the MSE loss. The network iterated 20 epochs with the initial learning rate at 5e-5, which decayed every 4 epochs. The dropout rate was 0.5 in training. For the testing procedure, a total of 56 images were tested and evaluated with corresponding human annotations. All the experiments were conducted in python with the TensorFlow54 library. The computing hardware was Tesla K80.

Evaluation metrics

The PLCC (Pearson’s linear correlation coefficient) is the linear correlation coefficient between the predicted score and human-labeled score. It measures the prediction accuracy of an IQA metric, i.e., the capability of the metric to predict the subjective scores with low error. The PLCC is calculated as follows:

$${{{\mathrm{PLCC}}}} = \frac{{\mathop {\sum}\nolimits_{i = 1}^{M_d} {\left( {\hat y_i - \hat y_{aug}} \right)\left( {y_i - y_{aug}} \right)} }}{{\left( {\mathop {\sum}\nolimits_{i = 1}^{M_d} {\left( {\hat y_i - \hat y_{aug}} \right)^2} } \right)^{\frac{1}{2}}\left( {\mathop {\sum}\nolimits_{i = 1}^{M_d} {\left( {y_i - y_{aug}} \right)^2} } \right)^{\frac{1}{2}}}}$$
(2)

where \(\hat y_i\) and yi are the predicted score and the human-labeled score of the ith image in a dataset of size Md respectively, \(\hat y_{avg}\) and yavg are the average of the predicted scores and human-labeled scores, respectively.

The SROCC (Spearman’s rank-ordered correlation coefficient) is the rank correlation coefficient between the predicted score and labeled score, and it compares the monotonicity of the prediction performance, i.e., the limit to which the predicted scores agree with the relative magnitude of the labels. The SROCC can be calculated via the following equation:

$${{{\mathrm{SROCC}}}} = 1 - \frac{{6\mathop {\sum}\nolimits_{i = 1}^{M_d} {\left( {d_i} \right)^2} }}{{M_d\left( {M_d^2 - 1} \right)}}$$
(3)

where the di is the difference between the ith image’s rank in prediction results and labels.

Segmentation-based evaluation method

To inspect the effect of distortion on segmentation accuracy, we applied D-LinkNet43, which is an encoder-decoder network connected by dilated convolutions55, for tomography image segmentation. It segmented the image into two classes and produced the probability map, which indicated the possibility of each pixel belonging to a class. Finally, the classification result is generated by setting a threshold to binarize the probability map. The network ran for 200 epochs on 110 images with segmentation labels. The size of the input image and label was 1024 × 1024 and they were normalized to a range of 0 and 1 before inputting to the network. The initial learning rate was 1e-4 and decayed to one-fifth of the previous value after fixed steps. The optimizer was Adam and binary cross-entropy loss was used to measure the difference between prediction and ground truth.

In the testing procedure, the output of the network was utilized to generate the uncertainty map. We used the entropy function44 to calculate the uncertainty, which is described as follows,

$$H\left[ {y|x,X,Y} \right] = - \mathop {\sum}\limits_c {p\left( {y = c|x,X,Y} \right)\log p\left( {y = c|x,X,Y} \right)}$$
(4)

where x is the test image, y is the predicted class, X and Y are the images and labels in the training process, c is the class index.

The IoU (intersection over union) and F1 scores were utilized to measure the segmentation performance. IoU means the area of overlap between the predicted segmentation and the ground truth divided by the area of union between them. It ranges from 0 to 1, with 0 signifying no overlapping and 1 indicating perfect overlapping. Different from IoU, the F1 score can be calculated by:

$${{{\mathrm{F}}}}1 = \frac{{2 \times {\mathrm{overlap}}}}{{{\mathrm{total}}\;{\mathrm{pixels}}}}$$
(5)

where the total pixels mean the number of pixels in both segmentation results and ground truth.