Evaluation of deep learning algorithms for semantic segmentation of car parts

Pasupa, Kitsuchart; Kittiworapanya, Phongsathorn; Hongngern, Napasin; Woraratpanya, Kuntpong

doi:10.1007/s40747-021-00397-8

Evaluation of deep learning algorithms for semantic segmentation of car parts

Original Article
Open access
Published: 22 May 2021

Volume 8, pages 3613–3625, (2022)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Evaluation of deep learning algorithms for semantic segmentation of car parts

Download PDF

Kitsuchart Pasupa ORCID: orcid.org/0000-0001-8359-9888¹,
Phongsathorn Kittiworapanya¹,
Napasin Hongngern¹ &
…
Kuntpong Woraratpanya¹

8909 Accesses
23 Citations
1 Altmetric
Explore all metrics

Abstract

Evaluation of car damages from an accident is one of the most important processes in the car insurance business. Currently, it still needs a manual examination of every basic part. It is expected that a smart device will be able to do this evaluation more efficiently in the future. In this study, we evaluated and compared five deep learning algorithms for semantic segmentation of car parts. The baseline reference algorithm was Mask R-CNN, and the other algorithms were HTC, CBNet, PANet, and GCNet. Runs of instance segmentation were conducted with those five algorithms. HTC with ResNet-50 was the best algorithm for instance segmentation on various kinds of cars such as sedans, trucks, and SUVs. It achieved a mean average precision at 55.2 on our original data set, that assigned different labels to the left and right sides and 59.1 when a single label was assigned to both sides. In addition, the models from every algorithm were tested for robustness, by running them on images of parts, in a real environment with various weather conditions, including snow, frost, fog and various lighting conditions. GCNet was the most robust; it achieved a mean performance under corruption, mPC = 35.2, and a relative degradation of performance on corrupted data, compared to clean data (rPC), of 64.4%, when left and right sides were assigned different labels, and mPC = 38.1 and rPC = $69.6\%$ when left- and right-side parts were considered the same part. The findings from this study may directly benefit developers of automated car damage evaluation system in their quest for the best design.

Generating Pseudo-labels for Car Damage Segmentation Using Deep Spectral Method

Online Label Refinement for Weakly Semi-supervised Semantic Vehicle Parsing Using CNN and Transformer

Uncertainty estimates for semantic segmentation: providing enhanced reliability for automated motor claims handling

Article Open access 15 May 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Recently, the insurance business has grown rapidly because more people have started to insure their life and property seriously to control the risks of extensive repair costs for a damaged car or property, after an accident. Car insurance is a major insurance business; it is mandatory for cars that have not been fully paid off yet. A crucial process in the operation of a car insurance company is the intricate car damage evaluation process, that requires evaluators to have comprehensive experience and skills in handling car damage. The evaluators will base their task on evidence, e.g., video recorded from car’s camera, photos taken from mobile phones showing the damages as pieces of evidence of the accident and log data from IoT devices—for example telematics [1, 2]. They must also present their damage evaluation to several parties and estimate the repair cost. This process not only takes a long time, but is also prone to human errors, fatigue or bias. Insurance companies desire to make this process more accurate, without needing to hire many highly paid damage evaluators.

New technology has made computers more powerful: machine learning enables a computer to learn from big data and provide clues for decision makers; computer vision enables a computer to recognize objects in an image or a video clip, which is directly applicable to the business. Edge computing enables front-end devices, e.g., smartphones, to analyze images in real time. This applies to the insurance business too. This technique pushes the heavy computation tasks, e.g., artificial intelligence, computer vision and complex algorithms, from centralized computing to the edge of the network—a front-end device. The front-end device will benefit from privacy, reliability and lower network latency [3,4,5]. Evaluators can use a smartphone to capture complete views of a car and analyze the captured image or video, in real-time, to evaluate damages and estimate the repair cost instantly. Any insurance company requires photos of damages to an insured car or property as pieces of evidence. Therefore, we brought in the new computer technologies to automate some steps of damage evaluation from photos of the damaged car—(1) identification of car parts; (2) identification of damaged parts; (3) damage evaluation for each part; and (4) repair cost estimation. These steps are illustrated in the schematic diagram in Fig. 1.

Here, we used image segmentation to automatically identify car parts. An image segmentation technique is similar to object detection; it detects where, in an image, an object is located, but adds recognition of the context of the object. An essential difference between the two techniques is that image segmentation works at the pixel level, whereas object detection works at the level of bounding boxes around the object. Image segmentation can be either semantic segmentation, where identical objects in the image are considered to be the same object, or instance segmentation, where identical objects are recognized as different instances. In particular, we used instance segmentation, since we wanted to differentiate different instances of the same object. For example, some car parts come in a left and right pair; instance segmentation enabled us to differentiate between the two members of this pair. A literature review showed that papers on car part segmentation are still limited, and no standards or criteria for this process have been established. Therefore, we tested a set of state-of-the-art deep-learning algorithms on a self-developed car part data set, containing images annotated with descriptions of the object in them. Our contributions are:

1.
Development of an extensive car part data set—annotated images of car parts from multiple viewpoints—some were taken from the Internet and some were taken by our team.
2.
Comparative evaluation in terms of mean average precision between Mask R-CNN (baseline technique) with ResNet Backbone and four state-of-the-art instance segmentation algorithms—the top four algorithms reported by paperswithcode.com [6].
3.
Robustness testing in terms of mPC and rPC of models from four state-of-the-art instance segmentation algorithms and the baseline model against real weather elements and lighting conditions in the photos.

The rest of this paper is arranged as follows: the second section briefly describes related works; the third section briefly describes the tested algorithms; the fourth section discusses the experimental setup and the data sets; the fifth section reports and discusses results, and the final section concludes.

Related works

Edge computing has emerged to push the computation capability closer to end-devices. It can improve response times and reduce required network bandwidth. With a combination of front-end devices, edge nodes and cloud computing, many applications that use machine learning and computer vision techniques have been successfully deployed. Many researchers developed their algorithms to fully operate on front-end devices to enhance system efficiency. Velichko et al. [7] proposed a lightweight neural network algorithm called “LogNNet”, that used filters based on logistic mapping for image recognition task. It can be employed in low-memory devices. Howard et al. [8] and Sandler et al. [9] developed MobileNets and MobileNetV2, which are efficient lightweight Convolution Neural Network (CNN) models, designed to work on mobile device. Tuli et al. [10] developed an object detection framework, EdgeLens, that integrated IoT, fog and cloud computing.

Applications of instance segmentation have included detection of individual humans in an image based on their posture. In addition, Zhang et al. [11] presented an instance segmentation method for human detection based on a human pose skeleton. It enabled recognition of the context of a posture even though, in the image, there was another human nearby or an overlap with another human. This capability differentiated it from other instance segmentation algorithms, e.g., Mask R-CNN [12]. Other instance segmentation applications include identification of biological objects in an image. In one instance, Yi et al. [13] presented an instance segmentation method for biological objects, that worked on heat map images.

Currently, several new instance segmentation algorithms have been proposed. For instance, CenterMask [14] did not use a bounding box but used a spatial attention-guided mask. It differed from algorithms that use a fully connected layer, e.g., Mask R-CNN. In addition, it used a fully convolutional one-stage object detector (FCOS) [15] rather than Faster R-CNN [16] in the object detection task, resulting in a higher detection accuracy of both still images and video frames. In another example, Wang et al. [17] developed an instance segmentation algorithm, “SOLO”, a one-step algorithm, that did not use bounding box in object detection but, instead, divided an image into a number of squares and detected the interested object in each square. It used a semantic category branch technique to determine semantic category as well as an instance mask branch to determine instance category. SOLO was improved into SOLOv2 [18]. Mask learning was developed based on dynamic convolution. No weights or parameters in the model were set as a fixed value, so that the feature map could be adjusted to various kinds of input. The model had two types of mask branch: a Mask Kernel Branch for learning the convolution kernel and a Mask Feature Branch for learning convoluted features. Non-Maximum Suppression Matrix was used to reduce processing time, which was shorter than any other tested algorithms.

Recently, one-stage instance segmentation methods, that do not have different branches for performing different functions, have gained more attention from researchers than two-stage methods, e.g., PolarMask [19], RDSNet [20] and YOLACT++ [21]. A two-stage method performs object detection first, then constructs a mask branch to predict each mask in a bounding box. Example of these methods are Mask R-CNN, PANet [22] and Mask Scoring R-CNN [23]. Chen et al. [24] proposed a BlendMask with an improved FCOS Object Detector. They added a blender module to an attention map. The blender module included both high- and low-resolution masks in every bounding box mask, enabling the model to predict the mask more accurately and rapidly than Mask R-CNN or other two-stage algorithms.

In a review of studies on car part segmentation, Lu et al. [25] presented a semantic segmentation method for car parts, based on landmark assignment and boundaries of each part. They used a graphical model to find relationships between car parts, then a segmentation by a weighted aggregation method (SWA) [26] to pair two nearby landmarks, then a Segment Appearance Consistency (SAC) technique, to connect segments of nearby landmarks, in every level of a hierarchical segmentation and to determine whether the same segment was represented in every hierarchical level. The outcome was a group of pixels that could classify various car parts. Nevertheless, in SAC and hierarchical segmentation for every hierarchical level, the meanings of a car part of different levels differed. In other words, an SAC, after only one round of SWA, was not able to segment all car parts in an image. Singh et al. [27] built a system to detect different car parts and localize their damages. However, the algorithms used in their system—Mask R-CNN, PANet and an ensemble model, that was based on Mask R-CNN and PANet—did not perform well. The MAP was lower than 0.5 across all algorithms. Dhieb et al. [28] used Inception-ResNetV2 to classify damage severity level, localize and detect part damage. Patil et al. [29] and Dwivedi et al. [30] used various CNN models to classify the car part damage, but these works only focused on a small set of car parts.

A website, paperswithcode.com, ranked all instance segmentation methods and determined the state-of-the-art ones [6]. They were benchmarked on various data sets, e.g., PASCAL VOC [31] and Common Objects in Context (COCO) Challenge [32]. Since we needed the best model for instance segmentation of car parts, we evaluated several algorithms on a large COCO Test-dev Task data set with a large number of categories, using Mask R-CNN with ResNet as baseline. The evaluated methods were the top four, as ranked by paperwithcode.com on 30/09/2019, that also used ResNet as Backbone: HTC [33], CBNet [34], PANet [22] and GCNet [35]. These algorithms are briefly described in the next section.

Methodology

The top-ranked algorithms from paperwithcode.com, on 30/09/2019, are briefly described here.

Mask region-based convolutional neural network (Mask R-CNN)

Instance segmentation Mask R-CNN algorithm [12] was a development of Faster R-CNN [16]. Faster R-CNN was only able to detect, where a target object was in an image and recognize it, but Mask R-CNN was also able to perform instance segmentation. Mask R-CNN had two main parts: (1) a backbone that extracted features from an image with Residual Neural Network (ResNet), a CNN 50–101 layers deep [36], in combination with Feature Pyramid Network (FPN) [37] and (2) a head that constructed a bounding box around a Region of Interest (ROI) and predicted the type of object in the box. The additional step of Mask R-CNN over Faster R-CNN constructed a mask for each ROI. In Mask R-CNN, after the backbone extracted features from an image, these features were input into a Region Proposal Network (RPN), which constructed anchor boxes of various sizes, that contained an object of interest and passed them to an ROI Extractor, that extracted the features from each ROI. Each ROI Map was forwarded to fully connected layers, consisting of two parallel components: the original components of Faster R-CNN for predicting bounding boxes and objects of interest (classification) and an additional component for predicting a mask in a bounding box. The flowchart of Mask R-CNN is illustrated in Fig. 2. Mask R-CNN was ranked number five by paperswithcode.com.

Global context network (GCNet)

GCNet [35] had a similar structure to Mask R-CNN, as can be seen in Fig. 2. However, the ResNet-FPN backbone was augmented with a global context block (Fig. 3). The Non-local Network (NLNet), a part of the block, solved the long-range dependency issue of deep neural networks [38]. NLNet worked in combination with a Squeeze-Excitation Network (SENet) to find the relationships between channels of each feature [39]. GCNet was as effective as NLNet, but computed faster, because it used fewer convolution and operation layers than NLNet. It was ranked number four by paperwithcode.com.

Path aggregation network (PANet)

PANet was developed by Liu et al. [22]. It had a similar structure to Mask R-CNN, as shown in Fig. 4, but the RPN and ROI Extractor were replaced by bottom-up path augmentation and adaptive feature pooling components. The bottom-up path augmentation component took an input from the previous stage and processed it together with an output of each FPN layer to generate feature maps. Those maps were used to better mix high and low-level features. Then, the adaptive feature pooling component processed the feature maps from every layer, concatenated all of its output, then sent them to the head component, consisting of many fully connected layers, to detect objects, construct masks and bounding boxes and classify detected objects. Because of those processes, PANet was highly accurate. It was able to take advantage of all levels of feature maps, from low to high level features in each feature map. PANet was ranked number three by paperswithcode.com.

Cascade mask R-CNN with composite backbone network (CBNet)

This method combined Cascade Mask R-CNN [40] and Composite Backbone Network [34]. First, CBNet improved the feature extraction step, using a number of connected backbones called Assistant Backbones. Each connected backbone extracted some features and sent a feature map to the next backbone, which also extracted some features and sent a new feature map to the next-to-next backbone and so on. The last backbone was called a ‘Lead Backbone’. It generated the final feature map, that was consecutively concatenated with features extracted from all previous backbones in the connection. Because of this repeated extraction step, low-level and high-level features were extracted into a more effective mix than a mix that Mask R-CNN generated. Second, Cascade Mask R-CNN, whose head was modified from that of Mask R-CNN, improved prediction accuracy. The bounding box head in a previous branch was forwarded to the ROI Extractor of the next branch to improve prediction accuracy, as illustrated in Fig. 5. This method was ranked number two by paperswithcode.com.

Hybrid task cascade for instance segmentation (HTC)

This algorithm was developed by Chen et al. [33] improving the efficiency of instance segmentation task. In this algorithm, the bounding box head, mask head and ROI extractor were interleaved in a cascade, illustrated in Fig. 6, and so bounding box prediction and mask prediction tasks proceeded in parallel instead of independently. A multi-stage mask branch technique was introduced. It took into account the mask from a previous branch in the generation of a mask in the current branch to improve information flow. Lastly, a semantic mask branch was connected to the head of every mask to enable the model to understand the context of the information in every mask better. All of these features improved the information flow in every task. This method was the top in the paperswithcode.com ranking.

Experimental framework

Data set

The data set contained 500 images of sedans, pickups and sports utility vehicles (SUVs) collected from the Internet and taken from public parking spaces. Images of these vehicles were taken in multiple views—front, back and angled views. The car identification number was blurred to hide individual vehicle details. Each image was annotated by the 18 listed instance masks and bounding boxes: , , , , , , , , , , , , , , , , (of trucks and SUVs), and (wheel and tire). The number of instances per category is shown in Fig. 7 and examples of the images in the data set are in Fig. 8. The DSMLR Car Part data set contains images and annotation in COCO Challenge format and is available for download at https://github.com/dsmlr/Car-Parts-Segmentation.

Experimental procedures and settings

We evaluated five algorithms: Mask R-CNN [12], HTC [33], CBNet [34], PANet [22] and GCNet [35], that used ResNet-50 and ResNet-101 as backbones, in terms of correctness and robustness on the car part data set. The algorithms were implemented with an MMDetection toolbox [41]. The experimental steps are described next. First, we resized all input images to $1024 \times 1024$ pixels, while maintaining the aspect ratio by zero-padding. Next, we randomly partitioned the car part data set into a training set (80% of the entire data set) and a test data set (20%). Then, since it was necessary to determine the best number of epochs for training the model for every evaluated algorithm, we ran a five-fold cross-validation by training for 200 epochs on each fold. The best number of epochs for each algorithm was the number that provided the lowest average five-fold validation loss. Validation loss was computed from 5 types of loss: (1) loss in classification task, (2) loss in bounding box task, (3) loss in segmentation task (Loss mask), (4) RPN loss in classification task and (5) RPN loss in bounding box task. Validation losses (4) and (5) were calculated by a Cross Entropy loss algorithm, embedded in the RPN. Next, we used a Stochastic Gradient Descent (SGD) to finding optimal parameters, setting the learning rate at 0.02 and weight decay at 0.0001. The optimal models were trained with the training set for the optimal number of epochs. The experiment was run five times with different random splits.

Furthermore, we evaluated the robustness of algorithm for semantic segmentation and object detection tasks on corrupted data, simulating four real weather conditions and lighting, i.e., snow, frost, fog and ambient light at five levels of severity. The corrupted examples were generated by methods described by Hendrycks and Dietterich [42] (visualized in Fig. 10).

Performance evaluation

Correctness

Each algorithm was evaluated for average precision (AP), based on the COCO Challenge, an established evaluation method for object detection tasks. AP was calculated from the Intersection over Union (IoU) of each interested object. IoU was calculated by

$$\begin{aligned} \mathrm{IoU} = \frac{\mathrm{Area~of~Overlap}}{\mathrm{Area~of~Union}}. \end{aligned}$$

(1)

A model was considered to successfully detect an object, if the IoU was equal to or higher than a threshold that we assigned. The AP$_{50}$ and AP$_{75}$ means that the IoU are greater than or equal to the threshold at 0.50 and 0.75, respectively. Then, the mean average precision (mAP), based on COCO Challenge, is the average over IoUs between the threshold at 0.50 and 0.95, computed as:

(2)

Since car parts take different sizes, we also evaluated the AP across scale of the car part, i.e.,APS for small parts with an area lower that $32^2$ pixels, AP$_\mathrm{M}$ for medium parts, with area between $32^2$ and $96^2$ pixels, and AP$_\mathrm{L}$ for large parts, with area greater than $96^2$ pixels. It is noted that AP on the COCO Challenge was reported in percent.

Robustness

Robustness was measured using two metrics—mean performance under corruption (mPC) and relative performance under corruption (rPC) metrics [43].

mPC is calculated:

$$\begin{aligned} \mathrm{mPC} = \frac{1}{N_c}\sum ^{N_c}_{c=1}\frac{1}{N_s}\sum ^{N_s}_{s=1}P_{c,s}, \end{aligned}$$

(3)

where $N_c = 4$ indicates the number of corruptions and $N_s = 5$ the number of severity levels (as set in this work), and $P_{c,s}$ is the performance measure evaluated on test data, that was corrupted with corruption type, c, under severity level, s. Although several metrics could be used for P, in this work, P levels were calculated using mAP. A higher mPC indicates a more robust algorithm.

rPC measured the relative degradation of performance on corrupted data compare to original data. It was calculated by

$$\begin{aligned} \mathrm{rPC} = \frac{\mathrm{mPC}}{P_\mathrm{original}}, \end{aligned}$$

(4)

where $P_\mathrm{original}$ is the performance of algorithm on the original data, that is mAP of the original data, $\mathrm{rPC} \in [0,1]$. rPC = 1 represents ‘perfect’ robustness, while 0 represents negligible robustness.

Experimental results and discussion

In this section, several comparisons were made and discussed:

1.
We compared overall algorithm performance based on two tasks—object detection and semantic segmentation tasks.
2.
We discussed robustness in potential real weather elements and lighting conditions.
3.
We further discussed performance and robustness, when left- and right-side parts were grouped under one label.

Overall performance of object detection and semantic segmentation tasks

The performances of all the algorithms are illustrated in Table 1, that includes and with different thresholds. It can be seen that HTC with ResNet-101 encoder achieved the best at 54.3 in object detection. In addition, it worked best on small and medium car parts, resulting in and at 35.6 and 52.0, respectively. This was followed by HTC with ResNet-50 encoder at 54.1 of . The stricter metric, at $\in $ (0.75, 0.95], came in second at 62.4, while HTC with ResNet-50 was the best contender at 63.6. Further, HTC with ResNet-50 performed best on large car parts, resulting in AP$_\mathrm{L}$ at 61.1. Surprisingly, Mask R-CNN with ResNet-50—the baseline—scored highest on at 77.0, but it did not perform well on the stricter metric. This was because Mask R-CNN tried to detect, classify and segment the car parts with low-level features, whereas other algorithms used global features or high-level features for segmentation tasks. On the other hand, Mask R-CNN, with the ResNet-101 encoder, achieved the highest at 55.4 in the semantic segmentation task, as well as in the strictest metric, , at 65.2, which is in-line with HTC with ResNet-50 encoder. Here, HTC, with the ResNet-50 encoder, secured the second best at 55.2, with a small difference in from Mask R-CNN with ResNet-101 encoder. It also worked best with the large car parts—AP$_\mathrm{L}$ at 63.6. In addition, PANet performed best on small car parts, yielding at 38.5.

In terms of performance related to the size of the car part, the models performed better on large parts followed by medium and small parts. The average AP$_\mathrm{L}$, and across all the models in the object detection task were 55.3, 46.9, and 32.1, respectively, and, in semantic segmentation, the scores were 59.2, 48.7, and 33.2, respectively. Larger parts led to better performance. Figure 9 shows a sample of object detection and semantic segmentation by the models with ResNet-50 and ResNet-101 encoders.

Table 1 Overall model performance on object detection and semantic segmentation tasks

Full size table

To determine which combination of model and encoder achieved the best overall performance, we used Kendall’s coefficient of concordance (W) to measure agreement between evaluation metrics. We rank the 10 candidate models (5 models with 2 encoders each) on 12 performance metrics (2 tasks with 6 metrics each). Then, we reported sum of the ranks of each candidate model as shown in Table 1 that leads to the ranking of the candidate models. Next, we calculate W by

$$\begin{aligned} W = \frac{12\left( \sum _{i=1}^{k}R_{i}^2\right) -3k^{2}n(n+1)^{2}}{k^{2}n(n^{2}-1)-k\sum _{j=1}^{k}(T_j)}, \end{aligned}$$

(5)

where n is the number of candidate models, R is the sum of ranks for the i-th candidate, k is the number of the performance metrics, and T is a correction factor, based on tied ranks (see [44] for more details). Here, $n =10$ and $k=12$. Thus, W=0.5079 that is transformed to a $\chi ^2$ value of W for significance testing against a null hypothesis of no agreement,

$$\begin{aligned} \chi ^2 = k(n-1)W. \end{aligned}$$

(6)

Thus, $X^2\!=\!54.8350$ leads to $p<0.01$ for 9 degrees of freedom. the led to $p<0.01$. Thus, we rejected the null hypothesis. Therefore, we confirmed that HTC with ResNet-50 and HTC with ResNet-101 are the first and the second rank, respectively.

Robustness

In this section, the models used in the previous subsection were further evaluated. They were tested on the modified test data, including the set of real weather elements and lighting conditions, with different severity levels as shown in Fig. 10. We illustrate the overall robustness test results, showing results for different types of noise for object detection and semantic segmentation tasks in Table 2. GCNet with the ResNet-50 encoder was the best contender; it achieved the highest robustness, based on rPC, in object detection at 64.8% and semantic segmentation at 64.4%. It yielded the best mPC in all weather conditions for both object detection and semantic segmentation tasks, except brightness changes in object detection task. It was clear that the worst was CBNet, with the ResNet-50 encoder, as it retained only 48.1% and 47.3% of the performance in object detection and semantic segmentation tasks, respectively. HTC, with ResNet-101, in the object detection task, achieved the highest , with the normal condition image, but although it only retained 53.2% of the performance, when the images were corrupted, its mPC was still ranked second at mPC= 28.9, after GCNet with ResNet-50. Moreover, HTC with ResNet-101 obtained the highest mPC with the brightness changes at 42.4. This also applied to the semantic segmentation task, HTC, with ResNet-101, ranked second in overall performance, based on mPC = 29.3, similar to Mask R-CNN with ResNet-101. We also found that the factors, that degraded performance for all algorithms, were snow and frost conditions, because they degraded the performance to less than 50% of the performance without corruption in both tasks. However, the algorithms tolerated changes in brightness and fog conditions well: they were still able to keep performance at 79.3% (light changes) and 63.0% (fog) in the object detection, and 78.4% and 63.0% in semantic segmentation.

Table 2 Performance of each method for object detection and semantic segmentation, including a robustness test with challenging real environments

Full size table

Merging left- and right-side car part as one label

After evaluating overall performance and robustness, we ran an error analysis to seek a way to improve the task. We found that the algorithms were usually confused with left- or right-side parts, e.g., predicting as a or vice versa. Therefore, we created a new set of data, that assigned a single label to left and right sides of a part. Then we fine-tuned each pre-trained model from the original labels at 100 epochs—other settings remained the same.

Table 3 shows the overall performance on both object detection and semantic segmentation, with left- and right-side part labels merged. All performances were higher than when left- and right-side parts were considered separately (Table 1): increased by 5.76% for object detection and 5.27% for semantic segmentation for all models. The table shows that HTC, with the ResNet-101 encoder, yielded the highest = 59.4, followed by HTC, with the ResNet-50 encoder, with = 59.1 in object detection. HTC, with ResNet-101, performed best on large car parts—the highest value of AP$_\mathrm{L}$ at 65.4—while HTC, with ResNet-50, encoder achieved the best performance on small and medium car parts, resulting in = 34.5 and = 53.5. In addition, HTC, with ResNet-50, was the best contender with the most strict metric = 68.6. Although Mask R-CNN, with ResNet-50, received the highest score, it was still worse than HTC, with ResNet-50 or ResNet-101, using the strictest metric. In semantic segmentation, HTC, with ResNet-101, also ranked first with = 60.1, followed by Mask R-CNN, with ResNet-50 or ResNet-101. Apparently, Mask R-CNN performed well in semantic segmentation, resulting in the highest performance on = 81.9, = 71.3, and = 55.2. Again, we used Kendall’s coefficient of concordance (W) to evaluate agreements between algorithms. The overall performance rank changed: Mask R-CNN, with ResNet-50, was now the first ranked, followed by HTC, with ResNet-50 and ResNet-101. The rankings in the table are significant at $p<0.01$ for 11 degrees of freedom (W = 0.4905 and $\chi ^2$ = 52.9691).

Table 3 Overall performances of the selected models on object detection and semantic segmentation tasks with merged side of the car part scenario

Full size table

We also evaluated algorithms robustness in the merged sides of a car part scenario on both tasks as shown in Table 4. The overall picture was very much the same as considering left side and right side separately. GCNet was still the most robust algorithm, while the worst was CBNet. Moreover, snow and frost were still the top most challenging conditions to corrupt the data, that impacted the algorithms.

Table 4 Performance of each method for object detection and semantic segmentation on merged part sides, including a robustness test with different real environments

Full size table

Conclusion

Computer network technology and end-devices are becoming more powerful. Also, the car insurance business is rapidly growing. Thus, an automated system for damage evaluation is necessary. In this work, we describe an automatic car part identification system based on images by deep learning techniques. We compared the performance of several state-of-the-art deep learning algorithms on a part segmentation task, using a car part data set, created for this work, that is now publicly available. Our experiments showed that HTC was the best model, followed by Mask R-CNN and GCNet, in both object detection and semantic segmentation tasks in normal weather conditions. Also, we evaluated algorithm robustness in real environmental and lighting conditions, simulating conditions that would occur in the field, when we take a photo using a smartphone. GCNet was the most robust model, because it achieved the best performance in overall pictures and in real conditions, except in varying brightness. Currently, edge computing has become more practical and able to overcome limitations of end-devices. Therefore, edge computing enabled the models to operate in the end-device, leading to a solution for real-time image analysis.

In future work, we will focus on developing a lighter weight model for semantic segmentation to ease the load on the end-device, without sacrificing its accuracy and robustness. We also aim to extend the work to detect, localize and estimate the severity of damage on different parts.

Availability of data and material

The data set generated and analyzed in this study is available in the GitHub repository, https://github.com/dsmlr/Car-Parts-Segmentation.

References

Handel P, Skog I, Wahlstrom J, Bonawiede F, Welch R, Ohlsson J et al (2014) Insurance telematics: opportunities and challenges with the smartphone solution. IEEE Intell Transp Syst Mag 6(4):57–70
Article Google Scholar
Husnjak S, Peraković D, Forenbacher I, Mumdziev M (2015) Telematics system in usage based motor insurance. Procedia Eng 100:816–825
Article Google Scholar
Zhou Z, Chen X, Li E, Zeng L, Luo K, Zhang J (2019) Edge intelligence: paving the last mile of artificial intelligence with edge computing. Proc IEEE 107(8):1738–1762
Article Google Scholar
Ni J, Zhang K, Lin X, Shen X (2018) Securing Fog computing for internet of things applications: challenges and solutions. IEEE Commun Surv Tutor 20(1):601–628
Article Google Scholar
Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge computing: vision and challenges. IEEE Internet Things J 3(5):637–646
Article Google Scholar
Papers with Code–COCO test-dev Benchmark (Instance Segmentation) (2020) https://paperswithcode.com/sota/instance-segmentation-on-coco. Accessed 1 Nov 2020
Velichko A (2020) Neural network for low-memory IoT devices and MNIST image recognition using kernels based on logistic map. Electronics 9(9):1432
Article Google Scholar
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T et al (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861v1
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) MobileNetV2: inverted residuals and linear bottlenecks. arXiv:1801.04381v4
Tuli S, Basumatary N, Buyya R (2019) EdgeLens: deep learning based object detection in integrated IoT, Fog and cloud computing environments. In: Proceedings of the international conference on information systems and computer networks (ISCON), p 496–502
Zhang S, Li R, Dong X, Rosin P, Cai Z, Han X et al (2019) Pose2Seg: detection free human instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), p 889–898
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision (ICCV), p 2980–2988
Yi J, Tang H, Wu P, Liu B, Hoeppner DJ, Metaxas DN et al (2019) Object-guided instance segmentation for biological images. arXiv:1911.09199v1
Lee Y, Park J (2019) CenterMask: real-time anchor-free instance segmentation. arXiv:1911.06667v6
Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), p 9626–9635
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Wang X, Kong T, Shen C, Jiang Y, Li L (2019) SOLO: segmenting objects by locations. arXiv:1912.04488v3
Wang X, Zhang R, Kong T, Li L, Shen C (2020) SOLOv2: dynamic and fast instance segmentation. ArXiv:2003.10152v3
Xie E, Sun P, Song X, Wang W, Liang D, Shen C et al (2019) PolarMask: single shot instance segmentation with polar representation. arXiv:1909.13226v4
Wang S, Gong Y, Xing J, Huang L, Huang C, Hu W (2019) RDSNet: a new deep architecture for reciprocal object detection and instance segmentation. arXiV:1912.05070v1
Bolya D, Zhou C, Xiao F, Lee YJ (2019) Better Real-time Instance Segmentation. arxiv:1912.06218
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), p 8759–8768
Huang Z, Huang L, Gong Y, Huang C, Wang X (2019) Mask scoring R-CNN. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), p 6402–6411
Chen H, Sun K, Tian Z, Shen C, Huang Y, Yan Y (2020) BlendMask: top-down meets bottom-up for instance segmentation. arXiv:2001.00309v3
Lu W, Lian X, Yuille A (2014) Parsing semantic parts of cars using graphical models and segment appearance consistency. arXiv:1406.2375v2
Liu Y, Zou L, Li J, Yan J, Shi W, Deng D (2016) Segmentation by weighted aggregation and perceptual hash for pedestrian detection. J Vis Commun Image Represent 36:80–89
Article Google Scholar
Singh R, Ayyar MP, Sri Pavan TV, Gosain S, Shah RR (2019) Automating car insurance claims using deep learning techniques. In: Proceedings of the IEEE international conference on multimedia big data (BigMM), p 199–207
Dhieb N, Ghazzai H, Besbes H, Massoud Y (2019) A very deep transfer learning model for vehicle damage detection and localization. In: Proceedings of the international conference on microelectronics (ICM), p 158–161
Patil K, Kulkarni M, Sriraman A, Karande S (2017) Deep learning based car damage classification. In: Proceedings of the IEEE international conference on machine learning and applications (ICMLA), p 50–54
Dwivedi M, Malik HS, Omkar SN, Monis EB, Khanna B, Samal SR et al (2020) Deep learning-based car damage classification and detection. In: Advances in artificial intelligence and data engineering. Springer, p 207–221
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The PASCAL visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D et al (2014) Microsoft COCO: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Proceedings of the European conference on computer vision (ECCV). Springer International Publishing, Cham, p 740–755
Chen K, Pang J, Wang J, Xiong Y, Li X, Sun S et al (2019) Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), p 4969–4978
Liu Y, Wang Y, Wang S, Liang T, Zhao Q, Tang Z et al (2019) CBNet: a novel composite backbone network architecture for object detection. arXiv:1909.03625v1
Cao Y, Xu J, Lin S, Wei F, Hu H (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF international conference on computer vision workshop (ICCVW), p 1971–1980
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), p 770–778
Lin T, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), p 936–944
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), p 7794–7803
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023
Article Google Scholar
Cai Z, Vasconcelos N (2019) Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans Pattern Anal Mach Intell 43(5):1483–1498
Article Google Scholar
Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X et al (2019) MMDetection: Open MMLab detection toolbox and benchmark. arXiv:190607155
Hendrycks D, Dietterich T (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv:1903.12261v1
Michaelis C, Mitzkus B, Geirhos R, Rusak E, Bringmann O, Ecker AS et al (2019) Benchmarking robustness in object detection: autonomous driving when winter is coming. arXiv:190707484
Siegel S, Castellan NJ (1988) Nonparametric statistics for the behavioral sciences, 2nd edn. McGraw-Hill, New York
Google Scholar

Download references

Acknowledgements

This work was supported by King Mongkut’s Institute of Technology Ladkrabang under grant agreement number 2564-02-06-002.

Funding

The funder had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok,10520, Thailand
Kitsuchart Pasupa, Phongsathorn Kittiworapanya, Napasin Hongngern & Kuntpong Woraratpanya

Authors

Kitsuchart Pasupa
View author publications
You can also search for this author in PubMed Google Scholar
Phongsathorn Kittiworapanya
View author publications
You can also search for this author in PubMed Google Scholar
Napasin Hongngern
View author publications
You can also search for this author in PubMed Google Scholar
Kuntpong Woraratpanya
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KP conceived the original idea of the method, validation, and revised the final manuscript. NH carried out the software implementation. KP and PK performed formal analysis, investigation, and writing the original draft. KP and KW conceived the initial concept of the study, review and editing the draft. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Kitsuchart Pasupa.

Ethics declarations

Conflicts of interest

We declare that we have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pasupa, K., Kittiworapanya, P., Hongngern, N. et al. Evaluation of deep learning algorithms for semantic segmentation of car parts. Complex Intell. Syst. 8, 3613–3625 (2022). https://doi.org/10.1007/s40747-021-00397-8

Download citation

Received: 24 January 2021
Accepted: 10 May 2021
Published: 22 May 2021
Issue Date: October 2022
DOI: https://doi.org/10.1007/s40747-021-00397-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Evaluation of deep learning algorithms for semantic segmentation of car parts

Abstract

Similar content being viewed by others

Generating Pseudo-labels for Car Damage Segmentation Using Deep Spectral Method

Online Label Refinement for Weakly Semi-supervised Semantic Vehicle Parsing Using CNN and Transformer

Uncertainty estimates for semantic segmentation: providing enhanced reliability for automated motor claims handling

Introduction

Related works

Methodology

Mask region-based convolutional neural network (Mask R-CNN)

Global context network (GCNet)

Path aggregation network (PANet)

Cascade mask R-CNN with composite backbone network (CBNet)

Hybrid task cascade for instance segmentation (HTC)

Experimental framework

Data set

Experimental procedures and settings

Performance evaluation

Correctness

Robustness

Experimental results and discussion

Overall performance of object detection and semantic segmentation tasks

Robustness

Merging left- and right-side car part as one label

Conclusion

Availability of data and material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation