Introduction

Recently, the insurance business has grown rapidly because more people have started to insure their life and property seriously to control the risks of extensive repair costs for a damaged car or property, after an accident. Car insurance is a major insurance business; it is mandatory for cars that have not been fully paid off yet. A crucial process in the operation of a car insurance company is the intricate car damage evaluation process, that requires evaluators to have comprehensive experience and skills in handling car damage. The evaluators will base their task on evidence, e.g., video recorded from car’s camera, photos taken from mobile phones showing the damages as pieces of evidence of the accident and log data from IoT devices—for example telematics [1, 2]. They must also present their damage evaluation to several parties and estimate the repair cost. This process not only takes a long time, but is also prone to human errors, fatigue or bias. Insurance companies desire to make this process more accurate, without needing to hire many highly paid damage evaluators.

New technology has made computers more powerful: machine learning enables a computer to learn from big data and provide clues for decision makers; computer vision enables a computer to recognize objects in an image or a video clip, which is directly applicable to the business. Edge computing enables front-end devices, e.g., smartphones, to analyze images in real time. This applies to the insurance business too. This technique pushes the heavy computation tasks, e.g., artificial intelligence, computer vision and complex algorithms, from centralized computing to the edge of the network—a front-end device. The front-end device will benefit from privacy, reliability and lower network latency [3,4,5]. Evaluators can use a smartphone to capture complete views of a car and analyze the captured image or video, in real-time, to evaluate damages and estimate the repair cost instantly. Any insurance company requires photos of damages to an insured car or property as pieces of evidence. Therefore, we brought in the new computer technologies to automate some steps of damage evaluation from photos of the damaged car—(1) identification of car parts; (2) identification of damaged parts; (3) damage evaluation for each part; and (4) repair cost estimation. These steps are illustrated in the schematic diagram in Fig. 1.

Fig. 1
figure 1

Car damage evaluation steps

Here, we used image segmentation to automatically identify car parts. An image segmentation technique is similar to object detection; it detects where, in an image, an object is located, but adds recognition of the context of the object. An essential difference between the two techniques is that image segmentation works at the pixel level, whereas object detection works at the level of bounding boxes around the object. Image segmentation can be either semantic segmentation, where identical objects in the image are considered to be the same object, or instance segmentation, where identical objects are recognized as different instances. In particular, we used instance segmentation, since we wanted to differentiate different instances of the same object. For example, some car parts come in a left and right pair; instance segmentation enabled us to differentiate between the two members of this pair. A literature review showed that papers on car part segmentation are still limited, and no standards or criteria for this process have been established. Therefore, we tested a set of state-of-the-art deep-learning algorithms on a self-developed car part data set, containing images annotated with descriptions of the object in them. Our contributions are:

  1. 1.

    Development of an extensive car part data set—annotated images of car parts from multiple viewpoints—some were taken from the Internet and some were taken by our team.

  2. 2.

    Comparative evaluation in terms of mean average precision between Mask R-CNN (baseline technique) with ResNet Backbone and four state-of-the-art instance segmentation algorithms—the top four algorithms reported by paperswithcode.com [6].

  3. 3.

    Robustness testing in terms of mPC and rPC of models from four state-of-the-art instance segmentation algorithms and the baseline model against real weather elements and lighting conditions in the photos.

The rest of this paper is arranged as follows: the second section briefly describes related works; the third section briefly describes the tested algorithms; the fourth section discusses the experimental setup and the data sets; the fifth section reports and discusses results, and the final section concludes.

Related works

Edge computing has emerged to push the computation capability closer to end-devices. It can improve response times and reduce required network bandwidth. With a combination of front-end devices, edge nodes and cloud computing, many applications that use machine learning and computer vision techniques have been successfully deployed. Many researchers developed their algorithms to fully operate on front-end devices to enhance system efficiency. Velichko et al. [7] proposed a lightweight neural network algorithm called “LogNNet”, that used filters based on logistic mapping for image recognition task. It can be employed in low-memory devices. Howard et al. [8] and Sandler et al. [9] developed MobileNets and MobileNetV2, which are efficient lightweight Convolution Neural Network (CNN) models, designed to work on mobile device. Tuli et al. [10] developed an object detection framework, EdgeLens, that integrated IoT, fog and cloud computing.

Applications of instance segmentation have included detection of individual humans in an image based on their posture. In addition, Zhang et al. [11] presented an instance segmentation method for human detection based on a human pose skeleton. It enabled recognition of the context of a posture even though, in the image, there was another human nearby or an overlap with another human. This capability differentiated it from other instance segmentation algorithms, e.g., Mask R-CNN [12]. Other instance segmentation applications include identification of biological objects in an image. In one instance, Yi et al. [13] presented an instance segmentation method for biological objects, that worked on heat map images.

Currently, several new instance segmentation algorithms have been proposed. For instance, CenterMask [14] did not use a bounding box but used a spatial attention-guided mask. It differed from algorithms that use a fully connected layer, e.g., Mask R-CNN. In addition, it used a fully convolutional one-stage object detector (FCOS) [15] rather than Faster R-CNN [16] in the object detection task, resulting in a higher detection accuracy of both still images and video frames. In another example, Wang et al. [17] developed an instance segmentation algorithm, “SOLO”, a one-step algorithm, that did not use bounding box in object detection but, instead, divided an image into a number of squares and detected the interested object in each square. It used a semantic category branch technique to determine semantic category as well as an instance mask branch to determine instance category. SOLO was improved into SOLOv2 [18]. Mask learning was developed based on dynamic convolution. No weights or parameters in the model were set as a fixed value, so that the feature map could be adjusted to various kinds of input. The model had two types of mask branch: a Mask Kernel Branch for learning the convolution kernel and a Mask Feature Branch for learning convoluted features. Non-Maximum Suppression Matrix was used to reduce processing time, which was shorter than any other tested algorithms.

Recently, one-stage instance segmentation methods, that do not have different branches for performing different functions, have gained more attention from researchers than two-stage methods, e.g., PolarMask [19], RDSNet [20] and YOLACT++ [21]. A two-stage method performs object detection first, then constructs a mask branch to predict each mask in a bounding box. Example of these methods are Mask R-CNN, PANet [22] and Mask Scoring R-CNN [23]. Chen et al. [24] proposed a BlendMask with an improved FCOS Object Detector. They added a blender module to an attention map. The blender module included both high- and low-resolution masks in every bounding box mask, enabling the model to predict the mask more accurately and rapidly than Mask R-CNN or other two-stage algorithms.

In a review of studies on car part segmentation, Lu et al. [25] presented a semantic segmentation method for car parts, based on landmark assignment and boundaries of each part. They used a graphical model to find relationships between car parts, then a segmentation by a weighted aggregation method (SWA) [26] to pair two nearby landmarks, then a Segment Appearance Consistency (SAC) technique, to connect segments of nearby landmarks, in every level of a hierarchical segmentation and to determine whether the same segment was represented in every hierarchical level. The outcome was a group of pixels that could classify various car parts. Nevertheless, in SAC and hierarchical segmentation for every hierarchical level, the meanings of a car part of different levels differed. In other words, an SAC, after only one round of SWA, was not able to segment all car parts in an image. Singh et al. [27] built a system to detect different car parts and localize their damages. However, the algorithms used in their system—Mask R-CNN, PANet and an ensemble model, that was based on Mask R-CNN and PANet—did not perform well. The MAP was lower than 0.5 across all algorithms. Dhieb et al. [28] used Inception-ResNetV2 to classify damage severity level, localize and detect part damage. Patil et al. [29] and Dwivedi et al. [30] used various CNN models to classify the car part damage, but these works only focused on a small set of car parts.

A website, paperswithcode.com, ranked all instance segmentation methods and determined the state-of-the-art ones [6]. They were benchmarked on various data sets, e.g., PASCAL VOC [31] and Common Objects in Context (COCO) Challenge [32]. Since we needed the best model for instance segmentation of car parts, we evaluated several algorithms on a large COCO Test-dev Task data set with a large number of categories, using Mask R-CNN with ResNet as baseline. The evaluated methods were the top four, as ranked by paperwithcode.com on 30/09/2019, that also used ResNet as Backbone: HTC [33], CBNet [34], PANet [22] and GCNet [35]. These algorithms are briefly described in the next section.

Methodology

The top-ranked algorithms from paperwithcode.com, on 30/09/2019, are briefly described here.

Mask region-based convolutional neural network (Mask R-CNN)

Instance segmentation Mask R-CNN algorithm [12] was a development of Faster R-CNN [16]. Faster R-CNN was only able to detect, where a target object was in an image and recognize it, but Mask R-CNN was also able to perform instance segmentation. Mask R-CNN had two main parts: (1) a backbone that extracted features from an image with Residual Neural Network (ResNet), a CNN 50–101 layers deep [36], in combination with Feature Pyramid Network (FPN) [37] and (2) a head that constructed a bounding box around a Region of Interest (ROI) and predicted the type of object in the box. The additional step of Mask R-CNN over Faster R-CNN constructed a mask for each ROI. In Mask R-CNN, after the backbone extracted features from an image, these features were input into a Region Proposal Network (RPN), which constructed anchor boxes of various sizes, that contained an object of interest and passed them to an ROI Extractor, that extracted the features from each ROI. Each ROI Map was forwarded to fully connected layers, consisting of two parallel components: the original components of Faster R-CNN for predicting bounding boxes and objects of interest (classification) and an additional component for predicting a mask in a bounding box. The flowchart of Mask R-CNN is illustrated in Fig. 2. Mask R-CNN was ranked number five by paperswithcode.com.

Fig. 2
figure 2

Mask R-CNN

Global context network (GCNet)

GCNet [35] had a similar structure to Mask R-CNN, as can be seen in Fig. 2. However, the ResNet-FPN backbone was augmented with a global context block (Fig. 3). The Non-local Network (NLNet), a part of the block, solved the long-range dependency issue of deep neural networks [38]. NLNet worked in combination with a Squeeze-Excitation Network (SENet) to find the relationships between channels of each feature [39]. GCNet was as effective as NLNet, but computed faster, because it used fewer convolution and operation layers than NLNet. It was ranked number four by paperwithcode.com.

Fig. 3
figure 3

Global context (GC) block. The feature map has size, \(C \times H \times W\)—channel number C, height H and width W. \(\otimes \) denotes matrix multiplication and \(\oplus \) denotes broadcast element-wise addition. r is the bottleneck ratio and C/r denotes the hidden representation dimension of the bottleneck

Path aggregation network (PANet)

PANet was developed by Liu et al. [22]. It had a similar structure to Mask R-CNN, as shown in Fig. 4, but the RPN and ROI Extractor were replaced by bottom-up path augmentation and adaptive feature pooling components. The bottom-up path augmentation component took an input from the previous stage and processed it together with an output of each FPN layer to generate feature maps. Those maps were used to better mix high and low-level features. Then, the adaptive feature pooling component processed the feature maps from every layer, concatenated all of its output, then sent them to the head component, consisting of many fully connected layers, to detect objects, construct masks and bounding boxes and classify detected objects. Because of those processes, PANet was highly accurate. It was able to take advantage of all levels of feature maps, from low to high level features in each feature map. PANet was ranked number three by paperswithcode.com.

Fig. 4
figure 4

PANet A FPN backbone, B bottom-up path augmentation, C adaptive feature pooling, D Box branch, E fully-connected fusion

Cascade mask R-CNN with composite backbone network (CBNet)

This method combined Cascade Mask R-CNN [40] and Composite Backbone Network [34]. First, CBNet improved the feature extraction step, using a number of connected backbones called Assistant Backbones. Each connected backbone extracted some features and sent a feature map to the next backbone, which also extracted some features and sent a new feature map to the next-to-next backbone and so on. The last backbone was called a ‘Lead Backbone’. It generated the final feature map, that was consecutively concatenated with features extracted from all previous backbones in the connection. Because of this repeated extraction step, low-level and high-level features were extracted into a more effective mix than a mix that Mask R-CNN generated. Second, Cascade Mask R-CNN, whose head was modified from that of Mask R-CNN, improved prediction accuracy. The bounding box head in a previous branch was forwarded to the ROI Extractor of the next branch to improve prediction accuracy, as illustrated in Fig. 5. This method was ranked number two by paperswithcode.com.

Fig. 5
figure 5

Cascade mask R-CNN with CBNet. The composite backbone—a combination of assistant backbones and lead backbone—helped improve prediction accuracy

Hybrid task cascade for instance segmentation (HTC)

This algorithm was developed by Chen et al. [33] improving the efficiency of instance segmentation task. In this algorithm, the bounding box head, mask head and ROI extractor were interleaved in a cascade, illustrated in Fig. 6, and so bounding box prediction and mask prediction tasks proceeded in parallel instead of independently. A multi-stage mask branch technique was introduced. It took into account the mask from a previous branch in the generation of a mask in the current branch to improve information flow. Lastly, a semantic mask branch was connected to the head of every mask to enable the model to understand the context of the information in every mask better. All of these features improved the information flow in every task. This method was the top in the paperswithcode.com ranking.

Fig. 6
figure 6

Hybrid task cascade for instance segmentation

Experimental framework

Data set

The data set contained 500 images of sedans, pickups and sports utility vehicles (SUVs) collected from the Internet and taken from public parking spaces. Images of these vehicles were taken in multiple views—front, back and angled views. The car identification number was blurred to hide individual vehicle details. Each image was annotated by the 18 listed instance masks and bounding boxes: , , , , , , , , , , , , , , , , (of trucks and SUVs), and (wheel and tire). The number of instances per category is shown in Fig. 7 and examples of the images in the data set are in Fig. 8. The DSMLR Car Part data set contains images and annotation in COCO Challenge format and is available for download at https://github.com/dsmlr/Car-Parts-Segmentation.

Fig. 7
figure 7

Number of annotated instances per category for the DSMLR Car Part data set

Fig. 8
figure 8

Samples of pair images and instance mask from the DSMLR Car Part data set: a sedan, b pickup and c sports utility vehicle (SUV)

Experimental procedures and settings

We evaluated five algorithms: Mask R-CNN [12], HTC [33], CBNet [34], PANet [22] and GCNet [35], that used ResNet-50 and ResNet-101 as backbones, in terms of correctness and robustness on the car part data set. The algorithms were implemented with an MMDetection toolbox [41]. The experimental steps are described next. First, we resized all input images to \(1024 \times 1024\) pixels, while maintaining the aspect ratio by zero-padding. Next, we randomly partitioned the car part data set into a training set (80% of the entire data set) and a test data set (20%). Then, since it was necessary to determine the best number of epochs for training the model for every evaluated algorithm, we ran a five-fold cross-validation by training for 200 epochs on each fold. The best number of epochs for each algorithm was the number that provided the lowest average five-fold validation loss. Validation loss was computed from 5 types of loss: (1) loss in classification task, (2) loss in bounding box task, (3) loss in segmentation task (Loss mask), (4) RPN loss in classification task and (5) RPN loss in bounding box task. Validation losses (4) and (5) were calculated by a Cross Entropy loss algorithm, embedded in the RPN. Next, we used a Stochastic Gradient Descent (SGD) to finding optimal parameters, setting the learning rate at 0.02 and weight decay at 0.0001. The optimal models were trained with the training set for the optimal number of epochs. The experiment was run five times with different random splits.

Furthermore, we evaluated the robustness of algorithm for semantic segmentation and object detection tasks on corrupted data, simulating four real weather conditions and lighting, i.e., snow, frost, fog and ambient light at five levels of severity. The corrupted examples were generated by methods described by Hendrycks and Dietterich [42] (visualized in Fig. 10).

Performance evaluation

Correctness

Each algorithm was evaluated for average precision (AP), based on the COCO Challenge, an established evaluation method for object detection tasks. AP was calculated from the Intersection over Union (IoU) of each interested object. IoU was calculated by

$$\begin{aligned} \mathrm{IoU} = \frac{\mathrm{Area~of~Overlap}}{\mathrm{Area~of~Union}}. \end{aligned}$$
(1)

A model was considered to successfully detect an object, if the IoU was equal to or higher than a threshold that we assigned. The AP\(_{50}\) and AP\(_{75}\) means that the IoU are greater than or equal to the threshold at 0.50 and 0.75, respectively. Then, the mean average precision (mAP), based on COCO Challenge, is the average over IoUs between the threshold at 0.50 and 0.95, computed as:

(2)

Since car parts take different sizes, we also evaluated the AP across scale of the car part, i.e.,APS for small parts with an area lower that \(32^2\) pixels, AP\(_\mathrm{M}\) for medium parts, with area between \(32^2\) and \(96^2\) pixels, and AP\(_\mathrm{L}\) for large parts, with area greater than \(96^2\) pixels. It is noted that AP on the COCO Challenge was reported in percent.

Robustness

Robustness was measured using two metrics—mean performance under corruption (mPC) and relative performance under corruption (rPC) metrics [43].

mPC is calculated:

$$\begin{aligned} \mathrm{mPC} = \frac{1}{N_c}\sum ^{N_c}_{c=1}\frac{1}{N_s}\sum ^{N_s}_{s=1}P_{c,s}, \end{aligned}$$
(3)

where \(N_c = 4\) indicates the number of corruptions and \(N_s = 5\) the number of severity levels (as set in this work), and \(P_{c,s}\) is the performance measure evaluated on test data, that was corrupted with corruption type, c, under severity level, s. Although several metrics could be used for P, in this work, P levels were calculated using mAP. A higher mPC indicates a more robust algorithm.

rPC measured the relative degradation of performance on corrupted data compare to original data. It was calculated by

$$\begin{aligned} \mathrm{rPC} = \frac{\mathrm{mPC}}{P_\mathrm{original}}, \end{aligned}$$
(4)

where \(P_\mathrm{original}\) is the performance of algorithm on the original data, that is mAP of the original data, \(\mathrm{rPC} \in [0,1]\). rPC = 1 represents ‘perfect’ robustness, while 0 represents negligible robustness.

Experimental results and discussion

In this section, several comparisons were made and discussed:

  1. 1.

    We compared overall algorithm performance based on two tasks—object detection and semantic segmentation tasks.

  2. 2.

    We discussed robustness in potential real weather elements and lighting conditions.

  3. 3.

    We further discussed performance and robustness, when left- and right-side parts were grouped under one label.

Overall performance of object detection and semantic segmentation tasks

The performances of all the algorithms are illustrated in Table 1, that includes and with different thresholds. It can be seen that HTC with ResNet-101 encoder achieved the best at 54.3 in object detection. In addition, it worked best on small and medium car parts, resulting in and at 35.6 and 52.0, respectively. This was followed by HTC with ResNet-50 encoder at 54.1 of . The stricter metric, at \(\in \) (0.75, 0.95], came in second at 62.4, while HTC with ResNet-50 was the best contender at 63.6. Further, HTC with ResNet-50 performed best on large car parts, resulting in AP\(_\mathrm{L}\) at 61.1. Surprisingly, Mask R-CNN with ResNet-50—the baseline—scored highest on at 77.0, but it did not perform well on the stricter metric. This was because Mask R-CNN tried to detect, classify and segment the car parts with low-level features, whereas other algorithms used global features or high-level features for segmentation tasks. On the other hand, Mask R-CNN, with the ResNet-101 encoder, achieved the highest at 55.4 in the semantic segmentation task, as well as in the strictest metric, , at 65.2, which is in-line with HTC with ResNet-50 encoder. Here, HTC, with the ResNet-50 encoder, secured the second best at 55.2, with a small difference in from Mask R-CNN with ResNet-101 encoder. It also worked best with the large car parts—AP\(_\mathrm{L}\) at 63.6. In addition, PANet performed best on small car parts, yielding at 38.5.

In terms of performance related to the size of the car part, the models performed better on large parts followed by medium and small parts. The average AP\(_\mathrm{L}\), and across all the models in the object detection task were 55.3, 46.9, and 32.1, respectively, and, in semantic segmentation, the scores were 59.2, 48.7, and 33.2, respectively. Larger parts led to better performance. Figure 9 shows a sample of object detection and semantic segmentation by the models with ResNet-50 and ResNet-101 encoders.

Table 1 Overall model performance on object detection and semantic segmentation tasks
Fig. 9
figure 9

Sample of object detection and semantic segmentation results: a ResNet-50 Encoder and b ResNet-101 Encoder

To determine which combination of model and encoder achieved the best overall performance, we used Kendall’s coefficient of concordance (W) to measure agreement between evaluation metrics. We rank the 10 candidate models (5 models with 2 encoders each) on 12 performance metrics (2 tasks with 6 metrics each). Then, we reported sum of the ranks of each candidate model as shown in Table 1 that leads to the ranking of the candidate models. Next, we calculate W by

$$\begin{aligned} W = \frac{12\left( \sum _{i=1}^{k}R_{i}^2\right) -3k^{2}n(n+1)^{2}}{k^{2}n(n^{2}-1)-k\sum _{j=1}^{k}(T_j)}, \end{aligned}$$
(5)

where n is the number of candidate models, R is the sum of ranks for the i-th candidate, k is the number of the performance metrics, and T is a correction factor, based on tied ranks (see [44] for more details). Here, \(n =10\) and \(k=12\). Thus, W=0.5079 that is transformed to a \(\chi ^2\) value of W for significance testing against a null hypothesis of no agreement,

$$\begin{aligned} \chi ^2 = k(n-1)W. \end{aligned}$$
(6)

Thus, \(X^2\!=\!54.8350\) leads to \(p<0.01\) for 9 degrees of freedom. the led to \(p<0.01\). Thus, we rejected the null hypothesis. Therefore, we confirmed that HTC with ResNet-50 and HTC with ResNet-101 are the first and the second rank, respectively.

Robustness

In this section, the models used in the previous subsection were further evaluated. They were tested on the modified test data, including the set of real weather elements and lighting conditions, with different severity levels as shown in Fig. 10. We illustrate the overall robustness test results, showing results for different types of noise for object detection and semantic segmentation tasks in Table 2. GCNet with the ResNet-50 encoder was the best contender; it achieved the highest robustness, based on rPC, in object detection at 64.8% and semantic segmentation at 64.4%. It yielded the best mPC in all weather conditions for both object detection and semantic segmentation tasks, except brightness changes in object detection task. It was clear that the worst was CBNet, with the ResNet-50 encoder, as it retained only 48.1% and 47.3% of the performance in object detection and semantic segmentation tasks, respectively. HTC, with ResNet-101, in the object detection task, achieved the highest , with the normal condition image, but although it only retained 53.2% of the performance, when the images were corrupted, its mPC was still ranked second at mPC= 28.9, after GCNet with ResNet-50. Moreover, HTC with ResNet-101 obtained the highest mPC with the brightness changes at 42.4. This also applied to the semantic segmentation task, HTC, with ResNet-101, ranked second in overall performance, based on mPC = 29.3, similar to Mask R-CNN with ResNet-101. We also found that the factors, that degraded performance for all algorithms, were snow and frost conditions, because they degraded the performance to less than 50% of the performance without corruption in both tasks. However, the algorithms tolerated changes in brightness and fog conditions well: they were still able to keep performance at 79.3% (light changes) and 63.0% (fog) in the object detection, and 78.4% and 63.0% in semantic segmentation.

Fig. 10
figure 10

Images in real environments and varied lighting conditions

Table 2 Performance of each method for object detection and semantic segmentation, including a robustness test with challenging real environments

Merging left- and right-side car part as one label

After evaluating overall performance and robustness, we ran an error analysis to seek a way to improve the task. We found that the algorithms were usually confused with left- or right-side parts, e.g., predicting as a or vice versa. Therefore, we created a new set of data, that assigned a single label to left and right sides of a part. Then we fine-tuned each pre-trained model from the original labels at 100 epochs—other settings remained the same.

Table 3 shows the overall performance on both object detection and semantic segmentation, with left- and right-side part labels merged. All performances were higher than when left- and right-side parts were considered separately (Table 1): increased by 5.76% for object detection and 5.27% for semantic segmentation for all models. The table shows that HTC, with the ResNet-101 encoder, yielded the highest = 59.4, followed by HTC, with the ResNet-50 encoder, with = 59.1 in object detection. HTC, with ResNet-101, performed best on large car parts—the highest value of AP\(_\mathrm{L}\) at 65.4—while HTC, with ResNet-50, encoder achieved the best performance on small and medium car parts, resulting in = 34.5 and = 53.5. In addition, HTC, with ResNet-50, was the best contender with the most strict metric = 68.6. Although Mask R-CNN, with ResNet-50, received the highest score, it was still worse than HTC, with ResNet-50 or ResNet-101, using the strictest metric. In semantic segmentation, HTC, with ResNet-101, also ranked first with = 60.1, followed by Mask R-CNN, with ResNet-50 or ResNet-101. Apparently, Mask R-CNN performed well in semantic segmentation, resulting in the highest performance on = 81.9, = 71.3, and = 55.2. Again, we used Kendall’s coefficient of concordance (W) to evaluate agreements between algorithms. The overall performance rank changed: Mask R-CNN, with ResNet-50, was now the first ranked, followed by HTC, with ResNet-50 and ResNet-101. The rankings in the table are significant at \(p<0.01\) for 11 degrees of freedom (W = 0.4905 and \(\chi ^2\) = 52.9691).

Table 3 Overall performances of the selected models on object detection and semantic segmentation tasks with merged side of the car part scenario

We also evaluated algorithms robustness in the merged sides of a car part scenario on both tasks as shown in Table 4. The overall picture was very much the same as considering left side and right side separately. GCNet was still the most robust algorithm, while the worst was CBNet. Moreover, snow and frost were still the top most challenging conditions to corrupt the data, that impacted the algorithms.

Table 4 Performance of each method for object detection and semantic segmentation on merged part sides, including a robustness test with different real environments

Conclusion

Computer network technology and end-devices are becoming more powerful. Also, the car insurance business is rapidly growing. Thus, an automated system for damage evaluation is necessary. In this work, we describe an automatic car part identification system based on images by deep learning techniques. We compared the performance of several state-of-the-art deep learning algorithms on a part segmentation task, using a car part data set, created for this work, that is now publicly available. Our experiments showed that HTC was the best model, followed by Mask R-CNN and GCNet, in both object detection and semantic segmentation tasks in normal weather conditions. Also, we evaluated algorithm robustness in real environmental and lighting conditions, simulating conditions that would occur in the field, when we take a photo using a smartphone. GCNet was the most robust model, because it achieved the best performance in overall pictures and in real conditions, except in varying brightness. Currently, edge computing has become more practical and able to overcome limitations of end-devices. Therefore, edge computing enabled the models to operate in the end-device, leading to a solution for real-time image analysis.

In future work, we will focus on developing a lighter weight model for semantic segmentation to ease the load on the end-device, without sacrificing its accuracy and robustness. We also aim to extend the work to detect, localize and estimate the severity of damage on different parts.