Keywords

1 Introduction

Detecting objects in a scene proved to be a very difficult task, which has been investigated for a variety of applications in recent years, such as face detection, self-driving cars, medical disease detection, video surveillance, and for natural disaster protection. The convolutional neural networks (CNNs) represent the heart of state-of-the-art object detection methods. They are used for extracting features. Several CNNs are available, for instance, AlexNet, VGGNet, and ResNet. These networks are mainly used for object classification task and have evaluated on some widely used benchmarks and datasets such as ImageNet (Fig. 1). In image classification or image recognition, the classifier classifies a single object in the image, outputs a single category per image, and gives the probability of matching a class. Whereas in object detection, the model must be able to recognize several objects in a single image and provides the coordinates that identify the location of the objects. This shows that the detection of objects can be more difficult than the classification of images.

Fig. 1.
figure 1

Examples of images from the ImageNet 2012 dataset.

Traditional object detection models tend to use methods such as Haar-Like features [1], HOG [2], and Scale-Invariant Feature Transform [3] for extracting the features in the image. Those approaches have been based on the way we could manually design the features or the model according to our understanding. Recently it has been proven that it is more efficient to let the machine handle these tasks. And this is when the convolutional neural networks came to take control, achieving impressive successes [4, 5]. The present paper will be structured as follows. First, we review the leading state-of-the-art convolutional neural network architectures used in object detection. We then introduce the image datasets we have used to compare the networks along with the experiments. We further report the results of each architecture when used with state-of-the-art object detection models.

2 Convolutional Neural Network Backbones

The selection of CNN architectures to be covered in this article is not made randomly, but according to their popularity and performance in different state of the art object detection models.

2.1 AlexNet

Krizhevsky et al. [4] in 2012, developed a convolutional neural network composed of 8 layers, where 5 are convolutional and 3 are fully connected. The network is called AlexNet. In comparison to LeNet-5, AlexNet [6] has more layers and contains around 60 million parameters. Rectified Linear Units (ReLUs) are used for the first time as activations in AlexNet instead of sigmoid and tanh activations to add non-linearity. AlexNet is used in object detection models such as R-CNN [7], and HyperNet [8].

2.2 VGG-16

In 2014 a network called VGG-16 [9] was released, composed of 13 convolutional and 3 fully connected layers with ReLU activation. VGG-16 provides more layers compared to AlexNet and uses smaller filters of 2 × 2 and 3 × 3. It includes 138 million parameters. A deeper version of VGG called VGG-19 is available. VGG-16 is one of the most used architectures in object detection and achieved interesting performances; it’s used for instance in algorithms like Fast R-CNN [10], Faster R-CNN [11], HyperNet [8], RON384 [12], SSD [13] and RefineDet [14].

2.3 GoogLeNet

Also called Inception V1, GoogLeNet [15] is a small network developed by Szegedy et al. in 2014. Their method is different from that of VGGNet and AlexNet. They came up with a new notion known as blocks of inception, where it embeds multi-scale convolutional transformations. The inception block includes filters of varying sizes 1 × 1, 3 × 3 and 5 × 5. It employs a 1 × 1 convolution in the middle of the network to reduce dimensionality and they opted to use global average pooling instead of fully connected layers. The network is made of 22 layers with 5 million parameters. GoogLeNet mainly is used in YOLO [16] object detection model.

2.4 ResNets

Convolutional neural networks have become more and more deeper with the addition of layers, but once the accuracy gets saturated, it quickly drops off. To solve this issue, He et al. in 2015 developed ResNets [5] which are based on residuals or skip connections. They also use Batch Normalization [17]. ResNets are mainly consisting of convolutional and identity blocks. There are many variants of ResNets, for instance, ResNet-34, ResNet-50 which is composed of 26 million parameters, ResNet-101 with 44 million parameters and ResNet-152 which is deeper with 152 layers. ResNet-50 and ResNet-101 are used widely in object detection models. While ResNet-50 is used in some object detection frameworks such as BlitzNet [18] and RetinaNet [19]. ResNet-101 is used in Faster R-CNN [5], R-FCN [20], and CoupleNet [21], etc.

2.5 Inception-ResNet-V2

Szegedy et al. published in 2016, Inception-ResNet-V2 [22], a CNN inspired by the ResNet and based on a hybrid approach by combining Inceptions and ResNet architectures, which use residual connections as an alternative to concatenation filters. Inception-ResNet-V2 is composed of 164 deep layers and about 55 million parameters. The Inception-ResNet models have led to better accuracy performance at shorter epochs. Inception-ResNet-V2 is used in Faster R-CNN G-RMI [23], and Faster R-CNN with TDM [24] object detection models.

2.6 DarkNet-19

A network developed to be small and efficient at the same time. It is based on many previous ideas like the Darknet reference, Network In Network [25], Inception [15, 26] and Batch Normalization [17]. Darknet-19 [27] uses convolutional layers instead of fully connected layers. It is composed of 19 convolutional and 5 max-pooling layers. It uses only 3 × 3 convolutional kernels and several 1 × 1 convolutional kernel to reduce the number of parameters. DarkNet-19 is used in YOLOv2 [27].

3 Data, Experiments and Results

3.1 Data

To assess the different CNNs mentioned above, we used several common data sets in the field of classification and object detection. First, we used the ImageNet database [28], one of the largest databases available today, it contains more than 14 million images from different categories. We used ImageNet to calculate Top-1 and Top-5 accuracy rates. Afterward, we used Pascal VOC [29] (2007 and 2012), and the Common Object in Context (COCO) [30] dataset for the object detection purposes.

3.2 Experiments and Results

In this section, we experiment with the CNNs mentioned in this paper along with the object detection models based on these networks under the different datasets and benchmarks. In Table 1, with the exception of the DarkNet-19, all the experiments are carried out with PyTorchFootnote 1, an open-source machine learning framework and Nvidia T4 GPU. The input size resolution is 224 × 224 for all networks except for Inception-ResNet-V2 where the input size is 299 × 299. To evaluate the computational complexity of each network we use the Multiply-And-Accumulate (MAC) operation that could be considered as two separate floating-point operations (FLOPs) [31]. In Table 2 and Table 3, the detectors are trained on Pascal VOC07 trainval and Pascal VOC12 trainval. The +S suffix means that the model is trained also for segmentation and extra annotations. For the Table 4, the models are trained on MS COCO trainval35k set.

Table 1. Network’s performance on the ImageNet 1-crop accuracy rates.
Table 2. Comparative results on Pascal VOC 2007 test set (%).
Table 3. Comparative results on Pascal VOC 2012 test set (%).
Table 4. MS COCO test-dev 2015 detection results (%).

4 Discussion

In Table 1, we note that the Inception-ResNet-V2 network achieved a Top-1 Accuracy of 80.3% and 95.1% in the Top-5, higher than all other networks. Both ResNet-50 and ResNet-101 perform almost as well as Inception-ResNet-V2. Whereas AlexNet had 56.5% and 79.09% in Top-1 and Top-5 respectively, the remaining networks achieved nearly similar results. From Table 1, architectures based on the residual concept achieve better accuracy using a very reduced number of parameters compared to other architectures. For example, ResNet-50 has about 25 million parameters, ResNet-101 has around 44 million parameters and Inception-ResNet-v2 contains almost 55 million parameters, whereas VGG-16 has more than 138 million parameters. Although AlexNet and Inception-Resnet-V2 have a very similar number of parameters, the accuracy and number of MACs are much lower in AlexNet compared to Inception-Resnet-V2. Table 2 clearly shows that in object detection the networks with the best performance are VGG and ResNets. ResNet-101 with CoupleNet, ResNet-50 with BlitzNet512 and PFPNet-R512 with VGG-16 performed an accuracy of 82.7%, 81.5%, and 82.3% respectively in the Pascal VOC 2007 test set. In Pascal VOC 2012, Table 3 indicates that PFPNet-R512 with VGG-16 and CoupleNet with ResNet-101 achieved an accuracy of 80.3% and 80.4% respectively, while YOLOv1 with GoogLeNet achieved only an accuracy of 57.9%. For MS COCO we notice that the models based on VGG-16, ResNet-101 and Inception-ResNet-V2 achieved interesting results which are 57.6%, 57.5% and 55.5% respectively for mAP@.5, and 35.2%, 36.4% and 34.7% for the mAP@ [.5,.95]. While YOLOv2 with DarkNet-19 produced a mAP@.5 of 44% and a mAP@ [.5,.95] of 21.6%. According to the results obtained, we could mention the networks contributing to the highest performance in object detection are VGG-16, the ResNets family and also Inception-ResNets-v2, which combines the Inception and ResNets networks. This explains the wide use of these architectures in different object-detector models.

The following Table 5 shows the main features added for each architecture to improve the performance.

Table 5. Main added features in the CNNs.

5 Conclusion

In this paper, we studied the state-of-the-art CNNs for object detection. We devoted the study to networks that have achieved remarkable performance. We outlined the datasets used for testing the CNNs as well as the object detection models. We compared those networks and models on multiple benchmarks and datasets. We report that the application of convolutional neural networks in object detection has given impressive state-of-the-art results.