Abstract

Modern object detectors always include two major parts: a feature extractor and a feature classifier as same as traditional object detectors. The deeper and wider convolutional architectures are adopted as the feature extractor at present. However, many notable object detection systems such as Fast/Faster RCNN only consider simple fully connected layers as the feature classifier. In this paper, we declare that it is beneficial for the detection performance to elaboratively design deep convolutional networks (ConvNets) of various depths for feature classification, especially using the fully convolutional architectures. In addition, this paper also demonstrates how to employ the fully convolutional architectures in the Fast/Faster RCNN. Experimental results show that a classifier based on convolutional layer is more effective for object detection than that based on fully connected layer and that the better detection performance can be achieved by employing deeper ConvNets as the feature classifier.

1. Introduction

Modern object detectors [1, 2] always include two major parts: a feature extractor and a feature classifier as same as traditional object detectors. These two parts are thought to be mutually independent in the traditional object detectors while they can be considered to be a unified course in the modern object detectors. The feature extractor in traditional object detection methods is usually a hand-engineered descriptor, such as SIFT [3] and HOG [4]. At the same time, the feature classifier is usually a linear SVM [5], a nonlinear boosted classifier [6], or an additive kernel SVM [7]. However, the deep ConvNets have dominated the feature extractor of the modern object detectors in various application scenarios [811]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale.

The successful RCNN [12] method applies high-capacity convolutional neural networks to extract a fixed-length feature vector from each region which is fed to a set of class-specific linear SVMs. It firstly pretrains the network by supervision for image classification with abundant data and then fine-tunes the network for detection where data is scarce. In fact, it only can be considered a hybrid of traditional detectors and deep ConvNets. Although its feature extractor is replaced by pretrained deep ConvNets, the classifier still uses a traditional model which is a set of class-specific linear SVMs. SPPnet [13] is also a hybrid model using convolutional layers to extract full-image features followed by a set of class-specific binary linear SVMs like RCNN. What is different is that the spatial pyramid pooling layer proposed by SPPnet enables feature extraction in arbitrary windows from the deep convolutional feature maps.

Fast RCNN [14] and Faster RCNN [15] make further evolution on the pipeline of object detection. Following the pioneering RCNN, Fast/Faster RCNN uses convolutional layers, initialized with discriminative pretraining for ImageNet [16] classification, to extract region-independent features followed by a regionwise multilayer perceptron (MLP) for classification. Besides, they jointly optimize a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages. Nevertheless, this strategy ending with MLP classifier architectures is memory hog.

Based on a thorough study of regionwise feature classifier in Fast/Faster RCNN, our main work and contributions include the following three aspects. Firstly, we notably prove that the input size of the prevailing fully convolutional architectures must satisfy certain condition due to the concatenation of the convolutional layer and the pooling layer. Secondly, based on the detailed analysis of these fully convolutional architectures, we put forward how to employ recent state-of-the-art image classification networks such as ResNet [17] and various versions of GoogleNet [18, 19] which are by design fully convolutional into Fast/Faster RCNN detection systems. Finally, we adopt the idea of skip connection analogous to the hybrid of PVANET [20] and FPN [21] that combines several intermediate outputs. Consequently, the low-level visual features and high-level semantic features can be taken into account at the same time.

In the remainder of this paper, we derive a general formula for accurately designing the input size of the various fully convolutional networks in which the convolutional layer and the pool layer are concatenated (Section 2) and propose an efficient architecture of skip connection stemming from PVANET and FPN (Section 3). Finally, we provide abundant experimental results on VOC2007 benchmarks in Fast/Faster RCNN detection systems employing various fully convolutional networks including/excluding the architecture of skip connection, with detailed settings for training and testing (Section 4).

2. Deriving the Condition of the Input Size Based on Fully Convolutional Architectures

Beginning with LeNet-5 [22], convolutional neural networks have typically had a standard structure which includes some stacked convolutional layers optionally followed by local response normalization and pooling layer and ends with two 4096-d fully connected (fc) layers. Variants of this basic design are prevailing in the image classification literature, which have acquired the best results on MNIST [23], CIFAR [24], and most notably the ILSVRC competition. For larger datasets such as ImageNet, the latest trend has been to increase the depth and width of by design fully convolutional CNN architecture, whose fully connected layers are replaced by the global average pooling layer, while using dropout [25] to deal with the problem of overfitting and batch normalization [26] to accelerate deep network training by reducing internal covariate shift. Meanwhile, a latest pooling technique called Mean-Max Pooling is introduced in DPN, which can improve the performance of a well-trained CNN in the testing phase without any noticeable computational overhead.

These prevailing fully convolutional networks, such as ResNet and GoogleNet and their updated versions, have the effective stride, namely, 25 pixels, as same as the ZF [27]/VGG [28] networks. In other words, the effective stride increases two times by each stage. The difference is the way of reducing feature map size by half. The ResNet/GoogleNet specially designs a residual/inception building block while the ZF/VGG only adopts the max pooling layer. As we know, Faster RCNN system can take an image of any size as input. Feeding images with varying sizes is owed not only to the proposed RoI pooling layer but also to the architecture of the ConvNets such as ZF/VGG which are stacked with the max pooling layers and the convolutional layers.

Table 1 has illustrated the detailed network architectures of various CNN networks. As you can see, the special inception/residual block is designed to reduce the feature map size in the fully convolutional networks. Furthermore, the parameters of the special reduction blocks are precisely designed to enable the alignment of feature map size in the concatenation layer. If we want to employ the fully convolutional networks in Fast/Faster RCNN system, we present that the input size in these various ConvNets must satisfy certain condition. Taking the Inception_v3 architecture as an example, we calculate each feature map size according to the model parameters of convolutional and pooling layer. Based on the CAFFE [29] framework, the output size of each convolutional and pooling layer can be calculated precisely by the following two formulas:where and are denoted as the floor and ceil function, respectively.

The detailed Inception_v3 architecture is defined in Table 2 according to the original paper. Because the output of the pooling layer will be concatenated with the outputs of the convolutional layers by the end of the inception block, these outputs must ensure the same feature map size. We can derive a general formula for accurately designing the input size of Inception_v3 by reversing-inference method, which is taken in two stages.

In consideration of the alignment issue of the feature map size in an inception block, we first calculate the input size of the reduction module consisting of a special inception block whose detailed parameters can be found in Table 2. For one reduction module, presumably the output size is which represents a known positive integer and the input size is which is unknown. Then we can establish a link between the output size and the input size by (1):where is denoted as a set of positive integers. Therefore, we can obtain the solution of the equation set; that is, . Assuming that Inception_v3 has reduction modules, we can easily derive the relationship between the output size of the final reduction module and the input size of the first reduction module which is (). The detailed derivation process has been shown in Figure 1.

The remaining part of Inception_v3 only consists of some stacked convolutional layers and pooling layers which are in front of the Inception_v3 architecture. Similarly, we can derive the relationship between the input size and the output size of the part by (1).

Finally, we can obtain a general formula about the input size of Inception_v3 architecture which is as follows:

The detailed input size of each layer for Inception_v3 is shown in Table 2 where equals 2. It is worth noting that the dilation of each layer is set to 1 which is omitted in Table 2. In the same way, we can derive the formula belonging to the input size of other ConvNets. From here we can see that their input sizes must be ensured to satisfy certain conditions when these fully convolutional networks are employed by the Fast/Faster RCNN system.

3. Skip-Layer Connections

Many studies have shown that multiscale representation and its combination are effective in many recent deep learning tasks. In essence, multiscale representation is a skip-layer connection method combining fine-grained details with highly abstracted information in feature extraction layer, which contributes to the following region proposal and classification network. Our skip-layer connection architecture is more like a combination of the observation from PVANET and FPN.

Our design choice combines the last and two intermediate layers whose scales of feature map are two times and four times the last layer, respectively. The backbone ConvNet computes a feature hierarchy including feature maps at five scales with a scaling step of 2. The feature maps of some layers have the same scale and we say these layers are in the same network stage. As we know, most ConvNets have five network stages which have strides of pixels with respect to the input image. Our multiscale features can be obtained by combining three network stages whose strides are pixels. Besides, the output of the last layer of each stage is chosen as our reference set of feature maps.

There are four steps to obtain our multiscale features as shown in Figure 2. To combine multilevel maps at the same resolution, different sampling strategies are carried out for different stages. The stage with stride 23 is downscaled by 3 × 3 MaxPool with stride 2 while the stage with stride 25 is upscaled by channelwise deconvolution whose weights are fixed as bilinear interpolation. For the purpose of merging the features of the three stages by elementwise addition, a 1 × 1 convolutional layer is used to adjust their channel dimensions to a fixed size which is set as 256 in our experiments. At last, a convolution layer on the merged feature map is appended to generate the final feature map, which is to reduce the aliasing effect of upsampling.

4. Experiments

4.1. Experimental Setup

As we know, Faster RCNN system includes two stages shown in Figure 3. The first stage called the region proposal network (RPN) processes the input of arbitrary size by some shared convolutional layers known as feature extractor. And then the extracted features output by the last shared convolutional layer are slid over with a spatial window by a small network. Each sliding window is then mapped to a lower-dimensional feature which is fed into two sibling convolutional layers for box regression and box classification, respectively. The second stage includes a detection network where Fast RCNN is adopted. The proposals generated by the RPN and the shared convolutional features are fed into the RoI pooling layer followed by the remaining layers of the backbone ConvNet in order to predict a class and class-specific box refinement for each proposal.

We only perform all our experiments on the 20 category PASCAL VOC2007 detection dataset. Our code is based on the official Faster RCNN code written in MATLAB which also includes the reimplementation of Fast RCNN. As a common practice, all network backbones used are pretrained on the 1000-class ImageNet dataset and then fine-tuned on the detection dataset. We investigate the pretrained GoogleNet, Inception_v2, Inception_v3, and ResNet-50 models that are publicly available. From the above analyses, any latest network can be similarly used as a backbone ConvNet in Fast/Faster RCNN, provided that the issue on the input size is solved.

All parameters related to Fast/Faster RCNN were set as in the original work except that the shorter edge of each input image was resized to be 587. What is noteworthy is that the last max pooling layer of ZF/VGG is replaced by a RoI pooling layer in the original Fast/Faster RCNN, which leads to an effective output stride of 24 instead of 25. To put all ConvNets on an equal foot, we slightly modify the original models by modifying the last network stage to have stride 1 instead of 2. Furthermore, the atrous [30] convolution is used in the last network stage to compensate for reduced stride. Except for GoogleNet, the batch normalization, whose parameters are frozen to be those estimated during ImageNet pretraining, is used after convolutional layers. All experiments were performed on Intel i7-6700K CPU and NVIDIA GTX1080 GPU.

4.2. Experiments on ConvNets for Fast/Faster RCNN System

RoI pooling is used to pool regionwise features from the shared convolutional feature maps. And it generates a fixed-size feature map for each proposal replacing the last pooling layer while ZF/VGG is the backbone network. Naturally, the remaining fc layers are used as regionwise classifier typically. Differing from ZF and VGG, other ConvNets replace the fc layers with a global average pooling layer. Therefore, we must choose the layer to insert RoI pooling layer for these ConvNets. Based on our observation, we choose the last network stage to insert RoI pooling layer to ensure the effective output stride of 24. Besides, we set the output size of RoI pooling as like VGG in all cases. Because three layers are included in the last network stage, we perform experiments on different layers of various depths for Fast/Faster RCNN system and get some exciting results shown in Table 3. According to the experimental results, we intuitively make an argument that a deeper convolution-based regionwise classifier is more effective but more time-consuming and is in general orthogonal to more powerful and deeper feature map. Moreover, we also perform original Fast RCNN experiments based on ZF/VGG and get the detection mAP of 58.9 and 68.2, respectively, which are significantly improved by Inception_v3 and ResNet_50. Besides, the results also indicate that the fully convolutional networks have smaller final model size due to lack of fully connected layers.

As discussed above, our skip-layer connection is used in the feature extraction stage by combining three network stages whose strides are pixels. Then all the stages of the fully convolutional networks are considered as a feature extractor. According to the previous experiment results, the final detection mAP is much less than others when the RoI pooling layer is only followed by the global average pooling layer. Consequently, we replace the global average pooling layer with two 1024-d fc layers. They are randomly initialized by the Xavier method due to having no pretrained fc layers available. Each fc layer is followed by a ReLU layer and a dropout layer with the dropout ratio of 0.25. Similarly, we perform experiments on different backbone ConvNets with our skip-connection layers for Fast RCNN system. The experiment results are shown in Table 4, which indicates our skip-connection architecture is fairly effective and the skip-connection architecture has almost identical performance as the third convolution-based regionwise classifier. Obviously, the fc layers account for most of the final model compared to convolutional layers. Besides, we assume that the performance can be improved further while using more fc layers.

In fact, Faster RCNN innovatively merges the proposed RPN and Fast RCNN into a single network by sharing their convolutional features. Therefore, we only perform some experiments on the above best case for Faster RCNN. Due to a lack of time, we only experiment on the ResNet_50 and get the detection mAP of 73.1 which is boosted by almost two percentage points in comparison with Fast RCNN. What is more, we try to run some experiments on the convolution-based regionwise classifier where our skip-layer connection is used in the feature extraction stage by combining three network stages whose strides are pixels. To our great pity, the obtained detection mAP is fairly low, only 62.4 and 65.9 for Inception_v2 and ResNet_50, respectively. Therefore, we put forward an argument that combining higher feature is more effective.

5. Conclusions

In this paper, we have presented how to use the prevailing fully convolutional architectures in the notable object detection systems such as Fast/Faster RCNN. Specifically, we have derived a general formula for accurately designing the input size of the various fully convolutional networks in which the convolutional layer and pooling layer are concatenated with their strides being greater than 1 and have proposed an efficient architecture of skip connection to accelerate the training process. It is worth noting that our experiments are only performed on the VOC2007 set for some reason. We strongly believe that the results clearly can be boosted by a large margin as using more training data. We believe that our theoretical analysis and experiments can bring in some insights into how to employ other CNN architectures in single-stage or two-stage object detection systems. Besides, we will leverage the Faster RCNN whose backbone ConvNet is replaced with the ResNet_50 to detect small objects in optical remote sensing image by accurately modifying the strides in the future work.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.