Fast crowd density estimation with convolutional neural networks

https://doi.org/10.1016/j.engappai.2015.04.006Get rights and content

Abstract

As an effective way for crowd control and management, crowd density estimation is an important research topic in artificial intelligence applications. Since the existing methods are hard to satisfy the accuracy and speed requirements of engineering applications, we propose to estimate crowd density by an optimized convolutional neural network (ConvNet). The contributions are twofold: first, convolutional neural network is first introduced for crowd density estimation. The estimation speed is significantly accelerated by removing some network connections according to the observation of the existence of similar feature maps. Second, a cascade of two ConvNet classifier has been designed, which improves both of the accuracy and speed. The method is tested on three data sets: PETS_2009, a Subway image sequence and a ground truth image sequence. Experiments confirm the good performance of the method on the same data sets compared with the state of the art works.

Introduction

Nowadays intelligent surveillance has been widely used in fire detection, vehicle identification, crowd management and so on. Crowd density estimation which is a crucial part of crowd management has become more and more important for the happening of stampede. In general, to estimate the crowd density, crowd features need to be designed first and then a classifier needs to be trained to discriminate the crowd density by these features. In the past few decades, many efforts have been made to find out both a better way for crowd feature description and a more efficient classifier for classification (Davies et al., 1995, Marana et al., 1997, Marana et al., 1998, Xiaohua et al., 2006, Ma et al., 2008, Ma et al., 2010, Su et al., 2010, Zhou et al., 2012).

Davies et al. (1995) proposed to estimate crowds using image processing. They also proposed using background removal and edge-detection for stationary crowd estimation and optical flow computation for the estimation of crowd motion. In fact, their method can be used in some practical circumstances except for the situations where crowd density went high. To deal with high density crowds, Marana et al. (1997) introduced in texture information which is based on gray level dependence matrix (GLDM) and the features were used by a self organizing neural network which was responsible for the crowd density estimation. However their work only gets 81.88% accuracy in average. To enhance the estimation performance, the support vector machine (SVM) was generally adopted (Xiaohua et al., 2006, Ma et al., 2010, Zhou et al., 2012). But, because the used features are not designed for the special practical environments, these SVM based methods also cannot obtain satisfactory results in applications.

Meanwhile, artificial neural network, which is another popular and efficient approach, has been widely used. Because it discriminates features more precisely like human neuron network. Recently, convolutional neural network (ConvNet) which is derived from Hubel and Wiesel (1962) has become research focus again for its three brilliant ideas of local receptive fields, shared weights and spatial sub-sampling. So, ConvNet can ensure some degree of shift, scale, and distortion invariance (Chen et al., 2010). Besides, ConvNet can directly process images and learn features because of the multiple convolutional layers whose convolutional weights are trained by the back propagation algorithms. For these advantages, ConvNets were widely applied into different fields such as document recognition (LeCun et al., 1998), traffic sign recognition (Sermanet and LeCun, 2011), face detection (Garcia and Delakis, 2002, Chen et al., 2010).

The standard ConvNet is made up by multiple layers. As the layer goes up, the distortion will get down. So some details of the image will probably be lost for the reason that only the feature maps in the last sub-sampling layers are accepted by the nets. To get better performance for shift, scale and distortion invariance, the multi-stage ConvNet in the works of Sermanet and LeCun (2011) and Sermanet et al. (2012) has been adopted. Besides, considering practical engineering application, some important improvements are made for the speed and accuracy. The main contribution contains two aspects. (a) The structure of multi-stage convolutional network is optimized by integrating similar feature maps, so as to decrease the computation cost and accelerate the speed of both training and detection stages. (b) Inspired by the boosting algorithm proposed in Chen et al. (2010), a cascade of two classifiers is also designed. The first one picks out samples which are obviously misclassified, and the second one, which is trained especially for hard samples, will reclassify those rejected samples. By this strategy, the accuracy is improved after handling hard samples more precisely.

The rest of this paper is organized as follows: in Section 2, we formally define our problem and introduce the basic concepts of ConvNets. Section 3 provides the detail interpretation of our improvements on ConvNets which are specially required for engineering applications. Section 4 shows experiment results. In the end, conclusions are given in Section 5.

Section snippets

General procedure for crowd estimation

Crowd density estimation is mainly used in public places such as train station, subway station and the square. In these places, from the video, people always overlap when it is crowded and sometimes only their heads protrude. So the method to estimate crowd density by analyzing the individuals to count their numbers is out of the ability (Davies et al., 1995). Through years of researches, a general procedure of crowd estimation has been established. Fig. 1 shows the steps of crowd density

Optimizing connections

The multi-stage ConvNet increases the quantity of features in the final classifier as well as the connections. This seriously increases the computation time at both training and detection stages, which affects its engineering applications. Inspired by the observation that there exists some redundant connections among two similar feature maps, we propose an optimization method, whose efficiency is confirmed in experiments, to move out these extra connections based on similarity matrix M(i,j) to

Experiments and comparisons

As there is no standard data set for crowd density estimation (Zhou et al., 2012), the approach is tested on 3 data sets: PETS_2009 (Ferryman and Ellis, 2010), a Subway video (Ma et al., 2008, Ma et al., 2010) and a video of Chunxi_Road in Chengdu.1 The experimental results are described in tables to demonstrate the validity of the improvements in crowd density estimation. To evaluate the performance of our method, some comparisons are

Conclusions

In this paper, we proposed a real-time approach of cascade optimized convolutional neural network for crowd density estimation. The proposed method is based on the multi-stage ConvNet. According to the key observation that some of connections in the network are not necessary, the idea of moving away similar feature maps from the 2nd stage together with their connections is applied. By the simplification of connections, a lighter network is obtained and the computation burden is significantly

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (61375038), 973 National Basic Research Program of China (2010CB732501) and Fundamental Research Funds for the Central University (ZYGX2012YB028 and ZYGX2011X015).

References (16)

  • A. Marana et al.

    Automatic estimation of crowd density using texture

    Saf. Sci.

    (1998)
  • B. Zhou et al.

    Higher-order SVD analysis for crowd density estimation

    Comput. Vis. Image Understand.

    (2012)
  • Chen, Y., Han, C., Wang, C., Jeng, B., Fan, K., 2015. A cnn-based face detector with a simple feature map and a...
  • A.C. Davies et al.

    Crowd monitoring using image processing

    Electron. Commun. Eng. J.

    (1995)
  • Ferryman, J., Ellis, A., 2010. PETS2010: dataset and challenge. In: 2010 Seventh IEEE International Conference on...
  • Garcia, C., Delakis, M., 2002. A neural architecture for fast and robust face detection. In: Proceedings of the 16th...
  • D.H. Hubel et al.

    Receptive fields, binocular interaction and functional architecture in the cat׳s visual cortex

    J. Physiol.

    (1962)
  • G.A. Kim et al.

    Estimation of crowd density in public areas based on neural network

    KSII Trans. Internet Inf. Syst.

    (2012)
There are more references available in the full text version of this article.

Cited by (0)

View full text