Elsevier

Pattern Recognition

Volume 82, October 2018, Pages 118-129
Pattern Recognition

Improved spatial pyramid matching for scene recognition

https://doi.org/10.1016/j.patcog.2018.04.025Get rights and content

Highlights

  • A new spatial partition scheme combined with modified pyramid matching kernel based on spatial pyramid matching is proposed.

  • Our modified spatial pyramid matching is proved to be superior to conventional spatial pyramid matching.

  • We also propose a method of learning mid-level features using various autoencoders.

  • Local visual descriptors are encoded into different mid-level features with sparsity, robustness and contractiveness respectively.

Abstract

A scene image is typically composed of successive background contexts and objects with regular shapes. To acquire such spatial information, we propose a new type of spatial partitioning scheme and a modified pyramid matching kernel based on spatial pyramid matching (SPM). A dense histogram of oriented gradients (HOG) is used as a low-level visual descriptor. Furthermore, inspired by the expressive coding ability of autoencoders, we also propose another approach that encodes local descriptors into mid-level features using various autoencoders. The learned mid-level features are encouraged to be sparse, robust and contractive. Then, modified spatial pyramid pooling and local normalization of the mid-level features facilitate the generation of high-level image signatures for scene classification. Comprehensive experimental results on publicly available scene datasets demonstrate the effectiveness of our methods.

Introduction

As one of the most challenging problems in the field of computer vision, scene recognition has received considerable attention due to the rapid development of intelligent machines. The approach of placing low-level descriptors (e.g., colour histogram, Local Binary Pattern, and Scale-Invariant Feature Transform) into a classifier directly has been shown to perform poorly [1] because scene images often contain many objects of interest under various backgrounds. Moreover, low-level descriptors are mainly dependent on edges or corner points, and they cannot provide adequate semantic information for scene recognition. Therefore, many researchers have focused on identifying the intermediate semantic representations to narrow the gap between computers and humans with respect to understanding scenes.

Many researchers have attempted to transform low-level descriptors into richer intermediate representations to improve recognition performance and help computers understand more abstract concepts. One extremely popular method is the Bag-of-Visual-Words (BoVW) [2], which is derived from text analysis. BoVW usually involves the following steps. First, local visual descriptors are extracted from image patches, and then dictionary learning produces the codebook, which includes representative visual words. Finally, the image can be characterized by the frequency histogram of the visual words. BoVW discards the spatial structure information in scene images, which restricts the power of the image representations. To overcome this problem, spatial pyramid matching (SPM) [3] based on the BoVW was proposed as a method of incorporating the spatial information of local visual descriptors into the histograms, and it has achieved significant success.

In this paper, the SPM method is used to identify generic spatial structure information within scene images and learn mid-level features. We adopt the histogram of oriented gradients (HOG) as the underlying descriptor because HOG descriptors can be easily and rapidly extracted. To incorporate the generic spatial structure information into the traditional SPM, a new spatial partitioning scheme is proposed to capture a greater degree of local sensitivity in scene images. Partitions in the horizontal and vertical directions are added to preserve consistent structure information. We also modify the pyramid matching kernel to alleviate the influence of viewpoints. This modified SPM achieves better performance and is superior to the conventional SPM in its computational and storage requirements. The steps of this modified SPM are shown in Fig. 1(a). After the K-means clustering on local visual descriptors, the spatial distribution histograms can be calculated. By applying the modified pyramid matching kernel, the histogram representation of the whole image can be obtained. Finally, the intersection kernel SVM (Support Vector Machine) is used to realize the classification.

Another approach named modified spatial pyramid pooling based on various autoencoders is proposed in this paper. Many models learn representations directly from pixels; in contrast, we explore the encoding of local visual descriptors. As an unsupervised learning technique, the autoencoder is designed to learn an over-complete mid-level feature. A single autoencoder has fewer parameters than other deep architectures, and the directed model facilitates its training. Interesting properties of local visual descriptors are exploited by using three types of autoencoder variants: sparse autoencoder, denoising autoencoder and contractive autoencoder. The learned mid-level features are encouraged to be sparse, robust and contractive. The training process of the autoencoder corresponds to the dictionary learning in the BoVW framework. This method makes the inference of the mid-level features more efficient. Then, the modified spatial pyramid pooling and local normalization on the mid-level features map produce the high-level image signature for scene recognition. This architecture merges the complementary strength of the BoVW framework and autoencoders. Compared with that of other unsupervised means, such as sparse coding, the inference of this model is simple and fast. The main steps of this approach mentioned above are shown in Fig. 1(b).

The remainder of this paper is organized as follows. We review related works in Section 2. The basic techniques are introduced in Section 3. The details of our proposed methods are described in Section 4, including our modified SPM based on HOG, the modified spatial pyramid pooling based on various autoencoders and intersection kernel SVM for scene classification. The experimental results and a discussions are provided in Section 5. Finally, we conclude this paper and offer the suggestions for future work in Section 6.

Section snippets

Prior work

Due to the limited power of local visual descriptors, many global features for scene recognition, including GIST [4], CENTRIST [5] and LDBP [6], have been proposed to describe the holistic appearance of a scene. Yu et al. [7] used a unified low-dimensional subspace to effectively fuse multiple features for scene recognition. Scene images can be described as a set of meaningful visual attributes [8], [9], although defining of these visual attributes requires significant manual effort and the

Local visual descriptors

In our study, local visual descriptors are extracted from patches densely located in the image window as shown in Fig. 2, where these adjacent patches are overlapped. Local visual descriptors calculated from dense regular grids provide for improved scene recognition [21] because this method is capable of capturing the features of uniform regions such as ocean, forest and sky.

In this work, HOG is selected as the local visual descriptor for scene recognition. HOG is a visual descriptor that is

Modified spatial pyramid matching

Conventional SPM calculates the distribution of local descriptors within even grids on different levels. This scheme of spatial partitioning generally splits continuous landscapes into fragments, and the local visual descriptors extracted from these fragments are inconsistent and trivial. Conventional SPM has a negative impact on the statistical characteristics of distribution histograms and may result in a bad scene recognition performance. To overcome this problem, the divisions in horizontal

Datasets and setup

Our proposed algorithms are evaluated on the Scene-15 dataset [3] and the Sports-8 dataset [35]. The Scene-15 dataset has 15 categories of scenes, including coast, forest, mountain, open country, highways, inside city, tall building, street, bedroom, kitchen, living room, office, suburb, industrial and store. This dataset contains 4485 images, with the number of images in each category ranging from 200 to 400. Example images in each category of the 15-Scene dataset are shown in Fig. 9. The

Conclusions

In this paper, we present two methods for scene recognition. In the modified SPM method, the HOG descriptors are regarded as low-level features. Our proposed new pyramid matching kernel and spatial partition scheme achieved better performance than the original SPM, using lower feature dimensionality. In the modified spatial pyramid pooling method based on various autoencoders, various constraints are added into the basic autoencoder to learn interesting and useful representations. This method

Conflict of interest

None declared.

Acknowledgement

This research is partially supported by The Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, and also partially supported by JSPS KAKENHI Grant Number 15K00159.

Lin Xie is currently a master student at School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology. His current research interests are in the areas of scene recognition and machine learning.

References (44)

  • S. Zhang et al.

    Constructing deep sparse coding network for image classification

    Pattern Recognit.

    (2017)
  • X. Song et al.

    Category co-occurrence modeling for large scale scene recognition

    Pattern Recognit.

    (2016)
  • Y. Liu et al.

    Adaptive spatial pooling for image classification

    Pattern Recognit.

    (2016)
  • J. Xiao et al.

    SUN database : large-scale scene recognition from abbey to zoo

  • J. Sivic et al.

    Video Google: a text retrieval approach to object matching in videos

    Towar. Categ. Object Recognit.

    (2003)
  • S. Lazebnik et al.

    Beyond bags of features : spatial pyramid matching for recognizing natural scene categories

  • J. Wu et al.

    Centrist: a visual descriptor for scene categorization

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • L. Li et al.

    Object bank : a high-level image representation for scene classification & semantic feature sparsification

    Adv. Neural Inf. Process. Syst.

    (2010)
  • L. Torresani et al.

    Efficient object category recognition using classemes

    Proc. European Conf. Comput. Vis.

    (2010)
  • J. Wu et al.

    Beyond the Euclidean distance: creating effective visual codebooks using the histogram intersection kernel

  • J. Yang et al.

    Linear spatial pyramid matching using sparse coding for image classification

  • S. Gao et al.

    Local features are not lonely -Laplacian sparse coding for image classification

  • Cited by (0)

    Lin Xie is currently a master student at School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology. His current research interests are in the areas of scene recognition and machine learning.

    Feifei Lee received her Ph.D. degree in electronic engineering from Tohoku University in Japan, in 2007. She is currently a professor at the University of Shanghai for Science and Technology. Her research interests include pattern recognition, video indexing, and image processing.

    Li Liu received the Ph.D. degree in pattern recognition and intelligent system from East China Normal University, Shanghai, China, in 2015. She was with the Centre for Pattern Recognition and Machine Intelligence, Concordia University, Montreal, QC, Canada, from 2013 to 2014 as a visiting doctoral student, and in 2016 as a visiting scholar. She is currently a lecturer with Nanchang University. Her research interests include pattern recognition, computer vision, and document image analysis.

    Zhong Yin received the Ph.D. degree in control science and engineering from the East China University of Science and Technology. He has been a lecturer at School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, China, since 2015. His research interests include intelligent human-machine systems, biomedical signal processing and pattern recognition.

    Yan Yan is currently pursuing the Ph.D. degree in control science and engineering at University of Shanghai for Science and Technology. Her current research interests are in the areas of image recognition, computer vision and machine learning.

    Weidong Wang is currently a master student at School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology. His current research interests are in the areas of pattern recognition and image retrieval.

    Junjie Zhao is currently a master student at School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology. His current research interests are in the areas of pattern recognition and video retrieval.

    Qiu Chen received Ph.D. degree in electronic engineering from Tohoku University, Japan, in 2004. Since then, he has been an associate professor at Tohoku University and Kogakuin Univeristy. His research interests include pattern recognition, computer vision, information retrieval and their applications. He is also a guest professor at the University of Shanghai for Science and Technology. Dr. Chen serves on the editorial boards of several journals, and he is a member of IEEE.

    1

    Both authors contributed equally to this work.

    View full text