Non-parametric spatially constrained local prior for scene parsing on real-world data

https://doi.org/10.1016/j.engappai.2020.103708Get rights and content

Abstract

Scene parsing aims to recognize the object category of every pixel in scene images, and it plays a central role in image content understanding and computer vision applications. However, accurate scene parsing from unconstrained real-world data is still a challenging task. In this paper, we present the non-parametric Spatially Constrained Local Prior (SCLP) for scene parsing on realistic data. For a given query image, the non-parametric SCLP is learnt by first retrieving a subset of most similar training images to the query image and then collecting prior information about object co-occurrence statistics between spatial image blocks and between adjacent superpixels from the retrieved subset. The SCLP is powerful in capturing both long- and short-range context about inter-object correlations in the query image and can be effectively integrated with traditional visual features to refine the classification results. Our experiments on the SIFT Flow and PASCAL-Context benchmark datasets show that the non-parametric SCLP used in conjunction with superpixel-level visual features achieves one of the top performance compared with state-of-the-art approaches.

Introduction

Scene parsing aims to assign every pixel of a query image to a correct semantic category, such as bus, road, sky and tree. It plays a central role in image content understanding and has the potential to support applications in a wide range of fields, such as object recognition, vehicle navigation, and risk identification. Although substantial progress has been made in recent years, scene parsing from complicated real-world data is still a very challenging task, due to substantial variability in both objects’ properties (e.g., number, type, appearance, size and location), and environmental conditions (e.g., over-exposure, under-exposure, shining and illumination variations). Thus, developing techniques that are capable of robustly representing the most discriminative characteristics of all objects in the scene, while retaining high robustness against environmental variations has been a primary focus in this field.

The context about scene type and spatial layout presents very useful information in scene parsing. It is general knowledge that some objects (e.g., car and road, grass and cow, and sea and sand) are more likely to co-occur in the same scenes, whereas some other objects (e.g., sea and computer, and boat and train) are unlikely to co-exist in the same scene. In addition, the spatial layout also conveys important contextual information about the spatial location of objects within the scene. For instance, sky has a high probability of presenting in the top part of a city street scene, while road is likely to present at the bottom pat of the scene. Fig. 1 shows three examples of real-world scenes with such contextual information. Thus, prior context about object co-occurrence statistics and their spatial locations can be potentially used to improve scene parsing results.

Towards this aim, various efforts have been invested in the literature, resulting in various types of contextual features such as absolute location (Shotton et al., 2009), relative location (Gould et al., 2008), directional spatial relationships (Jimei et al., 2014, Singhal et al., 2003), and object co-occurrence statistics (Micusik and Kosecka, 2009), as well as different types of graphical models such as Conditional Random Fields (CRFs) (Batra et al., 2008, Lei and Qiang, 2010) and Markov Random Fields (MRFs) (Xiaofeng et al., 2012), and deep learning models such as Convolutional Neural Network (CNN), Fully Convolutional Network (FCN) (Shelhamer et al., 2017), DeepLab (Chen et al., 2018), Residual Network (ResNet) (Wu et al., 2019) and Context-contrasted Gated Boundary Network (CGBNet) (Ding et al., 2020). Although promising results have been achieved, most of those approaches still suffer from two drawbacks:

  • (1)

    The training data are treated equally important in collecting contextual features or learning prediction models. However, in real-world scenarios, it is often the case that only a small portion of training data share similar characteristics to every individual query image. Rare classes also frequently present in realistic data and they may have big impact on scene parsing results (Shuai et al., 2018).

  • (2)

    Limited capacity in capturing long- and short-range context, both of which present essential information for scene parsing. It is essential to create context-aware features to overcome big variations in scene content (Wang et al., 2017).

To address the drawbacks above, this paper presents the non-parametric Spatially Constrained Local Prior (SCLP) for scene parsing from complicated real-world data. It first retrieves a subset of similar training images to a query image and then collects query-specific SCLP contextual features from the retrieved subset to represent more specific and useful prior context about inter-class correlations for the query image. The SCLP contextual features are further integrated with visual features in a decision-level fusion strategy to predict the class label of every superpixel in the query image. The power of non-parametric SCLP lies in its ability to represent both global and local context from the most similar training subset, separately for each query image, and further integrate contextual and visual features to improve the prediction results. It shows state-of-the-art performance on the SIFT Flow and PASCAL-Context benchmark datasets.

The main objectives of this paper are to (1) present a novel way of scene parsing on real-world data, which is particularly important for a wide range of practical applications. For instance, the proposed approach has already been applied to the segmentation and classification of objects (such as trees and grasses) for roadside fire risk prediction on video data from the state roads in central Queensland, Australia; (2) validate the effectiveness of non-parametric contextual features, which are generated specifically for every query image based on a training subset that has similar scene context to the query image; and (3) provide a simple yet effective framework for the extraction and integration of both global and local context with visual features.

Compared with widely adopted deep learning techniques such as CNN and its variants, the proposed approach has three unique characteristics: (1) CNN utilizes convolutional and pooling layers to progressively extract more abstract patterns from the image and thus it emphasizes more on local context, while our approach extracts global and local context separately in two specifically-designed ways and thus it is able to consider both long- and short-range context. (2) CNN integrates visual and contextual features in a unified framework which often requires training a large amount of system parameters, while our approach provides a simple framework which involves only a couple of system parameters and is much easier to train and adjust. (3) CNN treats all images equally important during system training and thus it often has a difficulty of labelling rare classes due to insufficient training samples for those classes, while our approach collects query-specific contextual features from a retrieved training subset to represent more specific and effective context tailored to every query image including those of rare classes.

The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 introduces the proposed non-parametric SCLP approach. The experiments are presented in Section 4 and finally Section 5 draws the conclusions.

Section snippets

Related work

Early approaches (Shotton et al., 2009, Bosch et al., 2007, Kang et al., 2011) to scene parsing are often designed based on a set of visual features that are extracted at a pixel or region (e.g. patch and superpixel) level. The extracted features are then taken as inputs into a prediction algorithm to obtain the class label of every pixel or region. In these approaches, two critical tasks are: (a) extracting visual features that are capable of capturing the most discriminative characteristics

Proposed non-parametric SCLP approach

Fig. 2 shows an overview of the system framework for the proposed non-parametric SCLP approach for scene parsing. For a given query image, the approach processes it in three main processing streamlines: (1) non-parametric SCLP extraction and prediction; (2) visual feature extraction and prediction; and (3) prediction fusion and voting.

For non-parametric SCLP extraction, an image retrieval process is introduced to retrieve a subset of images from the training data that has similar content and

Experimental results and analysis

This section evaluates the proposed non-parametric SCLP approach on the widely used SIFT Flow and a more recent PASCAL-Context dataset. We compare the classification performance obtained with different numbers of retrieved images and of rare classes on the SIFT Flow dataset. The performance is also compared with those of state-of-the-art approaches on both datasets.

Conclusions and future work

This paper presents a new non-parametric approach that utilizes SCLP contextual features for scene parsing on real-world data. A major novelty of the approach lies in the extraction of non-parametric SCLP contextual features for each query image from a small image subset retrieved from the training data. The image subset often shares similar scene content and spatial layouts to the query image. Therefore, the non-parametric SCLP is able to reflect more reliable and effective contextual

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (45)

  • FarabetC. et al.

    Learning Hierarchical features for scene labeling

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • FelzenszwalbP. et al.

    Efficient graph-based image segmentation

    Int. J. Comput. Vis.

    (2004)
  • FuJ. et al.

    Dual attention network for scene segmentation

  • FulkersonB. et al.

    Class segmentation and object localization with superpixel neighborhoods

  • GouldS. et al.

    Multi-class segmentation with relative location prior

    Int. J. Comput. Vis.

    (2008)
  • HanchuanP. et al.

    Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • HungW.-C. et al.

    Scene parsing with global context embedding

    (2017)
  • JimeiY. et al.

    Context driven scene parsing with attention to rare classes

  • KangY. et al.

    Multiband image segmentation and object recognition for understanding road scenes

    IEEE Trans. Intell. Transp. Syst.

    (2011)
  • LeiZ. et al.

    Image segmentation with a unified graphical model

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • LempitskyV. et al.

    Pylon model for semantic segmentation

    Adv. Neural Inf. Process. Syst.

    (2011)
  • LinD. et al.

    ZigZagNet: Fusing top-down and bottom-up context for object segmentation

  • Cited by (1)

    • Non-parametric scene parsing: Label transfer methods and datasets

      2022, Computer Vision and Image Understanding
      Citation Excerpt :

      The label transfer was formulated as a multi-objective function optimized using iterated conditional modes (ICM) method. Lately, Zhang (2020) developed the non-parametric Spatially Constrained Local Prior (SCLP) contextual features for parsing of real-world scenes. The global and local SCLP matrices contained prior information about co-occurrences and were designed to capture long-range and short-range context leading to significant increase of accuracy.

    View full text