Non-parametric spatially constrained local prior for scene parsing on real-world data
Introduction
Scene parsing aims to assign every pixel of a query image to a correct semantic category, such as bus, road, sky and tree. It plays a central role in image content understanding and has the potential to support applications in a wide range of fields, such as object recognition, vehicle navigation, and risk identification. Although substantial progress has been made in recent years, scene parsing from complicated real-world data is still a very challenging task, due to substantial variability in both objects’ properties (e.g., number, type, appearance, size and location), and environmental conditions (e.g., over-exposure, under-exposure, shining and illumination variations). Thus, developing techniques that are capable of robustly representing the most discriminative characteristics of all objects in the scene, while retaining high robustness against environmental variations has been a primary focus in this field.
The context about scene type and spatial layout presents very useful information in scene parsing. It is general knowledge that some objects (e.g., car and road, grass and cow, and sea and sand) are more likely to co-occur in the same scenes, whereas some other objects (e.g., sea and computer, and boat and train) are unlikely to co-exist in the same scene. In addition, the spatial layout also conveys important contextual information about the spatial location of objects within the scene. For instance, sky has a high probability of presenting in the top part of a city street scene, while road is likely to present at the bottom pat of the scene. Fig. 1 shows three examples of real-world scenes with such contextual information. Thus, prior context about object co-occurrence statistics and their spatial locations can be potentially used to improve scene parsing results.
Towards this aim, various efforts have been invested in the literature, resulting in various types of contextual features such as absolute location (Shotton et al., 2009), relative location (Gould et al., 2008), directional spatial relationships (Jimei et al., 2014, Singhal et al., 2003), and object co-occurrence statistics (Micusik and Kosecka, 2009), as well as different types of graphical models such as Conditional Random Fields (CRFs) (Batra et al., 2008, Lei and Qiang, 2010) and Markov Random Fields (MRFs) (Xiaofeng et al., 2012), and deep learning models such as Convolutional Neural Network (CNN), Fully Convolutional Network (FCN) (Shelhamer et al., 2017), DeepLab (Chen et al., 2018), Residual Network (ResNet) (Wu et al., 2019) and Context-contrasted Gated Boundary Network (CGBNet) (Ding et al., 2020). Although promising results have been achieved, most of those approaches still suffer from two drawbacks:
- (1)
The training data are treated equally important in collecting contextual features or learning prediction models. However, in real-world scenarios, it is often the case that only a small portion of training data share similar characteristics to every individual query image. Rare classes also frequently present in realistic data and they may have big impact on scene parsing results (Shuai et al., 2018).
- (2)
Limited capacity in capturing long- and short-range context, both of which present essential information for scene parsing. It is essential to create context-aware features to overcome big variations in scene content (Wang et al., 2017).
To address the drawbacks above, this paper presents the non-parametric Spatially Constrained Local Prior (SCLP) for scene parsing from complicated real-world data. It first retrieves a subset of similar training images to a query image and then collects query-specific SCLP contextual features from the retrieved subset to represent more specific and useful prior context about inter-class correlations for the query image. The SCLP contextual features are further integrated with visual features in a decision-level fusion strategy to predict the class label of every superpixel in the query image. The power of non-parametric SCLP lies in its ability to represent both global and local context from the most similar training subset, separately for each query image, and further integrate contextual and visual features to improve the prediction results. It shows state-of-the-art performance on the SIFT Flow and PASCAL-Context benchmark datasets.
The main objectives of this paper are to (1) present a novel way of scene parsing on real-world data, which is particularly important for a wide range of practical applications. For instance, the proposed approach has already been applied to the segmentation and classification of objects (such as trees and grasses) for roadside fire risk prediction on video data from the state roads in central Queensland, Australia; (2) validate the effectiveness of non-parametric contextual features, which are generated specifically for every query image based on a training subset that has similar scene context to the query image; and (3) provide a simple yet effective framework for the extraction and integration of both global and local context with visual features.
Compared with widely adopted deep learning techniques such as CNN and its variants, the proposed approach has three unique characteristics: (1) CNN utilizes convolutional and pooling layers to progressively extract more abstract patterns from the image and thus it emphasizes more on local context, while our approach extracts global and local context separately in two specifically-designed ways and thus it is able to consider both long- and short-range context. (2) CNN integrates visual and contextual features in a unified framework which often requires training a large amount of system parameters, while our approach provides a simple framework which involves only a couple of system parameters and is much easier to train and adjust. (3) CNN treats all images equally important during system training and thus it often has a difficulty of labelling rare classes due to insufficient training samples for those classes, while our approach collects query-specific contextual features from a retrieved training subset to represent more specific and effective context tailored to every query image including those of rare classes.
The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 introduces the proposed non-parametric SCLP approach. The experiments are presented in Section 4 and finally Section 5 draws the conclusions.
Section snippets
Related work
Early approaches (Shotton et al., 2009, Bosch et al., 2007, Kang et al., 2011) to scene parsing are often designed based on a set of visual features that are extracted at a pixel or region (e.g. patch and superpixel) level. The extracted features are then taken as inputs into a prediction algorithm to obtain the class label of every pixel or region. In these approaches, two critical tasks are: (a) extracting visual features that are capable of capturing the most discriminative characteristics
Proposed non-parametric SCLP approach
Fig. 2 shows an overview of the system framework for the proposed non-parametric SCLP approach for scene parsing. For a given query image, the approach processes it in three main processing streamlines: (1) non-parametric SCLP extraction and prediction; (2) visual feature extraction and prediction; and (3) prediction fusion and voting.
For non-parametric SCLP extraction, an image retrieval process is introduced to retrieve a subset of images from the training data that has similar content and
Experimental results and analysis
This section evaluates the proposed non-parametric SCLP approach on the widely used SIFT Flow and a more recent PASCAL-Context dataset. We compare the classification performance obtained with different numbers of retrieved images and of rare classes on the SIFT Flow dataset. The performance is also compared with those of state-of-the-art approaches on both datasets.
Conclusions and future work
This paper presents a new non-parametric approach that utilizes SCLP contextual features for scene parsing on real-world data. A major novelty of the approach lies in the extraction of non-parametric SCLP contextual features for each query image from a small image subset retrieved from the training data. The image subset often shares similar scene content and spatial layouts to the query image. Therefore, the non-parametric SCLP is able to reflect more reliable and effective contextual
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (45)
- et al.
Multi-hypothesis contextual modeling for semantic segmentation
Pattern Recognit. Lett.
(2019) - et al.
Segmentation and description of natural outdoor scenes
Image Vis. Comput.
(2007) - et al.
Leveraging semantic segmentation with learning-based confidence measure
Neurocomputing
(2019) - et al.
Wider or deeper: Revisiting the resnet model for visual recognition
Pattern Recognit.
(2019) - et al.
Learning class-specific affinities for image labelling
- et al.
Semantic segmentation with second-order pooling
- et al.
Nonparametric scene parsing: Label transfer via dense scene alignment
- et al.
DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) - et al.
Convolutional feature masking for joint object and stuff segmentation
- et al.
Semantic segmentation with context encoding and multi-path decoding
IEEE Trans. Image Process.
(2020)
Learning Hierarchical features for scene labeling
IEEE Trans. Pattern Anal. Mach. Intell.
Efficient graph-based image segmentation
Int. J. Comput. Vis.
Dual attention network for scene segmentation
Class segmentation and object localization with superpixel neighborhoods
Multi-class segmentation with relative location prior
Int. J. Comput. Vis.
Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy
IEEE Trans. Pattern Anal. Mach. Intell.
Scene parsing with global context embedding
Context driven scene parsing with attention to rare classes
Multiband image segmentation and object recognition for understanding road scenes
IEEE Trans. Intell. Transp. Syst.
Image segmentation with a unified graphical model
IEEE Trans. Pattern Anal. Mach. Intell.
Pylon model for semantic segmentation
Adv. Neural Inf. Process. Syst.
ZigZagNet: Fusing top-down and bottom-up context for object segmentation
Cited by (1)
Non-parametric scene parsing: Label transfer methods and datasets
2022, Computer Vision and Image UnderstandingCitation Excerpt :The label transfer was formulated as a multi-objective function optimized using iterated conditional modes (ICM) method. Lately, Zhang (2020) developed the non-parametric Spatially Constrained Local Prior (SCLP) contextual features for parsing of real-world scenes. The global and local SCLP matrices contained prior information about co-occurrences and were designed to capture long-range and short-range context leading to significant increase of accuracy.