Semantic image segmentation using low-level features and contextual cues

https://doi.org/10.1016/j.compeleceng.2013.04.017Get rights and content

Highlights

  • Gist representations and super-pixel histograms are used as low-level features.

  • Object co-occurrence and spatial layout relations are used as contextual cues.

  • Low-level features and contextual cues are fused in the inference framework.

Abstract

Semantic image segmentation aims to partition an image into non-overlapping regions and assign a pre-defined object class label to each region. In this paper, a semantic method combining low-level features and high-level contextual cues is proposed to segment natural scene images. The proposed method first takes the gist representation of an image as its global feature. The image is then over-segmented into many super-pixels and histogram representations of these super-pixels are used as local features. In addition, co-occurrence and spatial layout relations among object classes are exploited as contextual cues. Finally the features and cues are integrated into the inference framework based on conditional random field by defining specific potential terms and introducing weighting functions. The proposed method has been compared with state-of-the-art methods on the MSRC database, and the experimental results show its effectiveness.

Introduction

Image segmentation, which intends to partition an image into non-overlapping regions, is an important research topic in the fields of image processing and pattern recognition. Image segmentation has been extensively studied and many methods have been proposed [1], [2], [3]. Recently, semantic segmentation has attracted much attention [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. Compared with conventional image segmentation, semantic segmentation is not only to segment an image into non-overlapping parts but also to assign a pre-defined label such as sky, road, and car to each pixel. Semantic segmentation remains a challenging problem because of the diversity of natural scenes and the variability of object class instances. Discriminative representations of object classes using low-level image features are important for semantic segmentation. In order to obtain more discriminative representations, low-level global features and local features are expected to be exploited simultaneously [14], [15], [16], [17]. In addition, various kinds of high-level contextual cues, including information about detailed interactions among object classes in the scene, have been proven effective in semantic segmentation [4], [5], [6].

There has been significant progress in semantic segmentation [10], [11], [12], [13], [14], [15], [16], [17], [18]. Among current studies, the inference framework based on conditional random field (CRF) has been widely used since various features and cues can be readily integrated into the framework. Using the CRF based method, the segmentation task is formulated as a labeling problem, and the final result is obtained by optimizing an objective function. Shotton et al. [10] proposed to exploit textons for automatic visual recognition and semantic segmentation. The efficient classification and the accurate segmentation were achieved by incorporating the unary potential terms in the CRF based inference framework. Jiang and Tu [13] presented three components and illustrated them on the auto-context algorithm for improving the speed and accuracy of the CRF based model: a new multi-class classifier, a scale-space approach, and a region based voting scheme. Ladicky et al. [14] proposed a hierarchical CRF model to fuse features at different levels by introducing a high-order potential term. Boix et al. [15] put forward the harmony potential for image labeling problems. The harmony potential encoded any possible combination of labels, penalizing only unlikely combinations of classes. Zhang and Ji [16] proposed a unified graphical model for semantic segmentation. The method employed CRF to model the spatial relationships among image super-pixel regions and introduced a multilayer Bayesian network to model the causal dependencies among different image entities. Fu and Qiu [17] introduced the concept of object-like regions and proposed to incorporate the semantic features generated by CRF into the low-level spectral segmentation module.

To the best of our knowledge, no method yet has considered low-level global features, local features and high-level contextual cues simultaneously in the CRF based inference framework. In addition, global features are usually integrated into the inference framework by introducing high-order potential terms in current CRF based methods [14], [15]. The introduction of high-order terms results in a complex inference process. In this paper, the CRF based inference framework is followed and an effective method is proposed. Fig. 1 illustrates the diagram of the proposed method. The proposed method computes gist representations of images as global features. After an image is over-segmented into many super-pixels, i.e. small homogenous regions, using bottom-up methods, the color and texture histograms of the super-pixels are calculated as local features. Two kinds of contextual cues, i.e. co-occurrence and spatial layout relations among object classes, are exploited to guarantee accurate segmentation results. These low-level features and high-level contextual cues are integrated into the CRF based inference framework by defining specific potential terms and introducing weighing functions. The main contribution of the proposed method is the simultaneous utilization of low-level features and high-level contextual cues; furthermore, the integration of these features and cues does not introduce high-order terms compared with [14], [15].

The remainder of this paper is organized as follows: Section 2 introduces low-level features. Section 3 discusses high-level contextual cues. The integration of features and cues is detailed in Section 4. Experiments are conducted to verify the effectiveness of the proposed method in Section 5. Finally, concluding remarks are given in Section 6.

Section snippets

Low-level features

Low-level features used in the proposed method are composed of gist representations and histogram representations. Gist features are holistic representations of images and they are used as global features. The global gist feature is computed on the whole image which corresponds to a full resolution representation. Histograms are compact and discriminative representations of super-pixels and they are used as local features. In the training stage, the local histogram features are computed on

High-level contextual cues

Co-occurrence and spatial layout relations among object classes are used as high-level contextual cues in the proposed method. The co-occurrence relations impose restrictions on the likelihood that object classes occur simultaneously in the same scene; for example, a cow is more possible to appear in the meadow than on the beach, and a car usually appears in the street rather than in a room. In order to obtain accurate results, these contextual cues should be considered in semantic

Integration of low-level features and contextual cues

In this section, the integration of low-level features and contextual cues in the CRF based inference framework is detailed. Local features are integrated into the inference framework by defining conventional unary and pair-wise potential terms. Based on the assumption that the scene group label set is the same as the object class label set, global features are incorporated into the unary potential terms. Contextual cues are utilized by adding multiplicative weighing functions to the potential

Experiments and results

In this section, experiments were conducted to assess the performance of the proposed method. Computational time of the proposed method was also discussed. The average classification rate and the global classification rate are calculated to quantitatively evaluate the results. For each object class, the individual class classification rate is defined as the ratio of the number of correctly classified pixels to that of all the pixels of the class. The average classification rate is defined as

Conclusions

In this paper, a semantic method combining low-level features and high-level contextual cues has been presented for segmentation of natural scene images. The low-level features include global gist representations and local histograms. The gist acts as a holistic representation of an image and provides global information about the whole image. Local histograms are compact and discriminative representations of super-pixels. The high-level contextual cues are composed of co-occurrence and spatial

Acknowledgments

This work is supported by the National Natural Science Fund of China (Grant Nos. 60632050, 9082004, 61202318), Natural Science Foundation of Fujian Province (Grant No. 2012D109), and the Technology Project of provincial University of Fujian Province (Grant No. JK2011040).

Chongbo Zhou received his B.S. degree in Electronics and Information Processing from Qufu Normal University, China, in 2002 and M.S. degree in Circuits and Systems from Shanghai University, China, in 2005, respectively. He is currently pursuing his Ph.D. degree in Computer Science from Nanjing University of Science and Technology. His research interests include image segmentation and pattern recognition.

References (24)

  • Shotton J, Winn J, Rother C, Criminisi A. Textonboost: joint appearance, shape and context modeling for multi-class...
  • Shotton J, Johnson M, Cipolla R. Semantic texton forests for image categorization and segmentation. In: Proceedings of...
  • Cited by (11)

    • Nonparametric scene parsing in the images of buildings

      2018, Computers and Electrical Engineering
      Citation Excerpt :

      Parametric approaches are based on learning associated with an estimation of the model parameters in the training phase [3–5]. Zhou and Liu [3] present a semantic method for segmentation of natural scene images. Their method considers low-level global features, local features, and high-level contextual cues simultaneously in the conditional random field (CRF) based inference framework.

    • Automatic non-parametric image parsing via hierarchical semantic voting based on sparse–dense reconstruction and spatial–contextual cues

      2016, Neurocomputing
      Citation Excerpt :

      Thus, considering the semantics—similar but appearance—varying things/objects, how to design a physics-meaningful model to analyze the intrinsic correlations among the cross-image feature representations is extremely essential for the consistent label propagation. Third, from the perspective of the effective utility of spatial–contextual information in co-occurrence interpretation, various kinds of high-level spatial layouts and contextual interactions among different object classes have been proven effective in semantic parsing [9–12], because the co-occurrence relations can impose constraints on the likelihood that some object classes occur simultaneously in the same scene. Although the contextual cues should be taken into account in a relatively easy way via segmentation-by-detection like methods, some spatial-layout cues become messy and unreliable when photographing 2D images from real 3D scenes with arbitrary viewpoint.

    • Image segmentation method based on RGB-D fusion

      2020, Proceedings - 2020 International Conference on Virtual Reality and Visualization, ICVRV 2020
    • Visual scene prediction for blind people based on object recognition

      2018, Proceedings - 2017 14th International Conference on Computer Graphics, Imaging and Visualization, CGiV 2017
    View all citing articles on Scopus

    Chongbo Zhou received his B.S. degree in Electronics and Information Processing from Qufu Normal University, China, in 2002 and M.S. degree in Circuits and Systems from Shanghai University, China, in 2005, respectively. He is currently pursuing his Ph.D. degree in Computer Science from Nanjing University of Science and Technology. His research interests include image segmentation and pattern recognition.

    Chuancai Liu is a full professor in the School of Computer Science and Engineering of Nanjing University of Science and Technology, China. He obtained his Ph.D. degree from China Ship Research Academy in 1997. His research interests include image processing, pattern recognition and computer vision.

    Reviews processed and recommended for publication to Editor-in-Chief by Associate Editor Dr. Eduardo Cabal-Yepez.

    View full text