Abstract
The task of extracting a semantic video object
is split into two subproblems, namely, object
segmentation and region segmentation. Object
segmentation relies on a priori assumptions,
whereas region segmentation is data-driven and can be
solved in an automatic manner. These two subproblems
are not mutually independent, and they can benefit from
interactions with each other. In this paper, a framework
for such interaction is formulated. This representation
scheme based on region segmentation and semantic
segmentation is compatible with the view that image
analysis and scene understanding problems can be
decomposed into low-level and high-level tasks.
Low-level tasks pertain to region-oriented processing,
whereas the high-level tasks are closely related to
object-level processing. This approach emulates the
human visual system: what one sees in a scene
depends on the scene itself (region segmentation) as well
as on the cognitive task (semantic segmentation) at
hand. The higher-level segmentation results in a
partition corresponding to semantic video objects.
Semantic video objects do not usually have invariant
physical properties and the definition depends on the
application. Hence, the definition incorporates complex
domain-specific knowledge and is not easy to
generalize. For the specific implementation used in this
paper, motion is used as a clue to semantic information.
In this framework, an automatic algorithm is presented
for computing the semantic partition based on color
change detection. The change detection strategy is
designed to be immune to the sensor noise and local
illumination variations. The lower-level segmentation
identifies the partition corresponding to perceptually
uniform regions. These regions are derived by
clustering in an N-dimensional feature
space, composed of static as well as dynamic image
attributes. We propose an interaction mechanism between
the semantic and the region partitions which allows to
cope with multiple simultaneous objects. Experimental
results show that the proposed method extracts semantic
video objects with high spatial accuracy and temporal
coherence.