Abstract
We introduce an automatic segmentation framework that
blends the advantages of color-, texture-, shape-, and
motion-based segmentation methods in a computationally feasible
way. A spatiotemporal data structure is first constructed for
each group of video frames, in which each pixel is assigned a
feature vector based on low-level visual information. Then, the
smallest homogeneous components, so-called as volumes, are
expanded from selected marker points using an adaptive,
three-dimensional, centroid-linkage method. Self descriptors that
characterize each volume and relational descriptors that capture
the mutual properties between pairs of volumes are determined by
evaluating the boundary, trajectory, and motion of the volumes.
These descriptors are used to measure the similarity between
volumes based on which volumes are further grouped into objects.
A fine-to-coarse clustering algorithm yields a multiresolution
object tree representation as an output of the segmentation.