Scalable object-based video retrieval in HD video databases

https://doi.org/10.1016/j.image.2010.04.004Get rights and content

Abstract

With exponentially growing quantity of video content in various formats, including the popularisation of HD (High Definition) video and cinematographic content, the problem of efficient indexing and retrieval in video databases becomes crucial. Despite efficient methods have been designed for the frame-based queries on video with local features, object-based indexing and retrieval attract attention of research community by the seducing possibility to formulate meaningful queries on semantic objects. In the case of HD video, the principle of scalability addressed by actual compression standards is of great importance. It allows for indexing and retrieval on the lower resolution available in the compressed bit-stream. The wavelet decomposition used in the JPEG2000 standard provides this property. In this paper, we propose a scalable indexing of video content by objects. First, a method for scalable moving object extraction is designed. Using the wavelet data, it relies on the combination of robust global motion estimation with morphological colour segmentation at a low spatial resolution. It is then refined using the scalable order of data. Second, a descriptor is built only on the objects extracted. This descriptor is based on multi-scale histograms of wavelet coefficients of objects. Comparison with SIFT features extracted on segmented object masks gives promising results.

Introduction

With exponentially growing quantity of video content in various formats, including the popularisation of HD (High Definition) video and cinematographic content, the problem of efficient indexing and retrieval in video databases becomes crucial. After the success of early indexing and retrieval attempts on video considering global key-frame descriptors [1], [2], the main stream of research nowadays focus on indexing and retrieval of image and video content with sparse local features. One of the first noticeable papers in this area introduced the ScavFT (scale variant feature transform) [3] for tracking purposes. The characteristic points were detected in a single video frame based on the search of a strong minimal eigenvalue of a local gradient matrix. Then this technique was extended to the temporal dimension for video retrieval in [4]. Nevertheless, for indexing and retrieval purposes, the mostly used sparse feature points and associated features, invariant to luminance, scale and rotation changes, were the SIFT proposed in a fundamental work by Lowe [5]. Since then various improvements of this approach came out and a good survey of these techniques is presented in [6].

Features sparsely distributed in the frames do not convey semantics in terms of generic objects evolving in video. Hence, several recent research works try to link the sparse features with spatio-temporal objects. Thus in [7] the MSER (Maximally Stable Extremal Regions, [8]) together with SIFT features computed on them are used to train object models in video and recognize them.

Despite the originality of these ideas and the relatively good performance of the proposed methods, these approaches do not start with an explicit object extraction from video sequences. In some sense they try to avoid solving the classical “chicken and egg” problem of semantic object extraction from video.

At the same time, the significant effort of research community related to the elaboration of MPEG4 [9], MPEG7 [10] and JPEG2000 standards [11] was devoted to the development of automatic segmentation methods of video content to extract objects. The results of these methods, e.g. [12], [13], [14], while not always ensuring an ideal correspondence of extracted object borders to visually observed contours, were sufficiently good for visual recognition of extracted object areas for fine-tuning of encoding parameters and for content description.

Hence, we are strongly convinced that the paradigm consisting in segmenting objects first and then representing them in adequate feature spaces for object-based indexing and retrieval of video remains the promising road to the success and a good alternative for local modelling of content by feature points.

Nowadays, it is impossible to imagine visual content exchanged and stored in a raw format. The standard resolution of HD video (1080p) or filmic content (4 Kp) make a raw video content a tremendous mass of data which cannot be handled without compression. The specifications of DCI (Digital Cinema Initiative, LLC [15]) made MJPEG2000 [16] the digital cinema compression standard. Furthermore MJPEG2000 is becoming the common standard for archiving [17] of cultural cinematographic heritage with the greater quality/compression compromise than previously used solutions. Nowadays, data coded with this standard constitute databases of audio-visual content so large that access to digital content requires the development of automatic methods for indexing and retrieval of this compressed content. The metadata resulting from indexing process are very heterogeneous between different systems and are shaped with MPEG7 [11] and Dublin Core [18]. This is why the JPSearch project [19] aims on establishing the standardize interfaces for an abstract image retrieval framework. One of the focuses of JPSearch is the embedding of metadata in image data encoded in the JPEG2000 standard. Hence the latest research works link content encoding and indexing in the same framework be it images or video [20]. In our work we also address this issue and propose object-based indexing of video content jointly with MJPEG2000 compression. The present work is in the continuation of the RI (Rough indexing) paradigm we developed for MPEG compressed video [21].

New compression standards such as MJPEG2000 or H.264/SVC [22] have the interesting property of scalability. Hence, a codestream formatted with one of the standards mentioned can be sent to different users with different processing capabilities and network bandwidths by selectively transmitting and decoding the related part of the codestream. The extracted sub-streams correspond to a reduced spatial resolution (spatial scalability), a reduced temporal resolution (temporal scalability), a reduced quality for a given spatio-temporal resolution and/or for a reduced flow (SNR scalability). This scalability property gives an exciting perspective of video retrieval at various resolution levels, hence reducing computational work load and allowing for "cross-resolution" retrieval. The contribution of our work is two-fold. In the framework of object-based “scalable” indexing and retrieval of video content we first propose an object extraction/segmentation method operating on a scalable MJPEG2000 compressed stream. Therefore, the extraction process has to be performed with the only part of the stream available after transmission, i.e. the segmentation should be made in the only one direction from low resolution to high resolution. Secondly, in the paradigm of “object-based” video content retrieval we propose object descriptors for a “scalable retrieval” of content enabling search on reduced resolution level.

Thus, in our vision, the first step in object-based indexing is the foreground object extraction. Several approaches have been proposed in the past and most of them can be roughly classified either as intra-frame segmentation based or as motion segmentation based methods. In the former approach, each frame of the video sequence is independently segmented into regions of homogeneous intensity or texture, using traditional image segmentation techniques [23], while in the latter approach, a dense motion field is used for segmentation and pixels with homogeneous motion field are grouped together [24]. Since both approaches have their drawbacks, most object extraction tools combine spatial and temporal segmentation techniques [25], [26]. Challenge resides in applying this scheme without decompressing the video, as the low-level content descriptors, such as coefficients in the transform domain, can be efficiently re-used for the content analysis task [21].

In the case of the MJPEG2000 standard, one difficulty is that the standard does not provide motion descriptors and so they have to be estimated. The ME (Motion Estimation) problem in the wavelet domain has largely been studied in the literature [27]. We also proposed a first approach for motion estimation on JPEG wavelet pyramid in [28]. In this paper, we will present its improvements allowing reducing the negative effect of shift-variance property.

A second step in object-oriented indexing consists in defining features on the object effectively found. Following the RI paradigm, these features have to be defined in the compressed domain, i.e. the wavelet domain. Several indexing techniques in the wavelet domain exist for JPEG2000 compressed still images. Among them, we can cite the histogram-based techniques. A histogram is computed for each subband and comparison is made subband by subband [29]. The main drawback of such a technique is that it works only with limited camera operations. To reduce complexity and improve the robustness to illumination changes, [30] proposes modelling the histograms using a generalized Gaussian density function. Other indexing techniques are texture oriented [31]. In this work, we propose a global histogram-based descriptor for object in the wavelet domain at different levels of spatial scalability. We also compare the efficiency of this descriptor with an approach consisting in matching local features extracted from object area after the segmentation process.

The paper is organized as follows. The general framework of our approach is summarized in Section 2. Section 3 describes the spatio-temporal object extraction we designed and Section 4 the object-based scalable statistical descriptor we propose. Section 5 presents object-based search scenarios for different tasks of video retrieval, such as retrieval of post-production effects and retrieval of video clips. Results are given in Section 6. Section 7 concludes the paper and outlines the perspectives of this work.

Section snippets

Notations

Before describing the proposed general framework of the object-based scalable indexing, the notations and abbreviations used in the following of the paper are presented. The wavelet transform used in the JPEG2000 standard leads to a multi-scale representation. We consider K+1 the total number of decomposition layers, k is the current level, with k=0 the original image and k=K the lowest resolution level. At a given level k, a wavelet frame is a combination of four subbands obtained by

Scalable object extraction in the JPEG2000 wavelet domain

Scalable object extraction in the JPEG2000 Wavelet Domain follows the general principles we have previously developed in [32]. This section summarizes the approach and presents minor changes that have been brought to improve the quality of results.

Scalable object-based descriptors

After the previous object extraction step, a video is now represented as a set of objects, i.e. a multi-scale description of objects of interest is associated to each frame of the video (see Fig. 3).

Even if the amount of information available has been reduced, objects keep being meaningful and can lead to an interpretation of high semantic level.

Object-based retrieval scenarios

Even if the results of object extraction are acceptable, the segmentation process keeps being unstable and strongly depends on the relative amplitude between background and object motions. Unless more complex tools are applied such as tracking of multiple objects, the extraction of full objects at any frame of a clip is not guaranteed. Hence, in actual state of our work, we propose the following scenarios for queries.

Database for the benchmarking

In order to assess the robustness and efficiency of proposed features to various transformations of video such as affine transformations of image plane, partial masking, fraudulent post-production and so on one should have access to a content database containing transformed videos and annotated ground truth. For content-based image retrieval, such publicly available databases are numerous. For instance the benchmark [41] contains four versions of the same still image which has undergone

Conclusion and perspectives

In this paper we proposed a complete method for a scalable object-based retrieval of HD video content in MJPEG2000 compressed domain. Scalability, this new exciting property inherent to actual image and video compression standards is truly used in this framework.

First of all, the extraction of objects from video stream is scalable: at each spatial resolution level we got a relevant object mask after having performed all necessary operations such as motion estimation, initial morphological

Acknowledgement

This work is supported by French National grant ANR-06-MDCA-010 "ICOS-HD".

References (46)

  • J. Benois-Pineau et al.

    Hierarchical segmentation of video sequences for content manipulation and adaptive coding

    Signal Processing

    (1998)
  • Y. Wang et al.

    Multimedia content analysis using both audio and visual clues

    IEEE Signal Processing Magazine

    (2000)
  • Y. Zhai, J. Liu, X. Cao, A. Basharat, A. Hakeem, S. Ali, M. Shah, Video understanding and content-based retrieval, in:...
  • J. Shi et al.

    Good features to track

    Proceedings of the Conference on Computer Vision and Pattern Recognition

    (1994)
  • I. Laptev et al.

    On space-time interest points

    International Conference on Computer Vision

    (2003)
  • D.G. Lowe

    Distinctive image features from scale-invariant key-points

    International Journal of Computer Vision

    (2004)
  • K. Mikolajczyk et al.

    A performance evaluation of local descriptors

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • D. Liu et al.

    DISCOV: a framework for discovering objects in video

    IEEE Transactions on Multimedia

    (2008)
  • J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide baseline stereo from maximally stable extremal regions, in...
  • ISO/IEC 14496-2: 2004 Information technology – Coding of audio-visual objects – Part 2: Visual...
  • ISO/IEC 15938-3: 2002 Information technology – Multimedia content description interface – Part 3: Visual...
  • ISO/IEC 15444-1: 2004 Information technology – JPEG 2000 image coding system: Core coding...
  • P. Salembier et al.

    Segmentation-based video coding system allowing the manipulation of objects

    IEEE Transactions on Circuits and Systems for Video Technology

    (1997)
  • S. Jehan-Besson et al.

    A 3-step algorithm using region-based active contours for video objects detection

    EURASIP Journal on Applied Signal Processing

    (2002)
  • Digital Cinema. Initiative. url: 〈http://www.dcimovies.com/〉, Last seen:...
  • ISO/IEC. 15444-3: 2007. Information technology. jpeg2000 image coding system: Motion...
  • G. Pearson et al.

    An evaluation of motion jpeg2000 for video archiving

    Proc. Archiving

    (2005)
  • The Dublin Core Metadata Initiative (DCMI). Dublin core metadata element set—version 1.1: Reference description....
  • F. Dufaux, M. Ansorge, T. Ebrahimi, Overview of Jpsearch a standard for image search and retrieval, in: Proceeding of...
  • N. Adami, A. Boschetti, R. Leonardi, P. Migliorati, Embedded Indexing in Scalable Video Coding, in: Proceeding of the...
  • F. Manerba et al.

    Extraction of foreground objects from mpeg2 video stream in rough indexing framework, Storage and Retrieval Methods and Applications for Multimedia

    SPIE

    (2004)
  • ITU-T H.264, Advanced video coding for generic audiovisual services, November...
  • P. Salembier et al.

    Region-based representations of image and video: segmentation tools for multimedia services

    IEEE Transactions on Circuits and Systems for Video Technologies

    (1999)
  • Cited by (19)

    • Retrieval of flower videos based on a query with multiple species of flowers

      2021, Artificial Intelligence in Agriculture
      Citation Excerpt :

      An example may be an image, keywords, sketch, object, video, video frame etc., (Hu et al., 2011). In the literature we found retrieval of videos based on an object (Morand et al., 2010), frame (Shekar et al., 2016), video (Geetha et al., 2009; Gao et al., 2009; Han et al., 2014; Liang et al., 2012), keywords (Priya and Domnic, 2014). For the retrieval of videos the features and algorithms such as optical flow tensor and Hidden Markov Models (HMMs) (Gao et al., 2009), the multi-modal spectral clustering and ranking algorithm (Han et al., 2014), block wise intensity comparison (Geetha et al., 2009), Scale Invariant Feature Transform (SIFT) (Zhu et al., 2016), Bag-of-Features (Cui et al., 2016), dynamic weighted similarity measure with color and edge descriptors (Liang et al., 2012) are used.

    • JPSearch: New international standard providing interoperable framework for image search and sharing

      2012, Signal Processing: Image Communication
      Citation Excerpt :

      The effort of research on image search and retrieval was greatly increased with the popularity of digital cameras and phone cameras. Some efforts, which may not cover the complete scope of JPSearch Framework, related to parts of JPSearch standard can be found in [3–8]. In 2003, 50 million units of digital cameras were sold worldwide according to Photo Marketing Association (PMA) (www.pmai.org) marketing research, and the digital camera market was growing rapidly.

    • Tripartite Feature Enhanced Pyramid Network for Dense Prediction

      2023, IEEE Transactions on Image Processing
    • Exploring the Strengths of Neural Codes for Video Retrieval

      2022, Lecture Notes in Electrical Engineering
    View all citing articles on Scopus
    View full text