Scalable object-based video retrieval in HD video databases
Introduction
With exponentially growing quantity of video content in various formats, including the popularisation of HD (High Definition) video and cinematographic content, the problem of efficient indexing and retrieval in video databases becomes crucial. After the success of early indexing and retrieval attempts on video considering global key-frame descriptors [1], [2], the main stream of research nowadays focus on indexing and retrieval of image and video content with sparse local features. One of the first noticeable papers in this area introduced the ScavFT (scale variant feature transform) [3] for tracking purposes. The characteristic points were detected in a single video frame based on the search of a strong minimal eigenvalue of a local gradient matrix. Then this technique was extended to the temporal dimension for video retrieval in [4]. Nevertheless, for indexing and retrieval purposes, the mostly used sparse feature points and associated features, invariant to luminance, scale and rotation changes, were the SIFT proposed in a fundamental work by Lowe [5]. Since then various improvements of this approach came out and a good survey of these techniques is presented in [6].
Features sparsely distributed in the frames do not convey semantics in terms of generic objects evolving in video. Hence, several recent research works try to link the sparse features with spatio-temporal objects. Thus in [7] the MSER (Maximally Stable Extremal Regions, [8]) together with SIFT features computed on them are used to train object models in video and recognize them.
Despite the originality of these ideas and the relatively good performance of the proposed methods, these approaches do not start with an explicit object extraction from video sequences. In some sense they try to avoid solving the classical “chicken and egg” problem of semantic object extraction from video.
At the same time, the significant effort of research community related to the elaboration of MPEG4 [9], MPEG7 [10] and JPEG2000 standards [11] was devoted to the development of automatic segmentation methods of video content to extract objects. The results of these methods, e.g. [12], [13], [14], while not always ensuring an ideal correspondence of extracted object borders to visually observed contours, were sufficiently good for visual recognition of extracted object areas for fine-tuning of encoding parameters and for content description.
Hence, we are strongly convinced that the paradigm consisting in segmenting objects first and then representing them in adequate feature spaces for object-based indexing and retrieval of video remains the promising road to the success and a good alternative for local modelling of content by feature points.
Nowadays, it is impossible to imagine visual content exchanged and stored in a raw format. The standard resolution of HD video (1080p) or filmic content (4 Kp) make a raw video content a tremendous mass of data which cannot be handled without compression. The specifications of DCI (Digital Cinema Initiative, LLC [15]) made MJPEG2000 [16] the digital cinema compression standard. Furthermore MJPEG2000 is becoming the common standard for archiving [17] of cultural cinematographic heritage with the greater quality/compression compromise than previously used solutions. Nowadays, data coded with this standard constitute databases of audio-visual content so large that access to digital content requires the development of automatic methods for indexing and retrieval of this compressed content. The metadata resulting from indexing process are very heterogeneous between different systems and are shaped with MPEG7 [11] and Dublin Core [18]. This is why the JPSearch project [19] aims on establishing the standardize interfaces for an abstract image retrieval framework. One of the focuses of JPSearch is the embedding of metadata in image data encoded in the JPEG2000 standard. Hence the latest research works link content encoding and indexing in the same framework be it images or video [20]. In our work we also address this issue and propose object-based indexing of video content jointly with MJPEG2000 compression. The present work is in the continuation of the RI (Rough indexing) paradigm we developed for MPEG compressed video [21].
New compression standards such as MJPEG2000 or H.264/SVC [22] have the interesting property of scalability. Hence, a codestream formatted with one of the standards mentioned can be sent to different users with different processing capabilities and network bandwidths by selectively transmitting and decoding the related part of the codestream. The extracted sub-streams correspond to a reduced spatial resolution (spatial scalability), a reduced temporal resolution (temporal scalability), a reduced quality for a given spatio-temporal resolution and/or for a reduced flow (SNR scalability). This scalability property gives an exciting perspective of video retrieval at various resolution levels, hence reducing computational work load and allowing for "cross-resolution" retrieval. The contribution of our work is two-fold. In the framework of object-based “scalable” indexing and retrieval of video content we first propose an object extraction/segmentation method operating on a scalable MJPEG2000 compressed stream. Therefore, the extraction process has to be performed with the only part of the stream available after transmission, i.e. the segmentation should be made in the only one direction from low resolution to high resolution. Secondly, in the paradigm of “object-based” video content retrieval we propose object descriptors for a “scalable retrieval” of content enabling search on reduced resolution level.
Thus, in our vision, the first step in object-based indexing is the foreground object extraction. Several approaches have been proposed in the past and most of them can be roughly classified either as intra-frame segmentation based or as motion segmentation based methods. In the former approach, each frame of the video sequence is independently segmented into regions of homogeneous intensity or texture, using traditional image segmentation techniques [23], while in the latter approach, a dense motion field is used for segmentation and pixels with homogeneous motion field are grouped together [24]. Since both approaches have their drawbacks, most object extraction tools combine spatial and temporal segmentation techniques [25], [26]. Challenge resides in applying this scheme without decompressing the video, as the low-level content descriptors, such as coefficients in the transform domain, can be efficiently re-used for the content analysis task [21].
In the case of the MJPEG2000 standard, one difficulty is that the standard does not provide motion descriptors and so they have to be estimated. The ME (Motion Estimation) problem in the wavelet domain has largely been studied in the literature [27]. We also proposed a first approach for motion estimation on JPEG wavelet pyramid in [28]. In this paper, we will present its improvements allowing reducing the negative effect of shift-variance property.
A second step in object-oriented indexing consists in defining features on the object effectively found. Following the RI paradigm, these features have to be defined in the compressed domain, i.e. the wavelet domain. Several indexing techniques in the wavelet domain exist for JPEG2000 compressed still images. Among them, we can cite the histogram-based techniques. A histogram is computed for each subband and comparison is made subband by subband [29]. The main drawback of such a technique is that it works only with limited camera operations. To reduce complexity and improve the robustness to illumination changes, [30] proposes modelling the histograms using a generalized Gaussian density function. Other indexing techniques are texture oriented [31]. In this work, we propose a global histogram-based descriptor for object in the wavelet domain at different levels of spatial scalability. We also compare the efficiency of this descriptor with an approach consisting in matching local features extracted from object area after the segmentation process.
The paper is organized as follows. The general framework of our approach is summarized in Section 2. Section 3 describes the spatio-temporal object extraction we designed and Section 4 the object-based scalable statistical descriptor we propose. Section 5 presents object-based search scenarios for different tasks of video retrieval, such as retrieval of post-production effects and retrieval of video clips. Results are given in Section 6. Section 7 concludes the paper and outlines the perspectives of this work.
Section snippets
Notations
Before describing the proposed general framework of the object-based scalable indexing, the notations and abbreviations used in the following of the paper are presented. The wavelet transform used in the JPEG2000 standard leads to a multi-scale representation. We consider K+1 the total number of decomposition layers, k is the current level, with k=0 the original image and k=K the lowest resolution level. At a given level k, a wavelet frame is a combination of four subbands obtained by
Scalable object extraction in the JPEG2000 wavelet domain
Scalable object extraction in the JPEG2000 Wavelet Domain follows the general principles we have previously developed in [32]. This section summarizes the approach and presents minor changes that have been brought to improve the quality of results.
Scalable object-based descriptors
After the previous object extraction step, a video is now represented as a set of objects, i.e. a multi-scale description of objects of interest is associated to each frame of the video (see Fig. 3).
Even if the amount of information available has been reduced, objects keep being meaningful and can lead to an interpretation of high semantic level.
Object-based retrieval scenarios
Even if the results of object extraction are acceptable, the segmentation process keeps being unstable and strongly depends on the relative amplitude between background and object motions. Unless more complex tools are applied such as tracking of multiple objects, the extraction of full objects at any frame of a clip is not guaranteed. Hence, in actual state of our work, we propose the following scenarios for queries.
Database for the benchmarking
In order to assess the robustness and efficiency of proposed features to various transformations of video such as affine transformations of image plane, partial masking, fraudulent post-production and so on one should have access to a content database containing transformed videos and annotated ground truth. For content-based image retrieval, such publicly available databases are numerous. For instance the benchmark [41] contains four versions of the same still image which has undergone
Conclusion and perspectives
In this paper we proposed a complete method for a scalable object-based retrieval of HD video content in MJPEG2000 compressed domain. Scalability, this new exciting property inherent to actual image and video compression standards is truly used in this framework.
First of all, the extraction of objects from video stream is scalable: at each spatial resolution level we got a relevant object mask after having performed all necessary operations such as motion estimation, initial morphological
Acknowledgement
This work is supported by French National grant ANR-06-MDCA-010 "ICOS-HD".
References (46)
- et al.
Hierarchical segmentation of video sequences for content manipulation and adaptive coding
Signal Processing
(1998) - et al.
Multimedia content analysis using both audio and visual clues
IEEE Signal Processing Magazine
(2000) - Y. Zhai, J. Liu, X. Cao, A. Basharat, A. Hakeem, S. Ali, M. Shah, Video understanding and content-based retrieval, in:...
- et al.
Good features to track
Proceedings of the Conference on Computer Vision and Pattern Recognition
(1994) - et al.
On space-time interest points
International Conference on Computer Vision
(2003) Distinctive image features from scale-invariant key-points
International Journal of Computer Vision
(2004)- et al.
A performance evaluation of local descriptors
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2005) - et al.
DISCOV: a framework for discovering objects in video
IEEE Transactions on Multimedia
(2008) - J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide baseline stereo from maximally stable extremal regions, in...
- ISO/IEC 14496-2: 2004 Information technology – Coding of audio-visual objects – Part 2: Visual...
Segmentation-based video coding system allowing the manipulation of objects
IEEE Transactions on Circuits and Systems for Video Technology
A 3-step algorithm using region-based active contours for video objects detection
EURASIP Journal on Applied Signal Processing
An evaluation of motion jpeg2000 for video archiving
Proc. Archiving
Extraction of foreground objects from mpeg2 video stream in rough indexing framework, Storage and Retrieval Methods and Applications for Multimedia
SPIE
Region-based representations of image and video: segmentation tools for multimedia services
IEEE Transactions on Circuits and Systems for Video Technologies
Cited by (19)
Retrieval of flower videos based on a query with multiple species of flowers
2021, Artificial Intelligence in AgricultureCitation Excerpt :An example may be an image, keywords, sketch, object, video, video frame etc., (Hu et al., 2011). In the literature we found retrieval of videos based on an object (Morand et al., 2010), frame (Shekar et al., 2016), video (Geetha et al., 2009; Gao et al., 2009; Han et al., 2014; Liang et al., 2012), keywords (Priya and Domnic, 2014). For the retrieval of videos the features and algorithms such as optical flow tensor and Hidden Markov Models (HMMs) (Gao et al., 2009), the multi-modal spectral clustering and ranking algorithm (Han et al., 2014), block wise intensity comparison (Geetha et al., 2009), Scale Invariant Feature Transform (SIFT) (Zhu et al., 2016), Bag-of-Features (Cui et al., 2016), dynamic weighted similarity measure with color and edge descriptors (Liang et al., 2012) are used.
JPSearch: New international standard providing interoperable framework for image search and sharing
2012, Signal Processing: Image CommunicationCitation Excerpt :The effort of research on image search and retrieval was greatly increased with the popularity of digital cameras and phone cameras. Some efforts, which may not cover the complete scope of JPSearch Framework, related to parts of JPSearch standard can be found in [3–8]. In 2003, 50 million units of digital cameras were sold worldwide according to Photo Marketing Association (PMA) (www.pmai.org) marketing research, and the digital camera market was growing rapidly.
Tripartite Feature Enhanced Pyramid Network for Dense Prediction
2023, IEEE Transactions on Image ProcessingUnsupervised Learning of Spatio-Temporal Representation with Multi-Task Learning for Video Retrieval
2022, 2022 National Conference on Communications, NCC 2022Exploring the Strengths of Neural Codes for Video Retrieval
2022, Lecture Notes in Electrical EngineeringLightweight Salient Object Detection via Hierarchical Visual Perception Learning
2021, IEEE Transactions on Cybernetics