Visual saliency guided video compression algorithm

https://doi.org/10.1016/j.image.2013.07.003Get rights and content

Highlights

  • A two-stage novel video compression architecture is proposed for video compression.

  • Machine learning scheme over three dimensional features is used for saliency computation.

  • Saliency computation at macroblock level saves computation.

  • Thresholding of mutual information between successive frames indicates the frames requiring re-computation of saliency.

  • The motion vectors propagate the saliency values for macroblocks in P frames.

Abstract

Recently Saliency maps from input images are used to detect interesting regions in images/videos and focus on processing these salient regions. This paper introduces a novel, macroblock level visual saliency guided video compression algorithm. This is modelled as a 2 step process viz. salient region detection and frame foveation. Visual saliency is modelled as a combination of low level, as well as high level features which become important at the higher-level visual cortex. A relevance vector machine is trained over 3 dimensional feature vectors pertaining to global, local and rarity measures of conspicuity, to yield probabilistic values which form the saliency map. These saliency values are used for non-uniform bit-allocation over video frames. To achieve these goals, we also propose a novel video compression architecture, incorporating saliency, to save tremendous amount of computation. This architecture is based on thresholding of mutual information between successive frames for flagging frames requiring re-computation of saliency, and use of motion vectors for propagation of saliency values.

Introduction

The H.264 video compression standard has a wide range of applications from low bit-rate internet streaming applications to HDTV broadcast. The incipient HEVC seeks to further improve upon H.264, thereby achieving an even higher coding efficiency. This has been possible through improved exploitation of spatio-temporal prediction and entropy coding. The focus has however remained on minimization of the quantifiable ‘objective’ redundancy. In this paper we propose a video compression scheme which exploits features of human perception of video content for enhancing compression efficiency of H.264 based coding scheme.

It is now well established that the acuity of the human eye is limited to only 1–2° of visual angle [1]. This implies that when viewed from a recommended distance of 1.2 m, the eye can crisply perceive only a 2 cm radial region (computed as 1.2×tan(2°/2)) on a standard definition 32” LCD. Also, a recent eye-tracking study [2] on inter-observer saliency variations in task-free viewing of natural images has concluded that images known to have salient regions generate highly correlated eye-fixation maps for different viewers. These have reinforced the application of visual saliency to video compression among numerous other applications, like subjective visual quality assessment, progressive transmission, re-targeting for hand-held devices, effective website designs, and thumb nailing. Saliency based compression is achieved by encoding visually salient regions with high priority, while treating less interesting regions with low priority to save bits. In essence, it allows us to meet our goal of compression without significant degradation of viewing experience or ‘subjective’ quality.

Since 1998 a vast amount of research has gone into modelling of the human visual system (HVS) for the purpose of video processing. The earlier approaches were gaze contingent, dependant on eye trackers to record points of fixation. These were narrowly restricted to the case of a single viewer with eye-tracking apparatus, and hence not very useful. Later approaches exploited the computational neurobiological models to automatically predict the regions likely to attract human attention. However, each model came with its own merits and shortcomings, leaving salient region detection still a challenging and exciting area of research.

Itti et. al. [3] modelled visual attention as a combination of low level features pertaining to the degree of dissimilarity between a region and its surroundings. Novel center-surround approaches like [4] model saliency as the fraction of dissimilar pixels in concentric annular regions around each pixel. Hou and Zhang [5] take a completely different approach, suppressing the response to frequently occurring features while capturing deviances. Other transform domain approaches like [6], [7] follow a similar line of thought. Although these approaches work on psychological patterns with high accuracy, they often fail to detect salient objects in real life images. Some failure cases of these approaches are shown in Fig. 1. It is evident that these saliency maps are not quite close to ground truth.

The failure of these approaches can be attributed to Gestalt's grouping principle [8] which concerns the effect produced when the collective presence of a set of elements becomes more meaningful than their presence as separate elements. Thus, in this work we model saliency as a combination of low level, as well as high level features which become important at the higher-level visual cortex. Many authors like [9] resort to a linear combination of features such as contrast and skin color, but do not provide any explanation for the weights chosen. Hence, we propose a learning based feature integration algorithm where we train a Relevance Vector Machine (RVM) [10], [11] with 3 dimensional feature vectors to output probabilistic saliency values.

One of the earliest automated (as opposed to gaze contingent), visual saliency based video compression model was proposed by Itti [12] in 2004. In [12] a small number of virtual foveas attempt to track salient objects over video frames; and non-salient regions are Gaussian blurred to achieve compression. Guo and Zhang [6] use their PQFT approach for proto-object detection, and apply a multi-resolution wavelet domain foveation filter suppressing coefficients corresponding to background. The OPTOPOIHSH project which aimed at region of interest (ROI) based video compression to allow acceptable quality video transmission through low bandwidth channels, also employs some form of blurring [13]. Selective blurring can however lead to unpleasant artifacts and generally scores low on subjective evaluation. A novel bit allocation strategy through quantization parameter (QP) tuning, achieving compression while preserving visual quality, is presented in [14] which we adopt here.

A saliency preserving video compression scheme has been presented in [15], which reduces coding artifacts so that saliency of the region of interest is retained. In [16] a bit allocation strategy has been proposed which is based upon evaluation of perceptual distortion sensitivity of each macroblock. In [17] a video coding technique has been proposed which use visual saliency for adjusting image fidelity for compression. They use a saliency computation scheme different from the approach presented in this paper.

A simplified flow diagram of our compression model is shown in Fig. 2. In all the existing compression approaches, the saliency map is computed for each frame which can prove to be computationally very expensive. This is avoidable considering the temporal redundancy inherent in videos. We propose here a video coding architecture, incorporating visual saliency propagation, to save on a large amount of saliency computation, and hence time. This scheme uses thresholding of mutual information (MI) between successive frames for flagging frames which require re-computation of saliency; and use of motion vectors for carrying forward the saliency values.

The contribution of this paper to this field of study is thus twofold. First, a supervised procedure to compute saliency of an image using RVM over 3 dimensional feature vectors, pertaining to global, local and rarity measures of conspicuity is proposed. Second, a video coding architecture aimed at significant decrease in computation, and therefore time, is proposed. To arrive at this architecture, a novel saliency propagation and segmentation scheme based upon MI is implemented.

In this we have used H.264 encoder and decoder for establishing effectiveness of the strategy. However, the same scheme can be used with HEVC encoder and decoder. In HEVC, as in H.264, uniform reconstruction quantization (URQ) is used, with quantization scaling metrices supported for the various block sizes [18]. Accordingly, our QP tuning algorithm can be used with HEVC as well. However, flexibility in block size can permit grouping of MBs based upon saliency values.

The remainder of this paper is organized as follows. In Section 2 we discuss, in detail, the steps involved in our learning based saliency algorithm. Since all video coding operations are MB based, we learn saliency at MB level to save on unnecessary computation. This section also contains some results and comparison of our algorithm with other leading approaches. In Section 3 we describe our complete video coding architecture in which various issues relating to saliency propagation/re-calculation and bit allocation are addressed. Compression result on some varied video sequences and gain over standard H.264 with RDO is presented in Section 4 and conclusions are drawn in Section 5.

Section snippets

Our saliency algorithm

We use color spatial variance, center-surround multi scale ratio of dissimilarity and pulse DCT to construct 3 feature maps. Then, a soft, learning based approach is used to arrive at the final saliency map.

Our compression architecture

We wish to employ saliency for the purpose of video compression. However, computation of feature maps for each video frame can prove to be computationally very expensive if we rely on video compression techniques such as those proposed in [6], [12], [14] as they necessitate calculation of saliency map of each frame.

There exist consistency between salient regions for successive frames. Fig. 10, Fig. 11 are illustration of this characteristic observed in video sequence. So we propose here the use

Results

We present here the results for videos compressed in two different ways. First is through Gaussian blurring of raw video frames as front end pre-processing to the standard H.264/AVC JM encoder [32], leaving the encoder design intact. Second is through encoding of the theory of Section 3 into the JM reference software. A comparison of the video sizes, PSNR and Bit Rate obtained (in .264 format) is made against the standard JM output with rate distortion optimization (RDO) turned ON for 5 video

Conclusions

A vast amount of research has gone into modelling of the HVS with each model having its own merits and shortcomings. The potential which lies in an integration of these models has been demonstrated by the accuracy of our results. A simple and effective learning based approach for such a unification has been presented where an RVM trained over 3 dimensional feature vectors pertaining to global, local and rarity measures of conspicuity outputs probabilistic values which form the saliency map.

References (33)

  • A. Desolneux et al.

    Computational gestalts and perception thresholds

    Journal of Physiology-ParisNeurogeometry and Visual Perception

    (2003)
  • Z. Li et al.

    Visual attention guided bit allocation in video compression

    Image and Vision Computing

    (2011)
  • M. Nicolaou, A. James, A. Darzi, G.-Z. Yang, A study of saccade transition for attention segregation and task strategy...
  • U. Engelke, A. Maeder, H.-J. Zepernick, Analysing inter-observer saliency variations in task-free viewing of natural...
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • R. Huang, N. Sang, L. Liu, Q. Tang, Saliency based on multi-scale ratio of dissimilarity, in: Proceedings of the 20th...
  • X. Hou, L. Zhang, Saliency detection: a spectral residual approach, in: Proceedings of the IEEE Conference on Computer...
  • C. Guo et al.

    A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression

    IEEE Transactions on Image Processing

    (2010)
  • Y. Yu, B. Wang, L. Zhang, Pulse discrete cosine transform for saliency-based visual attention, in: Proceedings of the...
  • J.-C. Chiang, C.-S. Hsieh, G. Chang, F.-D. Jou, W.-N. Lie, Region-of-interest based rate control scheme with flexible...
  • M.E. Tipping

    The relevance vector machine

  • M. Tipping

    Sparse bayesian learning and the relevance vector machine

    Journal of Machine Learning Research

    (2001)
  • L. Itti

    Automatic foveation for video compression using a neurobiological model of visual attention

    IEEE Transactions on Image Processing

    (2004)
  • N. Tsapatsoulis, C. Pattichis, A. Kounoudes, C. Loizou, A. Constantinides, J.G. Taylor, Visual attention based region...
  • H. Hadizadeh, I.V. Bajic, Saliency-preserving video compression, in: Proceedings of the IEEE International Conference...
  • Y.H.Y. Chih Wei Tang et al.

    Visual sensitivity guided bit allocation for video coding

    IEEE Transactions on Multimedia

    (2006)
  • Cited by (43)

    • WavNet — Visual saliency detection using Discrete Wavelet Convolutional Neural Network

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Saliency detection decides the quality of the aforementioned systems and hence drawn the attention of many of the research groups. Salient area extraction is inevitable in applications of computer vision such as surveillance [1], mobile robots [2], proto-object detection [3], preprocessing and compression of images and videos [4], bio-medical applications [5], visual tracking and editing [6], creation of photo montages [7] and so on. Detection of salient areas in an image aims to predict the degree of content relevance within the image in accordance with the Human Visual System (HVS).

    • HEVC encoder optimization for HDR video coding based on irregularity concealment effect

      2018, Signal Processing: Image Communication
      Citation Excerpt :

      The first group focuses on researching the human visual attention model [12,13] in which saliency detection is one of the representative models. Gupta et al. [14] proposed visual saliency guided video compression which consisted of salient region detection and frame foveation at a macro-block level. Hadizadeh and Bajić [15] proposed coding artifact reduction based on saliency to keep user’s attention in region-of-interest (ROI).

    • SALIENCY MAP IN IMAGE VISUAL QUALITY ASSESSMENT AND PROCESSING

      2023, Radioelectronic and Computer Systems
    • An Interactive Annotation Tool for Perceptual Video Compression

      2022, 2022 14th International Conference on Quality of Multimedia Experience, QoMEX 2022
    View all citing articles on Scopus
    View full text