Visual saliency guided video compression algorithm

doi:10.1016/j.image.2013.07.003

Signal Processing: Image Communication

Volume 28, Issue 9, October 2013, Pages 1006-1022

https://doi.org/10.1016/j.image.2013.07.003 Get rights and content

Highlights

•
A two-stage novel video compression architecture is proposed for video compression.
•
Machine learning scheme over three dimensional features is used for saliency computation.
•
Saliency computation at macroblock level saves computation.
•
Thresholding of mutual information between successive frames indicates the frames requiring re-computation of saliency.
•
The motion vectors propagate the saliency values for macroblocks in P frames.

Abstract

Recently Saliency maps from input images are used to detect interesting regions in images/videos and focus on processing these salient regions. This paper introduces a novel, macroblock level visual saliency guided video compression algorithm. This is modelled as a 2 step process viz. salient region detection and frame foveation. Visual saliency is modelled as a combination of low level, as well as high level features which become important at the higher-level visual cortex. A relevance vector machine is trained over 3 dimensional feature vectors pertaining to global, local and rarity measures of conspicuity, to yield probabilistic values which form the saliency map. These saliency values are used for non-uniform bit-allocation over video frames. To achieve these goals, we also propose a novel video compression architecture, incorporating saliency, to save tremendous amount of computation. This architecture is based on thresholding of mutual information between successive frames for flagging frames requiring re-computation of saliency, and use of motion vectors for propagation of saliency values.

Introduction

The H.264 video compression standard has a wide range of applications from low bit-rate internet streaming applications to HDTV broadcast. The incipient HEVC seeks to further improve upon H.264, thereby achieving an even higher coding efficiency. This has been possible through improved exploitation of spatio-temporal prediction and entropy coding. The focus has however remained on minimization of the quantifiable ‘objective’ redundancy. In this paper we propose a video compression scheme which exploits features of human perception of video content for enhancing compression efficiency of H.264 based coding scheme.

It is now well established that the acuity of the human eye is limited to only 1–2° of visual angle [1]. This implies that when viewed from a recommended distance of 1.2 m, the eye can crisply perceive only a 2 cm radial region (computed as 1.2×tan(2°/2)) on a standard definition 32” LCD. Also, a recent eye-tracking study [2] on inter-observer saliency variations in task-free viewing of natural images has concluded that images known to have salient regions generate highly correlated eye-fixation maps for different viewers. These have reinforced the application of visual saliency to video compression among numerous other applications, like subjective visual quality assessment, progressive transmission, re-targeting for hand-held devices, effective website designs, and thumb nailing. Saliency based compression is achieved by encoding visually salient regions with high priority, while treating less interesting regions with low priority to save bits. In essence, it allows us to meet our goal of compression without significant degradation of viewing experience or ‘subjective’ quality.

Since 1998 a vast amount of research has gone into modelling of the human visual system (HVS) for the purpose of video processing. The earlier approaches were gaze contingent, dependant on eye trackers to record points of fixation. These were narrowly restricted to the case of a single viewer with eye-tracking apparatus, and hence not very useful. Later approaches exploited the computational neurobiological models to automatically predict the regions likely to attract human attention. However, each model came with its own merits and shortcomings, leaving salient region detection still a challenging and exciting area of research.

Itti et. al. [3] modelled visual attention as a combination of low level features pertaining to the degree of dissimilarity between a region and its surroundings. Novel center-surround approaches like [4] model saliency as the fraction of dissimilar pixels in concentric annular regions around each pixel. Hou and Zhang [5] take a completely different approach, suppressing the response to frequently occurring features while capturing deviances. Other transform domain approaches like [6], [7] follow a similar line of thought. Although these approaches work on psychological patterns with high accuracy, they often fail to detect salient objects in real life images. Some failure cases of these approaches are shown in Fig. 1. It is evident that these saliency maps are not quite close to ground truth.

The failure of these approaches can be attributed to Gestalt's grouping principle [8] which concerns the effect produced when the collective presence of a set of elements becomes more meaningful than their presence as separate elements. Thus, in this work we model saliency as a combination of low level, as well as high level features which become important at the higher-level visual cortex. Many authors like [9] resort to a linear combination of features such as contrast and skin color, but do not provide any explanation for the weights chosen. Hence, we propose a learning based feature integration algorithm where we train a Relevance Vector Machine (RVM) [10], [11] with 3 dimensional feature vectors to output probabilistic saliency values.

One of the earliest automated (as opposed to gaze contingent), visual saliency based video compression model was proposed by Itti [12] in 2004. In [12] a small number of virtual foveas attempt to track salient objects over video frames; and non-salient regions are Gaussian blurred to achieve compression. Guo and Zhang [6] use their PQFT approach for proto-object detection, and apply a multi-resolution wavelet domain foveation filter suppressing coefficients corresponding to background. The OPTOPOIHSH project which aimed at region of interest (ROI) based video compression to allow acceptable quality video transmission through low bandwidth channels, also employs some form of blurring [13]. Selective blurring can however lead to unpleasant artifacts and generally scores low on subjective evaluation. A novel bit allocation strategy through quantization parameter (QP) tuning, achieving compression while preserving visual quality, is presented in [14] which we adopt here.

A saliency preserving video compression scheme has been presented in [15], which reduces coding artifacts so that saliency of the region of interest is retained. In [16] a bit allocation strategy has been proposed which is based upon evaluation of perceptual distortion sensitivity of each macroblock. In [17] a video coding technique has been proposed which use visual saliency for adjusting image fidelity for compression. They use a saliency computation scheme different from the approach presented in this paper.

A simplified flow diagram of our compression model is shown in Fig. 2. In all the existing compression approaches, the saliency map is computed for each frame which can prove to be computationally very expensive. This is avoidable considering the temporal redundancy inherent in videos. We propose here a video coding architecture, incorporating visual saliency propagation, to save on a large amount of saliency computation, and hence time. This scheme uses thresholding of mutual information (MI) between successive frames for flagging frames which require re-computation of saliency; and use of motion vectors for carrying forward the saliency values.

The contribution of this paper to this field of study is thus twofold. First, a supervised procedure to compute saliency of an image using RVM over 3 dimensional feature vectors, pertaining to global, local and rarity measures of conspicuity is proposed. Second, a video coding architecture aimed at significant decrease in computation, and therefore time, is proposed. To arrive at this architecture, a novel saliency propagation and segmentation scheme based upon MI is implemented.

In this we have used H.264 encoder and decoder for establishing effectiveness of the strategy. However, the same scheme can be used with HEVC encoder and decoder. In HEVC, as in H.264, uniform reconstruction quantization (URQ) is used, with quantization scaling metrices supported for the various block sizes [18]. Accordingly, our QP tuning algorithm can be used with HEVC as well. However, flexibility in block size can permit grouping of MBs based upon saliency values.

The remainder of this paper is organized as follows. In Section 2 we discuss, in detail, the steps involved in our learning based saliency algorithm. Since all video coding operations are MB based, we learn saliency at MB level to save on unnecessary computation. This section also contains some results and comparison of our algorithm with other leading approaches. In Section 3 we describe our complete video coding architecture in which various issues relating to saliency propagation/re-calculation and bit allocation are addressed. Compression result on some varied video sequences and gain over standard H.264 with RDO is presented in Section 4 and conclusions are drawn in Section 5.

Section snippets

Our saliency algorithm

We use color spatial variance, center-surround multi scale ratio of dissimilarity and pulse DCT to construct 3 feature maps. Then, a soft, learning based approach is used to arrive at the final saliency map.

Our compression architecture

We wish to employ saliency for the purpose of video compression. However, computation of feature maps for each video frame can prove to be computationally very expensive if we rely on video compression techniques such as those proposed in [6], [12], [14] as they necessitate calculation of saliency map of each frame.

There exist consistency between salient regions for successive frames. Fig. 10, Fig. 11 are illustration of this characteristic observed in video sequence. So we propose here the use

Results

We present here the results for videos compressed in two different ways. First is through Gaussian blurring of raw video frames as front end pre-processing to the standard H.264/AVC JM encoder [32], leaving the encoder design intact. Second is through encoding of the theory of Section 3 into the JM reference software. A comparison of the video sizes, PSNR and Bit Rate obtained (in .264 format) is made against the standard JM output with rate distortion optimization (RDO) turned ON for 5 video

Conclusions

A vast amount of research has gone into modelling of the HVS with each model having its own merits and shortcomings. The potential which lies in an integration of these models has been demonstrated by the accuracy of our results. A simple and effective learning based approach for such a unification has been presented where an RVM trained over 3 dimensional feature vectors pertaining to global, local and rarity measures of conspicuity outputs probabilistic values which form the saliency map.

References (33)

A. Desolneux et al.
Computational gestalts and perception thresholds
Journal of Physiology-ParisNeurogeometry and Visual Perception
(2003)
Z. Li et al.
Visual attention guided bit allocation in video compression
Image and Vision Computing
(2011)
M. Nicolaou, A. James, A. Darzi, G.-Z. Yang, A study of saccade transition for attention segregation and task strategy...
U. Engelke, A. Maeder, H.-J. Zepernick, Analysing inter-observer saliency variations in task-free viewing of natural...
L. Itti et al.
A model of saliency-based visual attention for rapid scene analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1998)
R. Huang, N. Sang, L. Liu, Q. Tang, Saliency based on multi-scale ratio of dissimilarity, in: Proceedings of the 20th...
X. Hou, L. Zhang, Saliency detection: a spectral residual approach, in: Proceedings of the IEEE Conference on Computer...
C. Guo et al.
A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression
IEEE Transactions on Image Processing
(2010)
Y. Yu, B. Wang, L. Zhang, Pulse discrete cosine transform for saliency-based visual attention, in: Proceedings of the...
J.-C. Chiang, C.-S. Hsieh, G. Chang, F.-D. Jou, W.-N. Lie, Region-of-interest based rate control scheme with flexible...

M.E. Tipping

The relevance vector machine

M. Tipping

Sparse bayesian learning and the relevance vector machine

Journal of Machine Learning Research

(2001)

L. Itti

Automatic foveation for video compression using a neurobiological model of visual attention

IEEE Transactions on Image Processing

(2004)

N. Tsapatsoulis, C. Pattichis, A. Kounoudes, C. Loizou, A. Constantinides, J.G. Taylor, Visual attention based region...

H. Hadizadeh, I.V. Bajic, Saliency-preserving video compression, in: Proceedings of the IEEE International Conference...

Y.H.Y. Chih Wei Tang et al.

Visual sensitivity guided bit allocation for video coding

IEEE Transactions on Multimedia

(2006)

Cited by (43)

WavNet — Visual saliency detection using Discrete Wavelet Convolutional Neural Network
2021, Journal of Visual Communication and Image Representation
Citation Excerpt :
Saliency detection decides the quality of the aforementioned systems and hence drawn the attention of many of the research groups. Salient area extraction is inevitable in applications of computer vision such as surveillance [1], mobile robots [2], proto-object detection [3], preprocessing and compression of images and videos [4], bio-medical applications [5], visual tracking and editing [6], creation of photo montages [7] and so on. Detection of salient areas in an image aims to predict the degree of content relevance within the image in accordance with the Human Visual System (HVS).
In the recent advancements in image and video analysis, the detection of salient regions in the image becomes the initial step. This plays a crucial role in deciding the performance of such algorithms. In this work, a Multi-Resolution Feature Extraction (MRFE) technique that makes use of Discrete Wavelet Convolutional Neural Network (DWCNN) for generating features is employed. An Enhanced Feature Extraction (EFE) module extracts additional features from the high level features of the DWCNN, which are used to frame both channel as well as spatial attention models for yielding contextual attention maps. A new hybrid loss function is also proposed, which is a combination of Balanced Cross Entropy (BCE) loss and Edge based Structural Similarity (ESSIM) loss that effectively identifies and segments the salient regions with clear boundaries. The method is tested exhaustively with five different benchmark datasets and is proved superior to the existing state-of-the-art methods with a minimum Mean Absolute error (MAE) of 0.03 and F-measure of 0.956.
HEVC encoder optimization for HDR video coding based on irregularity concealment effect
2018, Signal Processing: Image Communication
Citation Excerpt :
The first group focuses on researching the human visual attention model [12,13] in which saliency detection is one of the representative models. Gupta et al. [14] proposed visual saliency guided video compression which consisted of salient region detection and frame foveation at a macro-block level. Hadizadeh and Bajić [15] proposed coding artifact reduction based on saliency to keep user’s attention in region-of-interest (ROI).
In this paper, we propose HEVC encoder optimization for HEVC Main 10 Profile-based HDR video coding using irregularity concealment effect. The human visual system (HVS) is more sensitive to the visual contents with regularity than those without it, i.e., the irregularity concealment effect. We adopt the irregularity concealment effect for HEVC encoder optimization. First, we calculate orientations of pixels, and then obtain visual regularity in a local region using the orientation distribution. Next, we combine the visual regularity with 10bit luminance adaptation and spatial masking to generate 10bit just-noticeable-distortion (JND) map, called visual regularity-based JND (VRJND). Finally, based on VRJND, we perform perceptual block merging for HEVC Main 10 Profile-based HDR video coding. Experimental results demonstrate that the proposed method achieves average bit-rate reductions of 4.2% in PSNR-DE100 and 5.24% in HDR-VDP-2.2 with only increase of 4.09% runtime over HEVC. Visual comparison further verifies that the proposed method preserves more structural and edge information than HEVC in reconstructing HDR videos.
SALIENCY MAP IN IMAGE VISUAL QUALITY ASSESSMENT AND PROCESSING
2023, Radioelectronic and Computer Systems
Video saliency aware intelligent HD video compression with the improvement of visual quality and the reduction of coding complexity
2022, Neural Computing and Applications
An Interactive Annotation Tool for Perceptual Video Compression
2022, 2022 14th International Conference on Quality of Multimedia Experience, QoMEX 2022
Artificial intelligence in the creative industries: a review
2022, Artificial Intelligence Review

View all citing articles on Scopus

View full text

Visual saliency guided video compression algorithm

Highlights

Abstract

Introduction

Section snippets

Our saliency algorithm

Our compression architecture

Results

Conclusions

Journal of Physiology-ParisNeurogeometry and Visual Perception

Image and Vision Computing

A model of saliency-based visual attention for rapid scene analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence

A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression

IEEE Transactions on Image Processing

The relevance vector machine

Sparse bayesian learning and the relevance vector machine

Journal of Machine Learning Research

Automatic foveation for video compression using a neurobiological model of visual attention

IEEE Transactions on Image Processing

Visual sensitivity guided bit allocation for video coding

IEEE Transactions on Multimedia