Video Temporal Segmentation Based on Color Histograms and Cross-Correlation

Sousa e Santos, Anderson Carlos; Pedrini, Helio

doi:10.1007/978-3-319-52277-7_28

Anderson Carlos Sousa e Santos¹⁶ &
Helio Pedrini¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10125))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1452 Accesses
1 Citations

Abstract

Several fields of knowledge generate and consume massive volumes of videos, such as entertainment, telemedicine, surveillance and security. The rapid growth in the demand for multimedia content has driven the development of fast and scalable mechanisms for storing, retrieving and transmitting video sequences. The automatic temporal segmentation is a fundamental process in the analysis of video content. This work proposes and evaluates an adaptive video shot detection based on color histograms and normalized cross-correlation. Experiments conducted on several video sequences demonstrate that the combination of these two features achieve high accuracy rates.

You have full access to this open access chapter, Download conference paper PDF

Video Shot Boundary Detection: A Review

A Survey on Video Segmentation

Video Shot Detection Using Cumulative Colour Histogram

Keywords

1 Introduction

Advances in data acquisition technologies have enabled users to record and share videos through a number of portable devices, such as cell phones, tablets, and digital cameras. Due to this steady increase in multimedia contents, a challenging task is to develop efficient mechanisms for storing, indexing, retrieving and transmitting such large amounts of data.

Video summarization [3] consists in automatically generating a short version of a video sequence, allowing the user to quickly evaluate the relevance of its content by means of only a set of representative frames. As a temporal video segmentation process [4, 6], some challenges associated with the video summarization include camera motion, varying lighting conditions, video genres, and subjectivity in the evaluation process.

The main contribution of this work is the proposition and evaluation of a video shot segmentation method based on the combination of inter-frame dissimilarity vectors of color histogram distances and block-based normalized cross-correlation between image pixel intensities. In addition, an adaptive local threshold strategy is defined to automatically detect the boundary frames. Experiments conducted on public video sequences demonstrate that the proposed method achieves high accuracy rates.

This paper is organized as follows. Section 2 briefly presents some relevant concepts and works related to the topic under investigation. Section 3 describes the proposed shot video detection methodology. Section 4 presents and discusses some of the results obtained with the proposed method. Finally, Sect. 5 concludes our work and includes some future work suggestions for improving the proposed method.

2 Background

Due to the advances in multimedia technology and large availability of digital content, there is an increasing demand for robust mechanisms for storing, indexing, browsing and retrieving video data. An open research problem is the automatic construction of a compact and meaningful representation of massive video sequences to help users understand the most important information of their content [9].

Temporal segmentation of a video into semantic units is a crucial stage in the analysis of video contents, whose process is known as shot boundary detection. A video shot consists of one or more frames generated contiguously to form a continuous action in time and space. A video summary can be constructed from a set of keyframes that represent the shots. In this context, two categories of transitions between shots are commonly defined: abrupt and gradual transitions. An abrupt transition corresponds to a cut between one frame of a shot and its adjacent frame in the next shot, whereas a gradual transition represents a smooth change over several frames.

Several video shot boundary detection approaches have been proposed in the literature [1, 2, 5, 7]. Two main steps are commonly performed in the cut detection methods: (i) a similarity or dissimilarity measure is initially computed for each pair of consecutive frames and (ii) a cut is detected if the measure is higher than a specified threshold.

3 Methodology

The proposed video cut detection method is based on two different dissimilarities between consecutive frames: the Bhattacharyya distance between color histograms and the inverse normalized cross-correlation between the intensity image blocks. The resulting metrics are combined with a simple mean fusion and submitted to an adaptive thresholding technique that detects the relative high disparity and classifies the frames as part of a shot transition or not. These main steps are illustrated in Fig. 1.

3.1 Histogram-Based Dissimilarity

In order to calculate the inter-frame dissimilarity, a quantized color histogram (CH) is extracted from each frame and the distance between two consecutive frames is calculated with the Bhattacharyya distance, as defined in Eq. 1.

$$\begin{aligned} d(H_i, H_{i-1}) = \sqrt{1 - \frac{1}{\sqrt{\overline{H}_i \cdot \overline{H}_{i-1} \cdot N^2}} \sum ^B_b{\sqrt{H_i(b) \cdot H_{i-1}(b)}}} \end{aligned}$$

(1)

where $\overline{H_k} = \displaystyle \normalsize \frac{1}{N}\sum _j{H_k(j)}$ and $H_i(b)$ is the probability of frame i having a pixel that falls into the color bin b.

3.2 Block-Based Cross-Correlation

The negative normalized cross-correlation (NCC) is a dissimilarity measure over the intensity image, as stated in Eq. 2.

$$\begin{aligned} d(f_i, f_{i-1}) = - \frac{1}{N} \sum _{x,y}{\frac{(f_i(x,y) - \overline{f}_i)(f_{i-1}(x,y) - \overline{f}_{i-1})}{\sigma _{f_i} \sigma _{f_{i-1}}}} \end{aligned}$$

(2)

where $\overline{f_i}$ is the average of $f_i$ and $\sigma _{f_{i}}$ is the standard deviation.

In order to avoid sensitivity to local changes between frames and presence of noise, each video frame is divided into non-overlapping blocks and the negative cross-correlation is calculated for each pair of corresponding blocks. Algorithm 1 summarizes the main steps for the block-based cross-correlation.

The block with the minimum dissimilarity is chosen since a significant change in it implies that all other blocks also changed.

3.3 Fusion

A combination of the dissimilarity vectors of the histogram-based distance and the block cross-correlation is performed to minimize the individual errors and uncertainty. Prior to the fusion process, a z-score normalization followed by a min-max scaling is applied to both vectors. The resulting dissimilarity vector constitutes a weighted mean between each position. Equation 3 summarizes the process.

$$\begin{aligned} D = \omega \cdot D_{\textit{CH}} + (1 - \omega ) \cdot D_{\textit{B-NCC}} \end{aligned}$$

(3)

where $D_{\textit{CH}}$ and $D_{\textit{B-NCC}}$ are the dissimilarity vectors for the color histogram and the block-based cross-correlation, respectively, D is the final vector of dissimilarities between frames, whereas $\omega $ is the weight applied to each dissimilarity measure.

3.4 Adaptive Thresholding

The thresholding over the dissimilarity vector is performed locally through a moving window. Since the goal is to find peaks in the frame dissimilarities, this stage is similar to an outlier detection process.

A local median M is calculated for each moving window of size m with center at i. The frames i and $i-1$ are considered as boundary transition frames if their dissimilarity is equal to or greater than the median plus an $\alpha $ value ($d_i \ge M + \alpha $). Furthermore, it needs to be the maximum point within the window to ensure that only the dominant peak is labeled as transition, avoiding redundancy. Figure 2 illustrates the behavior of the proposed thresholding method.

4 Experimental Results

Experiments were conducted on two different annotated data sets. The first one, referred here to as VIDEOSEG’2004 [10], contains 10 video sequences within a diversity of genres, such as news, commercial, movies, cartoons, television shows, as well as other challenging scenarios with low quality digitization, low lighting conditions, fast motions and production effects. The second data set is a shot boundary test collection for the TRECVID’2002 [8]. It consists of 18 videos, where most of them are documentaries and amateur films with low quality, noise and production artifacts, varying in length, date of creation and production style.

The evaluation protocol follows the TRECVID guidelines, such that the results are assessed in terms of precision, recall and their harmonic mean ($F_{\textit{score}}$). Equations 4 and 5 express the precision and recall measures, respectively, for a video V with a detection set S.

$$\begin{aligned} \text {Precision}&= \normalsize \frac{\displaystyle \sum _{f_i \in V}{S(i) \in \textit{Cut} \wedge i \in \textit{True Cut}}}{\sum _{f_i \in V} {S(i) \in \textit{Cut}}} \end{aligned}$$

(4)

$$\begin{aligned} \text {Recall}&= \normalsize \frac{\displaystyle \sum _{f_i \in V}{S(i) \in \textit{Cut} \wedge i \in \textit{True Cut}}}{\sum _{f_i \in V} {i \in \textit{True Cut}}} \end{aligned}$$

(5)

The $F_{\textit{score}}$ measure is defined as

$$\begin{aligned} F_{\textit{score}} = 2 \, \frac{\text {Precision} \times \text {Recall}}{\text {Precision}+\text {Recall}} \end{aligned}$$

(6)

The adaptive thresholding parameters were empirically determined and applied to all videos in both data sets. In our experiments, values of $\alpha = 0.2$ and window size $m = 7$ achieved the best performance. For the histogram, a quantization with 32 bins for each RGB channel (totalizing 32,768 colors) was defined. The number of blocks K applied to the cross-correlation was set to 16, generating a $4 \times 4$ grid on the frames. For the fusion, a constant weight $\omega = 0.5$ demonstrated to be the best overall value in both data sets.

Table 1 shows the results for the VIDEOSEG’2004 data set. The described approaches and a baseline available for the data set [10] based on feature tracking were compared with the proposed fusion method.

Table 1. Video cut detection results (VIDEOSEG’2004).

Full size table

It is noticeable that our fusion strategy for color histogram and block-based cross-correlation outperforms the respective individual methods and the provided baseline. The block-based normalized cross-correlation performs poorly in comparison to the global approach, once the videos in the VIDEOSEG’2004 data set have different frame dimensions and the block partitioning can discard important information. Nevertheless, such factors did not affect the performance of our fusion method.

Table 2 shows the results for the TRECVID’2002 data set. It is possible to observe that the proposed fusion outperforms all other approaches. Moreover, the block-based cross-correlation achieves superior performance than the global cross-correlation.

Table 2. Video cut detection results (TRECVID’2002).

Full size table

Figure 3 presents the official results, provided by TRECVID’2002, for each participant in the competition. Number in parentheses represent the number of submissions for each team. Our results are presented in a similar manner to allow an adequate comparison.

Our proposed method outperforms the majority of the submissions both in precision and recall. Furthermore, even the methods without combination are competitive to those submitted to TRECVID’2002. This corroborates the advantage of our adaptive thresholding method.

Figure 4 illustrates the inter-frame dissimilarities for a video section from TRECVID’2002. It is possible to observe that the number of false positives detected by the color histogram and block-based normalized cross-correlation are reduced through the fusion strategy.

5 Conclusions and Future Work

This work proposed an adaptive video cut detection method based on the combination of color histograms and cross-correlation. Furthermore, a local thresholding strategy is used to search for relative significant peaks.

Although the inter-frame dissimilarity measures are simple, both the fusion and adaptive thresholding approaches produced significant improvements in the experimental results obtained on two different data sets containing challenging videos. The proposed method also proved to be very competitive compared to the best submissions to TRECVID’2002 shot boundary competition.

As directions for future work, we intend to extend the method to address video gradual transitions and the automatic determination of weights in the fusion process.

References

Apostolidis, E., Mezaris, V.: Fast shot segmentation combining global and local visual descriptors. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6583–6587. IEEE (2014)
Google Scholar
Birinci, M., Kiranyaz, S.: A perceptual scheme for fully automatic video shot boundary detection. Sig. Process.: Image Commun. 29(3), 410–423 (2014)
Google Scholar
Cirne, M.V.M., Pedrini, H.: Summarization of Videos by Image Quality Assessment. In: Shao, L., Shan, C., Luo, J., Etoh, M. (eds.) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Advances in Pattern Recognition, pp. 901–908. Springer, Heidelberg (2014). doi:10.1007/978-1-84996-507-1_10
Google Scholar
Jiang, H., Zhang, G., Wang, H., Bao, H.: Spatio-temporal video segmentation of static scenes and its applications. IEEE Trans. Multimedia 17(1), 3–15 (2015)
Article Google Scholar
Jiang, X., Sun, T., Liu, J., Chao, J., Zhang, W.: An adaptive video shot segmentation scheme based on dual-detection model. Neurocomputing 116, 102–111 (2013)
Article Google Scholar
Petersohn, C.: Temporal Video Segmentation. Jörg Vogt Verlag, Niederwaldstr (2010)
Google Scholar
Tippaya, S., Sitjongsataporn, S., Tan, T., Chamnongthai, K., Khan, M.: Video shot boundary detection based on candidate segment selection and transition pattern analysis. In: IEEE International Conference on Digital Signal Processing, pp. 1025–1029, July 2015
Google Scholar
TRECVID: TRECVID Data Availability (2016). http://trecvid.nist.gov/trecvid.data.html
Veltkamp, R., Burkhardt, H., Kriegel, H.P.: State-of-the-Art in Content-based Image and Video Retrieval, vol. 22. Springer Science & Business Media, Heidelberg (2013)
MATH Google Scholar
Whitehead, A., Bose, P., Laganiere, R.: Feature based cut detection with automatic threshold selection. In: Enser, P., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 410–418. Springer, Heidelberg (2004). doi:10.1007/978-3-540-27814-6_49
Chapter Google Scholar

Download references

Acknowledgments

The authors are thankful to FAPESP (grant #2011/22749-8) and CNPq (grant #307113/2012-4) for their financial support.

Author information

Authors and Affiliations

Institute of Computing, University of Campinas, Campinas, SP, 13083-852, Brazil
Anderson Carlos Sousa e Santos & Helio Pedrini

Authors

Anderson Carlos Sousa e Santos
View author publications
You can also search for this author in PubMed Google Scholar
Helio Pedrini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Helio Pedrini .

Editor information

Editors and Affiliations

Pontificia Universidad Católica del Perú, Lima, Peru
César Beltrán-Castañón
Uppsala University, Uppsala, Sweden
Ingela Nyström
University of Ottawa, Ottawa, Ontario, Canada
Fazel Famili

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sousa e Santos, A.C., Pedrini, H. (2017). Video Temporal Segmentation Based on Color Histograms and Cross-Correlation. In: Beltrán-Castañón, C., Nyström, I., Famili, F. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2016. Lecture Notes in Computer Science(), vol 10125. Springer, Cham. https://doi.org/10.1007/978-3-319-52277-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-52277-7_28
Published: 16 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52276-0
Online ISBN: 978-3-319-52277-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Video Temporal Segmentation Based on Color Histograms and Cross-Correlation

Abstract