Keywords

1 Introduction

Advances in data acquisition technologies have enabled users to record and share videos through a number of portable devices, such as cell phones, tablets, and digital cameras. Due to this steady increase in multimedia contents, a challenging task is to develop efficient mechanisms for storing, indexing, retrieving and transmitting such large amounts of data.

Video summarization [3] consists in automatically generating a short version of a video sequence, allowing the user to quickly evaluate the relevance of its content by means of only a set of representative frames. As a temporal video segmentation process [4, 6], some challenges associated with the video summarization include camera motion, varying lighting conditions, video genres, and subjectivity in the evaluation process.

The main contribution of this work is the proposition and evaluation of a video shot segmentation method based on the combination of inter-frame dissimilarity vectors of color histogram distances and block-based normalized cross-correlation between image pixel intensities. In addition, an adaptive local threshold strategy is defined to automatically detect the boundary frames. Experiments conducted on public video sequences demonstrate that the proposed method achieves high accuracy rates.

This paper is organized as follows. Section 2 briefly presents some relevant concepts and works related to the topic under investigation. Section 3 describes the proposed shot video detection methodology. Section 4 presents and discusses some of the results obtained with the proposed method. Finally, Sect. 5 concludes our work and includes some future work suggestions for improving the proposed method.

2 Background

Due to the advances in multimedia technology and large availability of digital content, there is an increasing demand for robust mechanisms for storing, indexing, browsing and retrieving video data. An open research problem is the automatic construction of a compact and meaningful representation of massive video sequences to help users understand the most important information of their content [9].

Temporal segmentation of a video into semantic units is a crucial stage in the analysis of video contents, whose process is known as shot boundary detection. A video shot consists of one or more frames generated contiguously to form a continuous action in time and space. A video summary can be constructed from a set of keyframes that represent the shots. In this context, two categories of transitions between shots are commonly defined: abrupt and gradual transitions. An abrupt transition corresponds to a cut between one frame of a shot and its adjacent frame in the next shot, whereas a gradual transition represents a smooth change over several frames.

Several video shot boundary detection approaches have been proposed in the literature [1, 2, 5, 7]. Two main steps are commonly performed in the cut detection methods: (i) a similarity or dissimilarity measure is initially computed for each pair of consecutive frames and (ii) a cut is detected if the measure is higher than a specified threshold.

3 Methodology

The proposed video cut detection method is based on two different dissimilarities between consecutive frames: the Bhattacharyya distance between color histograms and the inverse normalized cross-correlation between the intensity image blocks. The resulting metrics are combined with a simple mean fusion and submitted to an adaptive thresholding technique that detects the relative high disparity and classifies the frames as part of a shot transition or not. These main steps are illustrated in Fig. 1.

Fig. 1.
figure 1

A flowchart of the proposed video cut detection method.

3.1 Histogram-Based Dissimilarity

In order to calculate the inter-frame dissimilarity, a quantized color histogram (CH) is extracted from each frame and the distance between two consecutive frames is calculated with the Bhattacharyya distance, as defined in Eq. 1.

$$\begin{aligned} d(H_i, H_{i-1}) = \sqrt{1 - \frac{1}{\sqrt{\overline{H}_i \cdot \overline{H}_{i-1} \cdot N^2}} \sum ^B_b{\sqrt{H_i(b) \cdot H_{i-1}(b)}}} \end{aligned}$$
(1)

where \(\overline{H_k} = \displaystyle \normalsize \frac{1}{N}\sum _j{H_k(j)}\) and \(H_i(b)\) is the probability of frame i having a pixel that falls into the color bin b.

3.2 Block-Based Cross-Correlation

The negative normalized cross-correlation (NCC) is a dissimilarity measure over the intensity image, as stated in Eq. 2.

$$\begin{aligned} d(f_i, f_{i-1}) = - \frac{1}{N} \sum _{x,y}{\frac{(f_i(x,y) - \overline{f}_i)(f_{i-1}(x,y) - \overline{f}_{i-1})}{\sigma _{f_i} \sigma _{f_{i-1}}}} \end{aligned}$$
(2)

where \(\overline{f_i}\) is the average of \(f_i\) and \(\sigma _{f_{i}}\) is the standard deviation.

In order to avoid sensitivity to local changes between frames and presence of noise, each video frame is divided into non-overlapping blocks and the negative cross-correlation is calculated for each pair of corresponding blocks. Algorithm 1 summarizes the main steps for the block-based cross-correlation.

figure a

The block with the minimum dissimilarity is chosen since a significant change in it implies that all other blocks also changed.

3.3 Fusion

A combination of the dissimilarity vectors of the histogram-based distance and the block cross-correlation is performed to minimize the individual errors and uncertainty. Prior to the fusion process, a z-score normalization followed by a min-max scaling is applied to both vectors. The resulting dissimilarity vector constitutes a weighted mean between each position. Equation 3 summarizes the process.

$$\begin{aligned} D = \omega \cdot D_{\textit{CH}} + (1 - \omega ) \cdot D_{\textit{B-NCC}} \end{aligned}$$
(3)

where \(D_{\textit{CH}}\) and \(D_{\textit{B-NCC}}\) are the dissimilarity vectors for the color histogram and the block-based cross-correlation, respectively, D is the final vector of dissimilarities between frames, whereas \(\omega \) is the weight applied to each dissimilarity measure.

3.4 Adaptive Thresholding

The thresholding over the dissimilarity vector is performed locally through a moving window. Since the goal is to find peaks in the frame dissimilarities, this stage is similar to an outlier detection process.

A local median M is calculated for each moving window of size m with center at i. The frames i and \(i-1\) are considered as boundary transition frames if their dissimilarity is equal to or greater than the median plus an \(\alpha \) value (\(d_i \ge M + \alpha \)). Furthermore, it needs to be the maximum point within the window to ensure that only the dominant peak is labeled as transition, avoiding redundancy. Figure 2 illustrates the behavior of the proposed thresholding method.

Fig. 2.
figure 2

Adaptive threshold over temporal dissimilarities.

4 Experimental Results

Experiments were conducted on two different annotated data sets. The first one, referred here to as VIDEOSEG’2004 [10], contains 10 video sequences within a diversity of genres, such as news, commercial, movies, cartoons, television shows, as well as other challenging scenarios with low quality digitization, low lighting conditions, fast motions and production effects. The second data set is a shot boundary test collection for the TRECVID’2002 [8]. It consists of 18 videos, where most of them are documentaries and amateur films with low quality, noise and production artifacts, varying in length, date of creation and production style.

The evaluation protocol follows the TRECVID guidelines, such that the results are assessed in terms of precision, recall and their harmonic mean (\(F_{\textit{score}}\)). Equations 4 and 5 express the precision and recall measures, respectively, for a video V with a detection set S.

$$\begin{aligned} \text {Precision}&= \normalsize \frac{\displaystyle \sum _{f_i \in V}{S(i) \in \textit{Cut} \wedge i \in \textit{True Cut}}}{\sum _{f_i \in V} {S(i) \in \textit{Cut}}} \end{aligned}$$
(4)
$$\begin{aligned} \text {Recall}&= \normalsize \frac{\displaystyle \sum _{f_i \in V}{S(i) \in \textit{Cut} \wedge i \in \textit{True Cut}}}{\sum _{f_i \in V} {i \in \textit{True Cut}}} \end{aligned}$$
(5)

The \(F_{\textit{score}}\) measure is defined as

$$\begin{aligned} F_{\textit{score}} = 2 \, \frac{\text {Precision} \times \text {Recall}}{\text {Precision}+\text {Recall}} \end{aligned}$$
(6)

The adaptive thresholding parameters were empirically determined and applied to all videos in both data sets. In our experiments, values of \(\alpha = 0.2\) and window size \(m = 7\) achieved the best performance. For the histogram, a quantization with 32 bins for each RGB channel (totalizing 32,768 colors) was defined. The number of blocks K applied to the cross-correlation was set to 16, generating a \(4 \times 4\) grid on the frames. For the fusion, a constant weight \(\omega = 0.5\) demonstrated to be the best overall value in both data sets.

Table 1 shows the results for the VIDEOSEG’2004 data set. The described approaches and a baseline available for the data set [10] based on feature tracking were compared with the proposed fusion method.

Table 1. Video cut detection results (VIDEOSEG’2004).

It is noticeable that our fusion strategy for color histogram and block-based cross-correlation outperforms the respective individual methods and the provided baseline. The block-based normalized cross-correlation performs poorly in comparison to the global approach, once the videos in the VIDEOSEG’2004 data set have different frame dimensions and the block partitioning can discard important information. Nevertheless, such factors did not affect the performance of our fusion method.

Table 2 shows the results for the TRECVID’2002 data set. It is possible to observe that the proposed fusion outperforms all other approaches. Moreover, the block-based cross-correlation achieves superior performance than the global cross-correlation.

Table 2. Video cut detection results (TRECVID’2002).

Figure 3 presents the official results, provided by TRECVID’2002, for each participant in the competition. Number in parentheses represent the number of submissions for each team. Our results are presented in a similar manner to allow an adequate comparison.

Fig. 3.
figure 3

Precision/Recall performance for (a) participants in the TRECVID’2002 and (b) methods described in this work.

Our proposed method outperforms the majority of the submissions both in precision and recall. Furthermore, even the methods without combination are competitive to those submitted to TRECVID’2002. This corroborates the advantage of our adaptive thresholding method.

Figure 4 illustrates the inter-frame dissimilarities for a video section from TRECVID’2002. It is possible to observe that the number of false positives detected by the color histogram and block-based normalized cross-correlation are reduced through the fusion strategy.

Fig. 4.
figure 4

Frame dissimilarities for a video section from TRECVID’2002.

5 Conclusions and Future Work

This work proposed an adaptive video cut detection method based on the combination of color histograms and cross-correlation. Furthermore, a local thresholding strategy is used to search for relative significant peaks.

Although the inter-frame dissimilarity measures are simple, both the fusion and adaptive thresholding approaches produced significant improvements in the experimental results obtained on two different data sets containing challenging videos. The proposed method also proved to be very competitive compared to the best submissions to TRECVID’2002 shot boundary competition.

As directions for future work, we intend to extend the method to address video gradual transitions and the automatic determination of weights in the fusion process.