New Gradient-Spatial-Structural Features for video script identification

https://doi.org/10.1016/j.cviu.2014.09.003Get rights and content

Highlights

  • Dominant pixel selection by exploring gradient information with histogram.

  • Proposing Gradient-Spatial-Features for text candidate selection.

  • Proposing Gradient-Structural-Features in new way for text candidate selection.

  • Determining weights based on templates for correct classification.

  • Integrating Gradient-Spatial-Structural-Features in novel way for classification.

Abstract

Multi-script identification helps in automatically selecting an appropriate OCR when video has several scripts; however, script identification in video frames is challenging because low resolution and complex background of video often cause disconnections or the loss of text information. This paper presents a novel idea that integrates the Gradient-Spatial-Features (GSpF) and the Gradient-Structural-Features (GStF) at block level based on an error factor and the weights of the features to identify six video scripts, namely, Arabic, Chinese, English, Japanese, Korean and Tamil. Horizontal and vertical gradient values are first computed for each text block to increase the contrast of text pixels. Then the method divides the horizontal and the vertical gradient blocks into two equal parts at the centroid in the horizontal direction. Histogram operation on each part is performed to select dominant text pixels from respective subparts of the horizontal and the vertical gradient blocks, which results in text components. After extracting GSpF and GStF from the text components, we finally propose to integrate the spatial and the structural features based on end points, intersection points, junction points and straightness of the skeleton of text components in a novel way to identify the scripts. The method is evaluated on 970 video frames of six scripts which involves font, font size or contrast variations, and is compared with an existing method in terms of classification rate. Experimental results show that the proposed method achieves 83.0% average classification rate for video script identification. The method is also evaluated by testing on noisy images and scanned low resolution documents, illustrating the robustness and the extensibility of the proposed Gradient-Spatial-Structural Features.

Introduction

Due to the advancement of IT technology and the increase in the usage of video that is delivered via TV broadcasting, Internet and wireless networks, the size of video database is increasing drastically nowadays [1], [2], [3], [4]. In order to enable users to locate their interested content in such enormous quantity of video data quickly, there must be powerful indexing, retrieval and summarization techniques. In this regard, the current methods proposed in the field of content based image retrieval are not much effective for a large video database especially for retrieving and labeling video events, which requires high level semantics but not only low level features [5]. Since content based image retrieval methods are not effective in bridging the gap between high level semantics and low level features, text detection, script identification and recognition have come into play with the objective of filling the gap with the help of Optical Character Recognizer (OCR). In this way, text detection and recognition helps in generating meaning that is close to the content of video. Thus, text information can be used effectively for video labeling or events retrieval [1], [2], [3], [4], [5].

Text detection and recognition is a familiar problem for the document analysis community where researchers detect and extract text information from scanned document images with a high resolution. Since a document image contains plane or homogenous background with a high resolution, text detection based on connected component analysis is a successful technique in this field [6], [7]. However, the same method cannot be used for detecting and extracting text from natural scene images because of complex background, font, font size variations and color bleeding [4], [7].

To solve this problem, several methods [8], [9], [10], [11], [12], [13] have been proposed in the literature based on bottom up analysis or region segmentation. However, these methods still enjoy the high resolution of text which gives high clarity and visibility compared to background. These methods expect directly or indirectly the complete shape of the character or consider the character as one component. Hence the methods depend much on connected component analysis. Thus, these methods may not be used directly on video for text detection and recognition because video suffers from both low resolution and complex background, which often cause disconnections, the loss of information, distortion, etc. It is evident that these methods [14], [15] have so far achieved the character recognition rates only between 60% and 72%. This poor accuracy is due to the complex background, for example, usually sports video contains scene texts embedded in complex background with greenery, buildings, advertisements, etc. In addition, video usually contains two types of texts: graphics text which is superimposed and scene text which is a part of the image. Since graphics text is the edited one, it is easy to process and we can expect good clarity and visibility, whereas scene text is unpredictable due to the variations in its characteristics. Further, graphic text is useful for news video analysis while scene text is useful for navigation applications, such as assisting blind people, vehicle tracking, assisting tourists, sports events extraction, and exciting events extraction [1], [2], [3], [4]. The presence of both such texts in video makes the problem more complex and challenging compared to text detection and recognition from either natural scene images or scanned images. In the same way, experimental results shows that when we apply conventional character recognition methods directly on video, the character recognition rate is achieved typically from 0% to 45% [16], [17], which is much lower than the recognition rate of scene text in natural scene images. The poor accuracy is because of the presence of both graphics and scene texts, low resolution and complex background. Note that although we can see lots of cameras with a high resolution for capturing images and video, low resolution cameras are still required in many real life applications such as mobile services and surveillance systems because of the shortage of memory, battery power, computation power or even operating cost that limits to capture relatively low resolution [18], [19]. Therefore, text extraction and recognition from video is still challenging and thus has become an emerging area for researchers.

To address the problems of video, several methods have been proposed in the past years. These methods can be classified widely as follows: connected component-based [6], [20], texture-based [21], [22], [23], [24], and gradient or edge based methods [25], [26], [27], [28]. Although these methods achieve a good accuracy for text detection or text block detection irrespective of fonts, font sizes, types of text and orientations, they do not have the ability to differentiate multi-script frames in video because the goal of these methods was to detect text block or text region regardless of scripts.

Similarly, we can see scene text recognition from natural scene images [29], [30], [31] and from video [16], [17], [32], [33], which takes the output of text detection, such as text region, blocks, lines, words and characters as the input for recognition. Generally, these methods take care of segmentation and binarization of text areas/blocks with the help of enhancement criteria to increase the contrast of text lines before feeding them to OCR [34]. Most of the methods either use publicly available OCR for binarization or their own classifier to recognize text in video. However, these methods are limited to a single script frame in video/image but not video containing multi-scripts because the extracted features and OCR are usually designed for a specific language. In addition, there is no universal OCR for recognizing different scripts in video since it is a hard task. Therefore, without script identification that helps automatically choosing appropriate OCR, there will be a big gap between text detection and recognition. Thus script identification is essential when the environment is of multi-script and multi-lingual like in Singapore, Malaysia and India, where we can see more than two official languages.

Script identification for a given document image with plain background at block level, text line level and word level in the field of document analysis and recognition has attained high recognition rates [35], [36], [37], [38]. In contrast to document images, script identification in video frame is a new topic and is challenging as it suffers from low resolution, complex background and font variation as mentioned above. Besides, the presence of multi-scripts in a video frame makes script identification and recognition more complex and challenging for researchers. As noted from the methods on script identification in document images [39], [40], Japanese, Korean, Chinese and Tamil scripts worsen the performances of the methods in terms of classification rate because the scripts share common features. Therefore, there is a great demand for developing a new method for automatically identifying scripts in video frames irrespective of text orientation, font variation, font size variation and contrast. In this work, we consider six scripts, namely, Arabic, Chinese, English, Japanese, Korean and Tamil since these languages represent a wide range of users in the world and are very popular.

Section snippets

Related work

Identification of different scripts in a document with plain background and a high resolution is a familiar problem in document analysis. In this regard, we can see a lot of methods in the literature. An overview of script identification methodologies based on structure and visual appearance is presented in [35]. It is noted from this review that the surveyed methods work well for camera-based images but not for video frames since the latter has a low contrast and complex background, and the

The proposed methodology

As discussed in the introduction section, we can find several sophisticated methods for text block detection and text region detection in video frames in the literature. These methods work well for texts from arbitrary orientations, low contrast texts, complex background and more important multi-scripts frames in video. For this work, we propose to use our earlier text detection method [56] to detect text blocks from video frames. This method explores wavelet-moments features with mutual

Experimental results

We create our own dataset chosen from different sources, such as sports news, weather news, and entertainment video to show that the proposed method works well for different varieties of video frames. As to template generation, we choose 50 blocks manually for each script, thus a total of 300 blocks to construct their templates. Our testing dataset includes 200 Arabic frames, 200 Chinese frames, 200 English frames, 200 Japanese frames, 200 Korean frames and 200 Tamil frames. In total, 1200

Comparative study

According to the literature review in Section 2, most of the methods use SVM classifier for script classification and identification. But, in our work, we propose to use templates that are constructed based on average feature values of 50 sample blocks for each script chosen randomly for respective databases. Therefore, to show that our soft integration method is effective, we compare it with the SVM classifier results. We use multi-class SVM classifier with RBF kernel for the same dataset

Conclusion and future work

We have proposed a new soft integration method which uses the Gradient-Spatial-Structural-Features for identifying six video scripts. The dominant text pixel selection is done based on the histograms on horizontal gradient and vertical gradient values of the input frames. Then we propose new features based on spatial and structural information of candidate text components in the blocks to identify the scripts. The experimental results on individual feature extraction methods and the combined

Acknowledgments

We are grateful to anonymous reviewers for their quality comments and suggestions enhance the quality, as well as clarity of the work in greatly. The work described in this paper was partly supported by the Natural Science Foundation of China under Grant Nos. 61272218 and 61321491, the 973 Program of China under Grant No. 2010CB327903, and the Program for New Century Excellent Talents under NCET-11-0232. This research was also supported in part under Grant No. UM.TNC2/IPPP/UPGP/261/15

References (61)

  • P.B. Pati et al.

    Word level multi-script identification

    Pattern Recogn. Lett.

    (2008)
  • P. Shivakumara et al.

    A novel mutual nearest neighbor based symmetry for text frame classification in video

    Pattern Recogn.

    (2011)
  • N. Sharma, U. Pal, M. Blumenstein, Recent advances in video based document processing: a review, in: Proc. DAS, 2012,...
  • J. Zang, R. Kasturi, Extraction of text objects in video documents: recent progress, in: Proc. DAS, 2008, pp....
  • D. Doermann, J. Liang, J. Li, Progress in camera-based document image analysis, in: Proc. ICDAR, 2003, pp....
  • V.Y. Mariano, R. Kasturi, Locating uniform-colored text in video frames, in: Proc. ICPR, 2000, pp....
  • Q. Ye, Q. Huang, W. Gao, D. Zhao, Fast and robust text detection in images and videos frames, in: Proc. Image and...
  • C.M. Thillou et al.

    Color text extraction with selective mtric-based clustering

    Comput. Vis. Image Underst.

    (2007)
  • C.M. Gracia et al.

    Fast perspective recovery of text in natural scenes

    Image Vis. Comput.

    (2013)
  • L. Rong, W. Suyu, Z. Shi, A two level algorithm for text detection in natural scene images, in: Proc. DAS, 2014, pp....
  • P. Zhou, L. Li, C.L. Tan, Character recognition under severe perspective distortion, in: Proc. ICDAR, 2009, pp....
  • Y. Zhou, J. Field, E.L. Miller, R. Wang, Scene text segmentation via inverse rendering, in: Proc. ICDAR, 2013, pp....
  • K.L. Bouman et al.

    A low complexity sign detection and text localization method for mobile applications

    IEEE Trans. Multimedia

    (2011)
  • A. Hartl, G. Reitmayr, Rectangular target extraction for mobile augmented reality applications, in: Proc. ICPR, 2012,...
  • V. Wu et al.

    Text finder: an automatic system to detect and recognize text in images

    IEEE Trans. PAMI

    (1999)
  • K.L. Kim et al.

    Texture-based approach for text detection in images using support vector machines and continuous adaptive mean shift algorithm

    IEEE Trans. PAMI

    (2003)
  • P. Shivakumara et al.

    A Laplacian approach to multi-oriented text detection in video

    IEEE Trans. PAMI

    (2011)
  • P. Shivakumara et al.

    Multi-oriented video scene text detection through Bayesian classification and boundary growing

    IEEE Trans. CSVT

    (2012)
  • N. Sharma, P. Shivakumara, U. Pal, M. Blumenstein, C.L. Tan, A new method for arbitrarily-oriented text detection in...
  • P. Shivakumara et al.

    Gradient vector flow and grouping based method for arbitrarily-oriented scene text detection in video images

    IEEE Trans. Circ. Syst. Video Technol. (TCSVT)

    (2013)
  • Cited by (33)

    • Fractional means based method for multi-oriented keyword spotting in video/scene/license plate images

      2019, Expert Systems with Applications
      Citation Excerpt :

      In addition, these methods shown that conventional approaches, such as the methods proposed for printed document images, may not work well for handwritten documents. Furthermore, the method proposed in Shivakumara et al. (2015a,b) is the latest which addressed the issue of keyword spotting in video images like the proposed work. Therefore, we use this method for comparative studies.

    • Improving patch-based scene text script identification with ensembles of conjoined networks

      2017, Pattern Recognition
      Citation Excerpt :

      The unconstrained text understanding problem for large collections of images from unknown sources has not been considered up to very recently [7–11]. While there exists some previous research in script identification of text over complex backgrounds [12,13], such methods have been so far limited to video overlaid-text, which presents in general different challenges than scene text. This paper addresses the problem of script identification in natural scene images, paving the road towards true multi-lingual end-to-end scene text understanding Fig. 1.

    • A blind deconvolution model for scene text detection and recognition in video

      2016, Pattern Recognition
      Citation Excerpt :

      Therefore, scene text detection and recognition is challenging compared to graphics text and it has several real-time applications, as mentioned above. However, it is observed from the literature [4,7] on text detection and recognition that distortion due to motion blur is a major issue among all other artifacts because blur interferes with the structure of the components which in turn changes the shape of the components. It is known that usually text detection and recognition methods directly or indirectly use the structure of the components for achieving good results.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Daniel Lopresti.

    View full text