Elsevier

Pattern Recognition Letters

Volume 35, 1 January 2014, Pages 46-57
Pattern Recognition Letters

An annotation assistance system using an unsupervised codebook composed of handwritten graphical multi-stroke symbols

https://doi.org/10.1016/j.patrec.2012.11.018Get rights and content

Abstract

Many present recognition systems take advantage of ground-truthed datasets for training, evaluating and testing. But the creation of ground-truthed datasets is a tedious task. This paper proposes an iterative unsupervised handwritten graphical symbols learning framework which can be used for assisting such a labeling task. Initializing each stroke as a segment, we construct a relational graph between the segments where the nodes are the segments and the edges are the spatial relations between them. To extract the relevant patterns, a quantization of segments and spatial relations is implemented. Discovering graphical symbols becomes then the problem of finding the sub-graphs according to the Minimum Description Length (MDL) principle. The discovered graphical symbols will become the new segments for the next iteration. In each iteration, the quantization of segments yields the codebook in which the user can label graphical symbols. This original method has been first applied on a dataset of simple mathematical expressions. The results reported in this work show that only 58.2% of the strokes have to be manually labeled.

Highlights

► A semi-automatic annotation system for unknown 2D graphical languages is proposed. ► A multi-stroke symbol codebook based on recurrent patterns is automatically defined. ► A relational graph between graphical units and Minimum Description Length are used. ► First experiments on a handwritten corpus show a reduction of the labeling cost.

Introduction

Graphical symbols which are the lexical units of graphical languages are composed of a spatial layout of single or several strokes. Usually everybody share some conventions about the symbol shape. These conventions allow individuals to read graphical messages comprising similar symbols. Many existing recognition systems (Tappert et al., 1990) analogously require the definition of the character or symbol set, and rely on a training dataset which defines the ground-truth at the symbol level. A machine learning algorithm in recognition systems consequently can be trained to recognize symbols from large, realistic corpora of ground-truthed input. Such datasets are essential for the training, evaluation, and testing stages of the recognition systems. However, collecting all the ink samples and labeling them at the symbol level is a very long and tedious task. Hence, it would be very interesting to be able to assist this process, so that most of the tedious work can be done automatically, and that only a high level supervision need to be defined to conclude the labeling process.

In this regard, we propose to extract automatically a finite set of relevant patterns, called codebook within an unlabeled dataset. Searching relevant patterns and extracting them aim to reduce the redundancy in appearance of basic regular shapes and regular layout of these shapes in a large collection of handwritten scripts.

For the targeted application, which is related to an on-line handwritten corpus of mathematical numerical expressions, we consider that the basic units are the strokes, a sequence of points between a pen-down and a pen-up. Should this assumption not be verified, then an additional segmentation process will have to be undergone, so that every basic graphical unit belongs to a unique symbol. Conversely, a symbol can be made of one or several strokes, which are not necessarily drawn consecutively, i.e. we do not exclude interspersed symbols. Afterward, a symbol is made of a single stroke or several strokes within the confines of specific spatial composition. The problem is to identify symbols from a large collection of handwritten strokes in spatial layouts. Let us illustrate some simple examples to understand the problems.

Imagine a document with only two different shapes of stroke, e.g. “-” and “>”. Without any context, “-” and “>” might be regarded as two different symbols “minus” and “greater than” respectively. Each stroke corresponds directly to a single symbol. If two strokes are placed together like “” we can imagine it becomes another symbol “arrow”. A stroke is only a part of symbol. Eventually, the same kind of stroke according to the context will be either a single symbol or a piece of a more complex symbol. So the first problem pointed out is searching different shapes of strokes, termed as graphemes.

Let us put two strokes together: it exists many composition rules named spatial relations. Applying two same graphemes, two different symbols, “

” and “”, can be constructed. The only difference between them is that “-” is arranged on the right side in “
” while on the left side in “”. This left and right relation is easily defined manually.

It is possible to design new symbols made of more different graphemes and spatial relations. For instance, a new symbol “” is constructed using the grapheme set {<,-,>}. We can say that “-” is between<” and “>”. In this case, between implies a relationship among three strokes which is the cardinality of this spatial relation (Clementini, 2009). In this paper the cardinality of spatial relation is limited to two strokes: from a reference stroke to an argument stroke; that is a pairwise spatial relation. However, with only 3 strokes we have to consider 6 different pairs of strokes to envisage all appropriate alternatives, for example (“<”,“-”), (“-”, “<”), (“<”,“>”), etc. The number of spatial relation couples will grow rapidly with the increasing number of strokes in a layout (Li et al., 2011). Searching automatically different pairwise spatial relations will be the second problem.

Considering a more complicated example, Fig. 1(a) shows four different symbols, “arrow”, “connection”, “process”, and “terminator”. However, the ground-truths are unknown in advance. To avoid the ambiguity that some strokes share the same grapheme, the stroke is referenced by their index (·). Which set of strokes (a segment) represents a symbol? Why the combination of the strokes {(1), (2), (3)} is a valid symbol (actually “arrow”)? An intuitive answer is that the spatial composition is “frequent”; it exists two similar patterns in the layout, {(1), (2), (3)} and {(5), (6), (7)}, comprising same graphemes and same spatial relations respectively (which are from the previous two problems). But the equally frequent combination of less strokes {(1), (2)} does not mean a symbol. Moreover, the third arrow {(11), (12)} only contains two strokes but its shape is similar with the previous two arrows. Graphical symbols with the same ground-truth can contain different number of strokes and different graphemes. Hence, the third problem is how to search some repetitive patterns in a layout yielding to the graphical symbols. A segmentation will therefore be generated at the symbol level.

By grouping graphemes in segments, we obtain a small finite set of symbol hypothesis called codebook with a higher semantic level. This codebook requires less annotation operations like in Fig. 1(b): only 3 segments have to be labeled instead of 6 symbols including 13 strokes in Fig. 1(a). But all similar segments in a cluster of the codebook do not contain the same ground-truth: different symbols can be mixed in one cluster. For instance, the stroke (4) of symbol “connection” and the stroke (13) of symbol “terminator” are merged in the same cluster because of two similar shapes. The ground-truth not only depends on the similar shape but also depends on context and meaning. Annotating the segments in a codebook will be the fourth problem.

Our previous work Li et al. (2011) studies the unsupervised symbol segmentation using the MDL principle and Li et al. (2012) is specified for the spatial relation learning. This paper proposes to use the unsupervised symbol segmentation using the MDL principle to reduce symbol labeling cost. Section 2 gives a brief survey of cluster labeling in text and in off-line characters, of codebook generation using the unsupervised natural language learning built on two-dimensional spatial relations, and of the annotation on a codebook. The proposed learning framework is revealed in Section 3. In this framework, we extract the codebook composed of multi-stroke symbols which the user can label. Section 5 describes an annotation measure to evaluate the performance on the on-line handwriting corpora. At the end, the conclusion of this work is presented in Section 6.

Section snippets

State of the art

According to authors knowledge there is no existing work about unsupervised symbol extraction on on-line handwriting for annotation assistance. However, several related works will be discussed in this section: reducing annotation workload, handwriting grapheme extraction, and graphical symbol analysis.

Proposed unsupervised multi-stroke symbol codebook learning framework

Our proposed automatic multi-stroke symbol extraction system is illustrated in Fig. 2. The on-line handwriting is imported in system, and the codebook is then exported for the annotation. Six main steps have to be taken into account in the system which is an iterative learning. In raw on-line data, the basic unit is the stroke. In our iterative learning framework, we consider the segment as the basic unit which may contain a multi-stroke structure. The initial segmentation is set up with each

Annotation using the codebook

In previous sections, the codebook composed of multi-stroke symbols have been obtained. We choose the center segment in the cluster to generate the visualized codebook. The user labels therefore these chosen segments stroke by stroke. In this section, we discuss how to label the segments in dataset with the small labeled codebook.

In the visualized codebook, the segments in a cluster are not always the same single symbol. Different symbols may be mixed in a cluster since the label is dependent

Experiment

In this section, we firstly present a cost function to evaluate the labeling procedure. Two on-line handwriting corpora are then described, and the labeling procedure using the different learned codebooks is tested on them.

Conclusion and discussion

In this paper, we propose an iterative learning framework to assist the handwritten graphical symbol labeling. Each stroke is initialized as a segment. A relational graph between the segments is then built. We quantify the segments (nodes) and the spatial relations (edges). The repetitive sub-graphs (symbols) are extracted according to the Minimum Description Length (MDL) principle, and are merged as the new segments for the next iteration. As a result of the quantization of the segments, a

References (30)

  • G. Chartrand

    Introductory Graph Theory

    (1985)
  • Clementini, E., 2009. A conceptual framework for modelling spatial relations. Ph.D. Thesis, INSA,...
  • D.J. Cook et al.

    Substructure discovery using minimum description length and background knowledge

    J. Artif. Intell. Res.

    (1994)
  • Cook, D.J., Holder, L.B., 2011. Substructure discovery using examples....
  • Delaye, A., Mac, S., Anquetil, E., 2009. Modeling relative positioning of handwritten patterns. In: 14th Biennial Conf....
  • Cited by (6)

    • Active graph based semi-supervised learning using image matching: Application to handwritten digit recognition

      2016, Pattern Recognition Letters
      Citation Excerpt :

      The label prediction of an example x will depend on both the labeled and unlabeled examples that are very close to x. Systems based on active learning using graph matching and agglomerative clustering have been developed for mathematical and online handwritten digits recognition [15,24]. Unsupervised learning classifiers and their combinations have been efficiently used for offline character recognition [40,41].

    • Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition

      2015, Pattern Recognition Letters
      Citation Excerpt :

      The pairwise matching cost considers both local and global features of the expression. For online handwritten digits, Li et al. [14] propose a codebook mapping to cluster strokes using an agglomerative clustering, followed by a mapping using Hausdorff distance of each stroke or stroke agglomeration to representative labels by a human annotator [15]. A similar attempt is proposed in [16], where resembling motifs have to be detected in medical sequences.

    • Semi-supervised learning for character recognition in historical archive documents

      2014, Pattern Recognition
      Citation Excerpt :

      However, in all cases an initial set of annotations must be provided manually. For handwritten graphical multi-stroke symbols an annotation assistance is proposed by Li et al. [13], where the annotation of the symbols is reduced to finding sub-graphs in a relation graph built from different segments. In the graph the nodes are the segments and the arcs represent the spatial relationships between them.

    • Automatic annotation extension and document classification using a probabilistic graphical model

      2016, CORIA 2016 - Conference en Recherche d'Informations et Applications- 13th French Information Retrieval Conference. CIFED 2016 - Colloque International Francophone sur l'Ecrit et le Document
    • Automatic annotation extension and classification of documents using a probabilistic graphical model

      2015, Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
    • Large image modality labeling initiative using semi-supervised and optimized clustering

      2015, International Journal of Multimedia Information Retrieval
    View full text