An annotation assistance system using an unsupervised codebook composed of handwritten graphical multi-stroke symbols
Highlights
► A semi-automatic annotation system for unknown 2D graphical languages is proposed. ► A multi-stroke symbol codebook based on recurrent patterns is automatically defined. ► A relational graph between graphical units and Minimum Description Length are used. ► First experiments on a handwritten corpus show a reduction of the labeling cost.
Introduction
Graphical symbols which are the lexical units of graphical languages are composed of a spatial layout of single or several strokes. Usually everybody share some conventions about the symbol shape. These conventions allow individuals to read graphical messages comprising similar symbols. Many existing recognition systems (Tappert et al., 1990) analogously require the definition of the character or symbol set, and rely on a training dataset which defines the ground-truth at the symbol level. A machine learning algorithm in recognition systems consequently can be trained to recognize symbols from large, realistic corpora of ground-truthed input. Such datasets are essential for the training, evaluation, and testing stages of the recognition systems. However, collecting all the ink samples and labeling them at the symbol level is a very long and tedious task. Hence, it would be very interesting to be able to assist this process, so that most of the tedious work can be done automatically, and that only a high level supervision need to be defined to conclude the labeling process.
In this regard, we propose to extract automatically a finite set of relevant patterns, called codebook within an unlabeled dataset. Searching relevant patterns and extracting them aim to reduce the redundancy in appearance of basic regular shapes and regular layout of these shapes in a large collection of handwritten scripts.
For the targeted application, which is related to an on-line handwritten corpus of mathematical numerical expressions, we consider that the basic units are the strokes, a sequence of points between a pen-down and a pen-up. Should this assumption not be verified, then an additional segmentation process will have to be undergone, so that every basic graphical unit belongs to a unique symbol. Conversely, a symbol can be made of one or several strokes, which are not necessarily drawn consecutively, i.e. we do not exclude interspersed symbols. Afterward, a symbol is made of a single stroke or several strokes within the confines of specific spatial composition. The problem is to identify symbols from a large collection of handwritten strokes in spatial layouts. Let us illustrate some simple examples to understand the problems.
Imagine a document with only two different shapes of stroke, e.g. “” and “”. Without any context, “” and “” might be regarded as two different symbols “minus” and “greater than” respectively. Each stroke corresponds directly to a single symbol. If two strokes are placed together like “” we can imagine it becomes another symbol “arrow”. A stroke is only a part of symbol. Eventually, the same kind of stroke according to the context will be either a single symbol or a piece of a more complex symbol. So the first problem pointed out is searching different shapes of strokes, termed as graphemes.
Let us put two strokes together: it exists many composition rules named spatial relations. Applying two same graphemes, two different symbols, “” and “”, can be constructed. The only difference between them is that “” is arranged on the right side in “” while on the left side in “”. This left and right relation is easily defined manually.
It is possible to design new symbols made of more different graphemes and spatial relations. For instance, a new symbol “” is constructed using the grapheme set . We can say that “” is between “” and “”. In this case, between implies a relationship among three strokes which is the cardinality of this spatial relation (Clementini, 2009). In this paper the cardinality of spatial relation is limited to two strokes: from a reference stroke to an argument stroke; that is a pairwise spatial relation. However, with only 3 strokes we have to consider 6 different pairs of strokes to envisage all appropriate alternatives, for example (“”,“”), (“”, “”), (“”,“”), etc. The number of spatial relation couples will grow rapidly with the increasing number of strokes in a layout (Li et al., 2011). Searching automatically different pairwise spatial relations will be the second problem.
Considering a more complicated example, Fig. 1(a) shows four different symbols, “arrow”, “connection”, “process”, and “terminator”. However, the ground-truths are unknown in advance. To avoid the ambiguity that some strokes share the same grapheme, the stroke is referenced by their index . Which set of strokes (a segment) represents a symbol? Why the combination of the strokes {(1), (2), (3)} is a valid symbol (actually “arrow”)? An intuitive answer is that the spatial composition is “frequent”; it exists two similar patterns in the layout, {(1), (2), (3)} and {(5), (6), (7)}, comprising same graphemes and same spatial relations respectively (which are from the previous two problems). But the equally frequent combination of less strokes {(1), (2)} does not mean a symbol. Moreover, the third arrow {(11), (12)} only contains two strokes but its shape is similar with the previous two arrows. Graphical symbols with the same ground-truth can contain different number of strokes and different graphemes. Hence, the third problem is how to search some repetitive patterns in a layout yielding to the graphical symbols. A segmentation will therefore be generated at the symbol level.
By grouping graphemes in segments, we obtain a small finite set of symbol hypothesis called codebook with a higher semantic level. This codebook requires less annotation operations like in Fig. 1(b): only 3 segments have to be labeled instead of 6 symbols including 13 strokes in Fig. 1(a). But all similar segments in a cluster of the codebook do not contain the same ground-truth: different symbols can be mixed in one cluster. For instance, the stroke (4) of symbol “connection” and the stroke (13) of symbol “terminator” are merged in the same cluster because of two similar shapes. The ground-truth not only depends on the similar shape but also depends on context and meaning. Annotating the segments in a codebook will be the fourth problem.
Our previous work Li et al. (2011) studies the unsupervised symbol segmentation using the MDL principle and Li et al. (2012) is specified for the spatial relation learning. This paper proposes to use the unsupervised symbol segmentation using the MDL principle to reduce symbol labeling cost. Section 2 gives a brief survey of cluster labeling in text and in off-line characters, of codebook generation using the unsupervised natural language learning built on two-dimensional spatial relations, and of the annotation on a codebook. The proposed learning framework is revealed in Section 3. In this framework, we extract the codebook composed of multi-stroke symbols which the user can label. Section 5 describes an annotation measure to evaluate the performance on the on-line handwriting corpora. At the end, the conclusion of this work is presented in Section 6.
Section snippets
State of the art
According to authors knowledge there is no existing work about unsupervised symbol extraction on on-line handwriting for annotation assistance. However, several related works will be discussed in this section: reducing annotation workload, handwriting grapheme extraction, and graphical symbol analysis.
Proposed unsupervised multi-stroke symbol codebook learning framework
Our proposed automatic multi-stroke symbol extraction system is illustrated in Fig. 2. The on-line handwriting is imported in system, and the codebook is then exported for the annotation. Six main steps have to be taken into account in the system which is an iterative learning. In raw on-line data, the basic unit is the stroke. In our iterative learning framework, we consider the segment as the basic unit which may contain a multi-stroke structure. The initial segmentation is set up with each
Annotation using the codebook
In previous sections, the codebook composed of multi-stroke symbols have been obtained. We choose the center segment in the cluster to generate the visualized codebook. The user labels therefore these chosen segments stroke by stroke. In this section, we discuss how to label the segments in dataset with the small labeled codebook.
In the visualized codebook, the segments in a cluster are not always the same single symbol. Different symbols may be mixed in a cluster since the label is dependent
Experiment
In this section, we firstly present a cost function to evaluate the labeling procedure. Two on-line handwriting corpora are then described, and the labeling procedure using the different learned codebooks is tested on them.
Conclusion and discussion
In this paper, we propose an iterative learning framework to assist the handwritten graphical symbol labeling. Each stroke is initialized as a segment. A relational graph between the segments is then built. We quantify the segments (nodes) and the spatial relations (edges). The repetitive sub-graphs (symbols) are extracted according to the Minimum Description Length (MDL) principle, and are merged as the new segments for the next iteration. As a result of the quantization of the segments, a
References (30)
- et al.
Efficient search strategy in structural analysis for handwritten mathematical expression recognition
Pattern Recognition
(2009) Modeling by shortest data description
Automatica
(1978)- et al.
Automatic writer identification framework for online handwritten documents using character prototypes
Pattern Recognition
(2009) - et al.
The Handbook of Computational Linguistics and Natural Language Processing
(2010) - Awal, A.M., Mouchère, H., Viard-Gaudin, C., 2010. A hybrid classifier for handwritten mathematical expression...
- Awal, A.M., Feng, G., Mouchère, H., Viard-Gaudin, C., 2011. First experiments on a new online handwritten flowchart...
Psychophysical analysis of visual space
(1970)- Bouteruche, F., Macé, S., Anquetil, E., 2006. Fuzzy relative positioning for on-line handwritten stroke analysis. In:...
- Bulacu, M., Schomaker, L., 2006. Combining multiple features for text-independent writer identification and...
- Bulacu, M., Schomaker, L., Brink, A., 2007. Text-independent writer identification and verification on offline arabic...
Introductory Graph Theory
Substructure discovery using minimum description length and background knowledge
J. Artif. Intell. Res.
Cited by (6)
Active graph based semi-supervised learning using image matching: Application to handwritten digit recognition
2016, Pattern Recognition LettersCitation Excerpt :The label prediction of an example x will depend on both the labeled and unlabeled examples that are very close to x. Systems based on active learning using graph matching and agglomerative clustering have been developed for mathematical and online handwritten digits recognition [15,24]. Unsupervised learning classifiers and their combinations have been efficiently used for offline character recognition [40,41].
Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition
2015, Pattern Recognition LettersCitation Excerpt :The pairwise matching cost considers both local and global features of the expression. For online handwritten digits, Li et al. [14] propose a codebook mapping to cluster strokes using an agglomerative clustering, followed by a mapping using Hausdorff distance of each stroke or stroke agglomeration to representative labels by a human annotator [15]. A similar attempt is proposed in [16], where resembling motifs have to be detected in medical sequences.
Semi-supervised learning for character recognition in historical archive documents
2014, Pattern RecognitionCitation Excerpt :However, in all cases an initial set of annotations must be provided manually. For handwritten graphical multi-stroke symbols an annotation assistance is proposed by Li et al. [13], where the annotation of the symbols is reduced to finding sub-graphs in a relation graph built from different segments. In the graph the nodes are the segments and the arcs represent the spatial relationships between them.
Automatic annotation extension and document classification using a probabilistic graphical model
2016, CORIA 2016 - Conference en Recherche d'Informations et Applications- 13th French Information Retrieval Conference. CIFED 2016 - Colloque International Francophone sur l'Ecrit et le DocumentAutomatic annotation extension and classification of documents using a probabilistic graphical model
2015, Proceedings of the International Conference on Document Analysis and Recognition, ICDARLarge image modality labeling initiative using semi-supervised and optimized clustering
2015, International Journal of Multimedia Information Retrieval