1 Introduction

Due to invention of pen/touch based input devices in PDAs and other computing devices, computer human interface has attracted interest of many researchers across the world. Many mobile/tablet based computer applications have been derived based on these technologies to facilitate the processing of information, that is mostly collected using forms to be filled by the people. With the invention of these touch based interfaces, it has become possible to handwrite documents on computers. In these devices, pen or finger movements between successive PEN-DOWN/PEN-UP events are picked up by sensors. This data considered as a dynamic representation of handwriting is known as digital ink. In online handwriting recognition system, this digital ink is converted to characters of a language. A lot of work in handwriting recognition has been done for non-Indian languages like simplified Chinese, traditional Chinese, English, Japanese, and Korean [1,2,3], whereas, not much work is done for Indian languages. Gurmukhi is the script used to write Punjabi language. Table 1 contains the characters used in Gurmukhi script. Writing style in this script is from left to right. Gurmukhi script contains 41 consonants. Out of these 41 consonants, there are 35 base consonants or akhars and 6 consonants with subscript dot (last column under consonants in Table 1). , , and are the three vowel holders in Gurmukhi script. A vowel holder is used where there is no consonant between vowel sounds. This script contains 10 vowels (also called as lāga mātrās), 9 of them have specified symbols and remaining one has no symbol and is known as mukta. The first column under vowels represents these 9 vowels with vowel modifiers and last column represents the usage of these vowels with the consonant . A mukta is pronounced between each and every consonant wherever no other vowel is present unless otherwise indicated. The vowel symbols can appear above, below, or to either side of consonants, or their respective vowel holders. The script also has two symbols for nasal sounds bindī and ṭippī, and one symbol which duplicates the sound of any consonant, namely, addak. Table 1 also represents subjoined symbols that can be placed at the bottom of other characters. These symbols are also known as paireen characters. Gurmukhi characters are written in various styles that magnifies the difficulty in recognition of a particular character. A stroke for writing Gurmukhi script characters may lie in three horizontal zones, namely, upper zone, middle zone nd lower zone as depicted in Fig. 1. The upper zone denotes the region above the headline, where some of the vowels ( , , , , , , and ) and sub-parts of some other vowels ( , and ) reside, while the middle zone represents the area below the headline where the consonants and some sub-parts of vowels ( , and ) are present. Middle zone consists of most of the Gurmukhi characters. The lower zone represents the area below middle zone where some vowels and subjoined symbols ( , , , , , and ) are present. A Gurmukhi character can be written with single or a combination of strokes, with stroke considered as smallest unit. These strokes may appear in a single zone, or combination of three zones described above. Identification of headline and baseline is a major task for identification of strokes located in these three zones. To initiate the recognition process, various classes of strokes were identified. It has been observed that similar stroke shapes appear in three different zones, and based on the zone in which these strokes are written, different characters are formed. In the current work, 74 different stroke classes have been identified as illustrated in Table 2, that are distinguishable from each other in shape and formation of stroke. Table 3 shows some of the examples where similar stoke shapes written in different zone results in formation of different characters. A writing zone identification algorithm is proposed in this work to correctly identify the Gurmukhi character.

Table 1 Chart of Gurmukhi characters
Fig. 1
figure 1

Zone division of Gurmukhi characters

Table 2 Gurmukhi strokes used for classification
Table 3 Some issues in character formation caused by writing zoning

2 Related Work in Handwriting Recognition System

This section briefly reviews the literature on handwriting recognition systems. In the computing environment, apart from keyboards being an effective interface for interaction with computers, touch based interfaces are gaining attention due to recent advances in computational technology. According to Yaeger et al. [4], handwriting is the natural way of interacting with computers. Various processes involved in handwriting recognition system, from input to the final recognition of character, was elaborated by Plamondon and Srihari [5] in their survey on online and offline handwriting recognition systems. In their survey, they categorized various recognition methods into two categories, namely, structural/rule based methods and statistical methods. Structural/rule based methods involve with defining robust and reliable rules for recognition purposes. Whereas, in Statistical methods, shape of a graphical mark known as stroke is described by fixed number of features, and the classes of these graphical marks are described by multidimensional probability distribution. Online handwriting recognition system for non-Indic languages like simplified Chinese, traditional Chinese, Japanese, Korean, and English has been explored by many researchers [1,2,3]. Online handwriting recognition system for Arabic script was explored by Almuallim and Yamaguchi [6], who proposed geometrical and topological features for classification of strokes. Kurtzberg [7] in his study on recognition of unconstrained handwritten discrete symbols proposed an elastic matching approach against a set of prototypes generated by individual writers for recognition of individual strokes. A neural predictive system for writer independent online character recognition was presented by Garcia et al. [8]. Each letter was modelled using predictive Artificial Neural Networks (ANNs). The system was also extended to durational HMM framework. HMM based recognition of online handwriting was experimented by Hu et al. [9] and Takahashi et al. [10]. Hu et al. [9] achieved a writer independent recognition rate of 94.5% on 3,823 unconstrained online handwritten word samples from 18 writers covering a 32 word vocabulary, whereas, 90.0% recognition rate was achieved by Takahashi et al. [10] for 881 Kanji characters.

A fair amount of research in the area of online handwriting recognition has been done for Indic languages like, Devanagari, Bangla, Tamil, and Telugu. Structural properties of Devanagari characters were utilized to improve the recognition performance of HMM based unconstrained online Devanagari characters by Connell et al. [11]. An accuracy of 86.5% with no rejects was achieved by them. Joshi et al. [12] developed a three stage system that consists of structural recognition, feature based recognition and output mapping for recognizing online handwritten Devanagari characters. Online handwriting recognition of Bangla language has been explored by many researchers [13,14,15,16,17,18]. Dutta and Chaudhury [13] proposed a curvature feature based recognition method for Bangla alphanumeric characters, and found the technique effective with two stage feed forward neural network. Bhattacharya et al. [14] proposed a scheme for recognition of offline Bangla characters with local chain code histogram of input character shape. A direction code based feature for recognition of online Bangla handwritten basic characters achieved 93.9 and 83.6% recognition accuracies, respectively, on its training and test sets by Bhattacharya et al. [15]. An HMM-based recognition system for Bangla online handwritten numerals was proposed by Parui et al. [16]. Support Vector Machine based handwritten Bangla character recognition system was proposed by Bhowmik et al. [17]. Biswas et al. [18] proposed a two stage approach for classification of Bangla characters and obtained a character level recognition accuracy of 91.8% on the test set of 8,616 samples. Recognition of scripts of Dravidian languages like Tamil, Telugu and Kannada has also been studied by many researchers. Comparison of various elastic matching algorithms for recognition of online handwritten Tamil characters was presented by Joshi et al. [19]. Aparna et al. [20] discussed recognition of characters using string matching approach along with a shape based feature vector.

Lehal and Singh [21] presented a system for recognition of machine printed offline Gurmukhi script and operated at sub-character level and achieved a recognition rate of 96.6%. An online handwriting recognition system for Gurmukhi script was proposed by Sharma et al. [22, 23]. Sharma et al. [24, 25] discussed about variation in sequence of strokes for writing Gurmukhi script, and proposed a rearrangement mechanism to group strokes, so that correct characters can be formed in the postprocessing phase of the recognition process. Sharma et al. [26] proposed an HMM based approach for recognition of online handwritten Gurmukhi characters. They achieved a recognition rate of 91.9% for 60 handwritten samples each of 41 Gurmukhi characters. Kumar and Sharma [27] proposed a postprocessing algorithm for generation of characters from a set of intermediate strokes created by recognition system. Kumar et al. [28] in their work presented a post-procesing technique for Gurmukhi script. The authors achieved an accuracy of 93.3% for characters with mukta, whereas, 33.9% accuracy was reported for characters with nasals. Verma and Sharma [29] analysed zone based features for recognition of strokes for writing Gurmukhi script. Verma and Sharma [30] reported a character recognition rate of 96.7% using a voting-based classifier from three different HMM-and SVM-based classifiers, whereas a stroke recognition rate of 96.4% has been achieved with an HMM-based classifier. This recognition rates has been achieved on a dataset of 1750 Gurmukhi characters containing 35 basic Gurmukhi characters. These characters did not contain vowel modifier, subjoined and other symbols. Baseline detection of scripts is one of the important challenge for successfully recognizing a script. Pechwitz and Märgner [31] proposed a baseline detection algorithm to detect the orientation of a word. They explained the importance of baseline as a precondition to a handwriting recognition system. Skew normalization and character segmentation can be performed with the help of baseline in any script. Jayaraman et al. [32] also used a similar baseline detection approach in their pursuit to recognition of Telugu script. They explained a modular approach to recognize Telugu script distributing the strokes into three classifiers based on upper and lower baselines in Telugu script. Prasad et al. [33] presented a divide and conquer approach to recognize Kannada characters, by dividing the strokes into various auxiliaries and using classifiers for each of the auxiliary to predict the Kannada character using a rule base. They have argued that apart from baseline detection, choice of feature for recognition engine is another area where one needs to focus his research into.

A lot of improvement in the recognition accuracy of Gurmukhi characters is still required to be achieved. In Sect. 1, the problem of similar shaped strokes appearing in different zone has been introduced. It is a challenging task to find out boundaries/baselines of three zones other than classifying the shape of the stroke when vowel modifiers, subjoined or other symbols, are included in a character.

3 Gurmukhi Script Recognition

This section describes the process of data collection for online handwritten Gurmukhi character recognition and also explains various stages of online handwritten Gurmukhi character recognition.

3.1 Data Collection

In this work, we have addressed the problem of identification of characters of Gurmukhi script. A character in Gurmukhi script can be written with one or more stroke combinations. One of the major steps in recognition of Gurmukhi characters is recognition of these strokes. Based on recognized strokes, and their combination with nearby strokes, the final Gurmukhi character is formed. To start with, an HMM-based stroke classifier is trained for recognition of strokes. A total of 44,301 samples of Gurmukhi words have been collected from 124 writers using Tablet PC device. For each word, all strokes along with their sequence have been recorded in an XML format. For each stroke, x-y traces of digital PEN on the touch screen between successive PEN-DOWN and PEN-UP events have been recorded. The writers selected were well-versed in writing Gurmukhi script. In order to have variations in handwriting samples, the writers were selected from various regions of Punjab (INDIA). For the purpose of training of classifier, the strokes that were written by writers as per their correct script formation in an unconstrained environment were selected and annotated. A total of 74 stroke classes, mentioned above, have been identified. Table 2 contains the shape of strokes, the classIDs associated with them and the number of samples used for training. All the words collected in the database have been annotated for these classes and respective zones at stroke level.

Fig. 2
figure 2

Stages of online handwritten Gurmukhi character recognition

3.2 Steps Involved in Recognition of Characters

In order to recognize online handwritten characters, the set of strokes go through various phases as shown in Fig. 2. An online handwritten stroke is written by a user using some touch device or pen device on tablet. Each stroke consists of x-y trace points. These strokes are fed to zone identification phase which identifies the zone of the strokes, based on the relative position of stroke with respect to other strokes. The three zones in which a stroke may lie can be upper, middle or lower zone as illustrated in Fig. 1. The x-y traces of the stroke are preprocessed for noise removal, normalization, centering and smoothing. This gives a set of 64 x-y traces to be used in further processing. An HMM based classifier is trained using features extracted from these 64 points. Each stroke is normalized to a \(300\times 300\) window. In another study, classifiers have been tested on three different window sizes, \(200\times 200\), \(300\times 300\), and \(400\times 400\) window sizes by authors [30]. Out of these three window sizes, classifiers performed well on the window size of \(300\times 300\). This window is further divided into 100 small equi-sized windows and labelled as \(w_1\), \(w_2\), ..., \(w_{100}\). Each point in 64 preprocessed traces will lie in one of these 100 windows. The window number where the point lies is taken as a feature value. This feature set is referred as \(R_{xy}\) in this paper. The feature \(R_{xy}\) has been used to build HMM model \(\lambda =(\pi , A, B)\) for each stroke as proposed by Rabiner and Juang [34]. The classifier was trained using a dataset of 74 stroke classes, each class having the number of samples between 83 and 141, as given in Table 2. During classification, each stroke is tested on HMM model for its matching class. After classification of strokes, identification of the zone of a stroke is required in order to recognize the character. The next section explains the process of identification of zone of a stroke. After identification of zone, certain post-processing steps are applied on recognized strokes to form the character, as explained in Sect. 4.

figure y
Fig. 3
figure 3

Various cases of writing zone detection

4 Writing Zone Identification

Identification of the writing zone for a stroke is an important process in the recognition model proposed in this work. Identification of a writing zone is critical for further identifying the character. We have proposed an algorithm MarkStrokeWritingZone that identifies the writing zone of a given stroke, namely, U (upper), M (middle), and L (lower). Based on the original x-y traces of each stroke, the strokes are grouped into sets as per x-axis projections. A stroke set may contain one or more strokes. These stroke sets are processed further to find the zone of each stroke in the set as per the y-projections of each stroke. At the start of the zone identification process, MaxYSpan of the stroke set is calculated. Two imaginary boundaries named as UpperY and LowerY are also obtained based on the span of the stroke set. These boundaries are considered for defining the upper, middle and lower zones in the stroke set. UpperY and LowerY are defined as y-values at the upper 20% and lower 20% of MaxYSpan, respectively. These percentages have been selected after analysing a large dataset of Gurmukhi characters. The remaining 60% area is thus initially marked for middle zone. The actual zones are found based on Algorithm 1. A StrokeSet is passed as an argument for identification of zone of each stroke in the set. Original x-y traces of each stroke are used for handling the zonal information in this process. There are two parts of the process. In the first part, we analyze the StrokeSet for presence of shirorekha. The shirorekha is an upper horizontal bar which connects various characters in a word and act as boundary between the upper and middle zone of the character. It has been observed empirically that shirorekha appears in upper 35% span of the StrokeSet. The highest horizontal bars in the upper 35% span of StrokeSet is marked as shirorekha. If shirorekha is found in the StrokeSet, StrokeMinY of the shirorekha stroke is considered as the ActualUpperY, the boundary between upper and middle zone. ActualLowerY is also derived from the ActualUpperY by subtracting 0.6 times the MaxYSpan, marking the middle zone and lower zone. In case, shoirorekha is not found in the above set, then, UpperY and LowerY calculated during initialization, are considered ActualUpperY and ActualLowerY, respectively. After finding these boundaries, in the second part, each stroke is compared around these boundary values, and zonal information to each stoke is tagged accordingly. If the stroke’s upper (StrokeYMax) and lower (StrokeYMin) are above ActualUpperY, then stroke is marked as upper zone stroke. Similarly, if new stroke’s upper and lower boundaries are below ActualLowerY, then the stroke is marked as lower zone stroke. If upper and lower boundaries of new stroke is between ActualUpperY and ActualLowerY, then the stroke is tagged as middle zone stroke. For handling paireen characters (for example in ), if more than 50% span of the stroke is below ActualLowerY, then the stroke is marked as lower zone stroke. In remaining conditions, the strokes are marked as middle zone. All the cases described above are depicted in Fig. 3.

Table 4 Userwise stroke writing zone identification accuracy

4.1 Verification and Validation of Writing Zone Identification

The writing zone identification algorithm has also been verified and validated in this work. For verification of the proposed algorithm, it has been tested on a dataset of 5000 characters from annotated data as explained in Sect. 3.1. The zone identification accuracy of 98.4, 99.3, and 98.7 for upper, middle, and lower zone, respectively has been achieved. For validation of the proposed algorithm for writing zone identification, a set of 4280 handwritten Gurmukhi characters is collected from 10 new users, who have not participated in the data collection activity for preparation of recognition models. Handwritten characters collected from these 10 users have been tested for zone identification and also for character recognition. Accuracy for zone identification is verified by manually visualizing the stroke and the zone predicted by the zone identification algorithm. It is worth mentioning here, that manual inspection of stroke’s zone is required for all written strokes, as the dataset is not annotated as done for training data defined in Sect. 3.1. Table 4 depicts the accuracy of zone identification process for various strokes. Zone accuracy is defined as the number of strokes correctly marked in a zone over total number of strokes marked for the zone. A validation accuracy of 97.7% has been achieved for zone identification process on 4280 handwritten characters.

4.2 Character Recognition

For recognition of Gurmukhi characters, the strokes are classified using HMM based stroke classifier explained in Sect. 3.2. The stroke classifier was trained with recognition accuracy of 95.3% using 5-fold cross validation. To predict a character from the set of strokes that appear after the recognition process, a rule based approach has been developed. These rules define the combinations of strokes to form a character. These rules have been developed by analysing the collected dataset for various combinations of strokes in which character can be written. A postprocessing algorithm proposed in [27] has been used to predict the character after the recognition of strokes. During the testing of recognition process, the test set consisted of 41 Gurmukhi characters, along with their combinations with nasal symbols ( , ), addak ( ), and 9 lāga mātrās ( , , , , , , , , and ). Character recognition accuracy of each of the user is shown in Table 5, where it can be seen that when a character without mātrās and nasal symbols is considered, the accuracy is in the range of 95.0–99.0% with an average of 97.1%. While recognizing complex character combinations with nasals, mātrās and both, the average accuracy goes down to 95.0, 88.5 and 73.0%, respectively. The stroke zone identification and stroke’s identification in these characters with nasal and mātrās combinations have been found comparable to overall zone identification accuracies of three zones and stroke recognition accuracy, mentioned above, respectively. A major reason behind decrease of character recognition rates is the variations in sequence of unicode characters, generated for writing these complex characters in Gurmukhi script. The character recognition accuracy achieved in this work is 88.4%, when basic characters, characters with vowel modifiers, characters with subjoined symbols, and characters with other symbols are considered.

Table 5 Userwise character recognition accuracy

5 Results and Discussion

The work presented in this paper revolves around the propositions of handwritten character recognition system for Gurmukhi script. A writing zone identification algorithm for classification of strokes written in upper, middle and lower zones for Gurmukhi script has been proposed and tested. Recognition of a stroke has been done using HMM with a feature set of normalized x-y traces of the stroke. In an earlier experiment, Sharma et al. [23] experimented with 40 stroke classes on Gurmukhi handwritten character recognition system and achieved the character recognition accuracies of 87.4 and 92.0% with elastic matching technique and HMM, respectively. In their study they used 130 samples each of 35 basic consonants. In the present work we have achieved an accuracy of 88.4% when Gurmukhi consonants; and their combinations with vowels, subjoined symbols, and other symbols have been considered. The accuracy achieved is 97.1% when only basic Gurmukhi consonants are considered (Table 5). The writing zone identification algorithm proposed in this work performs well with an accuracy of 97.7%. As a further work in this direction, one can experiment for achieving higher character recognition rates by building a statistical model for sequence of strokes, which can replace the rule based approach used in recognition of characters. The postprocessing algorithm can also be appended for handling different unicode combinations for generation of single character in Gurmukhi script.