Keywords

1 Introduction

Optical character recognition is highly researched domain in the area of image processing and computer vision. The commercial OCR systems can be divided into four generations, depending on the robustness and efficiency [1]. The first generation OCR can be characterized by the constrained letter shapes readable to the OCR, whereas the second generation is characterized by the recognition capabilities of the set of regular machine printed characters. The third generation OCR is focused on the poorly printed characters and handwritten characters. OCR dealing with complex documents intermixing with the text, graphics, table and mathematical symbols, unconstrained handwritten characters, low quality noisy documents comes under the fourth generation OCR [1]. However, there are several instances where, if the characters are a bit distorted the OCR systems fail. In most of the situations, a school going kid can outperform the highly sophisticated OCR systems. Therefore, there is a need to understand how a person can recognize the letter irrespective of different variations.

There are some roots in the domain of psychology, cognitive science and neuroscience, which may help us to address the posed question. It has been rightly said that letter recognition is the foundation of human reading [2]. However, the attempts being made to understand the letter perception are very rare. In order to understand the nature of letter representation, two broad theories of visual recognition have been proposed viz. template matching and feature based approach. In the template matching, letter recognition is achieved by matching the letter stimulus to an internal template [3] and in feature based approach, visual features of the letters are extracted in the early stage of processing and then comparing those features with the list of features stored in memory, a letter is recognized [4]. It is hard to believe that we have stored all the features and templates of all the characters to be recognized. Therefore, we are trying to understand how the processing of the letter takes place under various distortion levels of letters and understand what features one considers while recognizing the letter.

There is a cognitive link between eye movement and brain. Therefore, eye tracking seems to be a promising technology to shed more light on visual letter processing. Rayner stated [5] that eye movement data reflects moment-to-moment cognitive processes in the various tasks. In the literature, we find various researchers had used eye tracking technology to understand language expertise [6], number of words read [7], type of document read [8] etc. However, no one has talked about a letter recognition or letter processing. Along with this there is hardly any attempt being made to understand the Devanagari script using eye tracking technology. Therefore, the proposed paper aims to identify the visual features and understand the visual processing of the Devanagari letters by human using eye tracking metrics.

This paper has been divided into four sections. Section 1 talks about the introduction; experimental setup has been discussed in Sect. 2. Section 3 highlights the results and discussions over the results. Concluding remarks with possible future extensions are presented in the last section.

2 Experimental Setup and Details

We had performed the eye tracking experiment on 20 healthy, graduate participants (7 females and 13 males) with normal or corrected to normal vision and with age ranging in between 20–30 years. These participants were frequent readers of the Devanagari script. In order to capture the eye movement, the Tobii T120 eye tracker was used which was a camera based, remote eye tracker with a sampling frequency of 120 Hz.

2.1 Stimulus Design

Initially, we grouped the characters based on the common elements as shown in Fig. 1. After that, we selected some characters showing high variations in their structure and the characters which are frequently used, for our experiment. Mangal font with point size 72 was used for designing the stimuli which were reduced to a single pixel contour through morphological operations. We were trying to incorporate the handwritten variations in the character and trying to understand how readers process the letter. Based on the variations which were commonly observed in the different parts of the handwritten letters, we divided the letter in different parts as shown in Fig. 2. Then each letter part was scaled in the range from 0.2 to 5. The stimuli were thus formed by combing the scaled letter parts with unchanged/unscaled part of the same letter. Thus, the letter formed was termed as distorted letter and was presented in black color against a white background. The prototype of the stimuli is shown in the Table 1.

Fig. 1.
figure 1

Bhagwat’s group based on graphical similarity [9]

Fig. 2.
figure 2

Dividing the letter in different parts

Table 1. Prototype of the stimuli

2.2 Experimental Procedure

All the equipment required for the experiment had been set properly in the experimental room. Before starting the experiment, participants were given all the instructions about the experiment and explained their exact role in the study. When the participants agreed to participate, they were asked to sign the consent form and the necessary details such as name, age, eyesight, first language etc. were recorded. Each participant was seated in front of the eye tracker keeping the distance of 60–70 cm. The experiment started with the welcome message on the screen, followed by instructions and trial. The experiment was conducted in three phases. In each phase, participants were presented 6–7 letters with 25 variations each. The experiment started with the calibration of the eye tracker, according to the participant’s eyes. In order to maintain good quality of eye tracking data, calibration was done in all phases separately and participants were allowed to take a break of 2 min after each phase if they wished to do so.

After successful calibration, actual experiment started. The stimuli were presented as different variations of the different letters chosen randomly, however the decreasing order of distortion was maintained i.e. high distortion of letter1, high distortion for letter2, lesser distortion of letter1 and so on. The participants were asked to press the arrow key on the keyboard as well as spell the letter if they recognized the letter. If they were not able to recognize the letter, they could skip the letter by pressing the space bar key. This procedure continued till the end of the phase. The total time spent on the letter for recognition lets us know where the participants had faced difficulty in recognition. The pressing of a key allowed us to get to know the level of distortion participants were able to tolerate and recognize the letter.

The eye tracking data was collected throughout all the phases. In order to get fixations and saccades from the raw eye tracking data as shown in Fig. 3. Velocity based classification algorithm called as Velocity-Threshold Identification (I-VT) fixation classification algorithm was used. The algorithm classified the eye movements into fixation and saccades based on the velocity of the directional shift of the eye [10].

Fig. 3.
figure 3

Gaze plot generated during reading. The circle shows fixations where readers gaze at particular location for certain time duration and line joining two fixations is a saccade which indicates fast and rapid movement of eyes.

3 Results and Discussions

The analysis of the eye tracking data was carried out based on the fixation duration and fixation count. The participants were allowed to spend certain time to recognize the letter and allowed to proceed further by skipping the letter if they were unable to recognize the letters. The key pressed and letter pronounced loudly corresponding to correct letter recognition enabled us to know the level of distortion each one was able to tolerate. The correct recognition response is plotted for each letter as shown in Figs. 4, 5 and 6.

Fig. 4.
figure 4

Recognition of Devanagari letters in phase I

Fig. 5.
figure 5

Recognition of Devanagari letters in phase II

Fig. 6.
figure 6

Recognition of Devanagari letters in phase III

From these responses, it could be observed that almost all the participants had successfully recognized the letters in less time i.e. with higher levels of distortion. On the other hand, the letters were recognized a bit late by most of the participants. The recognition of the letters sharing same structural features such as and and and had occurred almost at the same time. There was a similarity in the eye tracking patterns of the participants and particular eye fixation pattern that had been observed among the participants as well. Maximum eye fixation duration on the character region implies the difficulty in understanding or encoding the information. Participants have to spend much more time to encode what exactly it represents. As there was a change in distortion level, there was an interesting change in response as well.

3.1 Delayed Letter Recognition

There had been many incorrect responses reported in the recognition of the letters sharing similar visual features. Therefore, the correct recognition of these letters got delayed and had occurred when the letter has lesser distortion level. This can be observed from Figs. 4, 5 and 6. This was due to the different way of processing the letters. Most of the participants reported as as as as etc. at a higher level of distortion. As the letter was getting normal, the participants fixated on different locations and then recognized the letter correctly. We have created the heat map using number of fixations and fixation duration as shown in Fig. 7. The variation in the fixation duration and number of fixations has enabled us to understand how the letter was processed and how the fixation pattern changed which had enabled the participant to recognize the letter correctly. The particular observations are tabulated in Table 2. The pronouncing the letter had given us the exact idea about what the participant had recognized and it had provided us the cross check whether the recognition was correct or not.

Fig. 7.
figure 7

The heat map based on the fixation count (number of fixations) for (a), (b) and fixation duration for (c), (d)

Table 2. Understanding the processing of the Devanagari letters through eye fixations

4 Conclusions

The proposed research unfolds insights about how Devanagari letters are processed by the readers. The eye tracking seems to be the promising technique to understand the moment-to-moment processing of the letter. In this work, we have incorporated the maximum variations that are generally seen in the handwritten characters and subsequently recorded readers’ behavior using eye tracking for these characters. The results demonstrate that the maximum attention of the reader is along the curves, knots, loops and contacts with the headline. There is also a change in fixation patterns along with various distortion levels. This peculiar eye movement behavior might have provided some crucial visual cues to the participant for efficient recognition. In our future work, all these demonstrated visual cues would be used to build the smart OCR model for robust character recognition.