Accurate, data-efficient, unconstrained text recognition with convolutional neural networks

doi:10.1016/j.patcog.2020.107482

Pattern Recognition

Volume 108, December 2020, 107482

https://doi.org/10.1016/j.patcog.2020.107482 Get rights and content

Highlights

•
We propose a novel neural network architecture for unconstrained text recognition.
•
Our proposed architecture uses only the highly efficient convolutional primitives
•
We achieve state-of-the-art results on seven public benchmark datasets.
•
Our proposed model has won the ICFHR2018 Competition on Automated Text Recognition.

Abstract

Unconstrained text recognition is an important computer vision task, featuring a wide variety of different sub-tasks, each with its own set of challenges. One of the biggest promises of deep neural networks has been the convergence and automation of feature extractors from input raw signals, allowing for the highest possible performance with minimum required domain knowledge. To this end, we propose a data-efficient, end-to-end neural network model for generic, unconstrained text recognition. In our proposed architecture we strive for simplicity and efficiency without sacrificing recognition accuracy. Our proposed architecture is a fully convolutional network without any recurrent connections trained with the CTC loss function. Thus it operates on arbitrary input sizes and produces strings of arbitrary length in a very efficient and parallelizable manner. We show the generality and superiority of our proposed text recognition architecture by achieving state-of-the-art results on seven public benchmark datasets, covering a wide spectrum of text recognition tasks, namely: Handwriting Recognition, CAPTCHA recognition, OCR, License Plate Recognition, and Scene Text Recognition. Our proposed architecture has won the ICFHR2018 Competition on Automated Text Recognition on a READ Dataset.

Introduction

Text recognition is considered one of the earliest computer vision tasks to be tackled by researches. For more than a century now since its inception as a field of research, researchers never stopped working on it. This can be contributed to two important factors.

First, the pervasiveness of text and its importance to our everyday life, as a visual encoding of language that is used extensively in communicating and preserving all kinds of human thought. Second, the necessity of text to humans and its pervasiveness have led to big adequacy requirements over its delivery and reception which has led to the large variability and ever-increasing visual forms of text. Text can originate either as printed or handwritten, with large possible variability in the handwriting styles, the printing fonts, and the formatting options. Text can be found organized in documents as lines, tables, forms, or cluttered in natural scenes. Text can suffer countless types and degrees of degradations, viewing distortions, occlusions, backgrounds, spacings, slantings, and curvatures. Text from spoken languages alone (as an important subset of human-produced text) is available in dozens of scripts that correspond to thousands of languages. All this has contributed to the long-standing complicated nature of unconstrained text recognition.

Since a deep Convolutional Neural Network (CNN) won the ImageNet image classification challenge [1], Deep Learning based techniques have invaded most tasks related to computer vision, either equating or surpassing all previous methods, at a fraction of the required domain knowledge and field expertise. Text recognition was no exception, and methods based on CNNs and Recurrent Neural Networks (RNNs) have dominated all text recognition tasks like OCR [2], Handwriting recognition [3], scene text recognition [4], and license plate recognition [5], and have became the de facto standard for the task.

Despite their success, one can spot a number of shortcomings in these works. First, for many tasks, an RNN is required for achieving state-of-the-art results which brings-in non-trivial latencies due to the sequential processing nature of RNNs. This is surprising, given the fact that, for pure-visual text recognition long range dependencies have no effect and only local neighborhood should affect the final frame or character classification results. Second, for each of these tasks, we have a separate model that can, with its own set of tweaks and tricks, achieve state-of-the-art result in a single task or a small number of tasks. Thus, No single model is demonstrated effective on the wide spectrum of text recognition tasks. Choosing a different model or feature-extractor for different input data even inside a relatively limited problem like text recognition is a great burden for practitioners and clearly contradicts the idea of automatic or data-driven representation learning promised by deep learning methods.

In this work, we propose a novel, purely feed forward neural network architecture for efficient, generic, unconstrained text recognition. Our proposed architecture is a fully convolutional CNN [6] that consists mostly of depthwise separable convolutions with novel inter-layer residual connections [7] and gating, trained on full line or word labels using the Connectionist Temporal Classification (CTC) loss [8]. We also propose a set of generic data augmentation techniques that are suitable for any text recognition task and show how they affect the performance of the system. We demonstrate the superior performance of our proposed system through extensive experimentation on seven public benchmark datasets. We were also able, for the first time, to demonstrate human-level performance on the reCAPTCHA dataset proposed recently in a Science paper [9], which is more than 20% absolute increase in CAPTCHA recognition rate compared to their proposed RCN system. We also achieve state-of-the-art performance in SVHN [10] (the full sequence version), the unconstrained settings of IAM English offline handwriting dataset [11], KHATT Arabic offline handwriting dataset [12], University of Washington (UW3) OCR dataset [13], AOLP license plate recognition dataset [14] (in all divisions). Our proposed system has also won the ICFHR2018 Competition on Automated Text Recognition on a READ Dataset [15] achieving more than 25% relative decrease in Character Error Rate (CER) compared to the entry achieving the second place.

To summarize, we address the unconstrained text recognition problem. In particular, we make the following contributions:

•
We propose a novel neural network architecture that is able to achieve state-of-the-art performance, with feed forward connections only (no recurrent connections), and using only the highly efficient convolutional primitives.
•
We propose a set of data augmentation procedures that are generic to any text recognition task and can boost the performance of any neural network architecture on text recognition.
•
We conduct an extensive set of experiments on seven benchmark datasets to demonstrate the validity of our claims about the generality of our proposed architecture. We also perform an extensive ablation study on our proposed model to demonstrate the importance of each of its submodules.

Section 2 gives an overview of related work on the field. Section 3 describes our architecture design and its training process in details. Section 4 describes our extensive set of experiments and presents its results.

Section snippets

Related work

Work in text recognition is enormous and spans over a century, we here focus on methods based on deep learning as they have been the state-of-the-art for at least five years now. Traditional methods are reviewed in [16].

There is two major trends in deep learning based sequence transduction problems, both avoid the need for fine-grained alignment between source and target sequences. The first is using CTC [8], and the second is using sequence-to-sequence (Encoder-Decoder) models [17] usually

Methodology

In this section, we present details of our proposed architecture. First, Section 3.1 gives an overview of the proposed architecture. Then, Section 3.2 presents derivation of the gating mechanism utilized by our architecture. After that, Section 3.3 discusses the data augmentation techniques we use in the course of training. Finally, Section 3.4 describes our model training and evaluation process.

Experiments

In this section we evaluate and compare the proposed architecture to the previous state-of-the-art techniques by conducting extensive experiments on seven public benchmark datasets covering a wide spectrum of text recognition tasks.

Conclusion

In this paper, we tackled the problem of general, unconstrained text recognition. We presented a novel, data and computation efficient, neural network architecture that can be trained end-to-end on variable-sized images using variable-sized line level transcriptions. We conducted an extensive set of experiments on seven public benchmark datasets covering a wide range of text recognition sub-tasks, and demonstrated state-of-the-art performance on each one of them using the same architecture and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Mohamed Yousef received the B.S. and M.S. degrees in Computer Science from the Faculty of Computers and Information, Assiut University, Egypt, in 2010 and 2013, respectively. His research interests include computer vision, deep neural networks, and generic text detection and recognition.

References (74)

S.A. Mahmoud et al.
KHATT: An open arabic offline handwritten text database
Pattern Recognit.
(2014)
A. Graves et al.
Framewise phoneme classification with bidirectional LSTM and other neural network architectures
Neural Netw.
(2005)
Y.-C. Wu et al.
Improving handwritten chinese text recognition using neural network language models and convolutional neural network shape models
Pattern Recognit.
(2017)
Y. Gao et al.
Reading scene text with fully convolutional sequence modeling
Neurocomputing
(2019)
A. Krizhevsky et al.
ImageNet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems
(2012)
T.M. Breuel et al.
High-performance OCR for printed english and fraktur using LSTM networks
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
(2013)
T. Bluche
Deep neural networks for large vocabulary handwritten text recognition
(2015)
B. Shi et al.
An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
H. Li, C. Shen, Reading car license plates using deep convolutional neural networks and LSTMS, arXiv:1601.05610...
J. Long et al.
Fully convolutional networks for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2015)

K. He et al.

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

A. Graves et al.

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Proceedings of the 23rd International Conference on Machine Learning

(2006)

D. George et al.

A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

Science

(2017)

Y. Netzer et al.

Reading digits in natural images with unsupervised feature learning

NIPS Workshop on Deep Learning and Unsupervised Feature Learning

(2011)

U.-V. Marti et al.

The IAM-database: an english sentence database for offline handwriting recognition

Int. J. Doc. Anal. Recogn.

(2002)

I. Phillips, User’s reference manual for the UW english/technical document image database III, UW-III English/Technical...

G.-S. Hsu et al.

Application-oriented license plate recognition

IEEE Trans. Veh. Technol.

(2013)

T. Strauß et al.

ICFHR2018 competition on automated text recognition on a read dataset

2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)

(2018)

Q. Ye et al.

Text detection and recognition in imagery: a survey

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

K. Cho et al.

Learning phrase representations using RNN encoder–decoder for statistical machine translation

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

(2014)

D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in:...

A. Hannun

Sequence modeling with CTC

Distill

(2017)

B. Su et al.

Accurate scene text recognition based on recurrent neural network

Asian Conference on Computer Vision

(2014)

C. Zhai et al.

Chinese image text recognition with BLSTM-CTC: a segmentation-free method

Chinese Conference on Pattern Recognition

(2016)

L. Sun et al.

Deep LSTM networks for online chinese handwriting recognition

2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

(2016)

T.M. Breuel

High performance text recognition using a hybrid convolutional-LSTM implementation

Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on

(2017)

J. Puigcerver

Are multidimensional recurrent layers really necessary for handwritten text recognition?

Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on

(2017)

K. Dutta et al.

Improving CNN-RNN hybrid networks for handwriting recognition

2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)

(2018)

Z. Xie et al.

Learning spatial-semantic context with fully convolutional recurrent network for online handwritten chinese text recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2017)

Y.-C. Wu et al.

Handwritten chinese text recognition using separable multi-dimensional recurrent neural network

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

(2017)

S. Bai, J.Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence...

F. Yin, Y.-C. Wu, X.-Y. Zhang, C.-L. Liu, Scene text recognition with sliding convolutional character models,...

B. Shi et al.

ASTER: An attentional scene text recognizer with flexible rectification

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

A. Chowdhury et al.

An efficient end-to-end neural model for handwritten text recognition

BMVC

(2018)

S. Ioffe et al.

Batch normalization: accelerating deep network training by reducing internal covariate shift

International Conference on Machine Learning

(2015)

J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization, arXiv:1607.06450...

S. Ioffe

Batch renormalization: towards reducing minibatch dependence in batch-normalized models

Advances in Neural Information Processing Systems

(2017)

Cited by (104)

Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units
2024, Information Processing and Management
In this study, a novel technique is proposed to recognize printed text in images for Urdu, a low-resource language with a scarcity of benchmark datasets. The proposed technique is called Efficient CRNN which uses depthwise separable convolutional and bidirectional gated recurrent unit layers, followed by connectionist temporal classification loss. The proposed technique is computationally more efficient than the existing text recognition techniques, requiring fewer parameters and computations. A multi-font printed Urdu text lines corpus is also presented, consisting of 245,000 text line images rendered using 7 different fonts. The corpus is called the MMU-Extension-22 and is used to train and evaluate existing state-of-the-art end-to-end text recognition techniques. Efficient CRNN is also evaluated using the proposed corpus. The proposed technique is first trained using a total of 196,000 text line images and then tested using 49,000 images. The Efficient CRNN technique achieved the minimum character and word error rates of 0.91% and 1.49% respectively for Urdu text line recognition under different settings, outperforming the existing computationally more complex techniques. The simple nature of the proposed technique not only makes it more efficient but also more robust for Urdu text line recognition, achieving a 2.23% reduced character error rate and a 71%¹ decrease in character error rate as compared to the best performing existing Recurrent Neural Networks based technique. Also, the proposed technique outperforms Vision Transformer-based network achieving a 0.79% reduced character error rate accounting for a 41% decrease in error. Also, the Efficient CRNN has 49.16% reduced parameters compared to the baseline Vision Transformer technique.
A novel classification method combining phase-field and DNN
2023, Pattern Recognition
This paper proposes a novel classification method. Firstly, we use the deep neural network (DNN) to classify the training set. After several iterations, we obtain the output vector $Y$ . The component of the largest value in vector $Y$ is represented as the label being classified, which we take as the output value. Because we chose the sigmoid function as our activation function, the output value is between 0 and 1. Therefore, the output value can represents the probability of the classified label by the DNN. Depending on the distribution of output values, we set tolerance values ( $T o l$ ) that categorize similar output values as the same label in the DNN. If the output value is lower than $T o l$ , we consider it categorically anomalous. Subsequently, we use the Phase-Field model to classify these anomalies and obtain better classification results. As this classification method combines Phase-Field model and DNN, we named it Phase-Field-DNN. In the numerical experiment using MNIST handwritten digit data set as experimental data, the classification accuracy of Phase-Field-DNN model is higher than that of Phase-Field model and DNN model through the analysis of the classification results of binary classification and multi-classification problems with this data. In addition, the model we proposed is used to classify the normal and abnormal brain MRIs, and the classification results are compared with those of others. After comparison, we find that our proposed model achieve the best classification results.
ResneSt-Transformer: Joint attention segmentation-free for end-to-end handwriting paragraph recognition model
2023, Array
Offline handwritten text recognition (HTR) typically relies on segmented text-line images for training and transcription. However, acquiring line-level position and transcript information can be challenging and time-consuming, while automatic line segmentation algorithms are prone to errors that impede the recognition phase. To address these issues, we introduce a state-of-the-art solution that integrates vision and language models using efficient split and multi-head attention neural networks, referred to as joint attention (ResneSt-Transformer), for end-to-end recognition of handwritten paragraphs. Our proposed novel one-stage, segmentation-free pipeline employs joint attention mechanisms to process paragraph images in an end-to-end trainable manner. This pipeline comprises three modules, with the output of one serving as the input for the next. Initially, a feature extraction module employing a CNN with a split attention mechanism (ResneSt50) is utilized. Subsequently, we develop an encoder module containing four transformer layers to generate robust representations of the entire paragraph image. Lastly, we designed a decoder module with six transformer layers to construct weighted masks. The encoder and decoder modules incorporate a multi-head self-attention mechanism and positional encoding, enabling the model to concentrate on specific feature maps at the current time step. By leveraging joint attention and a segmentation-free approach, our neural network calculates split attention weights on the visual representation, facilitating implicit line segmentation. This strategy signifies a substantial advancement toward achieving end-to-end transcription of entire paragraphs. Experiments conducted on paragraph-level benchmark datasets, including RIMES, IAM, and READ 2016 test datasets, demonstrate competitive results compared to recent paragraph-level models while maintaining reduced complexity. The code and pre-trained models are available on our GitHub repository here: HTTPS link.
On usage of an end-to-end deep neural architecture for handwritten digit string recognition
2024, Signal, Image and Video Processing
Textmatcher: cross-attentional neural network to compare image and text
2024, Machine Learning
The Text Detection and Recognition Method for Electrical Nameplates Based on Deep Learning
2024, Advances in Transdisciplinary Engineering

View all citing articles on Scopus

Khaled F. Hussain received the B.S. and M.S. degrees in electrical engineering from Assiut University, Assiut, Egypt, in 1994 and 1996, respectively, and the Ph.D. degree in computer science from University of Central Florida, Orlando, FL, USA, in 2001. From 2002 to 2006, he was a Visiting Assistant Professor with University of Central Florida. Since 2007, he has been with the Computer Science Department, Faculty of Computers and Information, Assiut University, where he is currently an Associate Professor, the Executive Director of the Multimedia Laboratory, and a Vice Dean. His major research interests include computer vision, computer graphics, augmented reality, and computer animation.

Usama Sayed, received his B.Sc. and M.Sc. Degrees from Assiut University, Assiut, Egypt in 1985 and 1993, respectively, and Ph.D. degree from Czech Technical University in Prague, Czech Republic in 2000, all in electrical engineering. From 1988 to 1996, he was at the Faculty of Engineering, Assiut University, working as an Assistant Lecturer. From February 1997 to July 2000, he was a research assistant in the department of Telecommunications Technology at the Czech Technical University in Prague, Czech Republic. From December 1999 to March 2000, he was a research assistant in the University of California In Santa Barbara (UCSB), USA. From November 2001 to April 2002, he was a post-Doctoral Fellow with the Faculty of Engineering, Czech Technical University in Prague, Czech Republic. From February 2006 to July 2011, he has been an Associate Professor with the Faculty of Engineering, Assiut University, Egypt. Dr. Usama is a Professor with the Faculty of Engineering, in Assiut University and he was the head of the electrical engineering department and the Vice Dean for Graduate Studies and Research, Faculty of Engineering, Assiut University, Egypt. He authored and co-authored more than 140 scientific papers. He has been selected for the inclusion in 2010 Edition of the USA-Marquis Who’s Who in the World. His research interests include telecommunication technology, wireless technology; Image recognition, image coding, statistical signal processing, blind signal separation, and video coding.

View full text

Accurate, data-efficient, unconstrained text recognition with convolutional neural networks

Highlights

Abstract

Introduction

Section snippets

Related work

Methodology

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgements

Pattern Recognit.

Neural Netw.

Pattern Recognit.

Neurocomputing

ImageNet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems

High-performance OCR for printed english and fraktur using LSTM networks

Document Analysis and Recognition (ICDAR), 2013 12th International Conference on

Deep neural networks for large vocabulary handwritten text recognition

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Fully convolutional networks for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Proceedings of the 23rd International Conference on Machine Learning

A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

Science

Reading digits in natural images with unsupervised feature learning

NIPS Workshop on Deep Learning and Unsupervised Feature Learning

The IAM-database: an english sentence database for offline handwriting recognition

Int. J. Doc. Anal. Recogn.

Application-oriented license plate recognition

IEEE Trans. Veh. Technol.

ICFHR2018 competition on automated text recognition on a read dataset

2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)

Text detection and recognition in imagery: a survey

IEEE Trans. Pattern Anal. Mach. Intell.

Learning phrase representations using RNN encoder–decoder for statistical machine translation

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Sequence modeling with CTC

Distill

Accurate scene text recognition based on recurrent neural network

Asian Conference on Computer Vision

Chinese image text recognition with BLSTM-CTC: a segmentation-free method

Chinese Conference on Pattern Recognition

Deep LSTM networks for online chinese handwriting recognition

2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

High performance text recognition using a hybrid convolutional-LSTM implementation

Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on

Are multidimensional recurrent layers really necessary for handwritten text recognition?

Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on

Improving CNN-RNN hybrid networks for handwriting recognition

2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)

Learning spatial-semantic context with fully convolutional recurrent network for online handwritten chinese text recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Handwritten chinese text recognition using separable multi-dimensional recurrent neural network

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

ASTER: An attentional scene text recognizer with flexible rectification

IEEE Trans. Pattern Anal. Mach. Intell.

An efficient end-to-end neural model for handwritten text recognition

BMVC

Batch normalization: accelerating deep network training by reducing internal covariate shift

International Conference on Machine Learning

Batch renormalization: towards reducing minibatch dependence in batch-normalized models

Advances in Neural Information Processing Systems