Elsevier

Pattern Recognition

Volume 108, December 2020, 107482
Pattern Recognition

Accurate, data-efficient, unconstrained text recognition with convolutional neural networks

https://doi.org/10.1016/j.patcog.2020.107482Get rights and content

Highlights

  • We propose a novel neural network architecture for unconstrained text recognition.

  • Our proposed architecture uses only the highly efficient convolutional primitives

  • We achieve state-of-the-art results on seven public benchmark datasets.

  • Our proposed model has won the ICFHR2018 Competition on Automated Text Recognition.

Abstract

Unconstrained text recognition is an important computer vision task, featuring a wide variety of different sub-tasks, each with its own set of challenges. One of the biggest promises of deep neural networks has been the convergence and automation of feature extractors from input raw signals, allowing for the highest possible performance with minimum required domain knowledge. To this end, we propose a data-efficient, end-to-end neural network model for generic, unconstrained text recognition. In our proposed architecture we strive for simplicity and efficiency without sacrificing recognition accuracy. Our proposed architecture is a fully convolutional network without any recurrent connections trained with the CTC loss function. Thus it operates on arbitrary input sizes and produces strings of arbitrary length in a very efficient and parallelizable manner. We show the generality and superiority of our proposed text recognition architecture by achieving state-of-the-art results on seven public benchmark datasets, covering a wide spectrum of text recognition tasks, namely: Handwriting Recognition, CAPTCHA recognition, OCR, License Plate Recognition, and Scene Text Recognition. Our proposed architecture has won the ICFHR2018 Competition on Automated Text Recognition on a READ Dataset.

Introduction

Text recognition is considered one of the earliest computer vision tasks to be tackled by researches. For more than a century now since its inception as a field of research, researchers never stopped working on it. This can be contributed to two important factors.

First, the pervasiveness of text and its importance to our everyday life, as a visual encoding of language that is used extensively in communicating and preserving all kinds of human thought. Second, the necessity of text to humans and its pervasiveness have led to big adequacy requirements over its delivery and reception which has led to the large variability and ever-increasing visual forms of text. Text can originate either as printed or handwritten, with large possible variability in the handwriting styles, the printing fonts, and the formatting options. Text can be found organized in documents as lines, tables, forms, or cluttered in natural scenes. Text can suffer countless types and degrees of degradations, viewing distortions, occlusions, backgrounds, spacings, slantings, and curvatures. Text from spoken languages alone (as an important subset of human-produced text) is available in dozens of scripts that correspond to thousands of languages. All this has contributed to the long-standing complicated nature of unconstrained text recognition.

Since a deep Convolutional Neural Network (CNN) won the ImageNet image classification challenge [1], Deep Learning based techniques have invaded most tasks related to computer vision, either equating or surpassing all previous methods, at a fraction of the required domain knowledge and field expertise. Text recognition was no exception, and methods based on CNNs and Recurrent Neural Networks (RNNs) have dominated all text recognition tasks like OCR [2], Handwriting recognition [3], scene text recognition [4], and license plate recognition [5], and have became the de facto standard for the task.

Despite their success, one can spot a number of shortcomings in these works. First, for many tasks, an RNN is required for achieving state-of-the-art results which brings-in non-trivial latencies due to the sequential processing nature of RNNs. This is surprising, given the fact that, for pure-visual text recognition long range dependencies have no effect and only local neighborhood should affect the final frame or character classification results. Second, for each of these tasks, we have a separate model that can, with its own set of tweaks and tricks, achieve state-of-the-art result in a single task or a small number of tasks. Thus, No single model is demonstrated effective on the wide spectrum of text recognition tasks. Choosing a different model or feature-extractor for different input data even inside a relatively limited problem like text recognition is a great burden for practitioners and clearly contradicts the idea of automatic or data-driven representation learning promised by deep learning methods.

In this work, we propose a novel, purely feed forward neural network architecture for efficient, generic, unconstrained text recognition. Our proposed architecture is a fully convolutional CNN [6] that consists mostly of depthwise separable convolutions with novel inter-layer residual connections [7] and gating, trained on full line or word labels using the Connectionist Temporal Classification (CTC) loss [8]. We also propose a set of generic data augmentation techniques that are suitable for any text recognition task and show how they affect the performance of the system. We demonstrate the superior performance of our proposed system through extensive experimentation on seven public benchmark datasets. We were also able, for the first time, to demonstrate human-level performance on the reCAPTCHA dataset proposed recently in a Science paper [9], which is more than 20% absolute increase in CAPTCHA recognition rate compared to their proposed RCN system. We also achieve state-of-the-art performance in SVHN [10] (the full sequence version), the unconstrained settings of IAM English offline handwriting dataset [11], KHATT Arabic offline handwriting dataset [12], University of Washington (UW3) OCR dataset [13], AOLP license plate recognition dataset [14] (in all divisions). Our proposed system has also won the ICFHR2018 Competition on Automated Text Recognition on a READ Dataset [15] achieving more than 25% relative decrease in Character Error Rate (CER) compared to the entry achieving the second place.

To summarize, we address the unconstrained text recognition problem. In particular, we make the following contributions:

  • We propose a novel neural network architecture that is able to achieve state-of-the-art performance, with feed forward connections only (no recurrent connections), and using only the highly efficient convolutional primitives.

  • We propose a set of data augmentation procedures that are generic to any text recognition task and can boost the performance of any neural network architecture on text recognition.

  • We conduct an extensive set of experiments on seven benchmark datasets to demonstrate the validity of our claims about the generality of our proposed architecture. We also perform an extensive ablation study on our proposed model to demonstrate the importance of each of its submodules.

Section 2 gives an overview of related work on the field. Section 3 describes our architecture design and its training process in details. Section 4 describes our extensive set of experiments and presents its results.

Section snippets

Related work

Work in text recognition is enormous and spans over a century, we here focus on methods based on deep learning as they have been the state-of-the-art for at least five years now. Traditional methods are reviewed in [16].

There is two major trends in deep learning based sequence transduction problems, both avoid the need for fine-grained alignment between source and target sequences. The first is using CTC [8], and the second is using sequence-to-sequence (Encoder-Decoder) models [17] usually

Methodology

In this section, we present details of our proposed architecture. First, Section 3.1 gives an overview of the proposed architecture. Then, Section 3.2 presents derivation of the gating mechanism utilized by our architecture. After that, Section 3.3 discusses the data augmentation techniques we use in the course of training. Finally, Section 3.4 describes our model training and evaluation process.

Experiments

In this section we evaluate and compare the proposed architecture to the previous state-of-the-art techniques by conducting extensive experiments on seven public benchmark datasets covering a wide spectrum of text recognition tasks.

Conclusion

In this paper, we tackled the problem of general, unconstrained text recognition. We presented a novel, data and computation efficient, neural network architecture that can be trained end-to-end on variable-sized images using variable-sized line level transcriptions. We conducted an extensive set of experiments on seven public benchmark datasets covering a wide range of text recognition sub-tasks, and demonstrated state-of-the-art performance on each one of them using the same architecture and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Mohamed Yousef received the B.S. and M.S. degrees in Computer Science from the Faculty of Computers and Information, Assiut University, Egypt, in 2010 and 2013, respectively. His research interests include computer vision, deep neural networks, and generic text detection and recognition.

References (74)

  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • A. Graves et al.

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    Proceedings of the 23rd International Conference on Machine Learning

    (2006)
  • D. George et al.

    A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

    Science

    (2017)
  • Y. Netzer et al.

    Reading digits in natural images with unsupervised feature learning

    NIPS Workshop on Deep Learning and Unsupervised Feature Learning

    (2011)
  • U.-V. Marti et al.

    The IAM-database: an english sentence database for offline handwriting recognition

    Int. J. Doc. Anal. Recogn.

    (2002)
  • I. Phillips, User’s reference manual for the UW english/technical document image database III, UW-III English/Technical...
  • G.-S. Hsu et al.

    Application-oriented license plate recognition

    IEEE Trans. Veh. Technol.

    (2013)
  • T. Strauß et al.

    ICFHR2018 competition on automated text recognition on a read dataset

    2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2018)
  • Q. Ye et al.

    Text detection and recognition in imagery: a survey

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • K. Cho et al.

    Learning phrase representations using RNN encoder–decoder for statistical machine translation

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    (2014)
  • D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in:...
  • A. Hannun

    Sequence modeling with CTC

    Distill

    (2017)
  • B. Su et al.

    Accurate scene text recognition based on recurrent neural network

    Asian Conference on Computer Vision

    (2014)
  • C. Zhai et al.

    Chinese image text recognition with BLSTM-CTC: a segmentation-free method

    Chinese Conference on Pattern Recognition

    (2016)
  • L. Sun et al.

    Deep LSTM networks for online chinese handwriting recognition

    2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2016)
  • T.M. Breuel

    High performance text recognition using a hybrid convolutional-LSTM implementation

    Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on

    (2017)
  • J. Puigcerver

    Are multidimensional recurrent layers really necessary for handwritten text recognition?

    Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on

    (2017)
  • K. Dutta et al.

    Improving CNN-RNN hybrid networks for handwriting recognition

    2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2018)
  • Z. Xie et al.

    Learning spatial-semantic context with fully convolutional recurrent network for online handwritten chinese text recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • Y.-C. Wu et al.

    Handwritten chinese text recognition using separable multi-dimensional recurrent neural network

    2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

    (2017)
  • S. Bai, J.Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence...
  • F. Yin, Y.-C. Wu, X.-Y. Zhang, C.-L. Liu, Scene text recognition with sliding convolutional character models,...
  • B. Shi et al.

    ASTER: An attentional scene text recognizer with flexible rectification

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • A. Chowdhury et al.

    An efficient end-to-end neural model for handwritten text recognition

    BMVC

    (2018)
  • S. Ioffe et al.

    Batch normalization: accelerating deep network training by reducing internal covariate shift

    International Conference on Machine Learning

    (2015)
  • J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization, arXiv:1607.06450...
  • S. Ioffe

    Batch renormalization: towards reducing minibatch dependence in batch-normalized models

    Advances in Neural Information Processing Systems

    (2017)
  • Cited by (104)

    View all citing articles on Scopus

    Mohamed Yousef received the B.S. and M.S. degrees in Computer Science from the Faculty of Computers and Information, Assiut University, Egypt, in 2010 and 2013, respectively. His research interests include computer vision, deep neural networks, and generic text detection and recognition.

    Khaled F. Hussain received the B.S. and M.S. degrees in electrical engineering from Assiut University, Assiut, Egypt, in 1994 and 1996, respectively, and the Ph.D. degree in computer science from University of Central Florida, Orlando, FL, USA, in 2001. From 2002 to 2006, he was a Visiting Assistant Professor with University of Central Florida. Since 2007, he has been with the Computer Science Department, Faculty of Computers and Information, Assiut University, where he is currently an Associate Professor, the Executive Director of the Multimedia Laboratory, and a Vice Dean. His major research interests include computer vision, computer graphics, augmented reality, and computer animation.

    Usama Sayed, received his B.Sc. and M.Sc. Degrees from Assiut University, Assiut, Egypt in 1985 and 1993, respectively, and Ph.D. degree from Czech Technical University in Prague, Czech Republic in 2000, all in electrical engineering. From 1988 to 1996, he was at the Faculty of Engineering, Assiut University, working as an Assistant Lecturer. From February 1997 to July 2000, he was a research assistant in the department of Telecommunications Technology at the Czech Technical University in Prague, Czech Republic. From December 1999 to March 2000, he was a research assistant in the University of California In Santa Barbara (UCSB), USA. From November 2001 to April 2002, he was a post-Doctoral Fellow with the Faculty of Engineering, Czech Technical University in Prague, Czech Republic. From February 2006 to July 2011, he has been an Associate Professor with the Faculty of Engineering, Assiut University, Egypt. Dr. Usama is a Professor with the Faculty of Engineering, in Assiut University and he was the head of the electrical engineering department and the Vice Dean for Graduate Studies and Research, Faculty of Engineering, Assiut University, Egypt. He authored and co-authored more than 140 scientific papers. He has been selected for the inclusion in 2010 Edition of the USA-Marquis Who’s Who in the World. His research interests include telecommunication technology, wireless technology; Image recognition, image coding, statistical signal processing, blind signal separation, and video coding.

    View full text