A selectional auto-encoder approach for document image binarization

doi:10.1016/j.patcog.2018.08.011

Pattern Recognition

Volume 86, February 2019, Pages 37-47

https://doi.org/10.1016/j.patcog.2018.08.011 Get rights and content

Highlights

•
A selectional autoencoder approach for document image binarization is studied.
•
The neural network is devoted to learning an image-to-image binarization.
•
Comprehensive experimentation with datasets of different typology is presented.
•
Results demonstrate that the approach is able to outperform the state of the art.

Abstract

Binarization plays a key role in the automatic information retrieval from document images. This process is usually performed in the first stages of document analysis systems, and serves as a basis for subsequent steps. Hence it has to be robust in order to allow the full analysis workflow to be successful. Several methods for document image binarization have been proposed so far, most of which are based on hand-crafted image processing strategies. Recently, Convolutional Neural Networks have shown an amazing performance in many disparate duties related to computer vision. In this paper we discuss the use of convolutional auto-encoders devoted to learning an end-to-end map from an input image to its selectional output, in which activations indicate the likelihood of pixels to be either foreground or background. Once trained, documents can therefore be binarized by parsing them through the model and applying a global threshold. This approach has proven to outperform existing binarization strategies in a number of document types.

Introduction

Image binarization consists in assigning a binary value to every single pixel of an image. Within the context of document analysis systems, the main objective is to distinguish the foreground (meaningful information) from the background.

Binarization plays a key role in the workflow of many document analysis and recognition systems [1], [2], [3], [4], [5]. It not only helps to reduce the complexity of the task but is also advisable for procedures involving morphological operations, detection of connected components, or histogram analysis, among others. Many methods have been proposed to accomplish this task. However, it is often complex to attain good results because documents may contain several difficulties—such as irregular leveling, blots, bleed-through, and so on—that may cause the process to fail.

In addition to all these obstacles, it is convenient to emphasize that it is very difficult for the same method to work successfully in a number of document styles, since the set of potential domains is very heterogeneous. In order to deal with this situation, we discuss a framework with which to binarize image documents based on machine learning. That is, a ground-truth of examples is used to train a model to perform the binarization task. This allows using the same approach in a wide range of documents, provided there is specific ground-truth data to train a new model for each document type.

Specifically, we make use of Convolutional Neural Networks (CNN) [6]. These networks involve multi-layer architectures that perform a series of transformations (convolutions) to the input signal. The parameters of these transformations are adjusted through a training process. CNN have dramatically improved the state of the art in many tasks such as image, video, and speech processing [7]. Thus, its use for document binarization is promising. In this case, we consider an image-to-image convolutional architecture, which is trained to transform an input image into its binarized version.

Our experiments focus on testing this strategy in different document types, namely Latin text documents, palm leaf scripts, Persian documents, and music scores. We also compare the approach against other classical and state-of-the-art algorithms for binarization, showing that this approach leads to a significant improvement.

The remainder of the paper is structured as follows: related works to document image binarization are introduced in Section 2; the image-to-image binarization framework based on convolutional models is described in Section 3; the experiments are presented in Section 4, including model configuration, training strategies, comparison with existing techniques, and cross-document adaptation; and finally, the current work concludes in Section 5.

Section snippets

Background

The most straightforward procedure for image binarization is to resort to simple thresholding, in which all pixels under a certain grayscale value are set to 0, and those above to 1. This threshold can be fixed by hand, yet algorithms such as Otsu’s [8] automatically estimate a value according to the input image. However, as the complexity of the document to process increases, this simple procedure usually leads to poor or irregular binarization, and so it is preferable to resort to other kind

Selectional auto-encoder for document image binarization

From a machine learning point of view, image binarization can be formulated as a two-class classification task at pixel level. The framework proposed follows this idea and, therefore, consists in learning to estimate which label must be given to every single pixel of an image. Since we are dealing with images of documents, we define the set of labels as foreground and background. As mentioned above, a way to implement this approach is to use a neural network to estimate which of these labels

Experiments

This section details the experimentation carried out to evaluate the discussed approach. The performance of a binarization algorithm can be evaluated in several ways. For instance, if the algorithm is part of a workflow to perform a particular task, an interesting way to measure the performance is in relation to the final performance. However, this implies that the evaluation of the algorithm may not be totally fair, since it would be strongly related to the performance of the rest of the

Conclusions

In this paper an approach for document image binarization has been presented. The strategy is to train a Selectional Auto-Encoder (SAE) that is able to learn an end-to-end transformation to binarize an image. Given a piece of image of a fixed size, the model outputs a selectional value for each pixel of the image depending on the confidence whether the pixel belongs to the foreground of the document. These values are eventually thresholded to yield a discrete binary result.

The influence of the

Acknowledgments

This work was partially supported by the Social Sciences and Humanities Research Council of Canada, the Spanish Ministerio de Ciencia, Innovación y Universidades through Juan de la Cierva - Formación grant (Ref. FJCI-2016-27873), and the Universidad de Alicante through grant GRE-16-04.

Jorge Calvo-Zaragoza received his Ph.D. degree in computer science from the University of Alicante in Juny 2016. He joined McGill University (Canada) in 2017 as a postdoctoral fellow. Currently, he is the recipient of a Juan de la Cierva - Formación research grant from the Spanish government. He has authored more than 30 papers in peer-reviewed journals and international conferences. His main interests are focused on Pattern Recognition and Document Analysis.

References (43)

G. Louloudis et al.
Text line detection in handwritten documents
Pattern Recognit.
(2008)
R. Saabni et al.
Text line extraction for historical document images
Pattern Recognit. Lett.
(2014)
S. He et al.
Junction detection in handwritten documents and its application to writer identification
Pattern Recognit.
(2015)
A.P. Giotis et al.
A survey of document image word spotting techniques
Pattern Recognit.
(2017)
J. Sauvola et al.
Adaptive document image binarization
Pattern Recognit.
(2000)
B. Gatos et al.
Adaptive degraded document image binarization
Pattern Recognit.
(2006)
J. Zhao, M. Mathieu, R. Goroshin, Y. LeCun, Stacked What-where Auto-encoders,...
J. Calvo-Zaragoza et al.
Avoiding staff removal stage in optical music recognition: application to scores written in white mensural notation
Pattern Anal. Appl.
(2015)
M.D. Zeiler et al.
Visualizing and understanding convolutional networks
Proceedings of the European Conference on Computer Vision
(2014)
Y. LeCun et al.
Deep learning
Nature
(2015)

N. Otsu

A threshold selection method from gray-level histograms

Automatica

(1975)

W. Niblack

An Introduction to Digital Image Processing

(1985)

C. Wolf et al.

Text localization, enhancement and binarization in multimedia documents

Proceedings of the International Conference on Pattern Recognition

(2002)

G. Lazzara et al.

Efficient multiscale sauvolaâs binarization

Int. J. Doc. Anal. Recognit.

(2014)

SuB. et al.

Robust document image binarization technique for degraded document images

IEEE Trans. Image Process.

(2013)

N.R. Howe

A laplacian energy for document binarization

Proceedings of the 2011 International Conference on Document Analysis and Recognition (ICDAR)

(2011)

N.R. Howe

Document binarization with automatic parameter tuning

Int. J. Doc. Anal. Recognit.

(2013)

S. Katz et al.

Direct visibility of point sets

ACM Transactions on Graphics (TOG)

(2007)

I. Pratikakis et al.

ICFHR 2016 handwritten document image binarization contest (H-DIBCO 2016)

Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition, ICFHR, Shenzhen, China

(2016)

ChiZ. et al.

A two-stage binarization approach for document images

Proceedings of the International Symposium on Intelligent Multimedia, Video and Speech Processing

(2001)

J.L. Hidalgo et al.

Enhancement and cleaning of handwritten data by using neural networks

Proceedings of the 2nd Iberian Conference on Pattern Recognition and Image Analysis

(2005)

Cited by (135)

Application and development of the Lattice Boltzmann modeling in pore-scale electrodes of solid oxide fuel cells
2024, Journal of Power Sources
The lattice Boltzmann method (LBM) plays an important role in the study of the internal flow behavior at the pore-scale inside the electrodes of solid oxide fuel cells (SOFCs). Porosity, tortuosity, and particle size have a remarkable effect on gas transport and electrocatalytic processes, determining the performance of cells when SOFCs are applied in electric power generation, energy storage systems, and industrial production in recent years. However, these pore-scale transport progresses are not well characterized in the numerical studies of conventional computational fluid dynamics (CFD), thus modeling with LBM at the pore-scale is an effective tool for simulating gas transport and electrochemical reactions in electrodes. It overcomes the drawbacks of experimental techniques that do not characterize these processes accurately enough and detail the distribution of important variables. In this review, the methodology and process of electrode pore-scale modeling are presented, along with the application and current studies of LBM for diffusion, electrochemical reactions, and ion migration in SOFC porous electrodes. Important results are discussed. Finally, future perspectives on pore-scale studies of porous electrodes are given. This in-depth review intends to provide ideas for the development and further application of LBM in porous SOFC electrodes.
A Dilated MultiRes Visual Attention U-Net for historical document image binarization
2024, Signal Processing: Image Communication
The task of binarization of historical document images has been in the forefront of image processing research, during the digital transition of libraries. The process of storing and transcribing valuable historical printed or handwritten material can salvage world cultural heritage and make it available online without physical attendance. The task of binarization can be viewed as a pre-processing step that attempts to separate the printed/handwritten characters in the image from possible noise and stains, which will assist in the Optical Character Recognition (OCR) process. Many approaches have been proposed before, including deep learning based approaches. In this article, we propose a U-Net style deep learning architecture that incorporates many other developments of deep learning, including residual connections, multi-resolution connections, visual attention blocks and dilated convolution blocks for upsampling. The novelties in the proposed DMVAnet lie in the use of these elements in combination in a novel U-Net style architecture and the application of DMVAnet in image binarization for the first time. In addition, the proposed DMVAnet is a very computationally lightweight network that performs very close or even better than the state-of-the-art approaches with a fraction of the network size and parameters. Finally, it can be used on platforms with restricted processing power and system resources, such as mobile devices and through scaling can result in inference times that allow for real-time applications.
GDB: Gated Convolutions-based Document Binarization
2024, Pattern Recognition
Document binarization is a crucial pre-processing step for various document analysis tasks. However, existing methods fail to accurately capture stroke edges, primarily due to the inherent limitations of vanilla convolutions and the absence of adequate boundary-related supervision during stroke edge extraction. In this paper, we formulate text extraction as the learning of gating values and propose an end-to-end network architecture based on gated convolutions, named GDB, to address the problem of imprecise stroke edge extraction. The gated convolutions enable the selective extraction of stroke feature with different attention. Our proposed framework comprises two stages. Firstly, a coarse sub-network with an extra edge branch is trained to enhance the precision of feature maps by incorporating a priori mask and edge information. Secondly, a refinement sub-network is cascaded to enhance the output of the first stage using gated convolutions based on the sharp edges. To effectively incorporate global information, GDB also integrates a parallelized multi-scale operation that combines local and global features. We conduct comprehensive experiments on ten Document Image Binarization Contest (DIBCO) datasets from 2009 to 2019 and Document Deblurring Datasets. Experimental results show that our proposed methods outperform the state-of-the-art methods across all metrics on average. Extensive ablation studys demonstrate the efficacy of key components. Available codes: https://github.com/Royalvice/GDB.
Text line extraction strategy for palm leaf manuscripts
2023, Pattern Recognition Letters
Text line segmentation is an important step in the historical document image analysis pipeline to supply useful information for recognition, keyword spotting and indexing. Many handcrafted-based and learning-based approaches have been developed to cope with text line extraction challenges. In this work we present a hybrid technique which combines a convolutional-based denoising task and heuristic Seam Carving framework. We propose the following changes to the original Seam Carving: (1) We applied adaptive slice to anticipate miss-extraction on a short text during medial seam computation. (2) We applied a triple smoothing to find the best local maxima of the smoothed horizontal projection profile which represents candidate medial seams. (3) We utilized five post-processing steps aimed at reconstructing a more precise medial seam. Three different palm leaf data sets: old Sundanese, old Balinese, and old Khmer, have been used to compare Convolutional Seam Carving (CSC) to several baseline methods, including the original Seam Carving. Experimental results show that the proposed method outperforms other current handcrafted-based baselines on all three palm leaf manuscript (PLM) data sets. On the old Sundanese data sets, CSC can produce a significant improvement of the performance rate compared to all other baseline approaches, and it also enhance the measurement on old Balinese and old Khmer datasets. In the ablation study, we discovered that a foreground extraction step is not only able to reduce noises and color degradation but also provide better separation of text region. Following that, an adaptive slice and triple smoothing approach contribute to solve the text length variation problem. Finally, post-processing steps were effective in connecting discontinuous medial seam. The code has been published in https://github.com/erickpaulus/text-line-segmentation.
Neural network estimation of kinetic parameters in distributed activation energy model (DAEM) without a priori assumptions for parallel reaction system
2023, Fuel
In this study, a new estimation method for the kinetic parameters in a distributed activation energy model (DAEM) was designed and developed. In the proposed method, the conversion estimation by the DAEM is regarded as a feedforward computation of a three-layer neural network, and the kinetic parameters of the DAEM are estimated by optimization of the neural network. The proposed method does not require an a priori assumption of the kinetic parameters or mechanism of a parallel reaction system. First, we created reaction data using numerical simulations, and a kinetic analysis using the neural network was performed. The neural network predicted conversion X very accurately; however, reactions with low contributions to the parallel reaction system also appeared. Next, we carried out a kinetic analysis using the neural network with the lower limit on the contribution of the ith reaction V_i*/V*. The lower limit on V_i*/V* did not influence the prediction accuracy of X and had a significant effect on reducing the reactions with low V_i*/V* and the estimation accuracies of the kinetic parameters in the DAEM. The optimal value of the lower limit on V_i*/V* was determined to be 1.0×10⁻⁵–1.0×10⁻⁴ when using the neural network with 64 hidden layer nodes. Moreover, the prediction accuracy of the proposed method was compared with those of conventional methods — the single-Gaussian method (1-DAEM), double-Gaussian method (2-DAEM), and Miura and Maki method. The kinetic parameters estimated using the proposed method were closer to the true values than those obtained using conventional methods. Moreover, the X value predicted by the proposed method was more accurate than that predicted by conventional methods. The influence of approximation on training data creation was also examined. The estimation accuracy of the neural network was still high but slightly deteriorated when the neural network was optimized using the reaction data created without the approximation.
A Novel Degraded Document Binarization Model through Vision Transformer Network
2023, Information Fusion
Degraded document binarization has received keen attention due to its vital influence on subsequent document analysis tasks. In this study, we propose a novel Degraded Document Binarization model through the vision transFormer framework, termed D²BFormer. Thanks to its end-to-end trainable fashion, the D²BFormer model is able to autonomously optimize its parameterized configuration of the entire learning pipeline without incurring the intensity-to-binary value conversion phase, resulting in an improved binarization quality. In addition, we propose a novel dual-branched encoding feature fusion module, which combines architectural components from the vision transformer framework and deep convolutional neural networks. The resulting encoding module can extract features from an input document that are sensitive to both global and local characteristics. Meanwhile, the proposed encoding feature extraction module can operate internally at a much lower spatial resolution than that of a raw input document, leading to reduced computational complexity. Furthermore, we propose a novel progressively merged decoding feature fusion module through carefully introduced skip connections both inside and outside the decoding network. The resulting decoding module progressively combines counterpart features derived from the corresponding layers of the encoding network with comparable spatial resolutions and up-sampled features generated from previous layers in the decoding network. Finally, the experiments conducted on ten public datasets demonstrate that the proposed D²BFormer model gains promising performance in terms of four metrics.

View all citing articles on Scopus

Antonio-Javier Gallego is assistant professor at the Department of Software and Computing Systems of the University of Alicante. He received a B.S. and M.S. degree in Computer Science from the University of Alicante in 2004, and the Ph.D. in Computer Science and Artificial Intelligence from the same university in 2012. He has more than 30 publications, including international journals, international conferences, books and book chapters. His research interests include Deep Learning, Pattern Recognition, and Computer Vision.

View full text

A selectional auto-encoder approach for document image binarization

Highlights

Abstract

Introduction

Section snippets

Background

Selectional auto-encoder for document image binarization

Experiments

Conclusions

Acknowledgments

Pattern Recognit.

Pattern Recognit. Lett.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Avoiding staff removal stage in optical music recognition: application to scores written in white mensural notation

Pattern Anal. Appl.

Visualizing and understanding convolutional networks

Proceedings of the European Conference on Computer Vision

Deep learning

Nature

A threshold selection method from gray-level histograms

Automatica

An Introduction to Digital Image Processing

Text localization, enhancement and binarization in multimedia documents

Proceedings of the International Conference on Pattern Recognition

Efficient multiscale sauvolaâs binarization

Int. J. Doc. Anal. Recognit.

Robust document image binarization technique for degraded document images

IEEE Trans. Image Process.

A laplacian energy for document binarization

Proceedings of the 2011 International Conference on Document Analysis and Recognition (ICDAR)

Document binarization with automatic parameter tuning

Int. J. Doc. Anal. Recognit.

Direct visibility of point sets

ACM Transactions on Graphics (TOG)

ICFHR 2016 handwritten document image binarization contest (H-DIBCO 2016)

Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition, ICFHR, Shenzhen, China

A two-stage binarization approach for document images

Proceedings of the International Symposium on Intelligent Multimedia, Video and Speech Processing

Enhancement and cleaning of handwritten data by using neural networks

Proceedings of the 2nd Iberian Conference on Pattern Recognition and Image Analysis