RectiNet-v2: A stacked network architecture for document image dewarping

doi:10.1016/j.patrec.2022.01.014

Pattern Recognition Letters

Volume 155, March 2022, Pages 41-47

https://doi.org/10.1016/j.patrec.2022.01.014 Get rights and content

Highlights

•
A Gated and Bifurcated stacked network is proposed to rectify warped documents.
•
Residual Paths are introduced to enhance the flow of information within the network.
•
A novel boundary weighted loss is used to enable greater focus on boundaries.
•
Results indicating dewarp quality are summarized with relevant comparisons.

Abstract

With the advent of mobile and hand-held cameras, document images have found their way into almost every domain. Dewarping of these images for the removal of perspective distortions and folds is essential so that they can be understood by document recognition algorithms. For this, we propose an end-to-end CNN architecture that can produce distortion-free document images from warped documents it takes as input. We train this model on warped document images simulated synthetically to compensate for the lack of enough natural data. Our method is novel in the use of a bifurcated decoder with shared weights to prevent intermingling of grid coordinates, in the use of residual networks in the U-Net skip connections to allow the flow of data from different receptive fields in the model, and in the use of a gated network to help the model focus on structure and line-level detail of the document image. We evaluate our method on the DocUNet dataset, a benchmark in this domain, and obtain results comparable to state-of-the-art methods.

Introduction

Photographing a document with the help of a camera is the most popular method of storing it. With the large-scale popularization of mobile devices with inbuilt cameras and storage functionalities, capturing document images has been the norm of storing data. These captures, however, are done casually more than often, resulting in distorted and warped images that can be interpreted by humans only, but not by document recognition systems due to large differences in illumination, placement, and condition of the documents. For machines to understand data contained in captured document images, dewarping of such images is a necessity.

A large number of classical image processing and optimization-based methods have been proposed for dewarping document images. These, however, fail when curves and folds occur simultaneously in document images, which require a more in-depth and varied analysis. To rectify these complex document images, deep learning methods have been introduced recently by [1,14,3] and [13]. These deep learning methods treat the problem of document dewarping as the prediction of a dense grid that can aid in the dewarping process. The dense grid-based approach for dewarping images is preferred to the sparse grid-based method as it can effectively capture very fine distortion that a very limited set of dewarping points or a sparse grid cannot. As a result, deep learning methods for document dewarping have been able to dewarp images of a complex nature with significantly higher precision as compared to their image processing counterparts.

Inspired by the stacked network architecture proposed by [14], we use a similar architecture in our methods comprising of a primary U-Net and a secondary U-Net. More specifically, we propose modifications in existing methods with a bifurcated secondary U-Net, a gated module, and residual paths in the primary U-Net to improve dense grid predictions and thereby dewarp quality. Our contributions can be summed up as :

•
Use of a bifurcated network that takes in images of dimension 256x256 and regresses a dense grid that can unwarp the document represented by the image. This unwarping grid can be interpolated later so that the images are dewarped at their original resolution. The bifurcated network allows us to prevent the intermingling of dense-grid values.
•
Use of Residual blocks in the skip connections of the U-Nets used in the stacked module. The use of residual blocks as proposed by [8] enables us to leverage different receptive fields in the skip connections and allows us to pass on information from various levels to the decoder layers.
•
Use of Gated Convolutional Layers in the model architecture, inspired by [18]. The presence of gates in these layers helps to capture edge and line-level data and pass it on in later layers as information on which the model has to focus. In other words, the GCN (Gated Convolutional Network) acts as an attention module to the Secondary U-Net.
•
Use of a Boundary Weighted mean squared loss function that focuses more on the boundary of the dense-grids predicted by a Secondary U-Net. This ensures that poor detection of boundaries by the network is penalized more, and unwarps obtained from the module contain minimal background data of the document image.

We train our network on a randomly sampled subset of Doc3D, a synthetically warped image dataset proposed by [3] that has warps created by emulating naturally found deformations in documents. Our gated and bifurcated network achieves an SSIM value of 0.50, an MS-SSIM value of 0.45, and a Local Distortion of 10.40 on the DocUNet benchmark proposed by [14].

Section snippets

Previous works

In the past several years, we have seen significant progress in the domain of document image dewarping. The methods proposed in past can be summarized into the following categories:

1.
Image Processing based methods
2.
Deep Learning based methods

Dataset

We make use of the data generation proposed by [3] as their data is significantly more realistic and offers better generalization with natural images as compared to [14]. In the dataset proposed by [3] 3D shapes and textures of naturally deformed documents were captured and rendered on images with the help of path tracing, taking in many camera positions and a variety of illumination effects and conditions. This allowed the creation of a large-scale image dataset with the data being highly

Architecture overview

The overall architecture of our method has been expressed in Fig 2. We have made significant changes in the stacked U-Net architecture originally proposed by [14]. Major changes in this regard lie in the addition of a gated convolutional network for proper processing of line-level information and a bifurcation in the secondary U-Net of the stack. Inspired by [8], we also add residual networks in the skip connections of our model to enhance the features that are being concatenated in the later

Experiments and analysis

We sample 45,000 synthetically warped images from the dataset released by [3]. We split these into a 9:1 train:validation split for our experiments and train our model on the training set. We store the model weights on the minimum validation loss and perform experiments on the benchmark dataset proposed by [14].

For comparison of the dewarp quality of our methods, metrics like MS-SSIM (Multi-Scale Structural Similarity Index), SSIM (Structural Similarity Index), and LD(Local Distortion) are used.

Conclusion

In this paper, we have proposed an algorithm to dewarp document images by recognizing their structure and predicting dense grids for mappings. We have demonstrated the effectiveness of a bifurcation of the traditional U-Net and the addition of a gated module and residual pathways by comparing our method with state-of-the-art methods on the DocUNet benchmark of 130 real-world images.

As discussed earlier, however, our methods fail in certain respects and future works making use of our

Declaration of Competing Interest

There is no conflict of interest for the present work.

Acknowledgments

All of the experiments demonstrated in this paper have been carried out in the Center for Microprocessor Application for Training Education and Research (CMATER), Jadavpur University on hardware infrastructure provided by Science and Engineering Research Board (SERB), India (Ref.# SB/S3/EECE/054/2016)

References (22)

N. Ibtehaz et al.
Multiresunet: rethinking the u-net architecture for multimodal biomedical image segmentation
Neural Networks
(2020)
K. Ma et al.
Docunet: Document Image Unwarping via a Stacked U-net
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018)
H. Bandyopadhyay, T. Dasgupta, N. Das, M. Nasipuri, A gated and bifurcated stacked u-net module for document image...
M.S. Brown et al.
Document Restoration Using 3D Shape: A General Deskewing Algorithm for Arbitrarily Warped Documents
Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001
(2001)
S. Das et al.
Dewarpnet: Single-image Document Unwarping with Stacked 3D and 2D Regression Networks
Proceedings of the IEEE International Conference on Computer Vision
(2019)
S. Das et al.
The Common Fold: Utilizing the Four-fold to Dewarp Printed Documents from a Single Image
Proceedings of the 2017 ACM Symposium on Document Engineering
(2017)
H. Ezaki, S. Uchida, A. Asano, H. Sakoe, Dewarping of document image by global optimization, IEEE, 2005. Eighth...
V. Frinken et al.
A novel word spotting method based on recurrent neural networks
IEEE Trans Pattern Anal Mach Intell
(2011)
B. Gatos, I. Pratikakis, K. Ntirogiannis, Segmentation based recovery of arbitrarily warped document images, IEEE,...
H.I. Koo
Text-line detection in camera-captured document images using the state estimation of connected components
IEEE Trans. Image Process.
(2016)

H.I. Koo et al.

Composition of a dewarped and enhanced document image from two view images

IEEE Trans. Image Process.

(2009)

Cited by (4)

Behavioral analysis of bar charts in documents via stochastic petri-net modeling
2023, Pattern Recognition Letters
The accurate understanding of documents depends on the effective processing of its individual modalities such as text, diagrams, tables, charts, and etc. While many research papers focus on extracting the illustrated values in bars charts, little work has been conduced regarding the analysis of this data to deduce behavioral information. In this paper, we present a methodology for the recognition and behavioral analysis of bar chart images. In particular, a Convolutional Neural Network model is trained for the initial chart classification and keypoints are extracted for the translation of identified columns into curves. By analyzing the curves associations and interactions with each other, and converting them into Stochastic Petri-nets, the methodology can perform behavioral analysis and deduce their functional characteristics. Empirical evaluation against state-of-the-art chart analysis tools shows high user-approval scores for the proposed method regarding the depth of extracted information and quality of responses.
Image projective transformation rectification with synthetic data for smartphone-captured chest X-ray photos classification
2023, Computers in Biology and Medicine
Automatic interpretation of chest X-ray (CXR) photos taken by smartphones at the same performance level as with digital CXRs is challenging, due to the projective transformation caused by the non-ideal camera position. Existing rectification methods for other camera-captured photos (document photos, license plate photos, etc.) cannot precisely rectify the projective transformation of CXR photos, due to its specific projective transformation type. In this paper, we propose an innovative deep learning-based Projective Transformation Rectification Network (PTRN) to automatically rectify the projective transformation of CXR photos by predicting the projective transformation matrix. Additionally, synthetic CXR photos are generated for training with the consideration of visual artifacts of natural images. The effectiveness of the proposed classification pipeline with PTRN is evaluated in the CheXphoto smartphone-captured CXR photo classification competition. It achieves first place with a huge performance improvement (ours 0.850, second-best 0.762, in AUC). Moreover, experimental results show that our approach successfully achieves the same performance level of digital CXR classification (AUC 0.893) on CXR photo classification (AUC 0.893).
Two Image Rectification Networks for Distorted and Warped Documents
2024, Shuju Caiji Yu Chuli/Journal of Data Acquisition and Processing
Image Projective Transformation Rectification with Synthetic Data for Smartphone-captured Chest X-ray Photos Classification
2022, arXiv

View full text

RectiNet-v2: A stacked network architecture for document image dewarping

Highlights

Abstract

Introduction

Section snippets

Previous works

Dataset

Architecture overview

Experiments and analysis

Conclusion

Declaration of Competing Interest

Acknowledgments

Neural Networks

Document Restoration Using 3D Shape: A General Deskewing Algorithm for Arbitrarily Warped Documents

Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001

Dewarpnet: Single-image Document Unwarping with Stacked 3D and 2D Regression Networks

Proceedings of the IEEE International Conference on Computer Vision

The Common Fold: Utilizing the Four-fold to Dewarp Printed Documents from a Single Image

Proceedings of the 2017 ACM Symposium on Document Engineering

A novel word spotting method based on recurrent neural networks

IEEE Trans Pattern Anal Mach Intell

Text-line detection in camera-captured document images using the state estimation of connected components

IEEE Trans. Image Process.

Composition of a dewarped and enhanced document image from two view images

IEEE Trans. Image Process.