Elsevier

Pattern Recognition Letters

Volume 155, March 2022, Pages 41-47
Pattern Recognition Letters

RectiNet-v2: A stacked network architecture for document image dewarping

https://doi.org/10.1016/j.patrec.2022.01.014Get rights and content

Highlights

  • A Gated and Bifurcated stacked network is proposed to rectify warped documents.

  • Residual Paths are introduced to enhance the flow of information within the network.

  • A novel boundary weighted loss is used to enable greater focus on boundaries.

  • Results indicating dewarp quality are summarized with relevant comparisons.

Abstract

With the advent of mobile and hand-held cameras, document images have found their way into almost every domain. Dewarping of these images for the removal of perspective distortions and folds is essential so that they can be understood by document recognition algorithms. For this, we propose an end-to-end CNN architecture that can produce distortion-free document images from warped documents it takes as input. We train this model on warped document images simulated synthetically to compensate for the lack of enough natural data. Our method is novel in the use of a bifurcated decoder with shared weights to prevent intermingling of grid coordinates, in the use of residual networks in the U-Net skip connections to allow the flow of data from different receptive fields in the model, and in the use of a gated network to help the model focus on structure and line-level detail of the document image. We evaluate our method on the DocUNet dataset, a benchmark in this domain, and obtain results comparable to state-of-the-art methods.

Introduction

Photographing a document with the help of a camera is the most popular method of storing it. With the large-scale popularization of mobile devices with inbuilt cameras and storage functionalities, capturing document images has been the norm of storing data. These captures, however, are done casually more than often, resulting in distorted and warped images that can be interpreted by humans only, but not by document recognition systems due to large differences in illumination, placement, and condition of the documents. For machines to understand data contained in captured document images, dewarping of such images is a necessity.

A large number of classical image processing and optimization-based methods have been proposed for dewarping document images. These, however, fail when curves and folds occur simultaneously in document images, which require a more in-depth and varied analysis. To rectify these complex document images, deep learning methods have been introduced recently by [1,14,3] and [13]. These deep learning methods treat the problem of document dewarping as the prediction of a dense grid that can aid in the dewarping process. The dense grid-based approach for dewarping images is preferred to the sparse grid-based method as it can effectively capture very fine distortion that a very limited set of dewarping points or a sparse grid cannot. As a result, deep learning methods for document dewarping have been able to dewarp images of a complex nature with significantly higher precision as compared to their image processing counterparts.

Inspired by the stacked network architecture proposed by [14], we use a similar architecture in our methods comprising of a primary U-Net and a secondary U-Net. More specifically, we propose modifications in existing methods with a bifurcated secondary U-Net, a gated module, and residual paths in the primary U-Net to improve dense grid predictions and thereby dewarp quality. Our contributions can be summed up as :

  • Use of a bifurcated network that takes in images of dimension 256x256 and regresses a dense grid that can unwarp the document represented by the image. This unwarping grid can be interpolated later so that the images are dewarped at their original resolution. The bifurcated network allows us to prevent the intermingling of dense-grid values.

  • Use of Residual blocks in the skip connections of the U-Nets used in the stacked module. The use of residual blocks as proposed by [8] enables us to leverage different receptive fields in the skip connections and allows us to pass on information from various levels to the decoder layers.

  • Use of Gated Convolutional Layers in the model architecture, inspired by [18]. The presence of gates in these layers helps to capture edge and line-level data and pass it on in later layers as information on which the model has to focus. In other words, the GCN (Gated Convolutional Network) acts as an attention module to the Secondary U-Net.

  • Use of a Boundary Weighted mean squared loss function that focuses more on the boundary of the dense-grids predicted by a Secondary U-Net. This ensures that poor detection of boundaries by the network is penalized more, and unwarps obtained from the module contain minimal background data of the document image.

We train our network on a randomly sampled subset of Doc3D, a synthetically warped image dataset proposed by [3] that has warps created by emulating naturally found deformations in documents. Our gated and bifurcated network achieves an SSIM value of 0.50, an MS-SSIM value of 0.45, and a Local Distortion of 10.40 on the DocUNet benchmark proposed by [14].

Section snippets

Previous works

In the past several years, we have seen significant progress in the domain of document image dewarping. The methods proposed in past can be summarized into the following categories:

  • 1.

    Image Processing based methods

  • 2.

    Deep Learning based methods

Dataset

We make use of the data generation proposed by [3] as their data is significantly more realistic and offers better generalization with natural images as compared to [14]. In the dataset proposed by [3] 3D shapes and textures of naturally deformed documents were captured and rendered on images with the help of path tracing, taking in many camera positions and a variety of illumination effects and conditions. This allowed the creation of a large-scale image dataset with the data being highly

Architecture overview

The overall architecture of our method has been expressed in Fig 2. We have made significant changes in the stacked U-Net architecture originally proposed by [14]. Major changes in this regard lie in the addition of a gated convolutional network for proper processing of line-level information and a bifurcation in the secondary U-Net of the stack. Inspired by [8], we also add residual networks in the skip connections of our model to enhance the features that are being concatenated in the later

Experiments and analysis

We sample 45,000 synthetically warped images from the dataset released by [3]. We split these into a 9:1 train:validation split for our experiments and train our model on the training set. We store the model weights on the minimum validation loss and perform experiments on the benchmark dataset proposed by [14].

For comparison of the dewarp quality of our methods, metrics like MS-SSIM (Multi-Scale Structural Similarity Index), SSIM (Structural Similarity Index), and LD(Local Distortion) are used.

Conclusion

In this paper, we have proposed an algorithm to dewarp document images by recognizing their structure and predicting dense grids for mappings. We have demonstrated the effectiveness of a bifurcation of the traditional U-Net and the addition of a gated module and residual pathways by comparing our method with state-of-the-art methods on the DocUNet benchmark of 130 real-world images.

As discussed earlier, however, our methods fail in certain respects and future works making use of our

Declaration of Competing Interest

There is no conflict of interest for the present work.

Acknowledgments

All of the experiments demonstrated in this paper have been carried out in the Center for Microprocessor Application for Training Education and Research (CMATER), Jadavpur University on hardware infrastructure provided by Science and Engineering Research Board (SERB), India (Ref.# SB/S3/EECE/054/2016)

References (22)

  • N. Ibtehaz et al.

    Multiresunet: rethinking the u-net architecture for multimodal biomedical image segmentation

    Neural Networks

    (2020)
  • K. Ma et al.

    Docunet: Document Image Unwarping via a Stacked U-net

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • H. Bandyopadhyay, T. Dasgupta, N. Das, M. Nasipuri, A gated and bifurcated stacked u-net module for document image...
  • M.S. Brown et al.

    Document Restoration Using 3D Shape: A General Deskewing Algorithm for Arbitrarily Warped Documents

    Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001

    (2001)
  • S. Das et al.

    Dewarpnet: Single-image Document Unwarping with Stacked 3D and 2D Regression Networks

    Proceedings of the IEEE International Conference on Computer Vision

    (2019)
  • S. Das et al.

    The Common Fold: Utilizing the Four-fold to Dewarp Printed Documents from a Single Image

    Proceedings of the 2017 ACM Symposium on Document Engineering

    (2017)
  • H. Ezaki, S. Uchida, A. Asano, H. Sakoe, Dewarping of document image by global optimization, IEEE, 2005. Eighth...
  • V. Frinken et al.

    A novel word spotting method based on recurrent neural networks

    IEEE Trans Pattern Anal Mach Intell

    (2011)
  • B. Gatos, I. Pratikakis, K. Ntirogiannis, Segmentation based recovery of arbitrarily warped document images, IEEE,...
  • H.I. Koo

    Text-line detection in camera-captured document images using the state estimation of connected components

    IEEE Trans. Image Process.

    (2016)
  • H.I. Koo et al.

    Composition of a dewarped and enhanced document image from two view images

    IEEE Trans. Image Process.

    (2009)
  • Cited by (4)

    • Two Image Rectification Networks for Distorted and Warped Documents

      2024, Shuju Caiji Yu Chuli/Journal of Data Acquisition and Processing
    View full text