RectiNet-v2: A stacked network architecture for document image dewarping
Introduction
Photographing a document with the help of a camera is the most popular method of storing it. With the large-scale popularization of mobile devices with inbuilt cameras and storage functionalities, capturing document images has been the norm of storing data. These captures, however, are done casually more than often, resulting in distorted and warped images that can be interpreted by humans only, but not by document recognition systems due to large differences in illumination, placement, and condition of the documents. For machines to understand data contained in captured document images, dewarping of such images is a necessity.
A large number of classical image processing and optimization-based methods have been proposed for dewarping document images. These, however, fail when curves and folds occur simultaneously in document images, which require a more in-depth and varied analysis. To rectify these complex document images, deep learning methods have been introduced recently by [1,14,3] and [13]. These deep learning methods treat the problem of document dewarping as the prediction of a dense grid that can aid in the dewarping process. The dense grid-based approach for dewarping images is preferred to the sparse grid-based method as it can effectively capture very fine distortion that a very limited set of dewarping points or a sparse grid cannot. As a result, deep learning methods for document dewarping have been able to dewarp images of a complex nature with significantly higher precision as compared to their image processing counterparts.
Inspired by the stacked network architecture proposed by [14], we use a similar architecture in our methods comprising of a primary U-Net and a secondary U-Net. More specifically, we propose modifications in existing methods with a bifurcated secondary U-Net, a gated module, and residual paths in the primary U-Net to improve dense grid predictions and thereby dewarp quality. Our contributions can be summed up as :
- •
Use of a bifurcated network that takes in images of dimension 256x256 and regresses a dense grid that can unwarp the document represented by the image. This unwarping grid can be interpolated later so that the images are dewarped at their original resolution. The bifurcated network allows us to prevent the intermingling of dense-grid values.
- •
Use of Residual blocks in the skip connections of the U-Nets used in the stacked module. The use of residual blocks as proposed by [8] enables us to leverage different receptive fields in the skip connections and allows us to pass on information from various levels to the decoder layers.
- •
Use of Gated Convolutional Layers in the model architecture, inspired by [18]. The presence of gates in these layers helps to capture edge and line-level data and pass it on in later layers as information on which the model has to focus. In other words, the GCN (Gated Convolutional Network) acts as an attention module to the Secondary U-Net.
- •
Use of a Boundary Weighted mean squared loss function that focuses more on the boundary of the dense-grids predicted by a Secondary U-Net. This ensures that poor detection of boundaries by the network is penalized more, and unwarps obtained from the module contain minimal background data of the document image.
We train our network on a randomly sampled subset of Doc3D, a synthetically warped image dataset proposed by [3] that has warps created by emulating naturally found deformations in documents. Our gated and bifurcated network achieves an SSIM value of 0.50, an MS-SSIM value of 0.45, and a Local Distortion of 10.40 on the DocUNet benchmark proposed by [14].
Section snippets
Previous works
In the past several years, we have seen significant progress in the domain of document image dewarping. The methods proposed in past can be summarized into the following categories:
- 1.
Image Processing based methods
- 2.
Deep Learning based methods
Dataset
We make use of the data generation proposed by [3] as their data is significantly more realistic and offers better generalization with natural images as compared to [14]. In the dataset proposed by [3] 3D shapes and textures of naturally deformed documents were captured and rendered on images with the help of path tracing, taking in many camera positions and a variety of illumination effects and conditions. This allowed the creation of a large-scale image dataset with the data being highly
Architecture overview
The overall architecture of our method has been expressed in Fig 2. We have made significant changes in the stacked U-Net architecture originally proposed by [14]. Major changes in this regard lie in the addition of a gated convolutional network for proper processing of line-level information and a bifurcation in the secondary U-Net of the stack. Inspired by [8], we also add residual networks in the skip connections of our model to enhance the features that are being concatenated in the later
Experiments and analysis
We sample 45,000 synthetically warped images from the dataset released by [3]. We split these into a 9:1 train:validation split for our experiments and train our model on the training set. We store the model weights on the minimum validation loss and perform experiments on the benchmark dataset proposed by [14].
For comparison of the dewarp quality of our methods, metrics like MS-SSIM (Multi-Scale Structural Similarity Index), SSIM (Structural Similarity Index), and LD(Local Distortion) are used.
Conclusion
In this paper, we have proposed an algorithm to dewarp document images by recognizing their structure and predicting dense grids for mappings. We have demonstrated the effectiveness of a bifurcation of the traditional U-Net and the addition of a gated module and residual pathways by comparing our method with state-of-the-art methods on the DocUNet benchmark of 130 real-world images.
As discussed earlier, however, our methods fail in certain respects and future works making use of our
Declaration of Competing Interest
There is no conflict of interest for the present work.
Acknowledgments
All of the experiments demonstrated in this paper have been carried out in the Center for Microprocessor Application for Training Education and Research (CMATER), Jadavpur University on hardware infrastructure provided by Science and Engineering Research Board (SERB), India (Ref.# SB/S3/EECE/054/2016)
References (22)
- et al.
Multiresunet: rethinking the u-net architecture for multimodal biomedical image segmentation
Neural Networks
(2020) - et al.
Docunet: Document Image Unwarping via a Stacked U-net
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018) - H. Bandyopadhyay, T. Dasgupta, N. Das, M. Nasipuri, A gated and bifurcated stacked u-net module for document image...
- et al.
Document Restoration Using 3D Shape: A General Deskewing Algorithm for Arbitrarily Warped Documents
Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001
(2001) - et al.
Dewarpnet: Single-image Document Unwarping with Stacked 3D and 2D Regression Networks
Proceedings of the IEEE International Conference on Computer Vision
(2019) - et al.
The Common Fold: Utilizing the Four-fold to Dewarp Printed Documents from a Single Image
Proceedings of the 2017 ACM Symposium on Document Engineering
(2017) - H. Ezaki, S. Uchida, A. Asano, H. Sakoe, Dewarping of document image by global optimization, IEEE, 2005. Eighth...
- et al.
A novel word spotting method based on recurrent neural networks
IEEE Trans Pattern Anal Mach Intell
(2011) - B. Gatos, I. Pratikakis, K. Ntirogiannis, Segmentation based recovery of arbitrarily warped document images, IEEE,...
Text-line detection in camera-captured document images using the state estimation of connected components
IEEE Trans. Image Process.
(2016)
Composition of a dewarped and enhanced document image from two view images
IEEE Trans. Image Process.
Cited by (4)
Behavioral analysis of bar charts in documents via stochastic petri-net modeling
2023, Pattern Recognition LettersImage projective transformation rectification with synthetic data for smartphone-captured chest X-ray photos classification
2023, Computers in Biology and MedicineTwo Image Rectification Networks for Distorted and Warped Documents
2024, Shuju Caiji Yu Chuli/Journal of Data Acquisition and Processing