Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation

Cao, Rongyu; Li, Hongwei; Zhou, Ganbin; Luo, Ping

doi:10.1007/978-3-030-86331-9_1

Rongyu Cao^11,12,
Hongwei Li¹³,
Ganbin Zhou¹⁴ &
…
Ping Luo^11,12,15

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12822))

Included in the following conference series:

International Conference on Document Analysis and Recognition

3637 Accesses

Abstract

In this paper we study the task of document layout recognition for digital documents, requiring that the model should detect the exact physical object region without missing any text or containing any redundant text outside objects. It is the vital step to support high-quality information extraction, table understanding and knowledge base construction over the documents from various vertical domains (e.g. financial, legal, and government fields). Here, we consider digital documents, where characters and graphic elements are given with their exact texts, positions inside document pages, compared with image documents. Towards document layout recognition with pinpoint accuracy, we consider this problem as a document panoptic segmentation task, that each token in the document page must be assigned a class label and an instance id. Considering that two predicted objects may intersect under traditional visual panoptic segmentation method, like Mask R-CNN, however, document objects never intersect because most document pages follow manhattan layout. Therefore, we propose a novel framework, named document panoptic segmentation (DPS) model. It first splits the document page into column regions and groups tokens into line regions, then extracts the textual and visual features, and finally assigns class label and instance id to each line region. Additionally, we propose a novel metric based on the intersection over union (IoU) between the tokens contained in predicted and the ground-truth object, which is more suitable than metric based on the area IoU between predicted and the ground-truth bounding box. Finally, the empirical experiments based on PubLayNet, ArXiv and Financial datasets show that the proposed DPS model obtains 0.8833, 0.9205 and 0.8530 mAP scores on three datasets. The proposed model obtains great improvement on mAP score compared with Faster R-CNN and Mask R-CNN models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Object detection. https://en.wikipedia.org/wiki/Object_detection
Bauguess, S.W.: The role of machine readability in an AI world (2018). https://www.sec.gov/news/speech/speech-bauguess-050318
Cao, R., Cao, Y., Zhou, G., Luo, P.: Extracting variable-depth logical document hierarchy from long documents: method, evaluation, and application. J. Comput. Sci. Technol. (2021)
Google Scholar
Cao, Y., Li, H., Luo, P., Yao, J.: Towards automatic numerical cross-checking: extracting formulas from text. In: WWW (2018)
Google Scholar
Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and performance metrics for table detection evaluation. In: DAS (2012)
Google Scholar
Gilani, A., Qasim, S.R., Malik, I., Shafait, F.: Table detection using deep learning. In: ICDAR (2017)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: ICDAR (2013)
Google Scholar
He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: ICDAR (2018)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: EMNLP (2018)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Google Scholar
Kirillov, A., He, K., Girshick, R., Rother, C., Dollar, P.: Panoptic segmentation. In: CVPR (2019)
Google Scholar
Koci, E., Thiele, M., Lehner, W., Romero, O.: Table recognition in spreadsheets via a graph representation. In: DAS (2018)
Google Scholar
Li, H., Yang, Q., Cao, Y., Yao, J., Luo, P.: Cracking tabular presentation diversity for automatic cross-checking over numerical facts. In: KDD (2020)
Google Scholar
Li, K., et al.: Cross-domain document object detection: benchmark suite and method. In: CVPR (2020)
Google Scholar
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition (2019)
Google Scholar
Li, M., et al.: Docbank: A benchmark dataset for document layout analysis. arXiv (2020)
Google Scholar
Li, X.H., Yin, F., Liu, C.L.: Page object detection from pdf document images by deep structured prediction and supervised clustering. In: ICPR (2018)
Google Scholar
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: ICLR (2016)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Google Scholar
Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digit. Libr. Syst. (2010)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Google Scholar
Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Conference on Pattern Recognition (1984)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: ICDAR (2019)
Google Scholar
Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks (2019)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Google Scholar
Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach towards the benchmarking of table structure recognition systems. In: DAS (2010)
Google Scholar
Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: JCDL (2018)
Google Scholar
Smith, R.: An overview of the tesseract OCR engine. In: ICDAR (2007)
Google Scholar
Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting and merging for table structure decomposition. In: ICDAR (2019)
Google Scholar
Wu, S., et al.: Fonduer: Knowledge base construction from richly formatted data. In: SIGMOD (2018)
Google Scholar
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD (2020)
Google Scholar
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: CVPR (2017)
Google Scholar
Zhong, X., Tang, J., Yepes, A.J.: PublayNet: largest dataset ever for document layout analysis. In: ICDAR (2019)
Google Scholar

Download references

Acknowledgements

The research work supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002104, the National Natural Science Foundation of China under Grant No. 62076231, U1811461. We thank Xu Wang and Jie Luo (from P.A.I Tech) for their kind help. We also thank anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, CAS, Beijing, 100190, China
Rongyu Cao & Ping Luo
University of Chinese Academy of Sciences, Beijing, 100049, China
Rongyu Cao & Ping Luo
Research Department, P.A.I. Ltd., Beijing, 100025, China
Hongwei Li
WeChat Search Application Department, Tencent Ltd., Beijing, 100080, China
Ganbin Zhou
Peng Cheng Laboratory, Shenzhen, 518066, China
Ping Luo

Authors

Rongyu Cao
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Ganbin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ping Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rongyu Cao .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, R., Li, H., Zhou, G., Luo, P. (2021). Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-86331-9_1
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86330-2
Online ISBN: 978-3-030-86331-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)