Skip to main content

Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12822))

Included in the following conference series:

  • 3637 Accesses

Abstract

In this paper we study the task of document layout recognition for digital documents, requiring that the model should detect the exact physical object region without missing any text or containing any redundant text outside objects. It is the vital step to support high-quality information extraction, table understanding and knowledge base construction over the documents from various vertical domains (e.g. financial, legal, and government fields). Here, we consider digital documents, where characters and graphic elements are given with their exact texts, positions inside document pages, compared with image documents. Towards document layout recognition with pinpoint accuracy, we consider this problem as a document panoptic segmentation task, that each token in the document page must be assigned a class label and an instance id. Considering that two predicted objects may intersect under traditional visual panoptic segmentation method, like Mask R-CNN, however, document objects never intersect because most document pages follow manhattan layout. Therefore, we propose a novel framework, named document panoptic segmentation (DPS) model. It first splits the document page into column regions and groups tokens into line regions, then extracts the textual and visual features, and finally assigns class label and instance id to each line region. Additionally, we propose a novel metric based on the intersection over union (IoU) between the tokens contained in predicted and the ground-truth object, which is more suitable than metric based on the area IoU between predicted and the ground-truth bounding box. Finally, the empirical experiments based on PubLayNet, ArXiv and Financial datasets show that the proposed DPS model obtains 0.8833, 0.9205 and 0.8530 mAP scores on three datasets. The proposed model obtains great improvement on mAP score compared with Faster R-CNN and Mask R-CNN models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://icdar2021.org/competitions/competition-on-scientific-literature-parsing/.

  2. 2.

    https://cocodataset.org/.

  3. 3.

    http://pdflux.com/.

References

  1. Object detection. https://en.wikipedia.org/wiki/Object_detection

  2. Bauguess, S.W.: The role of machine readability in an AI world (2018). https://www.sec.gov/news/speech/speech-bauguess-050318

  3. Cao, R., Cao, Y., Zhou, G., Luo, P.: Extracting variable-depth logical document hierarchy from long documents: method, evaluation, and application. J. Comput. Sci. Technol. (2021)

    Google Scholar 

  4. Cao, Y., Li, H., Luo, P., Yao, J.: Towards automatic numerical cross-checking: extracting formulas from text. In: WWW (2018)

    Google Scholar 

  5. Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and performance metrics for table detection evaluation. In: DAS (2012)

    Google Scholar 

  6. Gilani, A., Qasim, S.R., Malik, I., Shafait, F.: Table detection using deep learning. In: ICDAR (2017)

    Google Scholar 

  7. Girshick, R.: Fast R-CNN. In: ICCV (2015)

    Google Scholar 

  8. Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: ICDAR (2013)

    Google Scholar 

  9. He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: ICDAR (2018)

    Google Scholar 

  10. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  11. Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: EMNLP (2018)

    Google Scholar 

  12. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)

    Google Scholar 

  13. Kirillov, A., He, K., Girshick, R., Rother, C., Dollar, P.: Panoptic segmentation. In: CVPR (2019)

    Google Scholar 

  14. Koci, E., Thiele, M., Lehner, W., Romero, O.: Table recognition in spreadsheets via a graph representation. In: DAS (2018)

    Google Scholar 

  15. Li, H., Yang, Q., Cao, Y., Yao, J., Luo, P.: Cracking tabular presentation diversity for automatic cross-checking over numerical facts. In: KDD (2020)

    Google Scholar 

  16. Li, K., et al.: Cross-domain document object detection: benchmark suite and method. In: CVPR (2020)

    Google Scholar 

  17. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition (2019)

    Google Scholar 

  18. Li, M., et al.: Docbank: A benchmark dataset for document layout analysis. arXiv (2020)

    Google Scholar 

  19. Li, X.H., Yin, F., Liu, C.L.: Page object detection from pdf document images by deep structured prediction and supervised clustering. In: ICPR (2018)

    Google Scholar 

  20. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: ICLR (2016)

    Google Scholar 

  21. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)

    Google Scholar 

  22. Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digit. Libr. Syst. (2010)

    Google Scholar 

  23. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)

    Google Scholar 

  24. Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Conference on Pattern Recognition (1984)

    Google Scholar 

  25. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)

    Google Scholar 

  26. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017)

    Google Scholar 

  27. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)

    Google Scholar 

  28. Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: ICDAR (2019)

    Google Scholar 

  29. Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks (2019)

    Google Scholar 

  30. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)

    Google Scholar 

  31. Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach towards the benchmarking of table structure recognition systems. In: DAS (2010)

    Google Scholar 

  32. Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: JCDL (2018)

    Google Scholar 

  33. Smith, R.: An overview of the tesseract OCR engine. In: ICDAR (2007)

    Google Scholar 

  34. Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting and merging for table structure decomposition. In: ICDAR (2019)

    Google Scholar 

  35. Wu, S., et al.: Fonduer: Knowledge base construction from richly formatted data. In: SIGMOD (2018)

    Google Scholar 

  36. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2

  37. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD (2020)

    Google Scholar 

  38. Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: CVPR (2017)

    Google Scholar 

  39. Zhong, X., Tang, J., Yepes, A.J.: PublayNet: largest dataset ever for document layout analysis. In: ICDAR (2019)

    Google Scholar 

Download references

Acknowledgements

The research work supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002104, the National Natural Science Foundation of China under Grant No. 62076231, U1811461. We thank Xu Wang and Jie Luo (from P.A.I Tech) for their kind help. We also thank anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rongyu Cao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cao, R., Li, H., Zhou, G., Luo, P. (2021). Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86331-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86330-2

  • Online ISBN: 978-3-030-86331-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics