Skip to main content

A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19\(^{th}\) Century French Directories

  • Conference paper
  • First Online:
Book cover Document Analysis Systems (DAS 2022)

Abstract

Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”, “location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464.

N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu—All authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/Jean-Baptiste/camembert-ner.

  2. 2.

    https://huggingface.co/HueyNemud/das22-10-camembert_pretrained.

References

  1. Abadie, N., et al.: A dataset of french trade directories from the 19th century (FTD), March 2022. https://doi.org/10.5281/zenodo.6394464

  2. Bell, S., et al.: Automated data extraction from historical city directories: the rise and fall of mid-century gas stations in providence. RI. PLoS One 15(8), 1–12 (2020)

    MathSciNet  Google Scholar 

  3. Breuel, T.M.: The OCRopus open source OCR system. In: Document Recognition and Retrieval XV, vol. 6815, p. 68150F. International Society for Optics and Photonics (2008)

    Google Scholar 

  4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)

    MATH  Google Scholar 

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  6. Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7

    Chapter  Google Scholar 

  7. Huynh, V.-N., Hamdi, A., Doucet, A.: When to use OCR post-correction for named entity recognition? In: Ishita, E., Pang, N.L.S., Zhou, L. (eds.) ICADL 2020. LNCS, vol. 12504, pp. 33–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64452-9_3

    Chapter  Google Scholar 

  8. Kiessling, B.: Kraken contributors. http://kraken.re

  9. Kiessling, B., Tissot, R., Stokes, P., Stokl Ben Ezra, D.: eScriptorium: An open source platform for historical document analysis. In: International Conference on Document Analysis and Recognition Workshops, p. 19. IEEE (2019)

    Google Scholar 

  10. Kohút, J., Hradiš, M.: TS-Net: OCR trained to switch between text transcription styles. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 478–493. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_32

    Chapter  Google Scholar 

  11. Labusch, K., Neudecker, C.: Named entity disambiguation and linking historic newspaper OCR with bert. In: CLEF (2020)

    Google Scholar 

  12. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT. pp. 260–270 (2016)

    Google Scholar 

  13. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2020)

    Article  Google Scholar 

  14. Mansouri, A., Affendey, L.S., Mamat, A.: Named entity recognition approaches. TAL 52(1), 339–344 (2008)

    Google Scholar 

  15. Martin, L., et al.: CamemBERT: a tasty French language model. In: ProProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020)

    Google Scholar 

  16. März, L., Schweter, S., Poerner, N., Roth, B., Schütze, H.: Data centric domain adaptation for historical text with OCR rrrors. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 748–761. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_48

    Chapter  Google Scholar 

  17. Maurel, D., Friburger, N., Antoine, J.Y., Eshkol-Taravella, I., Nouvel, D.: Casen: a transducer cascade to recognize french named entities. TAL 52(1), 69–96 (2011)

    Google Scholar 

  18. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)

    Article  Google Scholar 

  19. Neudecker, C., Baierer, K., Gerber, M., Christian, C., Apostolos, A., Stefan, P.: A survey of OCR evaluation tools and metrics. In: The 6th International Workshop on Historical Document Imaging and Processing, pp. 13–18 (2021)

    Google Scholar 

  20. Nouvel, D., Antoine, J.Y., Friburger, N., Soulet, A.: Recognizing named entities using automatically extracted transduction rules. In: 5th Language and Technology Conference, pp. 136–140. Poznan, Poland (2011)

    Google Scholar 

  21. Santos, E.A.: Ocr evaluation tools for the 21st century. In: Proceedings of the Workshop on Computational Methods for Endangered Languages, vol. 1 (2019)

    Google Scholar 

  22. Smith, R.: An overview of the tesseract OCR engine. In: International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE (2007)

    Google Scholar 

  23. Spacy authors. https://spacy.io/

  24. van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks (2020)

    Google Scholar 

  25. Transkribus contributors. https://readcoop.eu/transkribus

  26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  27. Wick, C., Reul, C., Puppe, F.: Calamari-a high-performance tensorflow-based deep learning package for optical character recognition. Digit. Humanit. Q. 14(1) (2020)

    Google Scholar 

Download references

Acknowledgments

This work is supported by the French National Research Agency (ANR), as part of the SODUCO project, under Grant ANR-18-CE38-0013. The authors want to thank S. Bacciochi, P. Cristofoli and J. Perret for helping to create the reference dataset, L. Morice for annotating data, as well as G. Thomas, P. Abi Saad, R. Lelièvre, D. Mignon, T. Cavaciuti and P. Sadki for contributing to the annotation platform.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. Chazalon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B. (2022). A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19\(^{th}\) Century French Directories. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06555-2_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06554-5

  • Online ISBN: 978-3-031-06555-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics