Abstract
Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”, “location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464.
N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu—All authors contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadie, N., et al.: A dataset of french trade directories from the 19th century (FTD), March 2022. https://doi.org/10.5281/zenodo.6394464
Bell, S., et al.: Automated data extraction from historical city directories: the rise and fall of mid-century gas stations in providence. RI. PLoS One 15(8), 1–12 (2020)
Breuel, T.M.: The OCRopus open source OCR system. In: Document Recognition and Retrieval XV, vol. 6815, p. 68150F. International Society for Optics and Photonics (2008)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7
Huynh, V.-N., Hamdi, A., Doucet, A.: When to use OCR post-correction for named entity recognition? In: Ishita, E., Pang, N.L.S., Zhou, L. (eds.) ICADL 2020. LNCS, vol. 12504, pp. 33–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64452-9_3
Kiessling, B.: Kraken contributors. http://kraken.re
Kiessling, B., Tissot, R., Stokes, P., Stokl Ben Ezra, D.: eScriptorium: An open source platform for historical document analysis. In: International Conference on Document Analysis and Recognition Workshops, p. 19. IEEE (2019)
Kohút, J., Hradiš, M.: TS-Net: OCR trained to switch between text transcription styles. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 478–493. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_32
Labusch, K., Neudecker, C.: Named entity disambiguation and linking historic newspaper OCR with bert. In: CLEF (2020)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT. pp. 260–270 (2016)
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2020)
Mansouri, A., Affendey, L.S., Mamat, A.: Named entity recognition approaches. TAL 52(1), 339–344 (2008)
Martin, L., et al.: CamemBERT: a tasty French language model. In: ProProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020)
März, L., Schweter, S., Poerner, N., Roth, B., Schütze, H.: Data centric domain adaptation for historical text with OCR rrrors. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 748–761. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_48
Maurel, D., Friburger, N., Antoine, J.Y., Eshkol-Taravella, I., Nouvel, D.: Casen: a transducer cascade to recognize french named entities. TAL 52(1), 69–96 (2011)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Neudecker, C., Baierer, K., Gerber, M., Christian, C., Apostolos, A., Stefan, P.: A survey of OCR evaluation tools and metrics. In: The 6th International Workshop on Historical Document Imaging and Processing, pp. 13–18 (2021)
Nouvel, D., Antoine, J.Y., Friburger, N., Soulet, A.: Recognizing named entities using automatically extracted transduction rules. In: 5th Language and Technology Conference, pp. 136–140. Poznan, Poland (2011)
Santos, E.A.: Ocr evaluation tools for the 21st century. In: Proceedings of the Workshop on Computational Methods for Endangered Languages, vol. 1 (2019)
Smith, R.: An overview of the tesseract OCR engine. In: International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE (2007)
Spacy authors. https://spacy.io/
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks (2020)
Transkribus contributors. https://readcoop.eu/transkribus
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wick, C., Reul, C., Puppe, F.: Calamari-a high-performance tensorflow-based deep learning package for optical character recognition. Digit. Humanit. Q. 14(1) (2020)
Acknowledgments
This work is supported by the French National Research Agency (ANR), as part of the SODUCO project, under Grant ANR-18-CE38-0013. The authors want to thank S. Bacciochi, P. Cristofoli and J. Perret for helping to create the reference dataset, L. Morice for annotating data, as well as G. Thomas, P. Abi Saad, R. Lelièvre, D. Mignon, T. Cavaciuti and P. Sadki for contributing to the annotation platform.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B. (2022). A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19\(^{th}\) Century French Directories. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-06555-2_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)