A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19 $$^{th}$$ Century French Directories

Abadie, N.; Carlinet, E.; Chazalon, J.; Duménieu, B.

doi:10.1007/978-3-031-06555-2_30

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13237))

Included in the following conference series:

International Workshop on Document Analysis Systems

1764 Accesses
4 Citations

Abstract

Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”, “location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464.

N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu—All authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abadie, N., et al.: A dataset of french trade directories from the 19th century (FTD), March 2022. https://doi.org/10.5281/zenodo.6394464
Bell, S., et al.: Automated data extraction from historical city directories: the rise and fall of mid-century gas stations in providence. RI. PLoS One 15(8), 1–12 (2020)
MathSciNet Google Scholar
Breuel, T.M.: The OCRopus open source OCR system. In: Document Recognition and Retrieval XV, vol. 6815, p. 68150F. International Society for Optics and Photonics (2008)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
MATH Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7
Chapter Google Scholar
Huynh, V.-N., Hamdi, A., Doucet, A.: When to use OCR post-correction for named entity recognition? In: Ishita, E., Pang, N.L.S., Zhou, L. (eds.) ICADL 2020. LNCS, vol. 12504, pp. 33–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64452-9_3
Chapter Google Scholar
Kiessling, B.: Kraken contributors. http://kraken.re
Kiessling, B., Tissot, R., Stokes, P., Stokl Ben Ezra, D.: eScriptorium: An open source platform for historical document analysis. In: International Conference on Document Analysis and Recognition Workshops, p. 19. IEEE (2019)
Google Scholar
Kohút, J., Hradiš, M.: TS-Net: OCR trained to switch between text transcription styles. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 478–493. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_32
Chapter Google Scholar
Labusch, K., Neudecker, C.: Named entity disambiguation and linking historic newspaper OCR with bert. In: CLEF (2020)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT. pp. 260–270 (2016)
Google Scholar
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2020)
Article Google Scholar
Mansouri, A., Affendey, L.S., Mamat, A.: Named entity recognition approaches. TAL 52(1), 339–344 (2008)
Google Scholar
Martin, L., et al.: CamemBERT: a tasty French language model. In: ProProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020)
Google Scholar
März, L., Schweter, S., Poerner, N., Roth, B., Schütze, H.: Data centric domain adaptation for historical text with OCR rrrors. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 748–761. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_48
Chapter Google Scholar
Maurel, D., Friburger, N., Antoine, J.Y., Eshkol-Taravella, I., Nouvel, D.: Casen: a transducer cascade to recognize french named entities. TAL 52(1), 69–96 (2011)
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Article Google Scholar
Neudecker, C., Baierer, K., Gerber, M., Christian, C., Apostolos, A., Stefan, P.: A survey of OCR evaluation tools and metrics. In: The 6th International Workshop on Historical Document Imaging and Processing, pp. 13–18 (2021)
Google Scholar
Nouvel, D., Antoine, J.Y., Friburger, N., Soulet, A.: Recognizing named entities using automatically extracted transduction rules. In: 5th Language and Technology Conference, pp. 136–140. Poznan, Poland (2011)
Google Scholar
Santos, E.A.: Ocr evaluation tools for the 21st century. In: Proceedings of the Workshop on Computational Methods for Endangered Languages, vol. 1 (2019)
Google Scholar
Smith, R.: An overview of the tesseract OCR engine. In: International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE (2007)
Google Scholar
Spacy authors. https://spacy.io/
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks (2020)
Google Scholar
Transkribus contributors. https://readcoop.eu/transkribus
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wick, C., Reul, C., Puppe, F.: Calamari-a high-performance tensorflow-based deep learning package for optical character recognition. Digit. Humanit. Q. 14(1) (2020)
Google Scholar

Download references

Acknowledgments

This work is supported by the French National Research Agency (ANR), as part of the SODUCO project, under Grant ANR-18-CE38-0013. The authors want to thank S. Bacciochi, P. Cristofoli and J. Perret for helping to create the reference dataset, L. Morice for annotating data, as well as G. Thomas, P. Abi Saad, R. Lelièvre, D. Mignon, T. Cavaciuti and P. Sadki for contributing to the annotation platform.

Author information

Authors and Affiliations

LASTIG, Univ. Gustave Eiffel, IGN-ENSG, 94160, Saint-Mandé, France
N. Abadie
EPITA Research & Development Laboratory (LRDE), Le Kremlin-Bicêtre, France
E. Carlinet & J. Chazalon
CRH-EHESS, Paris, France
B. Duménieu

Authors

N. Abadie
View author publications
You can also search for this author in PubMed Google Scholar
E. Carlinet
View author publications
You can also search for this author in PubMed Google Scholar
J. Chazalon
View author publications
You can also search for this author in PubMed Google Scholar
B. Duménieu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. Chazalon .

Editor information

Editors and Affiliations

Kyushu University, Fukuoka, Japan
Seiichi Uchida
Boise State University, BOISE, ID, USA
Elisa Barney
LIRIS UMR CNRS, Villeurbanne, France
Véronique Eglin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B. (2022). A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19$^{th}$ Century French Directories. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-06555-2_30
Published: 18 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19\(^{th}\) Century French Directories

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19\(^{th}\) Century French Directories

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation