Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents

doi:10.5281/zenodo.6369941

Published March 19, 2022 | Version v1

Journal article Open

Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents

Event detection (ED) is a crucial task for natural language processing (NLP) and it involves the identification of instances of speci ed types of events in text and their classi cation into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labor-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages.We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence
of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.

Files

IJDL2022__29_March_10_30_pages__Event_Extraction_over_Digitised_Documents.pdf

Files (1.4 MB)

Name	Size	Download all
IJDL__2022____29_March___10___30_pages__Event_Extraction_over_Digitised_Documents.pdf md5:eaf51845305ae50d322b502da92a4a7e	1.4 MB	Preview Download

Additional details

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299: European Commission
EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153: European Commission

	All versions	This version
Views	60	59
Downloads	89	88
Data volume	133.3 MB	131.9 MB

Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents

Creators

Description

Files

IJDL__2022____29_March___10___30_pages__Event_Extraction_over_Digitised_Documents.pdf

Files (1.4 MB)

Additional details

Funding

IJDL2022__29_March_10_30_pages__Event_Extraction_over_Digitised_Documents.pdf