ABSTRACT
Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.
- Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: An Evaluation Measure for Higher Order Information Access Tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM '08). ACM, New York, NY, USA, 561--570. Google ScholarDigital Library
- Shariq Bashir. 2014. Estimating retrievability ranks of documents using document features. Neurocomputing 123, 0 (2014), 216 -- 232. Contains Special issue articles: Advances in Pattern Recognition Applications and Methods. Google ScholarDigital Library
- Shariq Bashir and Andreas Rauber. 2014. Automatic ranking of retrieval models using retrievability measure. Knowledge and Information Systems 41, 1 (2014), 189--221. Google ScholarDigital Library
- Shariq Bashir and Andreas Rauber. 2017. Retrieval Models Versus Retrievability. In Current Challenges in Patent Information Retrieval, Mihai Lupu, Katja Mayer, Noriko Kando, and Anthony J. Trippe (Eds.). Springer Berlin Heidelberg, 185--212.Google Scholar
- G. Chiron, A. Doucet, M. Coustaty, M. Visani, and J. P. Moreux. 2017. Impact of OCR Errors on the Use of Digital Libraries: Towards a Better Access to Information. In JCDL 2017. 1--4. Google ScholarDigital Library
- George Garvy. 1952. Inequality of income: Causes and measurement. In Studies in Income and Wealth, Volume 15. NBER, 25--48.Google Scholar
- Rose Holley. 2009. How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine 15, 3/4 (2009).Google Scholar
- Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkö- nen, Jukka Kervinen, et al. 2014. Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. In 80th IFLA General Conference and Assembly.Google Scholar
- Elke Mittendorf and Peter Schäuble. 2000. Information Retrieval Can Cope with Many Errors. Inf. Retr. 3, 3 (Oct. 2000), 189--216. Google ScholarDigital Library
- Günter Mühlberger, Johannes Zelger, and David Sagmeister. 2014. User-driven Correction of OCR Errors: Combining Crowdsourcing and Information Retrieval Technology. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH '14). ACM, New York, NY, USA, 53--56. Google ScholarDigital Library
- M. Ohta, A. Takasu, and J. Adachi. 1997. Retrieval methods for English-text with missrecognized OCR characters. In Proceedings of the Fourth International Conference on Document Analysis and Recognition, Vol. 2. 950--956 vol.2. Google ScholarDigital Library
- Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Lynda Hardman, and Arjen P. de Vries. 2017. Quantifying retrieval bias in Web archive search. International Journal on Digital Libraries (18 Apr 2017). Google ScholarDigital Library
- Kazem Taghva, Julie Borsack, and Allen Condit. 1996. Effects of OCR errors on ranking and feedback using the vector space model. Information Processing &Management 32, 3 (1996), 317 -- 327. Google ScholarDigital Library
- Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, and Lynda Hardman. 2016. Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL '16). ACM, New York, NY, USA, 7--16. Google ScholarDigital Library
- Myriam C. Traub, Jacco van Ossenbruggen, and Lynda Hardman. 2015. Impact Analysis of OCR Quality on Research Tasks in Digital Archives. Springer International Publishing, Cham, 252--263.Google Scholar
- Colin Wilkie and Leif Azzopardi. 2014. Best and Fairest: An Empirical Analysis of Retrieval System Bias. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13--16, 2014. Proceedings, Maarten de Rijke, Tom Kenter, Arjen P. de Vries, ChengXiang Zhai, Franciska de Jong, Kira Radinsky, and Katja Hofmann (Eds.). Springer International Publishing, Cham, 13--25.Google Scholar
- Colin Wilkie and Leif Azzopardi. 2014. Efficiently Estimating Retrievability Bias. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13--16, 2014. Proceedings, Maarten de Rijke, Tom Kenter, Arjen P. de Vries, ChengXiang Zhai, Franciska de Jong, Kira Radinsky, and Katja Hofmann (Eds.). Springer International Publishing, Cham, 720--726.Google Scholar
- Colin Wilkie and Leif Azzopardi. 2015. Retrievability and Retrieval Bias: A Comparison of Inequality Measures. In Advances in Information Retrieval, Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Norbert Fuhr (Eds.). Lecture Notes in Computer Science, Vol. 9022. Springer International Publishing, 209-- 214.Google Scholar
Index Terms
- Impact of Crowdsourcing OCR Improvements on Retrievability Bias
Recommendations
Correcting noisy OCR: context beats confusion
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural HeritageWe describe a system for automatic post OCR text correction of digital collections of historical texts. Documents, such as old newspapers, are often degraded, so even the best OCR tools can yield garbled text. When keywords are corrupted, text is ...
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital LibrariesBias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential ...
Recognition of OCR Invoice Metadata Block Types
Text, Speech, and DialogueAbstractAutomatically cataloging of thousands of paper-based structured documents is a crucial fund-saving task for future document management systems. Current optical character recognition (OCR) systems process the tabular data with a sufficient level of ...
Comments