research-article

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Authors:
Myriam C. Traub

Centrum Wiskunde &Informatica, Amsterdam, Netherlands

Centrum Wiskunde &Informatica, Amsterdam, Netherlands
View Profile

,
Thaer Samar

Centrum Wiskunde &Informatica, Amsterdam, Netherlands

Centrum Wiskunde &Informatica, Amsterdam, Netherlands
View Profile

,
Jacco van Ossenbruggen

Centrum Wiskunde &Informatica, Amsterdam, Netherlands

Centrum Wiskunde &Informatica, Amsterdam, Netherlands
View Profile

,
Lynda Hardman

Centrum Wiskunde &Informatica, Amsterdam, Netherlands

Centrum Wiskunde &Informatica, Amsterdam, Netherlands
View Profile

JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital LibrariesMay 2018Pages 29–36https://doi.org/10.1145/3197026.3197046

Published:23 May 2018Publication History

JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries

Pages 29–36

ABSTRACT

Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.

References

Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: An Evaluation Measure for Higher Order Information Access Tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM '08). ACM, New York, NY, USA, 561--570. Google ScholarDigital Library
Shariq Bashir. 2014. Estimating retrievability ranks of documents using document features. Neurocomputing 123, 0 (2014), 216 -- 232. Contains Special issue articles: Advances in Pattern Recognition Applications and Methods. Google ScholarDigital Library
Shariq Bashir and Andreas Rauber. 2014. Automatic ranking of retrieval models using retrievability measure. Knowledge and Information Systems 41, 1 (2014), 189--221. Google ScholarDigital Library
Shariq Bashir and Andreas Rauber. 2017. Retrieval Models Versus Retrievability. In Current Challenges in Patent Information Retrieval, Mihai Lupu, Katja Mayer, Noriko Kando, and Anthony J. Trippe (Eds.). Springer Berlin Heidelberg, 185--212.Google Scholar
G. Chiron, A. Doucet, M. Coustaty, M. Visani, and J. P. Moreux. 2017. Impact of OCR Errors on the Use of Digital Libraries: Towards a Better Access to Information. In JCDL 2017. 1--4. Google ScholarDigital Library
George Garvy. 1952. Inequality of income: Causes and measurement. In Studies in Income and Wealth, Volume 15. NBER, 25--48.Google Scholar
Rose Holley. 2009. How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine 15, 3/4 (2009).Google Scholar
Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkö- nen, Jukka Kervinen, et al. 2014. Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. In 80th IFLA General Conference and Assembly.Google Scholar
Elke Mittendorf and Peter Schäuble. 2000. Information Retrieval Can Cope with Many Errors. Inf. Retr. 3, 3 (Oct. 2000), 189--216. Google ScholarDigital Library
Günter Mühlberger, Johannes Zelger, and David Sagmeister. 2014. User-driven Correction of OCR Errors: Combining Crowdsourcing and Information Retrieval Technology. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH '14). ACM, New York, NY, USA, 53--56. Google ScholarDigital Library
M. Ohta, A. Takasu, and J. Adachi. 1997. Retrieval methods for English-text with missrecognized OCR characters. In Proceedings of the Fourth International Conference on Document Analysis and Recognition, Vol. 2. 950--956 vol.2. Google ScholarDigital Library
Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Lynda Hardman, and Arjen P. de Vries. 2017. Quantifying retrieval bias in Web archive search. International Journal on Digital Libraries (18 Apr 2017). Google ScholarDigital Library
Kazem Taghva, Julie Borsack, and Allen Condit. 1996. Effects of OCR errors on ranking and feedback using the vector space model. Information Processing &Management 32, 3 (1996), 317 -- 327. Google ScholarDigital Library
Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, and Lynda Hardman. 2016. Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL '16). ACM, New York, NY, USA, 7--16. Google ScholarDigital Library
Myriam C. Traub, Jacco van Ossenbruggen, and Lynda Hardman. 2015. Impact Analysis of OCR Quality on Research Tasks in Digital Archives. Springer International Publishing, Cham, 252--263.Google Scholar
Colin Wilkie and Leif Azzopardi. 2014. Best and Fairest: An Empirical Analysis of Retrieval System Bias. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13--16, 2014. Proceedings, Maarten de Rijke, Tom Kenter, Arjen P. de Vries, ChengXiang Zhai, Franciska de Jong, Kira Radinsky, and Katja Hofmann (Eds.). Springer International Publishing, Cham, 13--25.Google Scholar
Colin Wilkie and Leif Azzopardi. 2014. Efficiently Estimating Retrievability Bias. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13--16, 2014. Proceedings, Maarten de Rijke, Tom Kenter, Arjen P. de Vries, ChengXiang Zhai, Franciska de Jong, Kira Radinsky, and Katja Hofmann (Eds.). Springer International Publishing, Cham, 720--726.Google Scholar
Colin Wilkie and Leif Azzopardi. 2015. Retrievability and Retrieval Bias: A Comparison of Inequality Measures. In Advances in Information Retrieval, Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Norbert Fuhr (Eds.). Lecture Notes in Computer Science, Vol. 9022. Springer International Publishing, 209-- 214.Google Scholar

Index Terms

Impact of Crowdsourcing OCR Improvements on Retrievability Bias
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Information retrieval query processing
      1. Query log analysis

Recommendations

Correcting noisy OCR: context beats confusion
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

We describe a system for automatic post OCR text correction of digital collections of historical texts. Documents, such as old newspapers, are often degraded, so even the best OCR tools can yield garbled text. When keywords are corrupted, text is ...
Read More
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries

Bias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential ...
Read More
Recognition of OCR Invoice Metadata Block Types
Text, Speech, and Dialogue
Abstract
Automatically cataloging of thousands of paper-based structured documents is a crucial fund-saving task for future document management systems. Current optical character recognition (OCR) systems process the tabular data with a sufficient level of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
May 2018
453 pages
ISBN:9781450351782
DOI:10.1145/3197026
General Chairs:
Jiangping Chen
College of Information, UNT, USA
,
Marcos André Gonçalves
, Brazil
,
Jeff M. Allen
College of Information, UNT, USA
,
Program Chairs:
Edward A. Fox
Virginia Tech, USA
,
Min-Yen Kan
National University of Singapore, Singapore
,
Vivien Petras
Humboldt-Universität zu Berlin, Germany
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data quality
digital library
ocr
retrievability bias
Qualifiers
- research-article
Conference

Acceptance Rates
JCDL '18 Paper Acceptance Rate26of71submissions,37%Overall Acceptance Rate415of1,482submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 248
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Correcting noisy OCR: context beats confusion

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus

Recognition of OCR Invoice Metadata Block Types