skip to main content
10.1145/3197026.3197046acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Published:23 May 2018Publication History

ABSTRACT

Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.

References

  1. Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: An Evaluation Measure for Higher Order Information Access Tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM '08). ACM, New York, NY, USA, 561--570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Shariq Bashir. 2014. Estimating retrievability ranks of documents using document features. Neurocomputing 123, 0 (2014), 216 -- 232. Contains Special issue articles: Advances in Pattern Recognition Applications and Methods. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Shariq Bashir and Andreas Rauber. 2014. Automatic ranking of retrieval models using retrievability measure. Knowledge and Information Systems 41, 1 (2014), 189--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shariq Bashir and Andreas Rauber. 2017. Retrieval Models Versus Retrievability. In Current Challenges in Patent Information Retrieval, Mihai Lupu, Katja Mayer, Noriko Kando, and Anthony J. Trippe (Eds.). Springer Berlin Heidelberg, 185--212.Google ScholarGoogle Scholar
  5. G. Chiron, A. Doucet, M. Coustaty, M. Visani, and J. P. Moreux. 2017. Impact of OCR Errors on the Use of Digital Libraries: Towards a Better Access to Information. In JCDL 2017. 1--4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. George Garvy. 1952. Inequality of income: Causes and measurement. In Studies in Income and Wealth, Volume 15. NBER, 25--48.Google ScholarGoogle Scholar
  7. Rose Holley. 2009. How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine 15, 3/4 (2009).Google ScholarGoogle Scholar
  8. Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkö- nen, Jukka Kervinen, et al. 2014. Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. In 80th IFLA General Conference and Assembly.Google ScholarGoogle Scholar
  9. Elke Mittendorf and Peter Schäuble. 2000. Information Retrieval Can Cope with Many Errors. Inf. Retr. 3, 3 (Oct. 2000), 189--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Günter Mühlberger, Johannes Zelger, and David Sagmeister. 2014. User-driven Correction of OCR Errors: Combining Crowdsourcing and Information Retrieval Technology. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH '14). ACM, New York, NY, USA, 53--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Ohta, A. Takasu, and J. Adachi. 1997. Retrieval methods for English-text with missrecognized OCR characters. In Proceedings of the Fourth International Conference on Document Analysis and Recognition, Vol. 2. 950--956 vol.2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Lynda Hardman, and Arjen P. de Vries. 2017. Quantifying retrieval bias in Web archive search. International Journal on Digital Libraries (18 Apr 2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kazem Taghva, Julie Borsack, and Allen Condit. 1996. Effects of OCR errors on ranking and feedback using the vector space model. Information Processing &Management 32, 3 (1996), 317 -- 327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, and Lynda Hardman. 2016. Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL '16). ACM, New York, NY, USA, 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Myriam C. Traub, Jacco van Ossenbruggen, and Lynda Hardman. 2015. Impact Analysis of OCR Quality on Research Tasks in Digital Archives. Springer International Publishing, Cham, 252--263.Google ScholarGoogle Scholar
  16. Colin Wilkie and Leif Azzopardi. 2014. Best and Fairest: An Empirical Analysis of Retrieval System Bias. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13--16, 2014. Proceedings, Maarten de Rijke, Tom Kenter, Arjen P. de Vries, ChengXiang Zhai, Franciska de Jong, Kira Radinsky, and Katja Hofmann (Eds.). Springer International Publishing, Cham, 13--25.Google ScholarGoogle Scholar
  17. Colin Wilkie and Leif Azzopardi. 2014. Efficiently Estimating Retrievability Bias. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13--16, 2014. Proceedings, Maarten de Rijke, Tom Kenter, Arjen P. de Vries, ChengXiang Zhai, Franciska de Jong, Kira Radinsky, and Katja Hofmann (Eds.). Springer International Publishing, Cham, 720--726.Google ScholarGoogle Scholar
  18. Colin Wilkie and Leif Azzopardi. 2015. Retrievability and Retrieval Bias: A Comparison of Inequality Measures. In Advances in Information Retrieval, Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Norbert Fuhr (Eds.). Lecture Notes in Computer Science, Vol. 9022. Springer International Publishing, 209-- 214.Google ScholarGoogle Scholar

Index Terms

  1. Impact of Crowdsourcing OCR Improvements on Retrievability Bias

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
        May 2018
        453 pages
        ISBN:9781450351782
        DOI:10.1145/3197026

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 May 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        JCDL '18 Paper Acceptance Rate26of71submissions,37%Overall Acceptance Rate415of1,482submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader