Skip to main content

Probabilistic Automaton Model for Fuzzy English-Text Retrieval

  • Conference paper
  • First Online:
Research and Advanced Technology for Digital Libraries (ECDL 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1923))

Included in the following conference series:

Abstract

Optical character reader (OCR) misrecognition is a serious problem when searching against OCR-scanned documents in databases such as digital libraries. This paper proposes fuzzy retrieval methods for English text that contains errors in the recognized text without cor- recting the errors manually. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term based on probabilistic automata reflecting both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.56% to 97.88% at the cost of a decrease in precision rate from 100.00% to 95.52% with 20 expanded search terms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Eugene Charniak. Statistical Language Learning. The MIT Press, 1993.

    Google Scholar 

  2. W. B. Croft, S. M. Harding, K. Taghva, and J. Borsack. An evaluation of information retrieval accuracy with simulated OCR output. In Proc. of SDAIR’94 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 115–126, Las Vegas, NV, April 1994.

    Google Scholar 

  3. Daniel Lopresti and Jiangying Zhou. Retrieval strategies for noisy text. In Proc. of SDAIR’96 5th Annual Symposium on Document Analysis and Information Retrieval, pages 255–269, Las Vegas, NV, April 1996.

    Google Scholar 

  4. Daniel P. Lopresti. Robust retrieval of noisy text. In Proc. of ADL’96 Forum on Research and Technology Advances in Digital Libraries, pages 76–85, Library of Congress, Washington, D. C., May 1996. URL http://dlt.gsfc.nasa.gov/adl96/.

  5. Manabu Ohta, Atsuhiro Takasu, and Jun Adachi. Reduction of expanded search terms for fuzzy English-text retrieval. In Proc. of ECDL’98, LNCS 1513, pages 619–633, Crete, Greece, September 1998. Springer.

    Google Scholar 

  6. Kazem Taghva, Julie Borsack, and Allen Condit. An expert system for automatically correcting OCR output. In Proc. of the IS&T/SPIE 1994 International Symposium on Electronic Imaging Science and Technology, pages 270–278, San Jose, CA, February 1994.

    Google Scholar 

  7. Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.

    Google Scholar 

  8. Kazem Taghva, Julie Borsack, and Allen Condit. Effects of OCR errors on ranking and feedback using the vector space model. Information Processing & Management, 32(3):317–327, 1996.

    Article  Google Scholar 

  9. Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. on Information Systems, 14(1):64–93, January 1996.

    Article  Google Scholar 

  10. Kazem Taghva, Allen Condit, and Julie Borsack. An evaluation of an automatic markup system. In Proc. of the IS&T/SPIE 1995 International Symposium on Electronic Imaging Science and Technology, pages 317–327, San Jose, CA, February 1995.

    Google Scholar 

  11. Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Jeff Gilbreth. The MANICURE document processing system. Technical Report 95–02, Information Science Research Institute, University of Nevada, Las Vegas, NV, March 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ohta, M., Takasu, A., Adachi, J. (2000). Probabilistic Automaton Model for Fuzzy English-Text Retrieval. In: Borbinha, J., Baker, T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2000. Lecture Notes in Computer Science, vol 1923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45268-0_4

Download citation

  • DOI: https://doi.org/10.1007/3-540-45268-0_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41023-2

  • Online ISBN: 978-3-540-45268-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics