Skip to main content

Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval

  • Conference paper
String Processing and Information Retrieval (SPIRE 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6393))

Included in the following conference series:

Abstract

The retrieval of similar documents in the Web from a given document is different in many aspects from information retrieval based on queries generated by regular search engine users. In this work, a new method is proposed for Web similarity document retrieval based on generative language models and meta search engines. Probabilistic language models are used as a random query generator for the given document. Queries are submitted to a customizable set of Web search engines. Once all results obtained are gathered, its evaluation is determined by a proposed scoring function based on the Zipf law. Results obtained showed that the proposed methodology for query generation and scoring procedure solves the problem with acceptable levels of precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)

    Google Scholar 

  2. Hakerness, W.L.: Properties of the extended hypergeometric distribution. Ann. Math. Statist. 36(3), 938–945 (1965)

    Article  MathSciNet  Google Scholar 

  3. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 284–291. ACM, New York (2006)

    Google Scholar 

  4. Pereira Jr., A.R., Ziviani, N.: Retrieving similar documents from the web. J. Web Eng. 2(4), 247–261 (2004)

    Google Scholar 

  5. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  6. Nagaraj, S.V.: Web Caching And Its Applications. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Norwell (2004)

    Google Scholar 

  7. Selberg, E., Etzioni, O.: The metacrawler architecture for resource aggregation on the web. IEEE Expert, 11–14 (January–February 1997)

    Google Scholar 

  8. Somlo, G.L., Howe, A.E.: Using web helper agent profiles in query generation. In: AAMAS 2003: Proceedings of the second international joint conference on Autonomous agents and multiagent systems, pp. 812–818. ACM, New York (2003)

    Chapter  Google Scholar 

  9. Zaka, B.: Empowering plagiarism detection with a web services enabled collaborative network. Journal of Information Science and Engineering 25(5), 1391–1403 (2009)

    Google Scholar 

  10. Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bravo-Marquez, F., L’Huillier, G., Ríos, S.A., Velásquez, J.D. (2010). Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval. In: Chavez, E., Lonardi, S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16321-0_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16321-0_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16320-3

  • Online ISBN: 978-3-642-16321-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics