skip to main content
10.1145/2063576.2063592acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Discovering URLs through user feedback

Published:24 October 2011Publication History

ABSTRACT

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.

References

  1. E. Adar, J. Teevan, and S. T. Dumais. Resonance on the Web: Web dynamics and revisitation patterns. In Proc. 27th Int'l Conf. on Human Factors in Computing Systems, pages 1381--1390, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The web changes everything: Understanding the dynamics of web content. In Proc. 2nd ACM Int'l Conf. on Web Search and Data Mining, pages 282--291, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entry points. In Proc. 16th Int'l Conf. on World Wide Web, pages 441--450, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. K. Bergman. White paper: The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1):online, 2001.Google ScholarGoogle Scholar
  5. P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: A scalable fully distributed web crawler. Softw. Pract. Exper., 34(8):711--726, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. B. Cambazoglu, V. Plachouras, F. Junqueira, and L. Telloli. On the feasibility of geographically distributed web crawling. In Proc. 3rd Int'l Conf. on Scalable Information Systems, pages 1--10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Cho and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. In Proc. 26th Int'l Conf. on Very Large Data Bases, pages 200--209, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cho and H. Garcia-Molina. Parallel crawlers. In Proc. 11th Int'l Conf. on World Wide Web, pages 124--135, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4):390--426, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst., 30(1-7):161--172, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the Web. In Proc. 16th Int'l Conf. on World Wide Web, pages 421--430, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proc. 10th Int'l Conf. on World Wide Web, pages 106--113, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th Int'l Conf. on World Wide Web, pages 309--318, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino. Geographical partition for distributed web crawling. In Proc. 2005 Workshop on Geographic Information Retrieval, pages 55--60, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino. Efficient partitioning strategies for distributed web crawling. In Information Networking. Towards Ubiquitous Networking and Services, volume 5200 of Lect. Notes Comput. Sc., pages 544--553. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Fetterly, N. Craswell, and V. Vinay. The impact of crawl policy on web search e ectiveness. In Proc. 32nd Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 580--587, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. Softw. Pract. Exper., 34(2):213--237, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. d. C. Fontes and F. S. Silva. SmartCrawl: A new strategy for the exploration of the hidden web. In Proc. 6th Annual ACM Int'l Workshop on Web Information and Data Management, pages 9--15, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. A. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. In Proc. of the 27th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 478--479, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. Proc. 8th Int'l Conf. on World Wide Web, 2(4):219--229, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proc. 1st Int'l Conf. on Web Search and Data Mining, pages 195--206, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Kumar and A. Tomkins. A characterization of online browsing behavior. In Proc. 19th Int'l Conf. on World Wide Web, pages 561--570, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Lawrence and C. L. Giles. Accessibility of information on the Web. Intelligence, 11(1):32--39, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: Scaling to 6 billion pages and beyond. In Proc. 17th Int'l Conf. on World Wide Web, pages 427--436, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep web crawl. Proc. VLDB Endowment, 1(2):1241--1252, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. 10th Int'l Conf. on World Wide Web, pages 114--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: The evolution of the web from a search engine perspective. In Proc. 13th Int'l Conf. on World Wide Web, pages 1--12, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In Proc. 5th ACM/IEEE-CS Joint Conf. on Digital Libraries, pages 100--109, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Olston and M. Najork. Web crawling. Found. Trends Inf. Retr., 4(3):175--246, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th Int'l Conf. on World Wide Web, pages 401--411, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit feedback. In Proc. of the 11th Int'l Conf. on Knowledge Discovery in Data Mining, pages 239--248, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proc. 27th Int'l Conf. on Very Large Data Bases, pages 129--138, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. 18th Int'l Conf. on Data Engineering, page 357, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. E. Williams. Discovering web-based multimedia using search toolbar data, 2007. US 2007/0136263 A1.Google ScholarGoogle Scholar
  35. J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th Int'l Conf. on World Wide Web, pages 136--147, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Zeinalipour-Yazti and M. D. Dikaiakos. Design and implementation of a distributed crawler and filtering processor. In Proc. 5th Int'l Workshop on Next Generation Information Technologies and Systems, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Discovering URLs through user feedback

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
      October 2011
      2712 pages
      ISBN:9781450307178
      DOI:10.1145/2063576

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 October 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader