ABSTRACT
Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.
- E. Adar, J. Teevan, and S. T. Dumais. Resonance on the Web: Web dynamics and revisitation patterns. In Proc. 27th Int'l Conf. on Human Factors in Computing Systems, pages 1381--1390, 2009. Google ScholarDigital Library
- E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The web changes everything: Understanding the dynamics of web content. In Proc. 2nd ACM Int'l Conf. on Web Search and Data Mining, pages 282--291, 2009. Google ScholarDigital Library
- L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entry points. In Proc. 16th Int'l Conf. on World Wide Web, pages 441--450, 2007. Google ScholarDigital Library
- M. K. Bergman. White paper: The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1):online, 2001.Google Scholar
- P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: A scalable fully distributed web crawler. Softw. Pract. Exper., 34(8):711--726, 2004. Google ScholarDigital Library
- B. B. Cambazoglu, V. Plachouras, F. Junqueira, and L. Telloli. On the feasibility of geographically distributed web crawling. In Proc. 3rd Int'l Conf. on Scalable Information Systems, pages 1--10, 2008. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. In Proc. 26th Int'l Conf. on Very Large Data Bases, pages 200--209, 2000. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. Parallel crawlers. In Proc. 11th Int'l Conf. on World Wide Web, pages 124--135, 2002. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4):390--426, 2003. Google ScholarDigital Library
- J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Comput. Netw. ISDN Syst., 30(1-7):161--172, 1998. Google ScholarDigital Library
- A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the Web. In Proc. 16th Int'l Conf. on World Wide Web, pages 421--430, 2007. Google ScholarDigital Library
- J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proc. 10th Int'l Conf. on World Wide Web, pages 106--113, 2001. Google ScholarDigital Library
- N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th Int'l Conf. on World Wide Web, pages 309--318, 2004. Google ScholarDigital Library
- J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino. Geographical partition for distributed web crawling. In Proc. 2005 Workshop on Geographic Information Retrieval, pages 55--60, 2005. Google ScholarDigital Library
- J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino. Efficient partitioning strategies for distributed web crawling. In Information Networking. Towards Ubiquitous Networking and Services, volume 5200 of Lect. Notes Comput. Sc., pages 544--553. 2008. Google ScholarDigital Library
- D. Fetterly, N. Craswell, and V. Vinay. The impact of crawl policy on web search e ectiveness. In Proc. 32nd Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 580--587, 2009. Google ScholarDigital Library
- D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. Softw. Pract. Exper., 34(2):213--237, 2004. Google ScholarDigital Library
- A. d. C. Fontes and F. S. Silva. SmartCrawl: A new strategy for the exploration of the hidden web. In Proc. 6th Annual ACM Int'l Workshop on Web Information and Data Management, pages 9--15, 2004. Google ScholarDigital Library
- L. A. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. In Proc. of the 27th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 478--479, 2004. Google ScholarDigital Library
- A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. Proc. 8th Int'l Conf. on World Wide Web, 2(4):219--229, 1999. Google ScholarDigital Library
- P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proc. 1st Int'l Conf. on Web Search and Data Mining, pages 195--206, 2008. Google ScholarDigital Library
- R. Kumar and A. Tomkins. A characterization of online browsing behavior. In Proc. 19th Int'l Conf. on World Wide Web, pages 561--570, 2010. Google ScholarDigital Library
- S. Lawrence and C. L. Giles. Accessibility of information on the Web. Intelligence, 11(1):32--39, 2000. Google ScholarDigital Library
- H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: Scaling to 6 billion pages and beyond. In Proc. 17th Int'l Conf. on World Wide Web, pages 427--436, 2008. Google ScholarDigital Library
- J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep web crawl. Proc. VLDB Endowment, 1(2):1241--1252, 2008. Google ScholarDigital Library
- M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. 10th Int'l Conf. on World Wide Web, pages 114--118, 2001. Google ScholarDigital Library
- A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: The evolution of the web from a search engine perspective. In Proc. 13th Int'l Conf. on World Wide Web, pages 1--12, 2004. Google ScholarDigital Library
- A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In Proc. 5th ACM/IEEE-CS Joint Conf. on Digital Libraries, pages 100--109, 2005. Google ScholarDigital Library
- C. Olston and M. Najork. Web crawling. Found. Trends Inf. Retr., 4(3):175--246, 2010. Google ScholarDigital Library
- S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th Int'l Conf. on World Wide Web, pages 401--411, 2005. Google ScholarDigital Library
- F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit feedback. In Proc. of the 11th Int'l Conf. on Knowledge Discovery in Data Mining, pages 239--248, 2005. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proc. 27th Int'l Conf. on Very Large Data Bases, pages 129--138, 2001. Google ScholarDigital Library
- V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. 18th Int'l Conf. on Data Engineering, page 357, 2002. Google ScholarDigital Library
- H. E. Williams. Discovering web-based multimedia using search toolbar data, 2007. US 2007/0136263 A1.Google Scholar
- J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th Int'l Conf. on World Wide Web, pages 136--147, 2002. Google ScholarDigital Library
- D. Zeinalipour-Yazti and M. D. Dikaiakos. Design and implementation of a distributed crawler and filtering processor. In Proc. 5th Int'l Workshop on Next Generation Information Technologies and Systems, 2002. Google ScholarDigital Library
Index Terms
Discovering URLs through user feedback
Recommendations
A novel crawling algorithm for web pages
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval TechnologyCrawler is a main component of search engines. In search engines, crawler part is responsible for discovering and downloading web pages. No search engine can cover whole of the web, thus it has to focus on the most valuable web pages. Several Crawling ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
High-performance web crawling
Handbook of massive data setsHigh-performance web crawlers are an important component of many web services. For example, search services use web crawlers to populate their indices, comparison shopping engines use them to collect product and pricing information from online vendors, ...
Comments