ABSTRACT
Assigning documents accurately to sites is critical for the performance of multi-site Web search engines. In such settings, sites crawl only documents they index and forward queries to obtain best-matching documents from other sites. Inaccurate assignments may lead to inefficiencies when crawling Web pages or processing user queries. In this work, we propose a machine-learned document assignment strategy that uses the locality of document views in search results to decide upon assignments. We evaluate the performance of our strategy using various document features extracted from a large Web collection. Our experimental setup uses query logs from a number of search front-ends spread across different geographic locations and uses these logs to learn the document access patterns. We compare our technique against baselines such as region- and language-based document assignment and observe that our technique achieves substantial performance improvements with respect to recall. With our technique, we are able to obtain a small query forwarding rate (0.04) requiring roughly 45% less replication of documents compared to replicating all documents across all sites.
- R. Baeza-Yates, A. Gionis, F. Junqueira, V. Plachouras, and L. Telloli. On the feasibility of multi-site web search engines. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 425--434, 2009. Google ScholarDigital Library
- R. Baeza-Yates, C. Middleton, and C. Castillo. The geographical life of search. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pages 252--259, 2009. Google ScholarDigital Library
- R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 163--170, 2009. Google ScholarDigital Library
- L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22--28, 2003. Google ScholarDigital Library
- M. Bawa, G. S. Manku, and P. Raghavan. Sets: Search enhanced by topic segmentation. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 306--313, 2003. Google ScholarDigital Library
- B. Bohnet. Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, 2010. Google ScholarDigital Library
- L. Bottou and Y. LeCun. On-line learning for very large datasets. Applied Stochastic Models in Business and Industry, 21(2):137--151, 2005. Google ScholarDigital Library
- J. Callan. Distributed information retrieval. In W. B. Croft, editor, Advances in Information Retrieval. Recent Research from the Center for Intelligent Information Retrieval, chapter 5, pages 127--150. Kluwer Academic Publishers, 2000.Google Scholar
- J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21--28, 1995. Google ScholarDigital Library
- B. B. Cambazoglu, F. P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, and B. Bridge. A refreshing perspective of search engine caching. In Proceedings of the 19th International Conference on World Wide Web, pages 181--190, 2010. Google ScholarDigital Library
- B. B. Cambazoglu, V. Plachouras, and R. Baeza-Yates. Quantifying performance and quality gains in distributed web search engines. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 411--418, 2009. Google ScholarDigital Library
- B. B. Cambazoglu, V. Plachouras, F. Junqueira, and L. Telloli. On the feasibility of geographically distributed web crawling. In Proceedings of the 3rd International Conference on Scalable Information Systems, 2008. Google ScholarDigital Library
- B. B. Cambazoglu, E. Varol, E. Kayaaslan, C. Aykanat, and R. Baeza-Yates. Query forwarding in geographically distributed search engines. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 90--97, 2010. Google ScholarDigital Library
- K. Church, A. Greenberg, and J. Hamilton. On delivering embarrassingly distributed cloud services. In Proceedings of the 7th ACM Workshop on Hot Topics in Networks, 2008.Google Scholar
- C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273--297, 1995. Google ScholarDigital Library
- J. Hoffmann, M. Spranger, G. Daniel, J. Matthias, and H.-D. Burkhard. Further studies on the use of negative information in mobile robot localization. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 62--67, 2006.Google ScholarCross Ref
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137--142, 1998. Google ScholarDigital Library
- T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, 2006. Google ScholarDigital Library
- C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627--650, 2008. Google ScholarDigital Library
- Z. Lu and K. S. McKinley. Partial replica selection based on relevance for information retrieval. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97--104, 1999. Google ScholarDigital Library
- Z. Lu and K. S. McKinley. Partial collection replication versus caching for information retrieval systems. In Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 248--255, 2000. Google ScholarDigital Library
- W. Meng, C. Yu, and K.-L. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48--89, 2002. Google ScholarDigital Library
- S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2:345--389, 1998. Google ScholarDigital Library
- H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 67--73, 1997. Google ScholarDigital Library
- S. Orlando, R. Perego, and F. Silvestri. Design of a parallel and distributed web search engine. In Proceedings of the Parallel Computing Conference, pages 197--204, 2001.Google Scholar
- D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates.Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems, 28:1--36, 2010. Google ScholarDigital Library
- C. Sarigiannis, V. Plachouras, and R. Baeza-Yates. A study of the impact of index updates on distributed query processing for web search. In Proceedings of the 31th European Conference on Information Retrieval, pages 595--602, 2009. Google ScholarDigital Library
- E. Schurman and J. Brutlag. Performance related changes and their user impact. In Velocity: Web Performance and Operations Conference, 2009.Google Scholar
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
- F. Sebastiani, A. Sperduti, and N. Valdambrini. An improved boosting algorithm and its application to automated text categorization. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management, pages 78--85, 2000. Google ScholarDigital Library
- S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proceedings of the 24th International Conference on Machine Learning, pages 807--814, 2007. Google ScholarDigital Library
- L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298--305, 2003. Google ScholarDigital Library
- M. Stone. Cross-validation: A review. Math. Operationsforsch. Statist. Ser. Statistics, 9(1):127--129, 1978.Google ScholarCross Ref
- C. Tang, Z. Xu, and M. Mahalingam. PeerSearch: Efficient information retrieval in peer-to-peer networks. In Proceedings of HotNets-I, ACM SIGCOMM, 2002.Google Scholar
- Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1--2):69--90, 1999. Google ScholarDigital Library
- Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 42--49, 1999. Google ScholarDigital Library
- Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3):219--241, 2002. Google ScholarDigital Library
Index Terms
- Document assignment in multi-site search engines
Recommendations
Improving the efficiency of multi-site web search engines
WSDM '14: Proceedings of the 7th ACM international conference on Web search and data miningA multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is ...
Site-searching strategies of searchers referred from search engines
ASIST '13: Proceedings of the 76th ASIS&T Annual Meeting: Beyond the Cloud: Rethinking Information BoundariesIn this research, we analyze the referral queries and associated site-search queries at the session level from searchers coming from web search engines. Findings are based on a random sample of 10,000 from a total of 327,261 searching sessions of an ...
Document replication strategies for geographically distributed web search engines
Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the ...
Comments