poster

Document assignment in multi-site search engines

Authors:
Ulf Brefeld

Yahoo! Research, Barcelona, Spain

Yahoo! Research, Barcelona, Spain
View Profile

,
B. Barla Cambazoglu

Yahoo! Research, Barcelona, Spain

Yahoo! Research, Barcelona, Spain
View Profile

,
Flavio P. Junqueira

Yahoo! Research, Barcelona, Spain

Yahoo! Research, Barcelona, Spain
View Profile

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningFebruary 2011Pages 575–584https://doi.org/10.1145/1935826.1935907

Published:09 February 2011Publication History

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 575–584

ABSTRACT

Assigning documents accurately to sites is critical for the performance of multi-site Web search engines. In such settings, sites crawl only documents they index and forward queries to obtain best-matching documents from other sites. Inaccurate assignments may lead to inefficiencies when crawling Web pages or processing user queries. In this work, we propose a machine-learned document assignment strategy that uses the locality of document views in search results to decide upon assignments. We evaluate the performance of our strategy using various document features extracted from a large Web collection. Our experimental setup uses query logs from a number of search front-ends spread across different geographic locations and uses these logs to learn the document access patterns. We compare our technique against baselines such as region- and language-based document assignment and observe that our technique achieves substantial performance improvements with respect to recall. With our technique, we are able to obtain a small query forwarding rate (0.04) requiring roughly 45% less replication of documents compared to replicating all documents across all sites.

References

R. Baeza-Yates, A. Gionis, F. Junqueira, V. Plachouras, and L. Telloli. On the feasibility of multi-site web search engines. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 425--434, 2009. Google ScholarDigital Library
R. Baeza-Yates, C. Middleton, and C. Castillo. The geographical life of search. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pages 252--259, 2009. Google ScholarDigital Library
R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 163--170, 2009. Google ScholarDigital Library
L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22--28, 2003. Google ScholarDigital Library
M. Bawa, G. S. Manku, and P. Raghavan. Sets: Search enhanced by topic segmentation. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 306--313, 2003. Google ScholarDigital Library
B. Bohnet. Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, 2010. Google ScholarDigital Library
L. Bottou and Y. LeCun. On-line learning for very large datasets. Applied Stochastic Models in Business and Industry, 21(2):137--151, 2005. Google ScholarDigital Library
J. Callan. Distributed information retrieval. In W. B. Croft, editor, Advances in Information Retrieval. Recent Research from the Center for Intelligent Information Retrieval, chapter 5, pages 127--150. Kluwer Academic Publishers, 2000.Google Scholar
J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21--28, 1995. Google ScholarDigital Library
B. B. Cambazoglu, F. P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, and B. Bridge. A refreshing perspective of search engine caching. In Proceedings of the 19th International Conference on World Wide Web, pages 181--190, 2010. Google ScholarDigital Library
B. B. Cambazoglu, V. Plachouras, and R. Baeza-Yates. Quantifying performance and quality gains in distributed web search engines. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 411--418, 2009. Google ScholarDigital Library
B. B. Cambazoglu, V. Plachouras, F. Junqueira, and L. Telloli. On the feasibility of geographically distributed web crawling. In Proceedings of the 3rd International Conference on Scalable Information Systems, 2008. Google ScholarDigital Library
B. B. Cambazoglu, E. Varol, E. Kayaaslan, C. Aykanat, and R. Baeza-Yates. Query forwarding in geographically distributed search engines. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 90--97, 2010. Google ScholarDigital Library
K. Church, A. Greenberg, and J. Hamilton. On delivering embarrassingly distributed cloud services. In Proceedings of the 7th ACM Workshop on Hot Topics in Networks, 2008.Google Scholar
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273--297, 1995. Google ScholarDigital Library
J. Hoffmann, M. Spranger, G. Daniel, J. Matthias, and H.-D. Burkhard. Further studies on the use of negative information in mobile robot localization. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 62--67, 2006.Google ScholarCross Ref
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137--142, 1998. Google ScholarDigital Library
T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, 2006. Google ScholarDigital Library
C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627--650, 2008. Google ScholarDigital Library
Z. Lu and K. S. McKinley. Partial replica selection based on relevance for information retrieval. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97--104, 1999. Google ScholarDigital Library
Z. Lu and K. S. McKinley. Partial collection replication versus caching for information retrieval systems. In Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 248--255, 2000. Google ScholarDigital Library
W. Meng, C. Yu, and K.-L. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48--89, 2002. Google ScholarDigital Library
S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2:345--389, 1998. Google ScholarDigital Library
H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 67--73, 1997. Google ScholarDigital Library
S. Orlando, R. Perego, and F. Silvestri. Design of a parallel and distributed web search engine. In Proceedings of the Parallel Computing Conference, pages 197--204, 2001.Google Scholar
D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates.Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems, 28:1--36, 2010. Google ScholarDigital Library
C. Sarigiannis, V. Plachouras, and R. Baeza-Yates. A study of the impact of index updates on distributed query processing for web search. In Proceedings of the 31th European Conference on Information Retrieval, pages 595--602, 2009. Google ScholarDigital Library
E. Schurman and J. Brutlag. Performance related changes and their user impact. In Velocity: Web Performance and Operations Conference, 2009.Google Scholar
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
F. Sebastiani, A. Sperduti, and N. Valdambrini. An improved boosting algorithm and its application to automated text categorization. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management, pages 78--85, 2000. Google ScholarDigital Library
S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proceedings of the 24th International Conference on Machine Learning, pages 807--814, 2007. Google ScholarDigital Library
L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298--305, 2003. Google ScholarDigital Library
M. Stone. Cross-validation: A review. Math. Operationsforsch. Statist. Ser. Statistics, 9(1):127--129, 1978.Google ScholarCross Ref
C. Tang, Z. Xu, and M. Mahalingam. PeerSearch: Efficient information retrieval in peer-to-peer networks. In Proceedings of HotNets-I, ACM SIGCOMM, 2002.Google Scholar
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1--2):69--90, 1999. Google ScholarDigital Library
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 42--49, 1999. Google ScholarDigital Library
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3):219--241, 2002. Google ScholarDigital Library

Index Terms

Document assignment in multi-site search engines
1. Information systems
  1. Information retrieval

Recommendations

Improving the efficiency of multi-site web search engines
WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining

A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is ...
Read More
Site-searching strategies of searchers referred from search engines
ASIST '13: Proceedings of the 76th ASIS&T Annual Meeting: Beyond the Cloud: Rethinking Information Boundaries

In this research, we analyze the referral queries and associated site-search queries at the session level from searchers coming from web search engines. Findings are based on a random sample of 10,000 from a total of 327,261 searching sessions of an ...
Read More
Document replication strategies for geographically distributed web search engines

Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
February 2011
870 pages
ISBN:9781450304931
DOI:10.1145/1935826
General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classification
document replication
multi-site web search engines
Qualifiers
- poster
Conference

Acceptance Rates
WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 209
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Document assignment in multi-site search engines

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving the efficiency of multi-site web search engines

Site-searching strategies of searchers referred from search engines

Document replication strategies for geographically distributed web search engines