skip to main content
10.1145/1935826.1935907acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Document assignment in multi-site search engines

Published:09 February 2011Publication History

ABSTRACT

Assigning documents accurately to sites is critical for the performance of multi-site Web search engines. In such settings, sites crawl only documents they index and forward queries to obtain best-matching documents from other sites. Inaccurate assignments may lead to inefficiencies when crawling Web pages or processing user queries. In this work, we propose a machine-learned document assignment strategy that uses the locality of document views in search results to decide upon assignments. We evaluate the performance of our strategy using various document features extracted from a large Web collection. Our experimental setup uses query logs from a number of search front-ends spread across different geographic locations and uses these logs to learn the document access patterns. We compare our technique against baselines such as region- and language-based document assignment and observe that our technique achieves substantial performance improvements with respect to recall. With our technique, we are able to obtain a small query forwarding rate (0.04) requiring roughly 45% less replication of documents compared to replicating all documents across all sites.

References

  1. R. Baeza-Yates, A. Gionis, F. Junqueira, V. Plachouras, and L. Telloli. On the feasibility of multi-site web search engines. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 425--434, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baeza-Yates, C. Middleton, and C. Castillo. The geographical life of search. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pages 252--259, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 163--170, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22--28, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bawa, G. S. Manku, and P. Raghavan. Sets: Search enhanced by topic segmentation. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 306--313, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Bohnet. Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Bottou and Y. LeCun. On-line learning for very large datasets. Applied Stochastic Models in Business and Industry, 21(2):137--151, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Callan. Distributed information retrieval. In W. B. Croft, editor, Advances in Information Retrieval. Recent Research from the Center for Intelligent Information Retrieval, chapter 5, pages 127--150. Kluwer Academic Publishers, 2000.Google ScholarGoogle Scholar
  9. J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21--28, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. B. Cambazoglu, F. P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, and B. Bridge. A refreshing perspective of search engine caching. In Proceedings of the 19th International Conference on World Wide Web, pages 181--190, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. B. Cambazoglu, V. Plachouras, and R. Baeza-Yates. Quantifying performance and quality gains in distributed web search engines. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 411--418, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. B. Cambazoglu, V. Plachouras, F. Junqueira, and L. Telloli. On the feasibility of geographically distributed web crawling. In Proceedings of the 3rd International Conference on Scalable Information Systems, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. B. Cambazoglu, E. Varol, E. Kayaaslan, C. Aykanat, and R. Baeza-Yates. Query forwarding in geographically distributed search engines. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 90--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Church, A. Greenberg, and J. Hamilton. On delivering embarrassingly distributed cloud services. In Proceedings of the 7th ACM Workshop on Hot Topics in Networks, 2008.Google ScholarGoogle Scholar
  15. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273--297, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Hoffmann, M. Spranger, G. Daniel, J. Matthias, and H.-D. Burkhard. Further studies on the use of negative information in mobile robot localization. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 62--67, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  17. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137--142, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627--650, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Z. Lu and K. S. McKinley. Partial replica selection based on relevance for information retrieval. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97--104, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Z. Lu and K. S. McKinley. Partial collection replication versus caching for information retrieval systems. In Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 248--255, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Meng, C. Yu, and K.-L. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48--89, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2:345--389, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 67--73, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Orlando, R. Perego, and F. Silvestri. Design of a parallel and distributed web search engine. In Proceedings of the Parallel Computing Conference, pages 197--204, 2001.Google ScholarGoogle Scholar
  26. D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates.Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems, 28:1--36, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Sarigiannis, V. Plachouras, and R. Baeza-Yates. A study of the impact of index updates on distributed query processing for web search. In Proceedings of the 31th European Conference on Information Retrieval, pages 595--602, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. Schurman and J. Brutlag. Performance related changes and their user impact. In Velocity: Web Performance and Operations Conference, 2009.Google ScholarGoogle Scholar
  29. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Sebastiani, A. Sperduti, and N. Valdambrini. An improved boosting algorithm and its application to automated text categorization. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management, pages 78--85, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proceedings of the 24th International Conference on Machine Learning, pages 807--814, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298--305, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Stone. Cross-validation: A review. Math. Operationsforsch. Statist. Ser. Statistics, 9(1):127--129, 1978.Google ScholarGoogle ScholarCross RefCross Ref
  34. C. Tang, Z. Xu, and M. Mahalingam. PeerSearch: Efficient information retrieval in peer-to-peer networks. In Proceedings of HotNets-I, ACM SIGCOMM, 2002.Google ScholarGoogle Scholar
  35. Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1--2):69--90, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 42--49, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3):219--241, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Document assignment in multi-site search engines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
      February 2011
      870 pages
      ISBN:9781450304931
      DOI:10.1145/1935826

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 February 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader