Abstract
The traditional search solution for large collections divides the collection into subsets (shards), and processes the query against all shards in parallel (exhaustive search). The search cost and the computational requirements of this approach are often prohibitively high for organizations with few computational resources. This article investigates and extends an alternative: selective search, an approach that partitions the dataset based on document similarity to obtain topic-based shards, and searches only a few shards that are estimated to contain relevant documents for the query. We propose shard creation techniques that are scalable, efficient, self-reliant, and create topic-based shards with low variance in size, and high density of relevant documents.
The experimental results demonstrate that the effectiveness of selective search is on par with that of exhaustive search, and the corresponding search costs are substantially lower with the former. Also, the majority of the queries perform as well or better with selective search. An oracle experiment that uses optimal shard ranking for a query indicates that selective search can outperform the effectiveness of exhaustive search. Comparison with a query optimization technique shows higher improvements in efficiency with selective search. The overall best efficiency is achieved when the two techniques are combined in an optimized selective search approach.
- Robin Aly, Djoerd Hiemstra, and Thomas Demeester. 2013. Taily: shard selection using the tail of score distributions. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 673--682. Google ScholarDigital Library
- Jaime Arguello, Jamie Callan, and Fernando Diaz. 2009. Classification-based resource selection. In Proceedings of the ACM Conference on Information and Knowledge Management. 1277--1286. Google ScholarDigital Library
- Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 183--190. Google ScholarDigital Library
- Ricardo Baeza-Yates, Vanessa Murdock, and Claudia Hauff. 2009. Efficiency trade-offs in two-tier Web search systems. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 163--170. Google ScholarDigital Library
- Luiz Andró Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarDigital Library
- Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the ACM Conference on Information and Knowledge Management. 426--434. Google ScholarDigital Library
- Eric W. Brown. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 30--38. Google ScholarDigital Library
- Stefan Büttcher and Charles L. A. Clarke. 2006. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the ACM Conference on Information and Knowledge Management. 182--189. Google ScholarDigital Library
- Fidel Cacheda, Victor Carneiro, Vassilis Plachouras, and Iadh Ounis. 2007. Performance analysis of distributed information retrieval architectures using an improved network simulation model. Information Processing and Management 43, 204--224. Google ScholarDigital Library
- Jamie Callan. 2000. Distributed information retrieval. In Advances in Information Retrieval, W. Bruce Croft (Ed.). Springer, 127--150.Google Scholar
- Jamie Callan, Margaret Connell, and Aiqun Du. 1999. Automatic discovery of language models for text databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 479--490. Google ScholarDigital Library
- James P. Callan, Zhihong Lu, and W. Bruce Croft. 1995. Searching distributed collections with inference networks. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 21--28. Google ScholarDigital Library
- B. Barla Cambazoglu, Vassilis Plachouras, and Ricardo Baeza-Yates. 2009. Quantifying performance and quality gains in distributed Web search engines. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 411--418. Google ScholarDigital Library
- David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoölle S. Maarek, and Aya Soffer. 2001. Static index pruning for information retrieval systems. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 43--50. Google ScholarDigital Library
- Abdur Chowdhury and Greg Pass. 2003. Operational requirements for scalable search systems. In Proceedings of the ACM Conference on Information and Knowledge Management. 435--442. Google ScholarDigital Library
- Charles Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte track. In Proceedings of the 2004 Text Retrieval Conference.Google Scholar
- W. Bruce Croft. 1980. A model of cluster searching based on classification. In Information Systems. Vol. 5. 189--195.Google Scholar
- Edleno S. de Moura, Célia F. dos Santos, Daniel R. Fernandes, Altigran S. Silva, Pavel Calado, and Mario A. Nascimento. 2005. Improving Web search efficiency via a locality based static pruning method. In Proceedings of the 14th International Conference on World Wide Web. 235--244. Google ScholarDigital Library
- Arthur P. Dempster, N. M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1, 1--38.Google ScholarCross Ref
- A. El-Hamdouchi and P. Willett. 1989. Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32, 3, 220--227. Google ScholarDigital Library
- Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando. 2006. Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24, 1, 51--78. Google ScholarDigital Library
- Ana Freire, Craig Macdonald, Nicola Tonellotto, Iadh Ounis, and Fidel Cacheda. 2013. Hybrid query scheduling for a replicated search engine. In Proceedings of the European Conference on Information Retrieval. 435--446. Google ScholarDigital Library
- James C. French, Allison L. Powell, Jamie Callan, Charles L. Viles, Travis Emmitt, Kevin J. Prey, and Yun Mou. 1999. Comparing the performance of database selection algorithms. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 238--245. Google ScholarDigital Library
- Qingqing Gan and Torsten Suel. 2009. Improved techniques for result caching in web search engines. In Proceedings of the 18th International Conference on World Wide Web. 431--440. Google ScholarDigital Library
- Luis Gravano and Héctor García-Molina. 1995. Generalizing GIOSS to vector-space databases and broker hierarchies. In Proceedings of the Conference on Very Large Data Bases. 78--89. Google ScholarDigital Library
- Luis Gravano, Héctor García-Molina, and Anthony Tomasic. 1994. The effectiveness of GIOSS for the text database discovery problem. In Proceedings of the ACM SIGMOD Conference on Management of Data. 126--137. Google ScholarDigital Library
- Luis Gravano, Héctor García-Molina, and Anthony Tomasic. 1999. GIOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems 24, 2, 229--264. Google ScholarDigital Library
- Alan Griffiths, H. Claire Luckhurst, and Peter Willett. 1986. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science 37, 1, 3--11.Google ScholarCross Ref
- J. Heaps. 1978. Information Retrieval -- Computational and Theoretical Aspects. Academic Press, Waltham, MA. Google ScholarDigital Library
- N. Jardine and Cornelis Joost van Rijsbergen. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217--240.Google ScholarCross Ref
- Anagha Kulkarni. 2013. Efficient and Effective Large-scale Search. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- Anagha Kulkarni and Jamie Callan. 2010a. Document allocation policies for selective searching of distributed indexes. In Proceedings of the ACM Conference on Information and Knowledge Management. 449--458. Google ScholarDigital Library
- Anagha Kulkarni and Jamie Callan. 2010b. Topic-based index partitions for efficient and effective selective search. In SIGIR 2010 Workshop on Large-Scale Distributed Information Retrieval. 19--24.Google Scholar
- Anagha Kulkarni, Almer Tigelaar, Djoerd Hiemstra, and Jamie Callan. 2012. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of the ACM Conference on Information and Knowledge Management. 555--564. Google ScholarDigital Library
- Leah S. Larkey, Margaret E. Connell, and Jamie Callan. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the ACM Conference on Information and Knowledge Management. 282--289. Google ScholarDigital Library
- Ronny Lempel and Shlomo Moran. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web. 19--28. Google ScholarDigital Library
- S. Lloyd. 2006. Least squares quantization in PCM. IEEE Transactions on Information Theory. 28, 2, 129--137. Google ScholarDigital Library
- Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. 2012. Learning to predict response times for online scheduling. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 621--630. Google ScholarDigital Library
- A. Moffat, W. Webber, J. Zobel, and R. Baeza-Yates. 2007. A pipelined architecture for distributed text query evaluation. Information Retrieval 10, 3, 205--231. Google ScholarDigital Library
- Linh Thai Nguyen. 2009. Static index pruning for information retrieval systems: A posting-based approach. In SIGIR 2009 Workshop on Large-Scale Distributed Information Retrieval. 25--32.Google Scholar
- Paul Ogilvie and Jamie Callan. 2001. Experiments using the lemur toolkit. In Proceedings of the 2001 Text Retrieval Conference.Google Scholar
- Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. 2006. Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. 34. Google ScholarDigital Library
- Diego Puppin, Fabrizio Silvestri, Raffaele Perego, and Ricardo Baeza-Yates. 2010. Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems. 28, 2, 5:1--5:36. Google ScholarDigital Library
- Knut Magne Risvik, Yngve Aasheim, and Mathias Lidal. 2003. Multi-tier architecture for web search engines. In Proceedings of the 1st Latin American Web Congress. 132--143. Google ScholarDigital Library
- Paricia Correia Saraiva, Edleno Silva de Moura, Novio Ziviani, Wagner Meira, Rodrigo Fonseca, and Berthier Riberio-Neto. 2001. Rank-preserving two-level caching for scalable search engines. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 51--58. Google ScholarDigital Library
- Gerard Salton. 1971. Cluster search strategies and the optimization of retrieval effectiveness. In The SMART Retrieval System, Gerard Salton (Ed.). 223--242.Google Scholar
- Milad Shokouhi. 2007. Central-rank-based collection selection in uncooperative distributed information retrieval. In Proceedings of the 29th European Conference on Information Retrieval. 160--172. Google ScholarDigital Library
- Milad Shokouhi and Luo Si. 2011. Federated search. Foundations and Trends in Information Retrieval 5, 1, 1--102. Google ScholarDigital Library
- Luo Si and Jamie Callan. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 298--305. Google ScholarDigital Library
- Alan F. Smeaton and Cornelis Joost van Rijsbergen. 1981. The nearest neighbour problem in information retrieval: an algorithm using upperbounds. In Proceedings of the 4th Annual International ACM SIGIR Conference on Information Storage and Retrieval. 83--87. Google ScholarDigital Library
- Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization strategies for complex queries. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 219--225. Google ScholarDigital Library
- Paul Thomas and Milad Shokouhi. 2009. SUSHI: Scoring scaled samples for server selection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 419--426. Google ScholarDigital Library
- Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2013. Efficient and effective retrieval using selective pruning. In Proceeding of the International Conference on Web Search and Data Mining. 63--72. Google ScholarDigital Library
- Howard Turtle and James Flood. 1995. Query evaluation: Strategies and optimizations. Information Processing and Management 31, 6, 831--850. Google ScholarDigital Library
- Cornelis Joost van Rijsbergen. 1979. Information Retrieval. Butterworth, Oxford, UK.Google Scholar
- Ellen M. Voorhees. 1985. The cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 188--196. Google ScholarDigital Library
- Ellen M. Voorhees, Narendra K. Gupta, and Ben Johnson-Laird. 1995. Learning collection fusion strategies. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 172--179. Google ScholarDigital Library
- Peter Willett. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24, 5, 577--597. Google ScholarDigital Library
- Wai Yee Peter Wong and Dik Lun Lee. 1993. Implementations of partial document ranking using inverted files. Information Processing and Management 29, 5, 647--669. Google ScholarDigital Library
- Jinxi Xu and W. Bruce Croft. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 254--261. Google ScholarDigital Library
- Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214. Google ScholarDigital Library
Index Terms
- Selective Search: Efficient and Effective Search of Large Textual Collections
Recommendations
Query-Biased Partitioning for Selective Search
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementSelective search is a cluster-based distributed retrieval architecture that reduces computational costs by partitioning a corpus into topical shards, and selectively searching them. Prior research formed topical shards by clustering the corpus based on ...
Load-Balancing in Distributed Selective Search
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalSimulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of ...
Dynamic Shard Cutoff Prediction for Selective Search
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalSelective search architectures use resource selection algorithms such as Rank-S or Taily to rank index shards and determine how many to search for a given query. Most prior research evaluated solutions by their ability to improve efficiency without ...
Comments