research-article

Selective Search: Efficient and Effective Search of Large Textual Collections

Authors:
Anagha Kulkarni

San Francisco State University, San Francisco, CA

San Francisco State University, San Francisco, CA
View Profile

,
Jamie Callan

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 33 Issue 4Article No.: 17pp 1–33https://doi.org/10.1145/2738035

Published:23 April 2015Publication History

ACM Transactions on Information Systems

Abstract

The traditional search solution for large collections divides the collection into subsets (shards), and processes the query against all shards in parallel (exhaustive search). The search cost and the computational requirements of this approach are often prohibitively high for organizations with few computational resources. This article investigates and extends an alternative: selective search, an approach that partitions the dataset based on document similarity to obtain topic-based shards, and searches only a few shards that are estimated to contain relevant documents for the query. We propose shard creation techniques that are scalable, efficient, self-reliant, and create topic-based shards with low variance in size, and high density of relevant documents.

The experimental results demonstrate that the effectiveness of selective search is on par with that of exhaustive search, and the corresponding search costs are substantially lower with the former. Also, the majority of the queries perform as well or better with selective search. An oracle experiment that uses optimal shard ranking for a query indicates that selective search can outperform the effectiveness of exhaustive search. Comparison with a query optimization technique shows higher improvements in efficiency with selective search. The overall best efficiency is achieved when the two techniques are combined in an optimized selective search approach.

References

Robin Aly, Djoerd Hiemstra, and Thomas Demeester. 2013. Taily: shard selection using the tail of score distributions. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 673--682. Google ScholarDigital Library
Jaime Arguello, Jamie Callan, and Fernando Diaz. 2009. Classification-based resource selection. In Proceedings of the ACM Conference on Information and Knowledge Management. 1277--1286. Google ScholarDigital Library
Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 183--190. Google ScholarDigital Library
Ricardo Baeza-Yates, Vanessa Murdock, and Claudia Hauff. 2009. Efficiency trade-offs in two-tier Web search systems. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 163--170. Google ScholarDigital Library
Luiz Andró Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarDigital Library
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the ACM Conference on Information and Knowledge Management. 426--434. Google ScholarDigital Library
Eric W. Brown. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 30--38. Google ScholarDigital Library
Stefan Büttcher and Charles L. A. Clarke. 2006. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the ACM Conference on Information and Knowledge Management. 182--189. Google ScholarDigital Library
Fidel Cacheda, Victor Carneiro, Vassilis Plachouras, and Iadh Ounis. 2007. Performance analysis of distributed information retrieval architectures using an improved network simulation model. Information Processing and Management 43, 204--224. Google ScholarDigital Library
Jamie Callan. 2000. Distributed information retrieval. In Advances in Information Retrieval, W. Bruce Croft (Ed.). Springer, 127--150.Google Scholar
Jamie Callan, Margaret Connell, and Aiqun Du. 1999. Automatic discovery of language models for text databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 479--490. Google ScholarDigital Library
James P. Callan, Zhihong Lu, and W. Bruce Croft. 1995. Searching distributed collections with inference networks. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 21--28. Google ScholarDigital Library
B. Barla Cambazoglu, Vassilis Plachouras, and Ricardo Baeza-Yates. 2009. Quantifying performance and quality gains in distributed Web search engines. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 411--418. Google ScholarDigital Library
David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoölle S. Maarek, and Aya Soffer. 2001. Static index pruning for information retrieval systems. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 43--50. Google ScholarDigital Library
Abdur Chowdhury and Greg Pass. 2003. Operational requirements for scalable search systems. In Proceedings of the ACM Conference on Information and Knowledge Management. 435--442. Google ScholarDigital Library
Charles Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte track. In Proceedings of the 2004 Text Retrieval Conference.Google Scholar
W. Bruce Croft. 1980. A model of cluster searching based on classification. In Information Systems. Vol. 5. 189--195.Google Scholar
Edleno S. de Moura, Célia F. dos Santos, Daniel R. Fernandes, Altigran S. Silva, Pavel Calado, and Mario A. Nascimento. 2005. Improving Web search efficiency via a locality based static pruning method. In Proceedings of the 14th International Conference on World Wide Web. 235--244. Google ScholarDigital Library
Arthur P. Dempster, N. M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1, 1--38.Google ScholarCross Ref
A. El-Hamdouchi and P. Willett. 1989. Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32, 3, 220--227. Google ScholarDigital Library
Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando. 2006. Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24, 1, 51--78. Google ScholarDigital Library
Ana Freire, Craig Macdonald, Nicola Tonellotto, Iadh Ounis, and Fidel Cacheda. 2013. Hybrid query scheduling for a replicated search engine. In Proceedings of the European Conference on Information Retrieval. 435--446. Google ScholarDigital Library
James C. French, Allison L. Powell, Jamie Callan, Charles L. Viles, Travis Emmitt, Kevin J. Prey, and Yun Mou. 1999. Comparing the performance of database selection algorithms. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 238--245. Google ScholarDigital Library
Qingqing Gan and Torsten Suel. 2009. Improved techniques for result caching in web search engines. In Proceedings of the 18th International Conference on World Wide Web. 431--440. Google ScholarDigital Library
Luis Gravano and Héctor García-Molina. 1995. Generalizing GIOSS to vector-space databases and broker hierarchies. In Proceedings of the Conference on Very Large Data Bases. 78--89. Google ScholarDigital Library
Luis Gravano, Héctor García-Molina, and Anthony Tomasic. 1994. The effectiveness of GIOSS for the text database discovery problem. In Proceedings of the ACM SIGMOD Conference on Management of Data. 126--137. Google ScholarDigital Library
Luis Gravano, Héctor García-Molina, and Anthony Tomasic. 1999. GIOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems 24, 2, 229--264. Google ScholarDigital Library
Alan Griffiths, H. Claire Luckhurst, and Peter Willett. 1986. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science 37, 1, 3--11.Google ScholarCross Ref
J. Heaps. 1978. Information Retrieval -- Computational and Theoretical Aspects. Academic Press, Waltham, MA. Google ScholarDigital Library
N. Jardine and Cornelis Joost van Rijsbergen. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217--240.Google ScholarCross Ref
Anagha Kulkarni. 2013. Efficient and Effective Large-scale Search. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, PA.Google Scholar
Anagha Kulkarni and Jamie Callan. 2010a. Document allocation policies for selective searching of distributed indexes. In Proceedings of the ACM Conference on Information and Knowledge Management. 449--458. Google ScholarDigital Library
Anagha Kulkarni and Jamie Callan. 2010b. Topic-based index partitions for efficient and effective selective search. In SIGIR 2010 Workshop on Large-Scale Distributed Information Retrieval. 19--24.Google Scholar
Anagha Kulkarni, Almer Tigelaar, Djoerd Hiemstra, and Jamie Callan. 2012. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of the ACM Conference on Information and Knowledge Management. 555--564. Google ScholarDigital Library
Leah S. Larkey, Margaret E. Connell, and Jamie Callan. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the ACM Conference on Information and Knowledge Management. 282--289. Google ScholarDigital Library
Ronny Lempel and Shlomo Moran. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web. 19--28. Google ScholarDigital Library
S. Lloyd. 2006. Least squares quantization in PCM. IEEE Transactions on Information Theory. 28, 2, 129--137. Google ScholarDigital Library
Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. 2012. Learning to predict response times for online scheduling. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 621--630. Google ScholarDigital Library
A. Moffat, W. Webber, J. Zobel, and R. Baeza-Yates. 2007. A pipelined architecture for distributed text query evaluation. Information Retrieval 10, 3, 205--231. Google ScholarDigital Library
Linh Thai Nguyen. 2009. Static index pruning for information retrieval systems: A posting-based approach. In SIGIR 2009 Workshop on Large-Scale Distributed Information Retrieval. 25--32.Google Scholar
Paul Ogilvie and Jamie Callan. 2001. Experiments using the lemur toolkit. In Proceedings of the 2001 Text Retrieval Conference.Google Scholar
Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. 2006. Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. 34. Google ScholarDigital Library
Diego Puppin, Fabrizio Silvestri, Raffaele Perego, and Ricardo Baeza-Yates. 2010. Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems. 28, 2, 5:1--5:36. Google ScholarDigital Library
Knut Magne Risvik, Yngve Aasheim, and Mathias Lidal. 2003. Multi-tier architecture for web search engines. In Proceedings of the 1st Latin American Web Congress. 132--143. Google ScholarDigital Library
Paricia Correia Saraiva, Edleno Silva de Moura, Novio Ziviani, Wagner Meira, Rodrigo Fonseca, and Berthier Riberio-Neto. 2001. Rank-preserving two-level caching for scalable search engines. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 51--58. Google ScholarDigital Library
Gerard Salton. 1971. Cluster search strategies and the optimization of retrieval effectiveness. In The SMART Retrieval System, Gerard Salton (Ed.). 223--242.Google Scholar
Milad Shokouhi. 2007. Central-rank-based collection selection in uncooperative distributed information retrieval. In Proceedings of the 29th European Conference on Information Retrieval. 160--172. Google ScholarDigital Library
Milad Shokouhi and Luo Si. 2011. Federated search. Foundations and Trends in Information Retrieval 5, 1, 1--102. Google ScholarDigital Library
Luo Si and Jamie Callan. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 298--305. Google ScholarDigital Library
Alan F. Smeaton and Cornelis Joost van Rijsbergen. 1981. The nearest neighbour problem in information retrieval: an algorithm using upperbounds. In Proceedings of the 4th Annual International ACM SIGIR Conference on Information Storage and Retrieval. 83--87. Google ScholarDigital Library
Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization strategies for complex queries. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 219--225. Google ScholarDigital Library
Paul Thomas and Milad Shokouhi. 2009. SUSHI: Scoring scaled samples for server selection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 419--426. Google ScholarDigital Library
Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2013. Efficient and effective retrieval using selective pruning. In Proceeding of the International Conference on Web Search and Data Mining. 63--72. Google ScholarDigital Library
Howard Turtle and James Flood. 1995. Query evaluation: Strategies and optimizations. Information Processing and Management 31, 6, 831--850. Google ScholarDigital Library
Cornelis Joost van Rijsbergen. 1979. Information Retrieval. Butterworth, Oxford, UK.Google Scholar
Ellen M. Voorhees. 1985. The cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 188--196. Google ScholarDigital Library
Ellen M. Voorhees, Narendra K. Gupta, and Ben Johnson-Laird. 1995. Learning collection fusion strategies. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 172--179. Google ScholarDigital Library
Peter Willett. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24, 5, 577--597. Google ScholarDigital Library
Wai Yee Peter Wong and Dik Lun Lee. 1993. Implementations of partial document ranking using inverted files. Information Processing and Management 29, 5, 647--669. Google ScholarDigital Library
Jinxi Xu and W. Bruce Croft. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 254--261. Google ScholarDigital Library
Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214. Google ScholarDigital Library

Index Terms

Selective Search: Efficient and Effective Search of Large Textual Collections
1. Information systems

Recommendations

Query-Biased Partitioning for Selective Search
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Selective search is a cluster-based distributed retrieval architecture that reduces computational costs by partitioning a corpus into topical shards, and selectively searching them. Prior research formed topical shards by clustering the corpus based on ...
Read More
Load-Balancing in Distributed Selective Search
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Simulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of ...
Read More
Dynamic Shard Cutoff Prediction for Selective Search
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Selective search architectures use resource selection algorithms such as Rank-S or Taily to rank index shards and determine how many to search for a given query. Most prior research evaluated solutions by their ability to improve efficiency without ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 33, Issue 4
May 2015
213 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2766484
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 April 2015
- Accepted: 1 February 2015
- Revised: 1 December 2014
- Received: 1 August 2014
Published in tois Volume 33, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Large-scale text search
distributed information retrieval
document collection organization
partitioned search
resource selection
selective search
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 585
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Selective Search: Efficient and Effective Search of Large Textual Collections

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Query-Biased Partitioning for Selective Search

Load-Balancing in Distributed Selective Search

Dynamic Shard Cutoff Prediction for Selective Search