skip to main content
research-article

Selective Search: Efficient and Effective Search of Large Textual Collections

Published:23 April 2015Publication History
Skip Abstract Section

Abstract

The traditional search solution for large collections divides the collection into subsets (shards), and processes the query against all shards in parallel (exhaustive search). The search cost and the computational requirements of this approach are often prohibitively high for organizations with few computational resources. This article investigates and extends an alternative: selective search, an approach that partitions the dataset based on document similarity to obtain topic-based shards, and searches only a few shards that are estimated to contain relevant documents for the query. We propose shard creation techniques that are scalable, efficient, self-reliant, and create topic-based shards with low variance in size, and high density of relevant documents.

The experimental results demonstrate that the effectiveness of selective search is on par with that of exhaustive search, and the corresponding search costs are substantially lower with the former. Also, the majority of the queries perform as well or better with selective search. An oracle experiment that uses optimal shard ranking for a query indicates that selective search can outperform the effectiveness of exhaustive search. Comparison with a query optimization technique shows higher improvements in efficiency with selective search. The overall best efficiency is achieved when the two techniques are combined in an optimized selective search approach.

References

  1. Robin Aly, Djoerd Hiemstra, and Thomas Demeester. 2013. Taily: shard selection using the tail of score distributions. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 673--682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jaime Arguello, Jamie Callan, and Fernando Diaz. 2009. Classification-based resource selection. In Proceedings of the ACM Conference on Information and Knowledge Management. 1277--1286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 183--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ricardo Baeza-Yates, Vanessa Murdock, and Claudia Hauff. 2009. Efficiency trade-offs in two-tier Web search systems. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 163--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Luiz Andró Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the ACM Conference on Information and Knowledge Management. 426--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Eric W. Brown. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 30--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stefan Büttcher and Charles L. A. Clarke. 2006. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the ACM Conference on Information and Knowledge Management. 182--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fidel Cacheda, Victor Carneiro, Vassilis Plachouras, and Iadh Ounis. 2007. Performance analysis of distributed information retrieval architectures using an improved network simulation model. Information Processing and Management 43, 204--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jamie Callan. 2000. Distributed information retrieval. In Advances in Information Retrieval, W. Bruce Croft (Ed.). Springer, 127--150.Google ScholarGoogle Scholar
  11. Jamie Callan, Margaret Connell, and Aiqun Du. 1999. Automatic discovery of language models for text databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 479--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. James P. Callan, Zhihong Lu, and W. Bruce Croft. 1995. Searching distributed collections with inference networks. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Barla Cambazoglu, Vassilis Plachouras, and Ricardo Baeza-Yates. 2009. Quantifying performance and quality gains in distributed Web search engines. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 411--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoölle S. Maarek, and Aya Soffer. 2001. Static index pruning for information retrieval systems. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 43--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Abdur Chowdhury and Greg Pass. 2003. Operational requirements for scalable search systems. In Proceedings of the ACM Conference on Information and Knowledge Management. 435--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Charles Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte track. In Proceedings of the 2004 Text Retrieval Conference.Google ScholarGoogle Scholar
  17. W. Bruce Croft. 1980. A model of cluster searching based on classification. In Information Systems. Vol. 5. 189--195.Google ScholarGoogle Scholar
  18. Edleno S. de Moura, Célia F. dos Santos, Daniel R. Fernandes, Altigran S. Silva, Pavel Calado, and Mario A. Nascimento. 2005. Improving Web search efficiency via a locality based static pruning method. In Proceedings of the 14th International Conference on World Wide Web. 235--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Arthur P. Dempster, N. M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1, 1--38.Google ScholarGoogle ScholarCross RefCross Ref
  20. A. El-Hamdouchi and P. Willett. 1989. Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32, 3, 220--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando. 2006. Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24, 1, 51--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ana Freire, Craig Macdonald, Nicola Tonellotto, Iadh Ounis, and Fidel Cacheda. 2013. Hybrid query scheduling for a replicated search engine. In Proceedings of the European Conference on Information Retrieval. 435--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. James C. French, Allison L. Powell, Jamie Callan, Charles L. Viles, Travis Emmitt, Kevin J. Prey, and Yun Mou. 1999. Comparing the performance of database selection algorithms. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 238--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Qingqing Gan and Torsten Suel. 2009. Improved techniques for result caching in web search engines. In Proceedings of the 18th International Conference on World Wide Web. 431--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Luis Gravano and Héctor García-Molina. 1995. Generalizing GIOSS to vector-space databases and broker hierarchies. In Proceedings of the Conference on Very Large Data Bases. 78--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Luis Gravano, Héctor García-Molina, and Anthony Tomasic. 1994. The effectiveness of GIOSS for the text database discovery problem. In Proceedings of the ACM SIGMOD Conference on Management of Data. 126--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Luis Gravano, Héctor García-Molina, and Anthony Tomasic. 1999. GIOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems 24, 2, 229--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alan Griffiths, H. Claire Luckhurst, and Peter Willett. 1986. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science 37, 1, 3--11.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. Heaps. 1978. Information Retrieval -- Computational and Theoretical Aspects. Academic Press, Waltham, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Jardine and Cornelis Joost van Rijsbergen. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217--240.Google ScholarGoogle ScholarCross RefCross Ref
  31. Anagha Kulkarni. 2013. Efficient and Effective Large-scale Search. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, PA.Google ScholarGoogle Scholar
  32. Anagha Kulkarni and Jamie Callan. 2010a. Document allocation policies for selective searching of distributed indexes. In Proceedings of the ACM Conference on Information and Knowledge Management. 449--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Anagha Kulkarni and Jamie Callan. 2010b. Topic-based index partitions for efficient and effective selective search. In SIGIR 2010 Workshop on Large-Scale Distributed Information Retrieval. 19--24.Google ScholarGoogle Scholar
  34. Anagha Kulkarni, Almer Tigelaar, Djoerd Hiemstra, and Jamie Callan. 2012. Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of the ACM Conference on Information and Knowledge Management. 555--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Leah S. Larkey, Margaret E. Connell, and Jamie Callan. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the ACM Conference on Information and Knowledge Management. 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ronny Lempel and Shlomo Moran. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web. 19--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Lloyd. 2006. Least squares quantization in PCM. IEEE Transactions on Information Theory. 28, 2, 129--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. 2012. Learning to predict response times for online scheduling. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 621--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Moffat, W. Webber, J. Zobel, and R. Baeza-Yates. 2007. A pipelined architecture for distributed text query evaluation. Information Retrieval 10, 3, 205--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Linh Thai Nguyen. 2009. Static index pruning for information retrieval systems: A posting-based approach. In SIGIR 2009 Workshop on Large-Scale Distributed Information Retrieval. 25--32.Google ScholarGoogle Scholar
  41. Paul Ogilvie and Jamie Callan. 2001. Experiments using the lemur toolkit. In Proceedings of the 2001 Text Retrieval Conference.Google ScholarGoogle Scholar
  42. Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. 2006. Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. 34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Diego Puppin, Fabrizio Silvestri, Raffaele Perego, and Ricardo Baeza-Yates. 2010. Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems. 28, 2, 5:1--5:36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Knut Magne Risvik, Yngve Aasheim, and Mathias Lidal. 2003. Multi-tier architecture for web search engines. In Proceedings of the 1st Latin American Web Congress. 132--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Paricia Correia Saraiva, Edleno Silva de Moura, Novio Ziviani, Wagner Meira, Rodrigo Fonseca, and Berthier Riberio-Neto. 2001. Rank-preserving two-level caching for scalable search engines. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Gerard Salton. 1971. Cluster search strategies and the optimization of retrieval effectiveness. In The SMART Retrieval System, Gerard Salton (Ed.). 223--242.Google ScholarGoogle Scholar
  47. Milad Shokouhi. 2007. Central-rank-based collection selection in uncooperative distributed information retrieval. In Proceedings of the 29th European Conference on Information Retrieval. 160--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Milad Shokouhi and Luo Si. 2011. Federated search. Foundations and Trends in Information Retrieval 5, 1, 1--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Luo Si and Jamie Callan. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 298--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Alan F. Smeaton and Cornelis Joost van Rijsbergen. 1981. The nearest neighbour problem in information retrieval: an algorithm using upperbounds. In Proceedings of the 4th Annual International ACM SIGIR Conference on Information Storage and Retrieval. 83--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization strategies for complex queries. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 219--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Paul Thomas and Milad Shokouhi. 2009. SUSHI: Scoring scaled samples for server selection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 419--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2013. Efficient and effective retrieval using selective pruning. In Proceeding of the International Conference on Web Search and Data Mining. 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Howard Turtle and James Flood. 1995. Query evaluation: Strategies and optimizations. Information Processing and Management 31, 6, 831--850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Cornelis Joost van Rijsbergen. 1979. Information Retrieval. Butterworth, Oxford, UK.Google ScholarGoogle Scholar
  56. Ellen M. Voorhees. 1985. The cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 188--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Ellen M. Voorhees, Narendra K. Gupta, and Ben Johnson-Laird. 1995. Learning collection fusion strategies. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 172--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Peter Willett. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24, 5, 577--597. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Wai Yee Peter Wong and Dik Lun Lee. 1993. Implementations of partial document ranking using inverted files. Information Processing and Management 29, 5, 647--669. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jinxi Xu and W. Bruce Croft. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 254--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Selective Search: Efficient and Effective Search of Large Textual Collections

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM Transactions on Information Systems
                    ACM Transactions on Information Systems  Volume 33, Issue 4
                    May 2015
                    213 pages
                    ISSN:1046-8188
                    EISSN:1558-2868
                    DOI:10.1145/2766484
                    Issue’s Table of Contents

                    Copyright © 2015 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 23 April 2015
                    • Accepted: 1 February 2015
                    • Revised: 1 December 2014
                    • Received: 1 August 2014
                    Published in tois Volume 33, Issue 4

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • research-article
                    • Research
                    • Refereed

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader