Abstract
The proliferation of searchable text databases on corporate networks and the Internet causes a database selection problem for many people. Algorithms such as gGLOSS and CORI can automatically select which text databases to search for a given information need, but only if given a set of resource descriptions that accurately represent the contents of each database. The existing techniques for a acquiring resource descriptions have significant limitations when used in wide-area networks controlled by many parties. This paper presents query-based sampling, a new technicque for acquiring accurate resource descriptions. Query-based sampling does not require the cooperation of resource providers, nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are crated, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic dtabase selection.
- ALLAN, J., BALLESTEROS, L., CALLAN,J.P.,CROFT,W.B.,AND LU, Z. 1995. Recent experiments with INQUERY. In Proceedings of the 4th Text Retrieval Conference (TREC-4, Washington, D.C., Nov.), D. K. Harman, Ed. National Institute of Standards and Technology, Gaithers-burg, MD, 49-63.Google Scholar
- ALLAN, J., CALLAN, J., SANDERSON, M., XU, W., AND WEGMAN, S. 1999. INQUERY and TREC-7. In Proceedings of the 7th Conference on Text Retrieval (TREC-7, Gaithersburg, MD). 201-216.Google Scholar
- BAUMGARTEN, C. 1997. A probabilitic model for distributed informaiton retrieval. In Proceedings of the 20th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '97, Philadelphia, PA, July 27-31), W. Hersh, F. Can, and E. Voorhees. ACM Press, New York, NY, 258-266. Google Scholar
- CALLAN, J. 2000. Distributed information retrieval. In Advances in Information Retrieval,W. B. Croft, Ed. Kluwer Academic Publishers, Hingham, MA, 127-150.Google Scholar
- CALLAN, J., CONNELL, M., AND DU, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD '99, Philadelphia, PA, June). ACM Press, New York, NY, 479-490. Google Scholar
- CALLAN,J.P.,CROFT,W.B.,AND BROGLIO, J. 1995a. TREC and TIPSTER experiments with INQUERY. Inf. Process. Manage. 31, 3 (May-June), 327-343. Google Scholar
- CALLAN,J.P.,LU, Z., AND CROFT, W. B. 1995b. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 21-28. Google Scholar
- CLARKE, I., SANDBERG, O., WILEY, B., AND HONG, T. W. 2000. Freenet: A distributed anonymous information storage and retrieval system. In Proceedings of the ICSI Workshop on Design Issues in Anonymity and Unobservability (Berkeley, CA, July 25-26).Google Scholar
- CRASWELL, N., BAILEY, P., AND HAWKING, D. 2000. Server selection on the World Wide Web. In Proceedings of the 5th ACM Conference on Digital Libraries. ACM, New York, NY, 37-46. Google Scholar
- FRENCH, J., POWELL, A., CALLAN, J., VILES, C., EMMIT, T., PREY, K., AND MOU, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99, Berkeley, CA). 238-245. Google Scholar
- FRENCH,J.C.,POWELL,A.L.,VILES,C.L.,EMMITT, T., AND PREY, K. J. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Re-trieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 121-129. Google Scholar
- FUHR, N. 1999. A decision-theoretic approach to database selection in networked IR. ACM Trans. Inf. Syst. 17, 3 (July), 229-249. Google Scholar
- GRAVANO,L.AND GARCIA-MOLINA, H. 1995. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Sept.). 78-89. Google Scholar
- GRAVANO, L., CHANG, C.-C. K., GARCIA-MOLINA, H., AND PAEPCKE, A. 1997. STARTS: Stanford proposal for Internet meta-searching. In Proceedings of the International ACM Conference on Management of Data (SIGMOD '97, Tucson, AZ, May). ACM, New York, NY. Google Scholar
- GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994a. The effectiveness of GlOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY. Google Scholar
- GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994b. Precision and recall of GlOSS estimators for database discovery. In Proceedings of the 3rd IEEE International Conference on Parallel and Distributed Information Systems (PDIS, Austin, TX, Sept.). IEEE Computer Society Press, Los Alamitos, CA. Also available as Stanford Univ. Computer Science Tech. Rep. STAN-CS-TN-94-10. Google Scholar
- HARMAN,D.K.,ED. 1994. Proceedings of the 2nd Conference on Text Retrieval. (TREC-2). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Pub. 500-215.Google Scholar
- HARMAN, D., ED. 1995. Proceedings of the 3rd Conference on Text Retrieval. (TREC-3, Gaithersburg, MD). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Pub. 500-225.Google Scholar
- HAWKING,D.AND THISTLEWAITE, P. 1999. Methods for information server selection. ACM Trans. Inf. Syst. 17, 1 (Jan.), 40-76. Google Scholar
- HEAPS, J. 1978. Information Retrieval-Computational and Theoretical Aspects. Academic Press, Inc., New York, NY. Google Scholar
- KROVETZ, R. J. 1995. Word sense disambiguation for large text databases. Ph.D. Dissertation. Computer and Information Science Department, University of Massachusetts, Amherst, MA. Google Scholar
- LARKEY, L., CONNELL, M., AND CALLAN, J. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM '00). ACM, New York, NY, 282-289. Google Scholar
- LU, Z., CALLAN,J.P.,AND CROFT, W. B. 1996. Measures in collection ranking evaluation. Tech. Rep. TR96-39. Computer and Information Science Department, University of Massa-chusetts, Amherst, MA.Google Scholar
- LUHN, H. P. 1958. The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159-165.Google Scholar
- MARCUS, R. S. 1983. An experimental comparison of the effectiveness of computers and humans as search intermediaries. J. Am. Soc. Inf. Sci. 34, 381-404.Google Scholar
- MENG, W., LIU,K.L.,YU,C.T.,WANG, X., CHANG, Y., AND RISHE, N. 1998. Determining text databases to search in the Internet. In Proceedings of the 24th International Conference on Very Large Data Bases, A. Gupta, O. Shmueli, and J. Widom, Eds. Morgan Kaufmann, San Mateo, CA, 14-25. Google Scholar
- MENG, W., LIU,K.L.,YU,C.T.,WU, W., AND RISHE, N. 1999. Estimating the usefulness of search engines. In Proceedings of the 15th International IEEE Conference on Data Engineering (Sydney, Australia, Mar.). IEEE Press, Piscataway, NJ, 146-153. Google Scholar
- MORONEY, M. J. 1951. Facts from Figures. Penguin Books, New York, NY.Google Scholar
- NISO. 1995. Information Retrieval (Z39.50): Application service definition and protocol specification. Tech. Rep. ANSI/NISO Z39.50-1995. NISO Press, Bethesda, MD. Available via http://lcweb.loc.gov/z3950/agency/.Google Scholar
- POWELL, A., FRENCH, J., CALLAN, J., CONNELL, M., AND VILES, C. 2000. The impact of database selection on distributed searching. In Proceedings of the 23rd Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '00). ACM, New York, NY, 232-239. Google Scholar
- PRESS,W.H.,TEUKOLSKY,S.A.,VETTERLING,W.T.,AND FLANNERY, B. P. 1992. Numerical Recipes in C: The Art of Scientific Computing. 2nd ed. Cambridge University Press, New York, NY. Google Scholar
- TURTLE, H. R. 1991. Inference networks for document retrieval. Ph.D. Dissertation. Computer and Information Science Department, University of Massachusetts, Amherst, MA. Google Scholar
- TURTLE,H.AND CROFT, W. B. 1991. Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst. 9, 3 (July), 187-222. Google Scholar
- VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20. Google Scholar
- VOORHEES,E.M.AND TONG, R. M. 1997. Multiple search engines in database merging. In Proceedings of the 2nd ACM International Conference on Digital Libraries (DL '97, Philadel-phia, PA, July 23-26), R. B. Allen and E. Rasmussen, Chairs. ACM Press, New York, NY, 93-102. Google Scholar
- VOORHEES,E.M.,GUPTA,N.K.,AND JOHNSON-LAIRD, B. 1995. Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 172-179. Google Scholar
- WEISS, R., VELEZ, B., SHELDON,M.A.,NANPREMPRE, C., SZILAGYI, P., DUDA, P., AND GIFFORD,D. 1996. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the Seventh ACM Conference on Hypertext '96 (Washington, D.C., Mar. 16-20), D. Stotts, Chair. ACM Press, New York, NY, 180-193. Google Scholar
- XU,J.AND CALLAN, J. 1998. Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 112-120. Google Scholar
- XU,J.AND CROFT, B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99, Berkeley, CA). 254-261. Google Scholar
- YUWONO,B.AND LEE, D. L. 1996. Search and ranking algorithms for locating resources on the World Wide Web. In Proceedings of the 12th IEEE International Conference on Data Engineering (ICDE '97, New Orleans, LA, Feb.). IEEE Press, Piscataway, NJ, 164-171. Google Scholar
- YUWONO,B.AND LEE, D. L. 1997. Server ranking for distributed text retrieval systems on the Internet. In Proceedings of the 5th International Conference on Database Systems for Advanced Applications (Melbourne, Australia, Apr.), R. Topor and K. Tanaka, Eds. World Scientific Publishing Co., Inc., River Edge, NJ, 41-49. Google Scholar
- ZIPF, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.Google Scholar
Index Terms
- Query-based sampling of text databases
Recommendations
Comparing the performance of collection selection algorithms
The proliferation of online information resources increases the importance of effective and efficient information retrieval in a multicollection environment. Multicollection searching is cast in three parts: collection selection (also referred to as ...
A semisupervised learning method to merge search engine results
The proliferation of searchable text databases on local area networks and the Internet causes the problem of finding information that may be distributed among many disjoint text databases (distributed information retrieval). How to merge the results ...
LTRRS: A Learning to Rank Based Algorithm for Resource Selection in Distributed Information Retrieval
Information RetrievalAbstractResource selection is a key task in distributed information retrieval. There are many factors that affect the performance of resource selection. Learning to rank methods can effectively combine features and are widely used for document ranking in ...
Comments