skip to main content
article

Query-based sampling of text databases

Published:01 April 2001Publication History
Skip Abstract Section

Abstract

The proliferation of searchable text databases on corporate networks and the Internet causes a database selection problem for many people. Algorithms such as gGLOSS and CORI can automatically select which text databases to search for a given information need, but only if given a set of resource descriptions that accurately represent the contents of each database. The existing techniques for a acquiring resource descriptions have significant limitations when used in wide-area networks controlled by many parties. This paper presents query-based sampling, a new technicque for acquiring accurate resource descriptions. Query-based sampling does not require the cooperation of resource providers, nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are crated, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic dtabase selection.

References

  1. ALLAN, J., BALLESTEROS, L., CALLAN,J.P.,CROFT,W.B.,AND LU, Z. 1995. Recent experiments with INQUERY. In Proceedings of the 4th Text Retrieval Conference (TREC-4, Washington, D.C., Nov.), D. K. Harman, Ed. National Institute of Standards and Technology, Gaithers-burg, MD, 49-63.Google ScholarGoogle Scholar
  2. ALLAN, J., CALLAN, J., SANDERSON, M., XU, W., AND WEGMAN, S. 1999. INQUERY and TREC-7. In Proceedings of the 7th Conference on Text Retrieval (TREC-7, Gaithersburg, MD). 201-216.Google ScholarGoogle Scholar
  3. BAUMGARTEN, C. 1997. A probabilitic model for distributed informaiton retrieval. In Proceedings of the 20th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '97, Philadelphia, PA, July 27-31), W. Hersh, F. Can, and E. Voorhees. ACM Press, New York, NY, 258-266. Google ScholarGoogle Scholar
  4. CALLAN, J. 2000. Distributed information retrieval. In Advances in Information Retrieval,W. B. Croft, Ed. Kluwer Academic Publishers, Hingham, MA, 127-150.Google ScholarGoogle Scholar
  5. CALLAN, J., CONNELL, M., AND DU, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD '99, Philadelphia, PA, June). ACM Press, New York, NY, 479-490. Google ScholarGoogle Scholar
  6. CALLAN,J.P.,CROFT,W.B.,AND BROGLIO, J. 1995a. TREC and TIPSTER experiments with INQUERY. Inf. Process. Manage. 31, 3 (May-June), 327-343. Google ScholarGoogle Scholar
  7. CALLAN,J.P.,LU, Z., AND CROFT, W. B. 1995b. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 21-28. Google ScholarGoogle Scholar
  8. CLARKE, I., SANDBERG, O., WILEY, B., AND HONG, T. W. 2000. Freenet: A distributed anonymous information storage and retrieval system. In Proceedings of the ICSI Workshop on Design Issues in Anonymity and Unobservability (Berkeley, CA, July 25-26).Google ScholarGoogle Scholar
  9. CRASWELL, N., BAILEY, P., AND HAWKING, D. 2000. Server selection on the World Wide Web. In Proceedings of the 5th ACM Conference on Digital Libraries. ACM, New York, NY, 37-46. Google ScholarGoogle Scholar
  10. FRENCH, J., POWELL, A., CALLAN, J., VILES, C., EMMIT, T., PREY, K., AND MOU, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99, Berkeley, CA). 238-245. Google ScholarGoogle Scholar
  11. FRENCH,J.C.,POWELL,A.L.,VILES,C.L.,EMMITT, T., AND PREY, K. J. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Re-trieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 121-129. Google ScholarGoogle Scholar
  12. FUHR, N. 1999. A decision-theoretic approach to database selection in networked IR. ACM Trans. Inf. Syst. 17, 3 (July), 229-249. Google ScholarGoogle Scholar
  13. GRAVANO,L.AND GARCIA-MOLINA, H. 1995. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Sept.). 78-89. Google ScholarGoogle Scholar
  14. GRAVANO, L., CHANG, C.-C. K., GARCIA-MOLINA, H., AND PAEPCKE, A. 1997. STARTS: Stanford proposal for Internet meta-searching. In Proceedings of the International ACM Conference on Management of Data (SIGMOD '97, Tucson, AZ, May). ACM, New York, NY. Google ScholarGoogle Scholar
  15. GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994a. The effectiveness of GlOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY. Google ScholarGoogle Scholar
  16. GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994b. Precision and recall of GlOSS estimators for database discovery. In Proceedings of the 3rd IEEE International Conference on Parallel and Distributed Information Systems (PDIS, Austin, TX, Sept.). IEEE Computer Society Press, Los Alamitos, CA. Also available as Stanford Univ. Computer Science Tech. Rep. STAN-CS-TN-94-10. Google ScholarGoogle Scholar
  17. HARMAN,D.K.,ED. 1994. Proceedings of the 2nd Conference on Text Retrieval. (TREC-2). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Pub. 500-215.Google ScholarGoogle Scholar
  18. HARMAN, D., ED. 1995. Proceedings of the 3rd Conference on Text Retrieval. (TREC-3, Gaithersburg, MD). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Pub. 500-225.Google ScholarGoogle Scholar
  19. HAWKING,D.AND THISTLEWAITE, P. 1999. Methods for information server selection. ACM Trans. Inf. Syst. 17, 1 (Jan.), 40-76. Google ScholarGoogle Scholar
  20. HEAPS, J. 1978. Information Retrieval-Computational and Theoretical Aspects. Academic Press, Inc., New York, NY. Google ScholarGoogle Scholar
  21. KROVETZ, R. J. 1995. Word sense disambiguation for large text databases. Ph.D. Dissertation. Computer and Information Science Department, University of Massachusetts, Amherst, MA. Google ScholarGoogle Scholar
  22. LARKEY, L., CONNELL, M., AND CALLAN, J. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM '00). ACM, New York, NY, 282-289. Google ScholarGoogle Scholar
  23. LU, Z., CALLAN,J.P.,AND CROFT, W. B. 1996. Measures in collection ranking evaluation. Tech. Rep. TR96-39. Computer and Information Science Department, University of Massa-chusetts, Amherst, MA.Google ScholarGoogle Scholar
  24. LUHN, H. P. 1958. The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159-165.Google ScholarGoogle Scholar
  25. MARCUS, R. S. 1983. An experimental comparison of the effectiveness of computers and humans as search intermediaries. J. Am. Soc. Inf. Sci. 34, 381-404.Google ScholarGoogle Scholar
  26. MENG, W., LIU,K.L.,YU,C.T.,WANG, X., CHANG, Y., AND RISHE, N. 1998. Determining text databases to search in the Internet. In Proceedings of the 24th International Conference on Very Large Data Bases, A. Gupta, O. Shmueli, and J. Widom, Eds. Morgan Kaufmann, San Mateo, CA, 14-25. Google ScholarGoogle Scholar
  27. MENG, W., LIU,K.L.,YU,C.T.,WU, W., AND RISHE, N. 1999. Estimating the usefulness of search engines. In Proceedings of the 15th International IEEE Conference on Data Engineering (Sydney, Australia, Mar.). IEEE Press, Piscataway, NJ, 146-153. Google ScholarGoogle Scholar
  28. MORONEY, M. J. 1951. Facts from Figures. Penguin Books, New York, NY.Google ScholarGoogle Scholar
  29. NISO. 1995. Information Retrieval (Z39.50): Application service definition and protocol specification. Tech. Rep. ANSI/NISO Z39.50-1995. NISO Press, Bethesda, MD. Available via http://lcweb.loc.gov/z3950/agency/.Google ScholarGoogle Scholar
  30. POWELL, A., FRENCH, J., CALLAN, J., CONNELL, M., AND VILES, C. 2000. The impact of database selection on distributed searching. In Proceedings of the 23rd Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '00). ACM, New York, NY, 232-239. Google ScholarGoogle Scholar
  31. PRESS,W.H.,TEUKOLSKY,S.A.,VETTERLING,W.T.,AND FLANNERY, B. P. 1992. Numerical Recipes in C: The Art of Scientific Computing. 2nd ed. Cambridge University Press, New York, NY. Google ScholarGoogle Scholar
  32. TURTLE, H. R. 1991. Inference networks for document retrieval. Ph.D. Dissertation. Computer and Information Science Department, University of Massachusetts, Amherst, MA. Google ScholarGoogle Scholar
  33. TURTLE,H.AND CROFT, W. B. 1991. Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst. 9, 3 (July), 187-222. Google ScholarGoogle Scholar
  34. VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20. Google ScholarGoogle Scholar
  35. VOORHEES,E.M.AND TONG, R. M. 1997. Multiple search engines in database merging. In Proceedings of the 2nd ACM International Conference on Digital Libraries (DL '97, Philadel-phia, PA, July 23-26), R. B. Allen and E. Rasmussen, Chairs. ACM Press, New York, NY, 93-102. Google ScholarGoogle Scholar
  36. VOORHEES,E.M.,GUPTA,N.K.,AND JOHNSON-LAIRD, B. 1995. Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 172-179. Google ScholarGoogle Scholar
  37. WEISS, R., VELEZ, B., SHELDON,M.A.,NANPREMPRE, C., SZILAGYI, P., DUDA, P., AND GIFFORD,D. 1996. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the Seventh ACM Conference on Hypertext '96 (Washington, D.C., Mar. 16-20), D. Stotts, Chair. ACM Press, New York, NY, 180-193. Google ScholarGoogle Scholar
  38. XU,J.AND CALLAN, J. 1998. Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 112-120. Google ScholarGoogle Scholar
  39. XU,J.AND CROFT, B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99, Berkeley, CA). 254-261. Google ScholarGoogle Scholar
  40. YUWONO,B.AND LEE, D. L. 1996. Search and ranking algorithms for locating resources on the World Wide Web. In Proceedings of the 12th IEEE International Conference on Data Engineering (ICDE '97, New Orleans, LA, Feb.). IEEE Press, Piscataway, NJ, 164-171. Google ScholarGoogle Scholar
  41. YUWONO,B.AND LEE, D. L. 1997. Server ranking for distributed text retrieval systems on the Internet. In Proceedings of the 5th International Conference on Database Systems for Advanced Applications (Melbourne, Australia, Apr.), R. Topor and K. Tanaka, Eds. World Scientific Publishing Co., Inc., River Edge, NJ, 41-49. Google ScholarGoogle Scholar
  42. ZIPF, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar

Index Terms

  1. Query-based sampling of text databases

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader