article

Query-based sampling of text databases

Authors:
Jamie Callan

Carnegie Mellon Univ.

Carnegie Mellon Univ.
View Profile

,
Margaret Connell

Univ., of Massachusetts

Univ., of Massachusetts
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 19 Issue 2pp 97–130https://doi.org/10.1145/382979.383040

Published:01 April 2001Publication History

ACM Transactions on Information Systems

Abstract

The proliferation of searchable text databases on corporate networks and the Internet causes a database selection problem for many people. Algorithms such as gGLOSS and CORI can automatically select which text databases to search for a given information need, but only if given a set of resource descriptions that accurately represent the contents of each database. The existing techniques for a acquiring resource descriptions have significant limitations when used in wide-area networks controlled by many parties. This paper presents query-based sampling, a new technicque for acquiring accurate resource descriptions. Query-based sampling does not require the cooperation of resource providers, nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are crated, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic dtabase selection.

References

ALLAN, J., BALLESTEROS, L., CALLAN,J.P.,CROFT,W.B.,AND LU, Z. 1995. Recent experiments with INQUERY. In Proceedings of the 4th Text Retrieval Conference (TREC-4, Washington, D.C., Nov.), D. K. Harman, Ed. National Institute of Standards and Technology, Gaithers-burg, MD, 49-63.Google Scholar
ALLAN, J., CALLAN, J., SANDERSON, M., XU, W., AND WEGMAN, S. 1999. INQUERY and TREC-7. In Proceedings of the 7th Conference on Text Retrieval (TREC-7, Gaithersburg, MD). 201-216.Google Scholar
BAUMGARTEN, C. 1997. A probabilitic model for distributed informaiton retrieval. In Proceedings of the 20th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '97, Philadelphia, PA, July 27-31), W. Hersh, F. Can, and E. Voorhees. ACM Press, New York, NY, 258-266. Google Scholar
CALLAN, J. 2000. Distributed information retrieval. In Advances in Information Retrieval,W. B. Croft, Ed. Kluwer Academic Publishers, Hingham, MA, 127-150.Google Scholar
CALLAN, J., CONNELL, M., AND DU, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD '99, Philadelphia, PA, June). ACM Press, New York, NY, 479-490. Google Scholar
CALLAN,J.P.,CROFT,W.B.,AND BROGLIO, J. 1995a. TREC and TIPSTER experiments with INQUERY. Inf. Process. Manage. 31, 3 (May-June), 327-343. Google Scholar
CALLAN,J.P.,LU, Z., AND CROFT, W. B. 1995b. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 21-28. Google Scholar
CLARKE, I., SANDBERG, O., WILEY, B., AND HONG, T. W. 2000. Freenet: A distributed anonymous information storage and retrieval system. In Proceedings of the ICSI Workshop on Design Issues in Anonymity and Unobservability (Berkeley, CA, July 25-26).Google Scholar
CRASWELL, N., BAILEY, P., AND HAWKING, D. 2000. Server selection on the World Wide Web. In Proceedings of the 5th ACM Conference on Digital Libraries. ACM, New York, NY, 37-46. Google Scholar
FRENCH, J., POWELL, A., CALLAN, J., VILES, C., EMMIT, T., PREY, K., AND MOU, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99, Berkeley, CA). 238-245. Google Scholar
FRENCH,J.C.,POWELL,A.L.,VILES,C.L.,EMMITT, T., AND PREY, K. J. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Re-trieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 121-129. Google Scholar
FUHR, N. 1999. A decision-theoretic approach to database selection in networked IR. ACM Trans. Inf. Syst. 17, 3 (July), 229-249. Google Scholar
GRAVANO,L.AND GARCIA-MOLINA, H. 1995. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Sept.). 78-89. Google Scholar
GRAVANO, L., CHANG, C.-C. K., GARCIA-MOLINA, H., AND PAEPCKE, A. 1997. STARTS: Stanford proposal for Internet meta-searching. In Proceedings of the International ACM Conference on Management of Data (SIGMOD '97, Tucson, AZ, May). ACM, New York, NY. Google Scholar
GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994a. The effectiveness of GlOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY. Google Scholar
GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994b. Precision and recall of GlOSS estimators for database discovery. In Proceedings of the 3rd IEEE International Conference on Parallel and Distributed Information Systems (PDIS, Austin, TX, Sept.). IEEE Computer Society Press, Los Alamitos, CA. Also available as Stanford Univ. Computer Science Tech. Rep. STAN-CS-TN-94-10. Google Scholar
HARMAN,D.K.,ED. 1994. Proceedings of the 2nd Conference on Text Retrieval. (TREC-2). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Pub. 500-215.Google Scholar
HARMAN, D., ED. 1995. Proceedings of the 3rd Conference on Text Retrieval. (TREC-3, Gaithersburg, MD). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Pub. 500-225.Google Scholar
HAWKING,D.AND THISTLEWAITE, P. 1999. Methods for information server selection. ACM Trans. Inf. Syst. 17, 1 (Jan.), 40-76. Google Scholar
HEAPS, J. 1978. Information Retrieval-Computational and Theoretical Aspects. Academic Press, Inc., New York, NY. Google Scholar
KROVETZ, R. J. 1995. Word sense disambiguation for large text databases. Ph.D. Dissertation. Computer and Information Science Department, University of Massachusetts, Amherst, MA. Google Scholar
LARKEY, L., CONNELL, M., AND CALLAN, J. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM '00). ACM, New York, NY, 282-289. Google Scholar
LU, Z., CALLAN,J.P.,AND CROFT, W. B. 1996. Measures in collection ranking evaluation. Tech. Rep. TR96-39. Computer and Information Science Department, University of Massa-chusetts, Amherst, MA.Google Scholar
LUHN, H. P. 1958. The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159-165.Google Scholar
MARCUS, R. S. 1983. An experimental comparison of the effectiveness of computers and humans as search intermediaries. J. Am. Soc. Inf. Sci. 34, 381-404.Google Scholar
MENG, W., LIU,K.L.,YU,C.T.,WANG, X., CHANG, Y., AND RISHE, N. 1998. Determining text databases to search in the Internet. In Proceedings of the 24th International Conference on Very Large Data Bases, A. Gupta, O. Shmueli, and J. Widom, Eds. Morgan Kaufmann, San Mateo, CA, 14-25. Google Scholar
MENG, W., LIU,K.L.,YU,C.T.,WU, W., AND RISHE, N. 1999. Estimating the usefulness of search engines. In Proceedings of the 15th International IEEE Conference on Data Engineering (Sydney, Australia, Mar.). IEEE Press, Piscataway, NJ, 146-153. Google Scholar
MORONEY, M. J. 1951. Facts from Figures. Penguin Books, New York, NY.Google Scholar
NISO. 1995. Information Retrieval (Z39.50): Application service definition and protocol specification. Tech. Rep. ANSI/NISO Z39.50-1995. NISO Press, Bethesda, MD. Available via http://lcweb.loc.gov/z3950/agency/.Google Scholar
POWELL, A., FRENCH, J., CALLAN, J., CONNELL, M., AND VILES, C. 2000. The impact of database selection on distributed searching. In Proceedings of the 23rd Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '00). ACM, New York, NY, 232-239. Google Scholar
PRESS,W.H.,TEUKOLSKY,S.A.,VETTERLING,W.T.,AND FLANNERY, B. P. 1992. Numerical Recipes in C: The Art of Scientific Computing. 2nd ed. Cambridge University Press, New York, NY. Google Scholar
TURTLE, H. R. 1991. Inference networks for document retrieval. Ph.D. Dissertation. Computer and Information Science Department, University of Massachusetts, Amherst, MA. Google Scholar
TURTLE,H.AND CROFT, W. B. 1991. Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst. 9, 3 (July), 187-222. Google Scholar
VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20. Google Scholar
VOORHEES,E.M.AND TONG, R. M. 1997. Multiple search engines in database merging. In Proceedings of the 2nd ACM International Conference on Digital Libraries (DL '97, Philadel-phia, PA, July 23-26), R. B. Allen and E. Rasmussen, Chairs. ACM Press, New York, NY, 93-102. Google Scholar
VOORHEES,E.M.,GUPTA,N.K.,AND JOHNSON-LAIRD, B. 1995. Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 172-179. Google Scholar
WEISS, R., VELEZ, B., SHELDON,M.A.,NANPREMPRE, C., SZILAGYI, P., DUDA, P., AND GIFFORD,D. 1996. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the Seventh ACM Conference on Hypertext '96 (Washington, D.C., Mar. 16-20), D. Stotts, Chair. ACM Press, New York, NY, 180-193. Google Scholar
XU,J.AND CALLAN, J. 1998. Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 112-120. Google Scholar
XU,J.AND CROFT, B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99, Berkeley, CA). 254-261. Google Scholar
YUWONO,B.AND LEE, D. L. 1996. Search and ranking algorithms for locating resources on the World Wide Web. In Proceedings of the 12th IEEE International Conference on Data Engineering (ICDE '97, New Orleans, LA, Feb.). IEEE Press, Piscataway, NJ, 164-171. Google Scholar
YUWONO,B.AND LEE, D. L. 1997. Server ranking for distributed text retrieval systems on the Internet. In Proceedings of the 5th International Conference on Database Systems for Advanced Applications (Melbourne, Australia, Apr.), R. Topor and K. Tanaka, Eds. World Scientific Publishing Co., Inc., River Edge, NJ, 41-49. Google Scholar
ZIPF, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.Google Scholar

Index Terms

Query-based sampling of text databases
1. Information systems

Recommendations

Comparing the performance of collection selection algorithms

The proliferation of online information resources increases the importance of effective and efficient information retrieval in a multicollection environment. Multicollection searching is cast in three parts: collection selection (also referred to as ...
Read More
A semisupervised learning method to merge search engine results

The proliferation of searchable text databases on local area networks and the Internet causes the problem of finding information that may be distributed among many disjoint text databases (distributed information retrieval). How to merge the results ...
Read More
LTRRS: A Learning to Rank Based Algorithm for Resource Selection in Distributed Information Retrieval
Information Retrieval
Abstract
Resource selection is a key task in distributed information retrieval. There are many factors that affect the performance of resource selection. Learning to rank methods can effectively combine features and are widely used for document ranking in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 19, Issue 2
April 2001
119 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/382979
Issue’s Table of Contents

Copyright © 2001 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2001
Published in tois Volume 19, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed information retrieval
query-based sampling
resource ranking
resource selection
server selection
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 267
  Total Citations
  View Citations
- 1,693
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Query-based sampling of text databases

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Comparing the performance of collection selection algorithms

A semisupervised learning method to merge search engine results

LTRRS: A Learning to Rank Based Algorithm for Resource Selection in Distributed Information Retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Query-based sampling of text databases

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Comparing the performance of collection selection algorithms

A semisupervised learning method to merge search engine results

LTRRS: A Learning to Rank Based Algorithm for Resource Selection in Distributed Information Retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media