skip to main content
10.1145/2806416.2806606acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Pooled Evaluation Over Query Variations: Users are as Diverse as Systems

Published:17 October 2015Publication History

ABSTRACT

Evaluation of information retrieval systems with test collections makes use of a suite of fixed resources: a document corpus; a set of topics; and associated judgments of the relevance of each document to each topic. With large modern collections, exhaustive judging is not feasible. Therefore an approach called pooling is typically used where, for example, the documents to be judged can be determined by taking the union of all documents returned in the top positions of the answer lists returned by a range of systems. Conventionally, pooling uses system variations to provide diverse documents to be judged for a topic; different user queries are not considered. We explore the ramifications of user query variability on pooling, and demonstrate that conventional test collections do not cover this source of variation. The effect of user query variation on the size of the judging pool is just as strong as the effect of retrieval system variation. We conclude that user query variation should be incorporated early in test collection construction, and cannot be considered effectively post hoc.

References

  1. P. Bailey, A. Moffat, F. Scholer, and P. Thomas. User variability and IR system evaluation. In Proc. SIGIR, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Buckley and J. Walz. The TREC-8 query track. In Proc. TREC, 1999.Google ScholarGoogle Scholar
  3. C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. J. Inf. Ret., 10 (6): 491--508, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Büttcher, C. L. Clarke, P. C. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proc. SIGIR, pages 63--70, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 terabyte track. In Proc. TREC, 2004.Google ScholarGoogle Scholar
  6. D. Harman. The TREC test collections. In E. M. Voorhees and D. K. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval, chapter 2, pages 21--52. MIT Press, 2005.Google ScholarGoogle Scholar
  7. D. Metzler and W. B. Croft. A Markov random field model for term dependencies. In Proc. SIGIR, pages 472--479, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judgments. In Proc. SIGIR, pages 375--382, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Sparck Jones and C. J. van Rijsbergen. Report on the need for and the provision of an "ideal" information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, 1975. British Library Research and Development Report No. 5266.Google ScholarGoogle Scholar
  10. K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Part 1. phInf. Proc. Man., 36 (6): 779--808, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Proc. Man., 36 (5): 697--716, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. M. Voorhees. Overview of the TREC 2002 question answering track. In Proc. TREC, 2002.Google ScholarGoogle Scholar
  13. E. M. Voorhees. Overview of the TREC 2003 robust retrieval track. In Proc. TREC, 2003.Google ScholarGoogle Scholar
  14. W. Webber, A. Moffat, and J. Zobel. A similarity measure for indefinite rankings. ACM Trans. Inf. Sys., 28 (4): 20.1--20.38, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proc. SIGIR, pages 307--314, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pooled Evaluation Over Query Variations: Users are as Diverse as Systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
      October 2015
      1998 pages
      ISBN:9781450337946
      DOI:10.1145/2806416

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      CIKM '15 Paper Acceptance Rate165of646submissions,26%Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader