research-article

Reproduce and Improve: An Evolutionary Approach to Select a Few Good Topics for Information Retrieval Evaluation

Authors:
Kevin Roitero

University of Udine, Udine, ITA

University of Udine, Udine, ITA
View Profile

,
Michael Soprano

University of Udine, Udine, ITA

University of Udine, Udine, ITA
View Profile

,
Andrea Brunello

University of Udine, Udine, ITA

University of Udine, Udine, ITA
View Profile

,
Stefano Mizzaro

University of Udine, Udine, ITA

University of Udine, Udine, ITA
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 10 Issue 3Article No.: 12pp 1–21https://doi.org/10.1145/3239573

Published:29 September 2018Publication History

Journal of Data and Information Quality

Abstract

Effectiveness evaluation of information retrieval systems by means of a test collection is a widely used methodology. However, it is rather expensive in terms of resources, time, and money; therefore, many researchers have proposed methods for a cheaper evaluation. One particular approach, on which we focus in this article, is to use fewer topics: in TREC-like initiatives, usually system effectiveness is evaluated as the average effectiveness on a set of n topics (usually, n=50, but more than 1,000 have been also adopted); instead of using the full set, it has been proposed to find the best subsets of a few good topics that evaluate the systems in the most similar way to the full set. The computational complexity of the task has so far limited the analysis that has been performed. We develop a novel and efficient approach based on a multi-objective evolutionary algorithm. The higher efficiency of our new implementation allows us to reproduce some notable results on topic set reduction, as well as perform new experiments to generalize and improve such results. We show that our approach is able to both reproduce the main state-of-the-art results and to allow us to analyze the effect of the collection, metric, and pool depth used for the evaluation. Finally, differently from previous studies, which have been mainly theoretical, we are also able to discuss some practical topic selection strategies, integrating results of automatic evaluation approaches.

References

James Allan, Ben Carterette, Javed A. Aslam, Virgil Pavlu, Blagovest Dachev, and Evangelos Kanoulas. 2007. Million Query Track 2007 Overview. Technical Report. NIST. http://trec.nist.gov/pubs/trec18/papers/MQ09OVERVIEW.pdf.Google Scholar
David Banks, Paul Over, and Nien-Fan Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Information Retrieval 1, 1 (1999), 7--34. Google ScholarDigital Library
Andrea Berto, Stefano Mizzaro, and Stephen Robertson. 2013. On using fewer topics in information retrieval evaluations. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR’13). ACM, New York, NY, Article 9, 8 pages. Google ScholarDigital Library
Chris Buckley and Ellen M. Voorhees. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). ACM, New York, NY, 33--40. Google ScholarDigital Library
Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6, 2 (2002), 182--197. Google ScholarDigital Library
Agoston E. Eiben and J. E. Smith. 2003. Introduction to Evolutionary Computing. Springer-Verlag. Google ScholarDigital Library
Nicola Ferro. 2017. Reproducibility challenges in information retrieval evaluation. Journal of Data and Information Quality 8, 2, Article 8 (Jan. 2017), 4 pages. Google ScholarDigital Library
Nicola Ferro, Norbert Fuhr, Kalervo Järvelin, Noriko Kando, Matthias Lippold, and Justin Zobel. 2016. Increasing reproducibility in IR: Findings from the Dagstuhl seminar on reproducibility of data-oriented experiments in E-science. In ACM SIGIR Forum, Vol. 50. ACM, 68--82. Google ScholarDigital Library
John Guiver, Stefano Mizzaro, and Stephen Robertson. 2009. A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems 27, 4, Article 21 (Nov. 2009), 26 pages. Google ScholarDigital Library
Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 5 (Sept. 1999), 604--632. Google ScholarDigital Library
Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Information Retrieval 19, 4 (Aug. 2016), 416--445. Google ScholarDigital Library
Stefano Mizzaro and Stephen Robertson. 2007. Hits hits TREC: Exploring IR evaluation results with network analysis. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 479--486. Google ScholarDigital Library
Stephen Robertson. 2011. On the contributions of topics to system evaluation. In Proceedings of the 33rd European Conference on Advances in Information Retrieval, Volume 6611 (ECIR’11). Springer-Verlag, New York, 129--140. Google ScholarDigital Library
Kevin Roitero, Eddy Maddalena, and Stefano Mizzaro. 2017. Do Easy Topics Predict Effectiveness Better Than Difficult Topics? Springer International Publishing, Cham, 605--611.Google Scholar
Tetsuya Sakai. 2014. Designing test collections for comparing many systems. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14). ACM, New York, NY, USA, 61--70. Google ScholarDigital Library
Tetsuya Sakai. 2016. Topic set size design. Information Retrieval Journal 19, 3 (1 Jun 2016), 256--283. Google ScholarDigital Library
Ian Soboroff, Charles Nicholas, and Patrick Cahan. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, NY, 66--73. Google ScholarDigital Library
Karen Sparck Jones and Cornelis Joost van Rijsbergen. n.d. Information retrieval test collections. Journal of Documentation 32, 1.Google Scholar
Anselm Spoerri. 2005. How the overlap between the search results of different retrieval systems correlates with document relevance. Proceedings of the American Society for Information Science and Technology 42, 1 (2005).Google ScholarCross Ref
Ellen M. Voorhees. 2004. Overview of the TREC 2004 robust track. In Proceedings of The Thirteenth Text Retrieval Conference (TREC'04). Vol. 4. http://trec.nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf.Google Scholar
Ellen M. Voorhees and Chris Buckley. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02). ACM, New York, NY, 316--323. Google ScholarDigital Library
Ellen M. Voorhees and Donna Harman. 2000. Overview of the Proceedings of the 8th Text REtrieval Conference (TREC-8). 1--24.Google Scholar
William Webber, Alistair Moffat, and Justin Zobel. 2008. Statistical power in retrieval experimentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, New York, NY, 571--580. Google ScholarDigital Library
Shengli Wu and Fabio Crestani. 2003. Methods for ranking information retrieval systems without relevance judgments. In Proceedings of the 2003 ACM Symposium on Applied Computing (SAC’03). ACM, New York, NY, 811--816. Google ScholarDigital Library
Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98). ACM, New York, NY, 307--314. Google ScholarDigital Library

Index Terms

Reproduce and Improve: An Evolutionary Approach to Select a Few Good Topics for Information Retrieval Evaluation
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Bio-inspired approaches
        Genetic programming
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Test collections

Recommendations

Effectiveness Evaluation with a Subset of Topics: A Practical Approach
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Several researchers have proposed to reduce the number of topics used in TREC-like initiatives. One research direction that has been pursued is what is the optimal topic subset of a given cardinality that evaluates the systems/runs in the most accurate ...
Read More
Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments
Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and Analyses

The evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the information retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate ...
Read More
Fewer topics? A million topics? Both?! On topics subsets in test collections
Abstract
When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 10, Issue 3
Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and Analyses
September 2018
94 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3282439
Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 September 2018
- Accepted: 1 July 2018
- Revised: 1 April 2018
- Received: 1 October 2017
Published in jdiq Volume 10, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Test collection
evolutionary algorithms
few topics
reproducibility
topic selection strategy
topic sets
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 147
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Reproduce and Improve: An Evolutionary Approach to Select a Few Good Topics for Information Retrieval Evaluation

Journal of Data and Information Quality

Abstract

References

Cited By

Index Terms

Recommendations

Effectiveness Evaluation with a Subset of Topics: A Practical Approach

Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments

Fewer topics? A million topics? Both?! On topics subsets in test collections