Abstract
Effectiveness evaluation of information retrieval systems by means of a test collection is a widely used methodology. However, it is rather expensive in terms of resources, time, and money; therefore, many researchers have proposed methods for a cheaper evaluation. One particular approach, on which we focus in this article, is to use fewer topics: in TREC-like initiatives, usually system effectiveness is evaluated as the average effectiveness on a set of n topics (usually, n=50, but more than 1,000 have been also adopted); instead of using the full set, it has been proposed to find the best subsets of a few good topics that evaluate the systems in the most similar way to the full set. The computational complexity of the task has so far limited the analysis that has been performed. We develop a novel and efficient approach based on a multi-objective evolutionary algorithm. The higher efficiency of our new implementation allows us to reproduce some notable results on topic set reduction, as well as perform new experiments to generalize and improve such results. We show that our approach is able to both reproduce the main state-of-the-art results and to allow us to analyze the effect of the collection, metric, and pool depth used for the evaluation. Finally, differently from previous studies, which have been mainly theoretical, we are also able to discuss some practical topic selection strategies, integrating results of automatic evaluation approaches.
- James Allan, Ben Carterette, Javed A. Aslam, Virgil Pavlu, Blagovest Dachev, and Evangelos Kanoulas. 2007. Million Query Track 2007 Overview. Technical Report. NIST. http://trec.nist.gov/pubs/trec18/papers/MQ09OVERVIEW.pdf.Google Scholar
- David Banks, Paul Over, and Nien-Fan Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Information Retrieval 1, 1 (1999), 7--34. Google ScholarDigital Library
- Andrea Berto, Stefano Mizzaro, and Stephen Robertson. 2013. On using fewer topics in information retrieval evaluations. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR’13). ACM, New York, NY, Article 9, 8 pages. Google ScholarDigital Library
- Chris Buckley and Ellen M. Voorhees. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). ACM, New York, NY, 33--40. Google ScholarDigital Library
- Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6, 2 (2002), 182--197. Google ScholarDigital Library
- Agoston E. Eiben and J. E. Smith. 2003. Introduction to Evolutionary Computing. Springer-Verlag. Google ScholarDigital Library
- Nicola Ferro. 2017. Reproducibility challenges in information retrieval evaluation. Journal of Data and Information Quality 8, 2, Article 8 (Jan. 2017), 4 pages. Google ScholarDigital Library
- Nicola Ferro, Norbert Fuhr, Kalervo Järvelin, Noriko Kando, Matthias Lippold, and Justin Zobel. 2016. Increasing reproducibility in IR: Findings from the Dagstuhl seminar on reproducibility of data-oriented experiments in E-science. In ACM SIGIR Forum, Vol. 50. ACM, 68--82. Google ScholarDigital Library
- John Guiver, Stefano Mizzaro, and Stephen Robertson. 2009. A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems 27, 4, Article 21 (Nov. 2009), 26 pages. Google ScholarDigital Library
- Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 5 (Sept. 1999), 604--632. Google ScholarDigital Library
- Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Information Retrieval 19, 4 (Aug. 2016), 416--445. Google ScholarDigital Library
- Stefano Mizzaro and Stephen Robertson. 2007. Hits hits TREC: Exploring IR evaluation results with network analysis. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 479--486. Google ScholarDigital Library
- Stephen Robertson. 2011. On the contributions of topics to system evaluation. In Proceedings of the 33rd European Conference on Advances in Information Retrieval, Volume 6611 (ECIR’11). Springer-Verlag, New York, 129--140. Google ScholarDigital Library
- Kevin Roitero, Eddy Maddalena, and Stefano Mizzaro. 2017. Do Easy Topics Predict Effectiveness Better Than Difficult Topics? Springer International Publishing, Cham, 605--611.Google Scholar
- Tetsuya Sakai. 2014. Designing test collections for comparing many systems. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14). ACM, New York, NY, USA, 61--70. Google ScholarDigital Library
- Tetsuya Sakai. 2016. Topic set size design. Information Retrieval Journal 19, 3 (1 Jun 2016), 256--283. Google ScholarDigital Library
- Ian Soboroff, Charles Nicholas, and Patrick Cahan. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, NY, 66--73. Google ScholarDigital Library
- Karen Sparck Jones and Cornelis Joost van Rijsbergen. n.d. Information retrieval test collections. Journal of Documentation 32, 1.Google Scholar
- Anselm Spoerri. 2005. How the overlap between the search results of different retrieval systems correlates with document relevance. Proceedings of the American Society for Information Science and Technology 42, 1 (2005).Google ScholarCross Ref
- Ellen M. Voorhees. 2004. Overview of the TREC 2004 robust track. In Proceedings of The Thirteenth Text Retrieval Conference (TREC'04). Vol. 4. http://trec.nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf.Google Scholar
- Ellen M. Voorhees and Chris Buckley. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02). ACM, New York, NY, 316--323. Google ScholarDigital Library
- Ellen M. Voorhees and Donna Harman. 2000. Overview of the Proceedings of the 8th Text REtrieval Conference (TREC-8). 1--24.Google Scholar
- William Webber, Alistair Moffat, and Justin Zobel. 2008. Statistical power in retrieval experimentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, New York, NY, 571--580. Google ScholarDigital Library
- Shengli Wu and Fabio Crestani. 2003. Methods for ranking information retrieval systems without relevance judgments. In Proceedings of the 2003 ACM Symposium on Applied Computing (SAC’03). ACM, New York, NY, 811--816. Google ScholarDigital Library
- Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98). ACM, New York, NY, 307--314. Google ScholarDigital Library
Index Terms
- Reproduce and Improve: An Evolutionary Approach to Select a Few Good Topics for Information Retrieval Evaluation
Recommendations
Effectiveness Evaluation with a Subset of Topics: A Practical Approach
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalSeveral researchers have proposed to reduce the number of topics used in TREC-like initiatives. One research direction that has been pursued is what is the optimal topic subset of a given cardinality that evaluates the systems/runs in the most accurate ...
Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments
Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and AnalysesThe evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the information retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate ...
Fewer topics? A million topics? Both?! On topics subsets in test collections
AbstractWhen evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query ...
Comments