ABSTRACT
We introduce a method for evaluating the relevance of all visible components of a Web search results page, in the context of that results page. Contrary to Cranfield-style evaluation methods, our approach recognizes that a user's initial search interaction is with the result page produced by a search system, not the landing pages linked from it. Our key contribution is that the method allows us to investigate aspects of component relevance that are difficult or impossible to judge in isolation. Such contextual aspects include component-level information redundancy and cross-component coherence. We report on how the method complements traditional document relevance measurement and its support for comparative relevance assessment across multiple search engines. We also study possible issues with applying the method, including brand presentation effects, inter-judge agreement, and comparisons with document-based relevance judgments. Our findings show this is a useful method for evaluating the dominant user experience in interacting with search systems.
- A. Al-Maskari, M. Sanderson, P. Clough, and E. Airio. (2008). The good and the bad system: does the test collection predict users' effectiveness? Proc. SIGIR. Google ScholarDigital Library
- J. Allan, B. Carterette, and J. Lewis. (2005). When will information retrieval be "good enough"? Proc. SIGIR. Google ScholarDigital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A.P. de Vries, and E. Yilmaz. (2008). Relevance assessment: are judges exchangeable and does it matter. Proc. SIGIR. Google ScholarDigital Library
- P. Bailey, N. Craswell, R.W. White, L. Chen, A. Satyanarayana, and S.M.M. Tahaghoghi. (2010). Evaluating whole-page relevance. Proc. SIGIR, (poster paper). Google ScholarDigital Library
- P. Bailey, P. Thomas, and D. Hawking. (2007). Does brandname influence perceived search result quality? Yahoo!, Google, and WebKumara. Proc. ADCS.Google Scholar
- P. Borlund. (2000). Experimental components for the evaluation of interactive information retrieval systems. Journal of Documentation, 53(3): 71--90.Google ScholarCross Ref
- G. Buscher, S. Dumais, and E. Cutrell. (2010). The good, the bad, and the random: An eye-tracking study of ad quality in web search. Proc. SIGIR. Google ScholarDigital Library
- B. Carterette, P.N. Bennett, D.M. Chickering, and S. Dumais. (2008). Here or there: preference judgments for relevance. Proc. ECIR, 16--27. Google ScholarDigital Library
- R. Chandrasekar, M.R. Scott, D. Slawson, A.R.D. Rajan, and D. Makoski. (2008). Measuring search experience satisfaction using explicit context-aware feedback. Proc. Workshop on Human-Computer Interaction and Information Retrieval.Google Scholar
- C.L.A. Clarke, E. Agichtein, S. Dumais, and R.W. White. (2007). The influence of caption features on clickthrough patterns in web search. Proc. SIGIR. Google ScholarDigital Library
- C.L.A. Clarke, N. Craswell, and I. Soboroff. (2009). Overview of the TREC 2009 web track. Proc. TREC.Google Scholar
- C.W. Cleverdon. (1960). ASLIB Cranfield research project on the comparative efficiency of indexing systems. ASLIB Proceedings, XII, 421--431.Google ScholarCross Ref
- J. Cohen. (1960). A coefficient for agreement for nominal scales. Education and Psych. Measurement, 20: 37--46.Google ScholarCross Ref
- N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. (2008). An experimental comparison of click position-bias models. Proc. WSDM, 87--94. Google ScholarDigital Library
- E. Cutrell and Z. Guan. (2007). What are you looking for?: an eye-tracking study of information usage in web search. Proc. CHI, 407--416. Google ScholarDigital Library
- D. Downey, S. Dumais, D. Liebling, and E. Horvitz. (2008). Understanding the relationship between searchers' queries and information goals. Proc. CIKM, 449--458. Google ScholarDigital Library
- S. Dumais and N. Belkin. (2005). The TREC interactive tracks: putting the user into search. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press.Google Scholar
- J. Fleiss. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5): 378--382.Google ScholarCross Ref
- Z. Guan and E. Cutrell. (2007). An eye tracking study of the effect of target rank on web search. Proc. CHI, 417--420. Google ScholarDigital Library
- D. Hawking and N. Craswell. (2005). The very large collection and web tracks. TREC: Experimentation and Evaluation in Information Retrieval, 199--232. MIT Press.Google Scholar
- D. Hawking, N. Craswell, P. Bailey, and K. Griffiths. (2001). Measuring search engine quality. Information Retrieval, 4(1): 33--59. Google ScholarDigital Library
- D. Hawking, N. Craswell, P. Thistlewaite, and D. Harman. (1999). Results and challenges in Web search evaluation. Proc. WWW, 1321--1330. Google ScholarDigital Library
- W. Hersh. (2002). TREC 2002 interactive track report. Proc. TREC.Google Scholar
- P. Ingwersen and K. Järvelin. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer. Google ScholarDigital Library
- B.J. Jansen and U. Pooch. (2001). Web user studies: a review and framework for future work. JASIST, 52(3): 235--246. Google ScholarDigital Library
- B.J. Jansen, M. Zhang, and C.D. Schultz. (2009). Brand and its effect on user perception of search engine performance. JASIST, 60(8): 1572--1595. Google ScholarDigital Library
- K. Järvelin and J. Kekäläinen. (2000). IR evaluation methods for retrieving highly relevant documents. Proc. SIGIR. Google ScholarDigital Library
- D. Kelly, K. Gyllstrom, and E.W. Bailey. (2009). A comparison of query and term suggestion features for interactive searching. Proc SIGIR. Google ScholarDigital Library
- R. Kohavi, T. Crook, and R. Longbotham. (2009). Online experimentation at Microsoft. Proc. Workshop on Data Mining Case Studies and Practice Prize.Google Scholar
- J. R. Landis and G. G. Koch (1977). The measurement of observer agreement for categorical data. Biometrics, 33: 159--174.Google ScholarCross Ref
- J. Li, S. Huffman, and A. Tokuda. (2009). Good abandonment in mobile and PC internet search. Proc. SIGIR. Google ScholarDigital Library
- M.B. Meyer and J.M. Booker. (2001). Eliciting and Analyzing Expert Judgment: A Practical Guide. ASA-SIAM. Google ScholarDigital Library
- J. Nielsen and K. Pernice. (2009). Eyetracking Web Usability. New Riders Press. Google ScholarDigital Library
- F. Radlinski, M. Kurup, and T. Joachims. (2008). Learning diverse rankings with multi-armed bandits? Proc. ICML. Google ScholarDigital Library
- F. Radlinski, R. Kleinberg, and T. Joachims. (2008). How does clickthrough data reflect retrieval quality? Proc. CIKM, 43--52. Google ScholarDigital Library
- F. Radlinski and T. Joachims. (2006). Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. Proc. AAAI, 1406--1412. Google ScholarDigital Library
- F. Radlinski and N. Craswell. (2010). Comparing the sensitivity of information retrieval metrics. Proc. SIGIR. Google ScholarDigital Library
- P.E. Shrout and J.L. Fliess (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2): 420--428.Google ScholarCross Ref
- P. Thomas and D. Hawking. (2006). Evaluation by comparing result sets in context. Proc. CIKM, 94--101. Google ScholarDigital Library
- A. Turpin, F. Scholer, K. Järvelin, M. Wu, and J.S. Culpepper. (2009). Including summaries in system evaluation. Proc. SIGIR, 508--515. Google ScholarDigital Library
- A. Turpin and F. Scholer. (2006). User performance versus precision measures for simple search tasks. Proc. SIGIR. Google ScholarDigital Library
- E. M. Voorhees. (2008). On test collections for adaptive information retrieval. Information Processing and Management, 44(6): 1879--1885. Google ScholarDigital Library
- E. M. Voorhees and D. Harman. (2005). TREC: Experiment and Evaluation in Information Retrieval. MIT Press. Google ScholarDigital Library
- K. Wang, T. Walker, and Z. Zheng. (2009). PSkip: estimating relevance ranking quality from web search clickthrough data. Proc. KDD, 1355--1364. Google ScholarDigital Library
- R.W. White, M. Bilenko, and S. Cucerzan. (2007). Studying the use of popular destinations to enhance web search interaction. Proc. SIGIR. Google ScholarDigital Library
Index Terms
- Evaluating search systems using result page context
Recommendations
Evaluating whole-page relevance
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalWhole page relevance defines how well the surface-level repre-sentation of all elements on a search result page and the corre-sponding holistic attributes of the presentation respond to users' information needs. We introduce a method for evaluating the ...
Evaluating relevant in context: document retrieval with a twist
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalThe Relevant in Context retrieval task is document or article retrieval with a twist, where not only the relevant articles should be retrieved but also the relevant information within each article (captured by a set of XML elements) should be correctly ...
How Am I Doing?: Evaluating Conversational Search Systems Offline
As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important ...
Comments