research-article

Evaluating search systems using result page context

Authors:
Peter Bailey

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Nick Craswell

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Ryen W. White

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Liwei Chen

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Ashwin Satyanarayana

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
S. M.M. Tahaghoghi

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

IIiX '10: Proceedings of the third symposium on Information interaction in contextAugust 2010Pages 105–114https://doi.org/10.1145/1840784.1840801

Published:18 August 2010Publication History

IIiX '10: Proceedings of the third symposium on Information interaction in context

Pages 105–114

ABSTRACT

We introduce a method for evaluating the relevance of all visible components of a Web search results page, in the context of that results page. Contrary to Cranfield-style evaluation methods, our approach recognizes that a user's initial search interaction is with the result page produced by a search system, not the landing pages linked from it. Our key contribution is that the method allows us to investigate aspects of component relevance that are difficult or impossible to judge in isolation. Such contextual aspects include component-level information redundancy and cross-component coherence. We report on how the method complements traditional document relevance measurement and its support for comparative relevance assessment across multiple search engines. We also study possible issues with applying the method, including brand presentation effects, inter-judge agreement, and comparisons with document-based relevance judgments. Our findings show this is a useful method for evaluating the dominant user experience in interacting with search systems.

References

A. Al-Maskari, M. Sanderson, P. Clough, and E. Airio. (2008). The good and the bad system: does the test collection predict users' effectiveness? Proc. SIGIR. Google ScholarDigital Library
J. Allan, B. Carterette, and J. Lewis. (2005). When will information retrieval be "good enough"? Proc. SIGIR. Google ScholarDigital Library
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A.P. de Vries, and E. Yilmaz. (2008). Relevance assessment: are judges exchangeable and does it matter. Proc. SIGIR. Google ScholarDigital Library
P. Bailey, N. Craswell, R.W. White, L. Chen, A. Satyanarayana, and S.M.M. Tahaghoghi. (2010). Evaluating whole-page relevance. Proc. SIGIR, (poster paper). Google ScholarDigital Library
P. Bailey, P. Thomas, and D. Hawking. (2007). Does brandname influence perceived search result quality? Yahoo!, Google, and WebKumara. Proc. ADCS.Google Scholar
P. Borlund. (2000). Experimental components for the evaluation of interactive information retrieval systems. Journal of Documentation, 53(3): 71--90.Google ScholarCross Ref
G. Buscher, S. Dumais, and E. Cutrell. (2010). The good, the bad, and the random: An eye-tracking study of ad quality in web search. Proc. SIGIR. Google ScholarDigital Library
B. Carterette, P.N. Bennett, D.M. Chickering, and S. Dumais. (2008). Here or there: preference judgments for relevance. Proc. ECIR, 16--27. Google ScholarDigital Library
R. Chandrasekar, M.R. Scott, D. Slawson, A.R.D. Rajan, and D. Makoski. (2008). Measuring search experience satisfaction using explicit context-aware feedback. Proc. Workshop on Human-Computer Interaction and Information Retrieval.Google Scholar
C.L.A. Clarke, E. Agichtein, S. Dumais, and R.W. White. (2007). The influence of caption features on clickthrough patterns in web search. Proc. SIGIR. Google ScholarDigital Library
C.L.A. Clarke, N. Craswell, and I. Soboroff. (2009). Overview of the TREC 2009 web track. Proc. TREC.Google Scholar
C.W. Cleverdon. (1960). ASLIB Cranfield research project on the comparative efficiency of indexing systems. ASLIB Proceedings, XII, 421--431.Google ScholarCross Ref
J. Cohen. (1960). A coefficient for agreement for nominal scales. Education and Psych. Measurement, 20: 37--46.Google ScholarCross Ref
N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. (2008). An experimental comparison of click position-bias models. Proc. WSDM, 87--94. Google ScholarDigital Library
E. Cutrell and Z. Guan. (2007). What are you looking for?: an eye-tracking study of information usage in web search. Proc. CHI, 407--416. Google ScholarDigital Library
D. Downey, S. Dumais, D. Liebling, and E. Horvitz. (2008). Understanding the relationship between searchers' queries and information goals. Proc. CIKM, 449--458. Google ScholarDigital Library
S. Dumais and N. Belkin. (2005). The TREC interactive tracks: putting the user into search. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press.Google Scholar
J. Fleiss. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5): 378--382.Google ScholarCross Ref
Z. Guan and E. Cutrell. (2007). An eye tracking study of the effect of target rank on web search. Proc. CHI, 417--420. Google ScholarDigital Library
D. Hawking and N. Craswell. (2005). The very large collection and web tracks. TREC: Experimentation and Evaluation in Information Retrieval, 199--232. MIT Press.Google Scholar
D. Hawking, N. Craswell, P. Bailey, and K. Griffiths. (2001). Measuring search engine quality. Information Retrieval, 4(1): 33--59. Google ScholarDigital Library
D. Hawking, N. Craswell, P. Thistlewaite, and D. Harman. (1999). Results and challenges in Web search evaluation. Proc. WWW, 1321--1330. Google ScholarDigital Library
W. Hersh. (2002). TREC 2002 interactive track report. Proc. TREC.Google Scholar
P. Ingwersen and K. Järvelin. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer. Google ScholarDigital Library
B.J. Jansen and U. Pooch. (2001). Web user studies: a review and framework for future work. JASIST, 52(3): 235--246. Google ScholarDigital Library
B.J. Jansen, M. Zhang, and C.D. Schultz. (2009). Brand and its effect on user perception of search engine performance. JASIST, 60(8): 1572--1595. Google ScholarDigital Library
K. Järvelin and J. Kekäläinen. (2000). IR evaluation methods for retrieving highly relevant documents. Proc. SIGIR. Google ScholarDigital Library
D. Kelly, K. Gyllstrom, and E.W. Bailey. (2009). A comparison of query and term suggestion features for interactive searching. Proc SIGIR. Google ScholarDigital Library
R. Kohavi, T. Crook, and R. Longbotham. (2009). Online experimentation at Microsoft. Proc. Workshop on Data Mining Case Studies and Practice Prize.Google Scholar
J. R. Landis and G. G. Koch (1977). The measurement of observer agreement for categorical data. Biometrics, 33: 159--174.Google ScholarCross Ref
J. Li, S. Huffman, and A. Tokuda. (2009). Good abandonment in mobile and PC internet search. Proc. SIGIR. Google ScholarDigital Library
M.B. Meyer and J.M. Booker. (2001). Eliciting and Analyzing Expert Judgment: A Practical Guide. ASA-SIAM. Google ScholarDigital Library
J. Nielsen and K. Pernice. (2009). Eyetracking Web Usability. New Riders Press. Google ScholarDigital Library
F. Radlinski, M. Kurup, and T. Joachims. (2008). Learning diverse rankings with multi-armed bandits? Proc. ICML. Google ScholarDigital Library
F. Radlinski, R. Kleinberg, and T. Joachims. (2008). How does clickthrough data reflect retrieval quality? Proc. CIKM, 43--52. Google ScholarDigital Library
F. Radlinski and T. Joachims. (2006). Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. Proc. AAAI, 1406--1412. Google ScholarDigital Library
F. Radlinski and N. Craswell. (2010). Comparing the sensitivity of information retrieval metrics. Proc. SIGIR. Google ScholarDigital Library
P.E. Shrout and J.L. Fliess (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2): 420--428.Google ScholarCross Ref
P. Thomas and D. Hawking. (2006). Evaluation by comparing result sets in context. Proc. CIKM, 94--101. Google ScholarDigital Library
A. Turpin, F. Scholer, K. Järvelin, M. Wu, and J.S. Culpepper. (2009). Including summaries in system evaluation. Proc. SIGIR, 508--515. Google ScholarDigital Library
A. Turpin and F. Scholer. (2006). User performance versus precision measures for simple search tasks. Proc. SIGIR. Google ScholarDigital Library
E. M. Voorhees. (2008). On test collections for adaptive information retrieval. Information Processing and Management, 44(6): 1879--1885. Google ScholarDigital Library
E. M. Voorhees and D. Harman. (2005). TREC: Experiment and Evaluation in Information Retrieval. MIT Press. Google ScholarDigital Library
K. Wang, T. Walker, and Z. Zheng. (2009). PSkip: estimating relevance ranking quality from web search clickthrough data. Proc. KDD, 1355--1364. Google ScholarDigital Library
R.W. White, M. Bilenko, and S. Cucerzan. (2007). Studying the use of popular destinations to enhance web search interaction. Proc. SIGIR. Google ScholarDigital Library

Index Terms

Evaluating search systems using result page context
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Evaluating whole-page relevance
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Whole page relevance defines how well the surface-level repre-sentation of all elements on a search result page and the corre-sponding holistic attributes of the presentation respond to users' information needs. We introduce a method for evaluating the ...
Read More
Evaluating relevant in context: document retrieval with a twist
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

The Relevant in Context retrieval task is document or article retrieval with a twist, where not only the relevant articles should be retrieved but also the relevant information within each article (captured by a set of XML elements) should be correctly ...
Read More
How Am I Doing?: Evaluating Conversational Search Systems Offline
As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IIiX '10: Proceedings of the third symposium on Information interaction in context
August 2010
408 pages
ISBN:9781450302470
DOI:10.1145/1840784
General Chair:
Nicholas J. Belkin
Rutgers University, USA
,
Program Chair:
Diane Kelly
University of North Carolina at Chapel Hill, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 August 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation
measurement
web search relevance
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate21of45submissions,47%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 383
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating search systems using result page context

IIiX '10: Proceedings of the third symposium on Information interaction in context

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluating whole-page relevance

Evaluating relevant in context: document retrieval with a twist

How Am I Doing?: Evaluating Conversational Search Systems Offline