skip to main content
10.1145/1840784.1840801acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiixConference Proceedingsconference-collections
research-article

Evaluating search systems using result page context

Published:18 August 2010Publication History

ABSTRACT

We introduce a method for evaluating the relevance of all visible components of a Web search results page, in the context of that results page. Contrary to Cranfield-style evaluation methods, our approach recognizes that a user's initial search interaction is with the result page produced by a search system, not the landing pages linked from it. Our key contribution is that the method allows us to investigate aspects of component relevance that are difficult or impossible to judge in isolation. Such contextual aspects include component-level information redundancy and cross-component coherence. We report on how the method complements traditional document relevance measurement and its support for comparative relevance assessment across multiple search engines. We also study possible issues with applying the method, including brand presentation effects, inter-judge agreement, and comparisons with document-based relevance judgments. Our findings show this is a useful method for evaluating the dominant user experience in interacting with search systems.

References

  1. A. Al-Maskari, M. Sanderson, P. Clough, and E. Airio. (2008). The good and the bad system: does the test collection predict users' effectiveness? Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Allan, B. Carterette, and J. Lewis. (2005). When will information retrieval be "good enough"? Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A.P. de Vries, and E. Yilmaz. (2008). Relevance assessment: are judges exchangeable and does it matter. Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Bailey, N. Craswell, R.W. White, L. Chen, A. Satyanarayana, and S.M.M. Tahaghoghi. (2010). Evaluating whole-page relevance. Proc. SIGIR, (poster paper). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Bailey, P. Thomas, and D. Hawking. (2007). Does brandname influence perceived search result quality? Yahoo!, Google, and WebKumara. Proc. ADCS.Google ScholarGoogle Scholar
  6. P. Borlund. (2000). Experimental components for the evaluation of interactive information retrieval systems. Journal of Documentation, 53(3): 71--90.Google ScholarGoogle ScholarCross RefCross Ref
  7. G. Buscher, S. Dumais, and E. Cutrell. (2010). The good, the bad, and the random: An eye-tracking study of ad quality in web search. Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Carterette, P.N. Bennett, D.M. Chickering, and S. Dumais. (2008). Here or there: preference judgments for relevance. Proc. ECIR, 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Chandrasekar, M.R. Scott, D. Slawson, A.R.D. Rajan, and D. Makoski. (2008). Measuring search experience satisfaction using explicit context-aware feedback. Proc. Workshop on Human-Computer Interaction and Information Retrieval.Google ScholarGoogle Scholar
  10. C.L.A. Clarke, E. Agichtein, S. Dumais, and R.W. White. (2007). The influence of caption features on clickthrough patterns in web search. Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C.L.A. Clarke, N. Craswell, and I. Soboroff. (2009). Overview of the TREC 2009 web track. Proc. TREC.Google ScholarGoogle Scholar
  12. C.W. Cleverdon. (1960). ASLIB Cranfield research project on the comparative efficiency of indexing systems. ASLIB Proceedings, XII, 421--431.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Cohen. (1960). A coefficient for agreement for nominal scales. Education and Psych. Measurement, 20: 37--46.Google ScholarGoogle ScholarCross RefCross Ref
  14. N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. (2008). An experimental comparison of click position-bias models. Proc. WSDM, 87--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Cutrell and Z. Guan. (2007). What are you looking for?: an eye-tracking study of information usage in web search. Proc. CHI, 407--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Downey, S. Dumais, D. Liebling, and E. Horvitz. (2008). Understanding the relationship between searchers' queries and information goals. Proc. CIKM, 449--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Dumais and N. Belkin. (2005). The TREC interactive tracks: putting the user into search. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press.Google ScholarGoogle Scholar
  18. J. Fleiss. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5): 378--382.Google ScholarGoogle ScholarCross RefCross Ref
  19. Z. Guan and E. Cutrell. (2007). An eye tracking study of the effect of target rank on web search. Proc. CHI, 417--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Hawking and N. Craswell. (2005). The very large collection and web tracks. TREC: Experimentation and Evaluation in Information Retrieval, 199--232. MIT Press.Google ScholarGoogle Scholar
  21. D. Hawking, N. Craswell, P. Bailey, and K. Griffiths. (2001). Measuring search engine quality. Information Retrieval, 4(1): 33--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Hawking, N. Craswell, P. Thistlewaite, and D. Harman. (1999). Results and challenges in Web search evaluation. Proc. WWW, 1321--1330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Hersh. (2002). TREC 2002 interactive track report. Proc. TREC.Google ScholarGoogle Scholar
  24. P. Ingwersen and K. Järvelin. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B.J. Jansen and U. Pooch. (2001). Web user studies: a review and framework for future work. JASIST, 52(3): 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B.J. Jansen, M. Zhang, and C.D. Schultz. (2009). Brand and its effect on user perception of search engine performance. JASIST, 60(8): 1572--1595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Järvelin and J. Kekäläinen. (2000). IR evaluation methods for retrieving highly relevant documents. Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Kelly, K. Gyllstrom, and E.W. Bailey. (2009). A comparison of query and term suggestion features for interactive searching. Proc SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Kohavi, T. Crook, and R. Longbotham. (2009). Online experimentation at Microsoft. Proc. Workshop on Data Mining Case Studies and Practice Prize.Google ScholarGoogle Scholar
  30. J. R. Landis and G. G. Koch (1977). The measurement of observer agreement for categorical data. Biometrics, 33: 159--174.Google ScholarGoogle ScholarCross RefCross Ref
  31. J. Li, S. Huffman, and A. Tokuda. (2009). Good abandonment in mobile and PC internet search. Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M.B. Meyer and J.M. Booker. (2001). Eliciting and Analyzing Expert Judgment: A Practical Guide. ASA-SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Nielsen and K. Pernice. (2009). Eyetracking Web Usability. New Riders Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. F. Radlinski, M. Kurup, and T. Joachims. (2008). Learning diverse rankings with multi-armed bandits? Proc. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. Radlinski, R. Kleinberg, and T. Joachims. (2008). How does clickthrough data reflect retrieval quality? Proc. CIKM, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. Radlinski and T. Joachims. (2006). Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. Proc. AAAI, 1406--1412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. Radlinski and N. Craswell. (2010). Comparing the sensitivity of information retrieval metrics. Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P.E. Shrout and J.L. Fliess (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2): 420--428.Google ScholarGoogle ScholarCross RefCross Ref
  39. P. Thomas and D. Hawking. (2006). Evaluation by comparing result sets in context. Proc. CIKM, 94--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Turpin, F. Scholer, K. Järvelin, M. Wu, and J.S. Culpepper. (2009). Including summaries in system evaluation. Proc. SIGIR, 508--515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Turpin and F. Scholer. (2006). User performance versus precision measures for simple search tasks. Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. E. M. Voorhees. (2008). On test collections for adaptive information retrieval. Information Processing and Management, 44(6): 1879--1885. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. E. M. Voorhees and D. Harman. (2005). TREC: Experiment and Evaluation in Information Retrieval. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. K. Wang, T. Walker, and Z. Zheng. (2009). PSkip: estimating relevance ranking quality from web search clickthrough data. Proc. KDD, 1355--1364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R.W. White, M. Bilenko, and S. Cucerzan. (2007). Studying the use of popular destinations to enhance web search interaction. Proc. SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluating search systems using result page context

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      IIiX '10: Proceedings of the third symposium on Information interaction in context
      August 2010
      408 pages
      ISBN:9781450302470
      DOI:10.1145/1840784

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 August 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate21of45submissions,47%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader