Improving Retrievability and Recall by Automatic Corpus Partitioning

Bashir, Shariq; Rauber, Andreas

doi:10.1007/978-3-642-16175-9_5

Improving Retrievability and Recall by Automatic Corpus Partitioning

Shariq Bashir²⁰ &
Andreas Rauber²⁰

Chapter

329 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 6380))

Abstract

With increasing volumes of data, much effort has been devoted to finding the most suitable answer to an information need. However, in many domains, the question whether any specific information item can be found at all via a reasonable set of queries is essential. This concept of Retrievability of information has evolved into an important evaluation measure of IR systems in recall-oriented application domains. While several studies evaluated retrieval bias in systems, solid validation of the impact of retrieval bias and the development of methods to counter low retrievability of certain document types would be desirable.

This paper provides an in-depth study of retrievability characteristics over queries of different length in a large benchmark corpus, validating previous studies. It analyzes the possibility of automatically categorizing documents into low and high retrievable documents based on document properties rather than complex retrievability analysis. We furthermore show, that this classification can be used to improve overall retrievability of documents by treating these classes as separate document corpora, combining individual retrieval results. Experiments are validated on 1.2 million patents of the TREC Chemical Retrieval Track.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Azzopardi, L., Vinay, V.: Retrievability: an evaluation measure for higher order information access tasks. In: CIKM ’08: Proceeding of the 17th ACM Conference on Information and Knowledge Management, pp. 561–570. ACM, New York (2008)
Chapter Google Scholar
Baeza-Yates, R.: Applications of web query mining. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 7–22. Springer, Heidelberg (2005)
Chapter Google Scholar
Bashir, S., Rauber, A.: Analyzing document retrievability in patent retrieval settings. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) DEXA 2009. LNCS, vol. 5690, pp. 753–760. Springer, Heidelberg (2009)
Google Scholar
Bashir, S., Rauber, A.: Identification of low/high retrievable patents using content-based features. In: PaIR ’09: Proceeding of the 2nd International Workshop on Patent Information Retrieval, pp. 9–16 (2009)
Google Scholar
Custis, T., Al-Kofahi, K.: A new approach for evaluating query expansion: query-document term mismatch. In: SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 575–582. ACM, New York (2007)
Chapter Google Scholar
Doi, H., Seki, Y., Aono, M.: A patent retrieval method using a hierarchy of clusters at tut. In: NTCIR ’05: In Proceedings of NTCIR-5 Workshop Meeting, Tokyo, Japan (December 6-9, 2005)
Google Scholar
Fujii, A.: Enhancing patent retrieval by citation analysis. In: SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 793–794. ACM, New York (2007)
Chapter Google Scholar
Graf, E., Azzopardi, L.: A methodology for building a patent test collection for prior art search. In: EVIA ’08: The Second International Workshop on Evaluating Information Access, Tokyo, Japan, pp. 60–71 (2008)
Google Scholar
Itoh, H., Mano, H., Ogawa, Y.: Term distillation in patent retrieval. In: Proceedings of the ACL-2003 Workshop on Patent Corpus Processing, pp. 41–45. Association for Computational Linguistics (2003)
Google Scholar
Jordan, C., Watters, C., Gao, Q.: Using controlled query generation to evaluate blind relevance feedback algorithms. In: JCDL ’06: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 286–295. ACM, New York (2006)
Chapter Google Scholar
Lupu, M., Huang, J., Zhu, J., Tait, J.: Trec-chem: large scale chemical information retrieval evaluation at trec. SIGIR Forum 43(2), 63–70 (2009)
Article Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web (1999)
Google Scholar
Robertson, S., Zaragoza, H., Taylor, M.: Simple bm25 extension to multiple weighted fields. In: CIKM ’04: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 42–49. ACM, New York (2004)
Chapter Google Scholar
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR ’94: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 232–241. Springer, New York (1994)
Google Scholar
Sakai, T.: Comparing metrics across trec and ntcir: the robustness to system bias. In: CIKM ’08: Proceeding of the 17th ACM Conference on Information and Knowledge Management, pp. 581–590. ACM, New York (2008)
Chapter Google Scholar
Vaughan, L., Thelwall, M.: Search engine coverage bias: evidence and possible causes. Inf. Process. Manage. 40(4), 693–707 (2004)
Article Google Scholar
Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, USA (2005)
MATH Google Scholar
Xue, X., Croft, W.B.: Transforming patents into prior-art queries. In: SIGIR ’09: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 808–809. ACM, New York (2009)
Chapter Google Scholar
Zhai, C.: Risk minimization and language modeling in text retrieval dissertation abstract. SIGIR Forum 36(2), 100–101 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Software Technology and Interactive Systems, Vienna University of Technology, Austria
Shariq Bashir & Andreas Rauber

Authors

Shariq Bashir
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Rauber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut de Recherche en Informatique de Toulouse (IRIT), Paul Sabatier University, 118, route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
University of Linz, FAW, Altenbergerstraße 69, 4040, Linz, Austria
Josef Küng & Roland Wagner &
Department of Computer Science, Aalborg University, Selma Lagerløfs Vej 300, 9220, Aalborg, Denmark
Torben Bach Pedersen
Institute of Software Technology, Vienna University of Technology, Favoritenstr. 9-11/188, 1040, Vienna, Austria
A. Min Tjoa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bashir, S., Rauber, A. (2010). Improving Retrievability and Recall by Automatic Corpus Partitioning. In: Hameurlain, A., Küng, J., Wagner, R., Bach Pedersen, T., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems II. Lecture Notes in Computer Science, vol 6380. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16175-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-16175-9_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16174-2
Online ISBN: 978-3-642-16175-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics