skip to main content
research-article

Concept-Based Information Retrieval Using Explicit Semantic Analysis

Published:01 April 2011Publication History
Skip Abstract Section

Abstract

Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.

References

  1. Allan, J., Callan, J., Feng, F.-F., and Malin, D. 1999. Inquery and trec-8. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 637--644.Google ScholarGoogle Scholar
  2. Anderka, M. and Stein, B. 2009. The esa retrieval model revisited. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 670--671. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Arampatzis, A. and Kamps, J. 2008. A study of query length. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 811--812. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Armstrong, T. G., Moffat, A., Webber, W., and Zobel, J. 2009. Improvements that don’t add up: Ad-Hoc retrieval results since 1998. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, 601--610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Billerbeck, B. and Zobel, J. 2004. Questioning query expansion: An examination of behaviour and parameters. In Proceedings of the 15th Australasian Database Conference. Australian Computer Society, 69--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Buckley, C. and Robertson, S. 2008. Relevance feedback track overview: Trec 2008. In Proceedings of the 17th Text REtrieval Conference (TREC-17).Google ScholarGoogle Scholar
  7. Callan, J. P. 1994.Passage-Level evidence in document retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM/Springer, 302--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Castells, P., Fernandez, M., and Vallet, D. 2007. An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. Knowl. Data Engin. 19, 2, 261--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chang, M.-W., Ratinov, L., Roth, D., and Srikumar, V. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 830--835. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Croft, B. W. 2000. Combining Approaches to Information Retrieval. Kluwer Academic Publishers, Chapter 1, 1--36.Google ScholarGoogle Scholar
  11. Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2002. Predicting query performance. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 299--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  13. Dumais, S. T. 1994. Latent semantic indexing (lsi) and trec-2. In Proceedings of the 2nd Text Retrieval Conference (TREC-2). 105--116.Google ScholarGoogle Scholar
  14. Egozi, O., Gabrilovich, E., and Markovitch, S. 2008. Concept-Based feature generation and selection for information retrieval. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 1132--1137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fox, E. A. and Shaw, J. A. 1994. Combination of multiple searches. In Proceedings of the 2nd Text REtrieval Conference (TREC-2). 243--252.Google ScholarGoogle Scholar
  16. Fuller, M., Kaszkiel, M., Kimberley, S., Ng, C., Wilkinson, R., Wu, M., and Zobel, J. 1999. The rmit/csiro ad hoc, q&a, web, interactive, and speech experiments at trec 8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 549--564.Google ScholarGoogle Scholar
  17. Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. 1987. The vocabulary problem in human-system communication. Comm. ACM 30, 11, 964--971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gabrilovich, E. and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI’05). Morgan Kaufmann Publishers, 1048--1053. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI’06). AAAI Press, 1301--1306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). Morgan Kaufmann Publishers, 1606--1611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gauch, S., Madrid, J. M., Induri, S., Ravindran, D., and Chadlavada, S. 2003. Keyconcept: A conceptual search engine. Tech. rep. TR-8646-37, University of Kansas.Google ScholarGoogle Scholar
  22. Gonzalo, J., Verdejo, F., Chugur, I., and Cigarrin, J. 1998. Indexing with wordnet synsets can improve text retrieval. In Proceedings of the COLING/ACL Workshop on Usage of WordNet for NLP.Google ScholarGoogle Scholar
  23. Grootjen, F. and van der Weide, T. P. 2006. Conceptual query expansion. Data Knowl. Engin. 56, 174--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gupta, R. and Ratinov, L.-A. 2008. Text categorization with knowledge transfer from heterogeneous data sources. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 842--847. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Gurevych, I., Muller, C., and Zesch, T. 2007. What to be? - Electronic career guidance based on semantic relatedness. In Proceedings of the Association for Computational Linguistics (ACL). The Association for Computer Linguistics, 1032--1039.Google ScholarGoogle Scholar
  26. Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hawking, D. 1999. Acsys trec-8 experiments. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 307--316.Google ScholarGoogle Scholar
  28. Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Huang, X., Huang, Y. R., Wen, M., An, A., Liu, Y., and Poon, J. 2006. Applying data mining to pseudo-relevance feedback for high performance text retrieval. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM’06). IEEE Computer Society, 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. John, G. H., Kohavi, R., and Pfleger, K. 1994. Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning. 121--129.Google ScholarGoogle Scholar
  31. Kaptein, R., Kamps, J., and Hiemstra, D. 2008. The impact of positive, negative and topical relevance feedback. In Proceedings of the 17th Text REtrieval Conference (TREC-17).Google ScholarGoogle Scholar
  32. Kaszkiel, M. and Zobel, J. 2001. Effective ranking with arbitrary passages. J. Amer. Soc. Inf. Sci. Technol. 52, 4, 344--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kwok, K. L., Grunfeld, L., and Chan, M. 1999. Trec-8 ad-hoc, query and filtering track experiments using pircs. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 217--228.Google ScholarGoogle Scholar
  34. Lee, J. H. 1995. Combining multiple evidence from different properties of weighting schemes. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 180--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Li, W. and McCallum, A. 2006. Pachinko allocation: Dag-Structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). ACM, 577--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Liberman, S. and Markovitch, S. 2009. Compact hierarchical explicit semantic representation. In Proceedings of the IJCAI Workshop on User-Contributed Knowledge and Artificial Intelligence: An Evolving Synergy (WikiAI’09). Morgan Kaufmann Publishers.Google ScholarGoogle Scholar
  37. Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th International Conference on Information and Knowledge Management. ACM, 375--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Liu, Z. and Chu, W. W. 2005. Knowledge-Based query expansion to support scenario-specific retrieval of medical free text. In Proceedings of the ACM Symposium on Applied Computing. ACM, 1076--1083. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mandala, R., Takenobu, T., and Hozumi, T. 1998. The use of wordnet in information retrieval. In Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems. 31--37.Google ScholarGoogle Scholar
  40. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. 1990. Introduction to wordnet: An on-line lexical database. Int. J. Lexicograph. 3, 235--244.Google ScholarGoogle ScholarCross RefCross Ref
  41. Milne, D. N., Witten, I. H., and Nichols, D. M. 2007. A knowledge-based search engine powered by wikipedia. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. ACM, 445--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 206--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ozcan, R. and Aslandogan, Y. A. 2005. Concept-based information access. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). IEEE Computer Society, 794--799. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Potthast, M., Stein, B., and Anderka, M. 2008. A wikipedia-based multilingual retrieval model. In Proceedings of the 30th European Conference on IR Research (ECIR). Springer, 522--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Qiu, Y. and Frei, H. P. 1993. Concept based query expansion. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 160--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Quinlan, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 1, 81--106. Google ScholarGoogle ScholarCross RefCross Ref
  47. Robertson, S. E. and Walker, S. 1999. Okapi/keenbow at trec-8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 151--162.Google ScholarGoogle Scholar
  48. Rocchio, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton Ed., Prentice Hall, Englewood Cliffs, NJ, 313--323.Google ScholarGoogle Scholar
  49. Ruthven, I. and Lalmas, M. 2003. A survey on the use of relevance feedback for information access systems. Knowl. Engin. Rev. 18, 2, 95--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Salton, G. and Buckley, C. 1990. Improving retrieval performance by relevance feedback. J. Amer. Soc. Inf. Sci. 41, 4, 288--297.Google ScholarGoogle ScholarCross RefCross Ref
  51. Sanderson, M. 2000. Retrieving with good sense. Inf. Retriev. 2, 1, 49--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Schuetze, H. and Pedersen, J. O. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval. 161--175.Google ScholarGoogle Scholar
  53. Singhal, A., Abney, S., Bacchiani, M., Collins, M., Hindle, D., and Pereira, F. 1999. At&t at trec-8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 317--330.Google ScholarGoogle Scholar
  54. Singhal, A., Mitra, M., and Buckley, C. 1997. Learning routing queries in a query zone. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Singhal, A., Salton, G., Mitra, M., and Buckley, C. 1995. Document length normalization. Tech. rep. TR95-1529, Cornell University, Ithaca, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Sorg, P. and Cimiano, P. 2008. Cross-lingual information retrieval with explicit semantic analysis. In Working Notes for the CLEF 2008 Workshop.Google ScholarGoogle Scholar
  57. Stokoe, C. P., Oakes, M., and Tait, J. 2003. Word sense disambiguation in information retrieval revisited. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 159--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Styltsvig, H. B. 2006. Ontology-based information retrieval. Ph.D. thesis, Department of Computer Science, Roskilde University, Denmark.Google ScholarGoogle Scholar
  59. Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. Inf. Retriev. 1, 3, 151--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Voorhees, E. M. 1993. Using wordnet to disambiguate word senses for text retrieval. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Voorhees, E. M. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer, 61--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Voorhees, E. M. 2005. Overview of the TREC 2004 robust retrieval track. In Proceedings of the 13th Text REtrieval Conference (TREC-13). 70--79.Google ScholarGoogle ScholarCross RefCross Ref
  63. Voorhees, E. M. and Harman, D. 1998. Overview of the seventh text retrieval conference (trec-7). In Proceedings of the 7th Text REtrieval Conference (TREC-7). 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  64. Voorhees, E. M. and Harman, D. 1999. Overview of the eighth text retrieval conference (trec-8). In Proceedings of the 8th Text REtrieval Conference (TREC-8). 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  65. Wei, X. and Croft, W. B. 2006. Lda-Based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, 178--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18, 1, 79--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Yan, R., Hauptmann, A. G., and Jin, R. 2003. Negative pseudo-relevance feedback in content-based video retrieval. In Proceedings of the 11th ACM International Conference on Multimedia. ACM, 343--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, 412--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Yi, X. and Allan, J. 2009. A comparative study of utilizing topic models for information retrieval. In Proceedings of the 31st European Conference on IR Research (ECIR). Springer, 29--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Zhou, X., Zhang, X., and Hu, X. 2006. Using concept-based indexing to improve language modeling approach to genomic ir. In Lecture Notes in Computer Science. Springer, 444--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 307--314. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Concept-Based Information Retrieval Using Explicit Semantic Analysis

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Information Systems
          ACM Transactions on Information Systems  Volume 29, Issue 2
          April 2011
          193 pages
          ISSN:1046-8188
          EISSN:1558-2868
          DOI:10.1145/1961209
          Issue’s Table of Contents

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 April 2011
          • Accepted: 1 January 2011
          • Revised: 1 October 2010
          • Received: 1 February 2010
          Published in tois Volume 29, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader