research-article

Concept-Based Information Retrieval Using Explicit Semantic Analysis

Authors:
Ofer Egozi

Technion---Israel Institute of Technology

Technion---Israel Institute of Technology
View Profile

,
Shaul Markovitch

Technion---Israel Institute of Technology

Technion---Israel Institute of Technology
View Profile

,
Evgeniy Gabrilovich

Technion---Israel Institute of Technology

Technion---Israel Institute of Technology
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 29 Issue 2Article No.: 8pp 1–34https://doi.org/10.1145/1961209.1961211

Published:01 April 2011Publication History

ACM Transactions on Information Systems

Abstract

Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.

References

Allan, J., Callan, J., Feng, F.-F., and Malin, D. 1999. Inquery and trec-8. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 637--644.Google Scholar
Anderka, M. and Stein, B. 2009. The esa retrieval model revisited. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 670--671. Google ScholarDigital Library
Arampatzis, A. and Kamps, J. 2008. A study of query length. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 811--812. Google ScholarDigital Library
Armstrong, T. G., Moffat, A., Webber, W., and Zobel, J. 2009. Improvements that don&#8217;t add up: Ad-Hoc retrieval results since 1998. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM&#8217;09). ACM, 601--610. Google ScholarDigital Library
Billerbeck, B. and Zobel, J. 2004. Questioning query expansion: An examination of behaviour and parameters. In Proceedings of the 15th Australasian Database Conference. Australian Computer Society, 69--76. Google ScholarDigital Library
Buckley, C. and Robertson, S. 2008. Relevance feedback track overview: Trec 2008. In Proceedings of the 17th Text REtrieval Conference (TREC-17).Google Scholar
Callan, J. P. 1994.Passage-Level evidence in document retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM/Springer, 302--310. Google ScholarDigital Library
Castells, P., Fernandez, M., and Vallet, D. 2007. An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. Knowl. Data Engin. 19, 2, 261--272. Google ScholarDigital Library
Chang, M.-W., Ratinov, L., Roth, D., and Srikumar, V. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 830--835. Google ScholarDigital Library
Croft, B. W. 2000. Combining Approaches to Information Retrieval. Kluwer Academic Publishers, Chapter 1, 1--36.Google Scholar
Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2002. Predicting query performance. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 299--306. Google ScholarDigital Library
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6, 391--407.Google ScholarCross Ref
Dumais, S. T. 1994. Latent semantic indexing (lsi) and trec-2. In Proceedings of the 2nd Text Retrieval Conference (TREC-2). 105--116.Google Scholar
Egozi, O., Gabrilovich, E., and Markovitch, S. 2008. Concept-Based feature generation and selection for information retrieval. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 1132--1137. Google ScholarDigital Library
Fox, E. A. and Shaw, J. A. 1994. Combination of multiple searches. In Proceedings of the 2nd Text REtrieval Conference (TREC-2). 243--252.Google Scholar
Fuller, M., Kaszkiel, M., Kimberley, S., Ng, C., Wilkinson, R., Wu, M., and Zobel, J. 1999. The rmit/csiro ad hoc, q&amp;a, web, interactive, and speech experiments at trec 8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 549--564.Google Scholar
Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. 1987. The vocabulary problem in human-system communication. Comm. ACM 30, 11, 964--971. Google ScholarDigital Library
Gabrilovich, E. and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI&#8217;05). Morgan Kaufmann Publishers, 1048--1053. Google ScholarDigital Library
Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI&#8217;06). AAAI Press, 1301--1306. Google ScholarDigital Library
Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI&#8217;07). Morgan Kaufmann Publishers, 1606--1611. Google ScholarDigital Library
Gauch, S., Madrid, J. M., Induri, S., Ravindran, D., and Chadlavada, S. 2003. Keyconcept: A conceptual search engine. Tech. rep. TR-8646-37, University of Kansas.Google Scholar
Gonzalo, J., Verdejo, F., Chugur, I., and Cigarrin, J. 1998. Indexing with wordnet synsets can improve text retrieval. In Proceedings of the COLING/ACL Workshop on Usage of WordNet for NLP.Google Scholar
Grootjen, F. and van der Weide, T. P. 2006. Conceptual query expansion. Data Knowl. Engin. 56, 174--193. Google ScholarDigital Library
Gupta, R. and Ratinov, L.-A. 2008. Text categorization with knowledge transfer from heterogeneous data sources. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 842--847. Google ScholarDigital Library
Gurevych, I., Muller, C., and Zesch, T. 2007. What to be? - Electronic career guidance based on semantic relatedness. In Proceedings of the Association for Computational Linguistics (ACL). The Association for Computer Linguistics, 1032--1039.Google Scholar
Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Google ScholarDigital Library
Hawking, D. 1999. Acsys trec-8 experiments. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 307--316.Google Scholar
Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR&#8217;99). ACM, 50--57. Google ScholarDigital Library
Huang, X., Huang, Y. R., Wen, M., An, A., Liu, Y., and Poon, J. 2006. Applying data mining to pseudo-relevance feedback for high performance text retrieval. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM&#8217;06). IEEE Computer Society, 295--306. Google ScholarDigital Library
John, G. H., Kohavi, R., and Pfleger, K. 1994. Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning. 121--129.Google Scholar
Kaptein, R., Kamps, J., and Hiemstra, D. 2008. The impact of positive, negative and topical relevance feedback. In Proceedings of the 17th Text REtrieval Conference (TREC-17).Google Scholar
Kaszkiel, M. and Zobel, J. 2001. Effective ranking with arbitrary passages. J. Amer. Soc. Inf. Sci. Technol. 52, 4, 344--364. Google ScholarDigital Library
Kwok, K. L., Grunfeld, L., and Chan, M. 1999. Trec-8 ad-hoc, query and filtering track experiments using pircs. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 217--228.Google Scholar
Lee, J. H. 1995. Combining multiple evidence from different properties of weighting schemes. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 180--188. Google ScholarDigital Library
Li, W. and McCallum, A. 2006. Pachinko allocation: Dag-Structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning (ICML&#8217;06). ACM, 577--584. Google ScholarDigital Library
Liberman, S. and Markovitch, S. 2009. Compact hierarchical explicit semantic representation. In Proceedings of the IJCAI Workshop on User-Contributed Knowledge and Artificial Intelligence: An Evolving Synergy (WikiAI&#8217;09). Morgan Kaufmann Publishers.Google Scholar
Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th International Conference on Information and Knowledge Management. ACM, 375--382. Google ScholarDigital Library
Liu, Z. and Chu, W. W. 2005. Knowledge-Based query expansion to support scenario-specific retrieval of medical free text. In Proceedings of the ACM Symposium on Applied Computing. ACM, 1076--1083. Google ScholarDigital Library
Mandala, R., Takenobu, T., and Hozumi, T. 1998. The use of wordnet in information retrieval. In Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems. 31--37.Google Scholar
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. 1990. Introduction to wordnet: An on-line lexical database. Int. J. Lexicograph. 3, 235--244.Google ScholarCross Ref
Milne, D. N., Witten, I. H., and Nichols, D. M. 2007. A knowledge-based search engine powered by wikipedia. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. ACM, 445--454. Google ScholarDigital Library
Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 206--214. Google ScholarDigital Library
Ozcan, R. and Aslandogan, Y. A. 2005. Concept-based information access. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC&#8217;05). IEEE Computer Society, 794--799. Google ScholarDigital Library
Potthast, M., Stein, B., and Anderka, M. 2008. A wikipedia-based multilingual retrieval model. In Proceedings of the 30th European Conference on IR Research (ECIR). Springer, 522--530. Google ScholarDigital Library
Qiu, Y. and Frei, H. P. 1993. Concept based query expansion. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 160--169. Google ScholarDigital Library
Quinlan, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 1, 81--106. Google ScholarCross Ref
Robertson, S. E. and Walker, S. 1999. Okapi/keenbow at trec-8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 151--162.Google Scholar
Rocchio, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton Ed., Prentice Hall, Englewood Cliffs, NJ, 313--323.Google Scholar
Ruthven, I. and Lalmas, M. 2003. A survey on the use of relevance feedback for information access systems. Knowl. Engin. Rev. 18, 2, 95--145. Google ScholarDigital Library
Salton, G. and Buckley, C. 1990. Improving retrieval performance by relevance feedback. J. Amer. Soc. Inf. Sci. 41, 4, 288--297.Google ScholarCross Ref
Sanderson, M. 2000. Retrieving with good sense. Inf. Retriev. 2, 1, 49--69. Google ScholarDigital Library
Schuetze, H. and Pedersen, J. O. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval. 161--175.Google Scholar
Singhal, A., Abney, S., Bacchiani, M., Collins, M., Hindle, D., and Pereira, F. 1999. At&amp;t at trec-8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 317--330.Google Scholar
Singhal, A., Mitra, M., and Buckley, C. 1997. Learning routing queries in a query zone. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 25--32. Google ScholarDigital Library
Singhal, A., Salton, G., Mitra, M., and Buckley, C. 1995. Document length normalization. Tech. rep. TR95-1529, Cornell University, Ithaca, NY. Google ScholarDigital Library
Sorg, P. and Cimiano, P. 2008. Cross-lingual information retrieval with explicit semantic analysis. In Working Notes for the CLEF 2008 Workshop.Google Scholar
Stokoe, C. P., Oakes, M., and Tait, J. 2003. Word sense disambiguation in information retrieval revisited. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 159--166. Google ScholarDigital Library
Styltsvig, H. B. 2006. Ontology-based information retrieval. Ph.D. thesis, Department of Computer Science, Roskilde University, Denmark.Google Scholar
Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. Inf. Retriev. 1, 3, 151--173. Google ScholarDigital Library
Voorhees, E. M. 1993. Using wordnet to disambiguate word senses for text retrieval. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 171--180. Google ScholarDigital Library
Voorhees, E. M. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer, 61--69. Google ScholarDigital Library
Voorhees, E. M. 2005. Overview of the TREC 2004 robust retrieval track. In Proceedings of the 13th Text REtrieval Conference (TREC-13). 70--79.Google ScholarCross Ref
Voorhees, E. M. and Harman, D. 1998. Overview of the seventh text retrieval conference (trec-7). In Proceedings of the 7th Text REtrieval Conference (TREC-7). 1--24.Google ScholarCross Ref
Voorhees, E. M. and Harman, D. 1999. Overview of the eighth text retrieval conference (trec-8). In Proceedings of the 8th Text REtrieval Conference (TREC-8). 1--24.Google ScholarCross Ref
Wei, X. and Croft, W. B. 2006. Lda-Based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR&#8217;06). ACM, 178--185. Google ScholarDigital Library
Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18, 1, 79--112. Google ScholarDigital Library
Yan, R., Hauptmann, A. G., and Jin, R. 2003. Negative pseudo-relevance feedback in content-based video retrieval. In Proceedings of the 11th ACM International Conference on Multimedia. ACM, 343--346. Google ScholarDigital Library
Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, 412--420. Google ScholarDigital Library
Yi, X. and Allan, J. 2009. A comparative study of utilizing topic models for information retrieval. In Proceedings of the 31st European Conference on IR Research (ECIR). Springer, 29--41. Google ScholarDigital Library
Zhou, X., Zhang, X., and Hu, X. 2006. Using concept-based indexing to improve language modeling approach to genomic ir. In Lecture Notes in Computer Science. Springer, 444--455. Google ScholarDigital Library
Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 307--314. Google ScholarDigital Library

Index Terms

Concept-Based Information Retrieval Using Explicit Semantic Analysis
1. Information systems
  1. Information retrieval

Recommendations

Concept-Based Relevance Models for Medical and Semantic Information Retrieval
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

Relevance models provide an important approach for estimating probabilities of words in the relevant class. However, the associated bag-of-words assumption breaks dependencies between words, especially between those within a phrase. If such dependencies ...
Read More
Enhancing semantic search using case-based modular ontology
SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

In this paper, we present a semantic search approach based on Case-based modular Ontology. Our work aims to improve ontology-based information retrieval by the integration of the traditional information retrieval, the use of ontology and the case based ...
Read More
Using BM25F for semantic search
SEMSEARCH '10: Proceedings of the 3rd International Semantic Search Workshop

Information Retrieval (IR) approaches for semantic web search engines have become very populars in the last years. Popularization of different IR libraries, like Lucene, that allows IR implementations almost out-of-the-box have make easier IR ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 29, Issue 2
April 2011
193 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1961209
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2011
- Accepted: 1 January 2011
- Revised: 1 October 2010
- Received: 1 February 2010
Published in tois Volume 29, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Concept-based retrieval
explicit semantic analysis
feature selection
semantic search
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 185
  Total Citations
  View Citations
- 2,473
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Concept-Based Information Retrieval Using Explicit Semantic Analysis

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Concept-Based Relevance Models for Medical and Semantic Information Retrieval

Enhancing semantic search using case-based modular ontology

Using BM25F for semantic search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Concept-Based Information Retrieval Using Explicit Semantic Analysis

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Concept-Based Relevance Models for Medical and Semantic Information Retrieval

Enhancing semantic search using case-based modular ontology

Using BM25F for semantic search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media