Abstract
Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.
- Allan, J., Callan, J., Feng, F.-F., and Malin, D. 1999. Inquery and trec-8. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 637--644.Google Scholar
- Anderka, M. and Stein, B. 2009. The esa retrieval model revisited. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 670--671. Google ScholarDigital Library
- Arampatzis, A. and Kamps, J. 2008. A study of query length. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 811--812. Google ScholarDigital Library
- Armstrong, T. G., Moffat, A., Webber, W., and Zobel, J. 2009. Improvements that don’t add up: Ad-Hoc retrieval results since 1998. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, 601--610. Google ScholarDigital Library
- Billerbeck, B. and Zobel, J. 2004. Questioning query expansion: An examination of behaviour and parameters. In Proceedings of the 15th Australasian Database Conference. Australian Computer Society, 69--76. Google ScholarDigital Library
- Buckley, C. and Robertson, S. 2008. Relevance feedback track overview: Trec 2008. In Proceedings of the 17th Text REtrieval Conference (TREC-17).Google Scholar
- Callan, J. P. 1994.Passage-Level evidence in document retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM/Springer, 302--310. Google ScholarDigital Library
- Castells, P., Fernandez, M., and Vallet, D. 2007. An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. Knowl. Data Engin. 19, 2, 261--272. Google ScholarDigital Library
- Chang, M.-W., Ratinov, L., Roth, D., and Srikumar, V. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 830--835. Google ScholarDigital Library
- Croft, B. W. 2000. Combining Approaches to Information Retrieval. Kluwer Academic Publishers, Chapter 1, 1--36.Google Scholar
- Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2002. Predicting query performance. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 299--306. Google ScholarDigital Library
- Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6, 391--407.Google ScholarCross Ref
- Dumais, S. T. 1994. Latent semantic indexing (lsi) and trec-2. In Proceedings of the 2nd Text Retrieval Conference (TREC-2). 105--116.Google Scholar
- Egozi, O., Gabrilovich, E., and Markovitch, S. 2008. Concept-Based feature generation and selection for information retrieval. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 1132--1137. Google ScholarDigital Library
- Fox, E. A. and Shaw, J. A. 1994. Combination of multiple searches. In Proceedings of the 2nd Text REtrieval Conference (TREC-2). 243--252.Google Scholar
- Fuller, M., Kaszkiel, M., Kimberley, S., Ng, C., Wilkinson, R., Wu, M., and Zobel, J. 1999. The rmit/csiro ad hoc, q&a, web, interactive, and speech experiments at trec 8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 549--564.Google Scholar
- Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. 1987. The vocabulary problem in human-system communication. Comm. ACM 30, 11, 964--971. Google ScholarDigital Library
- Gabrilovich, E. and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI’05). Morgan Kaufmann Publishers, 1048--1053. Google ScholarDigital Library
- Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI’06). AAAI Press, 1301--1306. Google ScholarDigital Library
- Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). Morgan Kaufmann Publishers, 1606--1611. Google ScholarDigital Library
- Gauch, S., Madrid, J. M., Induri, S., Ravindran, D., and Chadlavada, S. 2003. Keyconcept: A conceptual search engine. Tech. rep. TR-8646-37, University of Kansas.Google Scholar
- Gonzalo, J., Verdejo, F., Chugur, I., and Cigarrin, J. 1998. Indexing with wordnet synsets can improve text retrieval. In Proceedings of the COLING/ACL Workshop on Usage of WordNet for NLP.Google Scholar
- Grootjen, F. and van der Weide, T. P. 2006. Conceptual query expansion. Data Knowl. Engin. 56, 174--193. Google ScholarDigital Library
- Gupta, R. and Ratinov, L.-A. 2008. Text categorization with knowledge transfer from heterogeneous data sources. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI Press, 842--847. Google ScholarDigital Library
- Gurevych, I., Muller, C., and Zesch, T. 2007. What to be? - Electronic career guidance based on semantic relatedness. In Proceedings of the Association for Computational Linguistics (ACL). The Association for Computer Linguistics, 1032--1039.Google Scholar
- Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Google ScholarDigital Library
- Hawking, D. 1999. Acsys trec-8 experiments. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 307--316.Google Scholar
- Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, 50--57. Google ScholarDigital Library
- Huang, X., Huang, Y. R., Wen, M., An, A., Liu, Y., and Poon, J. 2006. Applying data mining to pseudo-relevance feedback for high performance text retrieval. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM’06). IEEE Computer Society, 295--306. Google ScholarDigital Library
- John, G. H., Kohavi, R., and Pfleger, K. 1994. Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning. 121--129.Google Scholar
- Kaptein, R., Kamps, J., and Hiemstra, D. 2008. The impact of positive, negative and topical relevance feedback. In Proceedings of the 17th Text REtrieval Conference (TREC-17).Google Scholar
- Kaszkiel, M. and Zobel, J. 2001. Effective ranking with arbitrary passages. J. Amer. Soc. Inf. Sci. Technol. 52, 4, 344--364. Google ScholarDigital Library
- Kwok, K. L., Grunfeld, L., and Chan, M. 1999. Trec-8 ad-hoc, query and filtering track experiments using pircs. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 217--228.Google Scholar
- Lee, J. H. 1995. Combining multiple evidence from different properties of weighting schemes. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 180--188. Google ScholarDigital Library
- Li, W. and McCallum, A. 2006. Pachinko allocation: Dag-Structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). ACM, 577--584. Google ScholarDigital Library
- Liberman, S. and Markovitch, S. 2009. Compact hierarchical explicit semantic representation. In Proceedings of the IJCAI Workshop on User-Contributed Knowledge and Artificial Intelligence: An Evolving Synergy (WikiAI’09). Morgan Kaufmann Publishers.Google Scholar
- Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th International Conference on Information and Knowledge Management. ACM, 375--382. Google ScholarDigital Library
- Liu, Z. and Chu, W. W. 2005. Knowledge-Based query expansion to support scenario-specific retrieval of medical free text. In Proceedings of the ACM Symposium on Applied Computing. ACM, 1076--1083. Google ScholarDigital Library
- Mandala, R., Takenobu, T., and Hozumi, T. 1998. The use of wordnet in information retrieval. In Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems. 31--37.Google Scholar
- Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. 1990. Introduction to wordnet: An on-line lexical database. Int. J. Lexicograph. 3, 235--244.Google ScholarCross Ref
- Milne, D. N., Witten, I. H., and Nichols, D. M. 2007. A knowledge-based search engine powered by wikipedia. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. ACM, 445--454. Google ScholarDigital Library
- Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 206--214. Google ScholarDigital Library
- Ozcan, R. and Aslandogan, Y. A. 2005. Concept-based information access. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). IEEE Computer Society, 794--799. Google ScholarDigital Library
- Potthast, M., Stein, B., and Anderka, M. 2008. A wikipedia-based multilingual retrieval model. In Proceedings of the 30th European Conference on IR Research (ECIR). Springer, 522--530. Google ScholarDigital Library
- Qiu, Y. and Frei, H. P. 1993. Concept based query expansion. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 160--169. Google ScholarDigital Library
- Quinlan, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 1, 81--106. Google ScholarCross Ref
- Robertson, S. E. and Walker, S. 1999. Okapi/keenbow at trec-8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 151--162.Google Scholar
- Rocchio, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton Ed., Prentice Hall, Englewood Cliffs, NJ, 313--323.Google Scholar
- Ruthven, I. and Lalmas, M. 2003. A survey on the use of relevance feedback for information access systems. Knowl. Engin. Rev. 18, 2, 95--145. Google ScholarDigital Library
- Salton, G. and Buckley, C. 1990. Improving retrieval performance by relevance feedback. J. Amer. Soc. Inf. Sci. 41, 4, 288--297.Google ScholarCross Ref
- Sanderson, M. 2000. Retrieving with good sense. Inf. Retriev. 2, 1, 49--69. Google ScholarDigital Library
- Schuetze, H. and Pedersen, J. O. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval. 161--175.Google Scholar
- Singhal, A., Abney, S., Bacchiani, M., Collins, M., Hindle, D., and Pereira, F. 1999. At&t at trec-8. In Proceedings of the 8th Text REtrieval Conference (TREC-8). 317--330.Google Scholar
- Singhal, A., Mitra, M., and Buckley, C. 1997. Learning routing queries in a query zone. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 25--32. Google ScholarDigital Library
- Singhal, A., Salton, G., Mitra, M., and Buckley, C. 1995. Document length normalization. Tech. rep. TR95-1529, Cornell University, Ithaca, NY. Google ScholarDigital Library
- Sorg, P. and Cimiano, P. 2008. Cross-lingual information retrieval with explicit semantic analysis. In Working Notes for the CLEF 2008 Workshop.Google Scholar
- Stokoe, C. P., Oakes, M., and Tait, J. 2003. Word sense disambiguation in information retrieval revisited. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 159--166. Google ScholarDigital Library
- Styltsvig, H. B. 2006. Ontology-based information retrieval. Ph.D. thesis, Department of Computer Science, Roskilde University, Denmark.Google Scholar
- Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. Inf. Retriev. 1, 3, 151--173. Google ScholarDigital Library
- Voorhees, E. M. 1993. Using wordnet to disambiguate word senses for text retrieval. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 171--180. Google ScholarDigital Library
- Voorhees, E. M. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer, 61--69. Google ScholarDigital Library
- Voorhees, E. M. 2005. Overview of the TREC 2004 robust retrieval track. In Proceedings of the 13th Text REtrieval Conference (TREC-13). 70--79.Google ScholarCross Ref
- Voorhees, E. M. and Harman, D. 1998. Overview of the seventh text retrieval conference (trec-7). In Proceedings of the 7th Text REtrieval Conference (TREC-7). 1--24.Google ScholarCross Ref
- Voorhees, E. M. and Harman, D. 1999. Overview of the eighth text retrieval conference (trec-8). In Proceedings of the 8th Text REtrieval Conference (TREC-8). 1--24.Google ScholarCross Ref
- Wei, X. and Croft, W. B. 2006. Lda-Based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, 178--185. Google ScholarDigital Library
- Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18, 1, 79--112. Google ScholarDigital Library
- Yan, R., Hauptmann, A. G., and Jin, R. 2003. Negative pseudo-relevance feedback in content-based video retrieval. In Proceedings of the 11th ACM International Conference on Multimedia. ACM, 343--346. Google ScholarDigital Library
- Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, 412--420. Google ScholarDigital Library
- Yi, X. and Allan, J. 2009. A comparative study of utilizing topic models for information retrieval. In Proceedings of the 31st European Conference on IR Research (ECIR). Springer, 29--41. Google ScholarDigital Library
- Zhou, X., Zhang, X., and Hu, X. 2006. Using concept-based indexing to improve language modeling approach to genomic ir. In Lecture Notes in Computer Science. Springer, 444--455. Google ScholarDigital Library
- Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 307--314. Google ScholarDigital Library
Index Terms
- Concept-Based Information Retrieval Using Explicit Semantic Analysis
Recommendations
Concept-Based Relevance Models for Medical and Semantic Information Retrieval
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementRelevance models provide an important approach for estimating probabilities of words in the relevant class. However, the associated bag-of-words assumption breaks dependencies between words, especially between those within a phrase. If such dependencies ...
Enhancing semantic search using case-based modular ontology
SAC '10: Proceedings of the 2010 ACM Symposium on Applied ComputingIn this paper, we present a semantic search approach based on Case-based modular Ontology. Our work aims to improve ontology-based information retrieval by the integration of the traditional information retrieval, the use of ontology and the case based ...
Using BM25F for semantic search
SEMSEARCH '10: Proceedings of the 3rd International Semantic Search WorkshopInformation Retrieval (IR) approaches for semantic web search engines have become very populars in the last years. Popularization of different IR libraries, like Lucene, that allows IR implementations almost out-of-the-box have make easier IR ...
Comments