ABSTRACT
Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques have been evaluated on a document collection larger than several megabytes. Many NLP techniques are simply not efficient enough, and not robust enough, to handle a large amount of text. This paper proposes a new Probabilistic model for noun phrase parsing, and reports on the application of such a parsing technique to enhance document indexing. The effectiveness of using syntactic phrases provided by the parser to supplement single words for indexing is evaluated with a 250 megabytes document collection. The experiment's results show that supplementing single words with syntactic phrases for indexing consistently and significantly improves retrieval performance.
- Belkin, N., and Croft, B. 1987. Retrieval techniques. In: Williams, Martha E.(Ed.), Annual Review of Information Science Technology, Vol. 22. Amsterdam, NL: Elsevier Science Publishers. 1987. 110--145. Google ScholarDigital Library
- Brown, P. et al. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4), December, 1992. 467--479. Google ScholarDigital Library
- Dempster, A. P. et al. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 B, 1977. 1--38.Google Scholar
- Evans, D. A., Ginther-Webster, K., Hart, M., Lefferts, R., Monarch, I., 1991. Automatic indexing using selective NLP and first-order thesauri. In: A. Lichnerowicz (ed.), Intelligent Text and Image Handling. Proceedings of a Conference, RIAO '91. Amsterdam, NL: Elsevier. 1991. pp. 624--644.Google Scholar
- Evans, D. A., Lefferts, R. G., Grefenstette, G., Handerson, S. H., Hersh, W. R., and Archbold, A. A. 1993. CLARIT TREC design, experiments, and results. In: Donna K. Harman (ed.), The First Text REtrieval Conference (TREC-1). NIST Special Publication 500--207. Washington, DC: U. S. Government Printing Office, 1993. pp. 251--286; 494--501.Google Scholar
- Evans, David A. and Lefferts, Robert G. 1995. CLARIT-TREC experiments, Information Processing and Management, Vol. 31, No. 3, 1995. 385--395. Google ScholarDigital Library
- Evans, D., Milić-Frayling, N., and Lefferts, R. 1996. CLARIT TREC-4 Experiments, in Donna K. Harman (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication 500--236. Washington, DC: U. S. Government Printing Office, 1996. pp. 305--321.Google Scholar
- Evans, D. and Zhai, C. 1996. Noun-phrase analysis in unrestricted text for information retrieval. Proceedings of the 34th Annual meeting of Association for Computational Linguistics, Santa Cruz, University of California, June 24--28, 1996. 17--24. Google ScholarDigital Library
- Fagan, Joel L. 1987. Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-syntactic methods, PhD thesis, Dept. of Computer Science, Cornell University, Sept. 1987.Google Scholar
- Harman, D. 1994. The Second Text REtrieval Conference (TREC-2), NIST Special publication 500--215. National Institute of Standards and Technology, 1994. Google ScholarDigital Library
- Harman, D. 1996. TREC 5 Conference Notes, Nov. 20--22, 1996.Google Scholar
- Jelinek, F., Lafferty, J. D., and Mercer, R. L. 1990. Basic methods of probabilistic context free grammars. Yorktown Heights, N. Y.: IBM T. J. Watson Research Center, 1990. Research report RC. 16374.Google Scholar
- Lafferty, J. 1995. Notes on the EM Algorithm, Information Theory course notes, Carnegie Mellon University.Google Scholar
- Lafferty, J. 1996. Personal Communications.Google Scholar
- Lauer, Mark. 1994. Conceptual association for compound noun analysis. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Student Session, Las Cruces, NM, 1994. 337--339. Google ScholarDigital Library
- Lauer, Mark. 1995. Corpus statistics meet with the noun compound: Some empirical results. Proceedings of the 33th Annual Meeting of the Association for Computational Linguistics, 1995. Google ScholarDigital Library
- Lewis, D. 1991. Representation and Learning in Information Retrieval. Ph.D thesis, COINS Technical Report 91--93, Univ. of Massachusetts, 1991. Google ScholarDigital Library
- Lewis, D. and Sparck Jones, K. 1996. Applications of natural language processing in information retrieval. Communications of ACM, Vol. 39, No. 1, 1996, 92--101. Google ScholarDigital Library
- Liberman, M. and Sproat, R. 1992. The stress and structure of modified noun phrases in English. In: Sag, I. and Szabolcsi, A. (Eds.), Lexical Matters, CSLI Lecture Notes No. 24. University of Chicago Press, 1992. 131--181.Google Scholar
- Marcus, Mitchell. 1980. A Theory of Syntactic Recognition for Natural Language. MIT Press, Cambridge, MA, 1980. Google ScholarDigital Library
- Pustejovsky, J., Bergler, S., and Anick, P. 1993. Lexical semantic techniques for corpus analysis. In: Computational Linguistics, Vol. 19 (2), Special Issue on Using Large Corpora II, 1993. 331--358. Google ScholarDigital Library
- Resnik, P. and Hearst, M. 1993. Structural ambiguity and conceptual relations. In: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, June 22, 1993. Ohio State University. 58--64.Google Scholar
- Salton, G. and McGill, M. 1983. Introduction to Modern Information Retrieval, New York, NY: McGraw-Hill, 1983. Google ScholarDigital Library
- Smeaton, Alan F. 1992. Progress in application of natural language processing to information retrieval. The Computer Journal, Vol. 35, No. 3, 1992. 268--278. Google ScholarDigital Library
- Strzakowski, T. 1992. TTP: A fast and robust parser for natural language processing. Proceedings of the 14th International Conference on Computational Linguistics (COLING), Nantes, France, July, 1992. 198--204. Google ScholarDigital Library
- Strzalkowski, T. and Vauthey, B. 1992. Information retrieval using robust natural language processing. Proceedings of the 30th ACL Meeting, Neward, DE, June-July, 1992. 104--111. Google ScholarDigital Library
- Strzalkowski, T. and Carballo, J. 1994. Recent developments in natural language text retrieval. In: Harman, D. (Ed.), The Second Text REtrieval Conference (TREC-2), NIST Special Publication 500--215. 1994. 123--136. Google ScholarDigital Library
- Strzalkowski, T. 1995. Natural language information retrieval. Information Processing and Management. Vol. 31, No. 3, 1995. 397--417. Google ScholarDigital Library
- Strzalkowski, T. et al. 1995. Natural language information retrieval. TREC-3 report. In: Harman, D. (Ed.), The Third Text REtrieval Conference (TREC-3), NIST Special Publication 500--225. 1995. 39--53.Google Scholar
- Strzalkowski, T. et al. 1996. Natural language information retrieval: TREC-4 report. In: Harman, D. (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication 500--236. Washington, DC: U.S. Government Printing Office, 1996. pp. 245--258.Google Scholar
- Zhai, C., Tong, X., Milić-Frayling, N., and Evans D. 1997. Evaluation of syntactic phrase indexing - CLARIT TREC5 NLP track report. to appear in The Fifth Text REtrieval Conference (TREC-5), NIST special publication, 1997, forthcoming.Google Scholar
- Fast statistical parsing of noun phrases for document indexing
Recommendations
Exploiting noun phrases and semantic relationships for text document clustering
Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy and hypernymy. ...
Paraphrasing Japanese noun phrases using character-based indexing
PARAPHRASE '03: Proceedings of the second international workshop on Paraphrasing - Volume 16This paper proposes a novel method to extract paraphrases of Japanese noun phrases from a set of documents. The proposed method consists of three steps: (1) retrieving passages using character-based index terms given a noun phrase as an input query, (2) ...
Unsupervised Method for Parsing Coordinated Base Noun Phrases
CICLing '07: Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text ProcessingSyntactic parsing is an important processing step for various language processing applications including Information Extraction, Question Answering, and Machine Translation. Parsing base Noun Phrases is one particular parsing issue that is not handled ...
Comments