skip to main content
10.3115/974557.974603dlproceedingsArticle/Chapter ViewAbstractPublication PagesanlcConference Proceedingsconference-collections
Article
Free Access

Fast statistical parsing of noun phrases for document indexing

Published:31 March 1997Publication History

ABSTRACT

Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques have been evaluated on a document collection larger than several megabytes. Many NLP techniques are simply not efficient enough, and not robust enough, to handle a large amount of text. This paper proposes a new Probabilistic model for noun phrase parsing, and reports on the application of such a parsing technique to enhance document indexing. The effectiveness of using syntactic phrases provided by the parser to supplement single words for indexing is evaluated with a 250 megabytes document collection. The experiment's results show that supplementing single words with syntactic phrases for indexing consistently and significantly improves retrieval performance.

References

  1. Belkin, N., and Croft, B. 1987. Retrieval techniques. In: Williams, Martha E.(Ed.), Annual Review of Information Science Technology, Vol. 22. Amsterdam, NL: Elsevier Science Publishers. 1987. 110--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brown, P. et al. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4), December, 1992. 467--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dempster, A. P. et al. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 B, 1977. 1--38.Google ScholarGoogle Scholar
  4. Evans, D. A., Ginther-Webster, K., Hart, M., Lefferts, R., Monarch, I., 1991. Automatic indexing using selective NLP and first-order thesauri. In: A. Lichnerowicz (ed.), Intelligent Text and Image Handling. Proceedings of a Conference, RIAO '91. Amsterdam, NL: Elsevier. 1991. pp. 624--644.Google ScholarGoogle Scholar
  5. Evans, D. A., Lefferts, R. G., Grefenstette, G., Handerson, S. H., Hersh, W. R., and Archbold, A. A. 1993. CLARIT TREC design, experiments, and results. In: Donna K. Harman (ed.), The First Text REtrieval Conference (TREC-1). NIST Special Publication 500--207. Washington, DC: U. S. Government Printing Office, 1993. pp. 251--286; 494--501.Google ScholarGoogle Scholar
  6. Evans, David A. and Lefferts, Robert G. 1995. CLARIT-TREC experiments, Information Processing and Management, Vol. 31, No. 3, 1995. 385--395. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Evans, D., Milić-Frayling, N., and Lefferts, R. 1996. CLARIT TREC-4 Experiments, in Donna K. Harman (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication 500--236. Washington, DC: U. S. Government Printing Office, 1996. pp. 305--321.Google ScholarGoogle Scholar
  8. Evans, D. and Zhai, C. 1996. Noun-phrase analysis in unrestricted text for information retrieval. Proceedings of the 34th Annual meeting of Association for Computational Linguistics, Santa Cruz, University of California, June 24--28, 1996. 17--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fagan, Joel L. 1987. Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-syntactic methods, PhD thesis, Dept. of Computer Science, Cornell University, Sept. 1987.Google ScholarGoogle Scholar
  10. Harman, D. 1994. The Second Text REtrieval Conference (TREC-2), NIST Special publication 500--215. National Institute of Standards and Technology, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Harman, D. 1996. TREC 5 Conference Notes, Nov. 20--22, 1996.Google ScholarGoogle Scholar
  12. Jelinek, F., Lafferty, J. D., and Mercer, R. L. 1990. Basic methods of probabilistic context free grammars. Yorktown Heights, N. Y.: IBM T. J. Watson Research Center, 1990. Research report RC. 16374.Google ScholarGoogle Scholar
  13. Lafferty, J. 1995. Notes on the EM Algorithm, Information Theory course notes, Carnegie Mellon University.Google ScholarGoogle Scholar
  14. Lafferty, J. 1996. Personal Communications.Google ScholarGoogle Scholar
  15. Lauer, Mark. 1994. Conceptual association for compound noun analysis. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Student Session, Las Cruces, NM, 1994. 337--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lauer, Mark. 1995. Corpus statistics meet with the noun compound: Some empirical results. Proceedings of the 33th Annual Meeting of the Association for Computational Linguistics, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lewis, D. 1991. Representation and Learning in Information Retrieval. Ph.D thesis, COINS Technical Report 91--93, Univ. of Massachusetts, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lewis, D. and Sparck Jones, K. 1996. Applications of natural language processing in information retrieval. Communications of ACM, Vol. 39, No. 1, 1996, 92--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Liberman, M. and Sproat, R. 1992. The stress and structure of modified noun phrases in English. In: Sag, I. and Szabolcsi, A. (Eds.), Lexical Matters, CSLI Lecture Notes No. 24. University of Chicago Press, 1992. 131--181.Google ScholarGoogle Scholar
  20. Marcus, Mitchell. 1980. A Theory of Syntactic Recognition for Natural Language. MIT Press, Cambridge, MA, 1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Pustejovsky, J., Bergler, S., and Anick, P. 1993. Lexical semantic techniques for corpus analysis. In: Computational Linguistics, Vol. 19 (2), Special Issue on Using Large Corpora II, 1993. 331--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Resnik, P. and Hearst, M. 1993. Structural ambiguity and conceptual relations. In: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, June 22, 1993. Ohio State University. 58--64.Google ScholarGoogle Scholar
  23. Salton, G. and McGill, M. 1983. Introduction to Modern Information Retrieval, New York, NY: McGraw-Hill, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Smeaton, Alan F. 1992. Progress in application of natural language processing to information retrieval. The Computer Journal, Vol. 35, No. 3, 1992. 268--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Strzakowski, T. 1992. TTP: A fast and robust parser for natural language processing. Proceedings of the 14th International Conference on Computational Linguistics (COLING), Nantes, France, July, 1992. 198--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Strzalkowski, T. and Vauthey, B. 1992. Information retrieval using robust natural language processing. Proceedings of the 30th ACL Meeting, Neward, DE, June-July, 1992. 104--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Strzalkowski, T. and Carballo, J. 1994. Recent developments in natural language text retrieval. In: Harman, D. (Ed.), The Second Text REtrieval Conference (TREC-2), NIST Special Publication 500--215. 1994. 123--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Strzalkowski, T. 1995. Natural language information retrieval. Information Processing and Management. Vol. 31, No. 3, 1995. 397--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Strzalkowski, T. et al. 1995. Natural language information retrieval. TREC-3 report. In: Harman, D. (Ed.), The Third Text REtrieval Conference (TREC-3), NIST Special Publication 500--225. 1995. 39--53.Google ScholarGoogle Scholar
  30. Strzalkowski, T. et al. 1996. Natural language information retrieval: TREC-4 report. In: Harman, D. (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication 500--236. Washington, DC: U.S. Government Printing Office, 1996. pp. 245--258.Google ScholarGoogle Scholar
  31. Zhai, C., Tong, X., Milić-Frayling, N., and Evans D. 1997. Evaluation of syntactic phrase indexing - CLARIT TREC5 NLP track report. to appear in The Fifth Text REtrieval Conference (TREC-5), NIST special publication, 1997, forthcoming.Google ScholarGoogle Scholar
  1. Fast statistical parsing of noun phrases for document indexing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          ANLC '97: Proceedings of the fifth conference on Applied natural language processing
          March 1997
          417 pages

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 31 March 1997

          Qualifiers

          • Article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader