Article

Free Access

Fast statistical parsing of noun phrases for document indexing

Author:
Chengxiang Zhai

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

ANLC '97: Proceedings of the fifth conference on Applied natural language processingMarch 1997Pages 312–319https://doi.org/10.3115/974557.974603

Published:31 March 1997Publication History

ANLC '97: Proceedings of the fifth conference on Applied natural language processing

Pages 312–319

ABSTRACT

Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques have been evaluated on a document collection larger than several megabytes. Many NLP techniques are simply not efficient enough, and not robust enough, to handle a large amount of text. This paper proposes a new Probabilistic model for noun phrase parsing, and reports on the application of such a parsing technique to enhance document indexing. The effectiveness of using syntactic phrases provided by the parser to supplement single words for indexing is evaluated with a 250 megabytes document collection. The experiment's results show that supplementing single words with syntactic phrases for indexing consistently and significantly improves retrieval performance.

References

Belkin, N., and Croft, B. 1987. Retrieval techniques. In: Williams, Martha E.(Ed.), Annual Review of Information Science Technology, Vol. 22. Amsterdam, NL: Elsevier Science Publishers. 1987. 110--145. Google ScholarDigital Library
Brown, P. et al. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4), December, 1992. 467--479. Google ScholarDigital Library
Dempster, A. P. et al. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 B, 1977. 1--38.Google Scholar
Evans, D. A., Ginther-Webster, K., Hart, M., Lefferts, R., Monarch, I., 1991. Automatic indexing using selective NLP and first-order thesauri. In: A. Lichnerowicz (ed.), Intelligent Text and Image Handling. Proceedings of a Conference, RIAO '91. Amsterdam, NL: Elsevier. 1991. pp. 624--644.Google Scholar
Evans, D. A., Lefferts, R. G., Grefenstette, G., Handerson, S. H., Hersh, W. R., and Archbold, A. A. 1993. CLARIT TREC design, experiments, and results. In: Donna K. Harman (ed.), The First Text REtrieval Conference (TREC-1). NIST Special Publication 500--207. Washington, DC: U. S. Government Printing Office, 1993. pp. 251--286; 494--501.Google Scholar
Evans, David A. and Lefferts, Robert G. 1995. CLARIT-TREC experiments, Information Processing and Management, Vol. 31, No. 3, 1995. 385--395. Google ScholarDigital Library
Evans, D., Milić-Frayling, N., and Lefferts, R. 1996. CLARIT TREC-4 Experiments, in Donna K. Harman (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication 500--236. Washington, DC: U. S. Government Printing Office, 1996. pp. 305--321.Google Scholar
Evans, D. and Zhai, C. 1996. Noun-phrase analysis in unrestricted text for information retrieval. Proceedings of the 34th Annual meeting of Association for Computational Linguistics, Santa Cruz, University of California, June 24--28, 1996. 17--24. Google ScholarDigital Library
Fagan, Joel L. 1987. Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-syntactic methods, PhD thesis, Dept. of Computer Science, Cornell University, Sept. 1987.Google Scholar
Harman, D. 1994. The Second Text REtrieval Conference (TREC-2), NIST Special publication 500--215. National Institute of Standards and Technology, 1994. Google ScholarDigital Library
Harman, D. 1996. TREC 5 Conference Notes, Nov. 20--22, 1996.Google Scholar
Jelinek, F., Lafferty, J. D., and Mercer, R. L. 1990. Basic methods of probabilistic context free grammars. Yorktown Heights, N. Y.: IBM T. J. Watson Research Center, 1990. Research report RC. 16374.Google Scholar
Lafferty, J. 1995. Notes on the EM Algorithm, Information Theory course notes, Carnegie Mellon University.Google Scholar
Lafferty, J. 1996. Personal Communications.Google Scholar
Lauer, Mark. 1994. Conceptual association for compound noun analysis. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Student Session, Las Cruces, NM, 1994. 337--339. Google ScholarDigital Library
Lauer, Mark. 1995. Corpus statistics meet with the noun compound: Some empirical results. Proceedings of the 33th Annual Meeting of the Association for Computational Linguistics, 1995. Google ScholarDigital Library
Lewis, D. 1991. Representation and Learning in Information Retrieval. Ph.D thesis, COINS Technical Report 91--93, Univ. of Massachusetts, 1991. Google ScholarDigital Library
Lewis, D. and Sparck Jones, K. 1996. Applications of natural language processing in information retrieval. Communications of ACM, Vol. 39, No. 1, 1996, 92--101. Google ScholarDigital Library
Liberman, M. and Sproat, R. 1992. The stress and structure of modified noun phrases in English. In: Sag, I. and Szabolcsi, A. (Eds.), Lexical Matters, CSLI Lecture Notes No. 24. University of Chicago Press, 1992. 131--181.Google Scholar
Marcus, Mitchell. 1980. A Theory of Syntactic Recognition for Natural Language. MIT Press, Cambridge, MA, 1980. Google ScholarDigital Library
Pustejovsky, J., Bergler, S., and Anick, P. 1993. Lexical semantic techniques for corpus analysis. In: Computational Linguistics, Vol. 19 (2), Special Issue on Using Large Corpora II, 1993. 331--358. Google ScholarDigital Library
Resnik, P. and Hearst, M. 1993. Structural ambiguity and conceptual relations. In: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, June 22, 1993. Ohio State University. 58--64.Google Scholar
Salton, G. and McGill, M. 1983. Introduction to Modern Information Retrieval, New York, NY: McGraw-Hill, 1983. Google ScholarDigital Library
Smeaton, Alan F. 1992. Progress in application of natural language processing to information retrieval. The Computer Journal, Vol. 35, No. 3, 1992. 268--278. Google ScholarDigital Library
Strzakowski, T. 1992. TTP: A fast and robust parser for natural language processing. Proceedings of the 14th International Conference on Computational Linguistics (COLING), Nantes, France, July, 1992. 198--204. Google ScholarDigital Library
Strzalkowski, T. and Vauthey, B. 1992. Information retrieval using robust natural language processing. Proceedings of the 30th ACL Meeting, Neward, DE, June-July, 1992. 104--111. Google ScholarDigital Library
Strzalkowski, T. and Carballo, J. 1994. Recent developments in natural language text retrieval. In: Harman, D. (Ed.), The Second Text REtrieval Conference (TREC-2), NIST Special Publication 500--215. 1994. 123--136. Google ScholarDigital Library
Strzalkowski, T. 1995. Natural language information retrieval. Information Processing and Management. Vol. 31, No. 3, 1995. 397--417. Google ScholarDigital Library
Strzalkowski, T. et al. 1995. Natural language information retrieval. TREC-3 report. In: Harman, D. (Ed.), The Third Text REtrieval Conference (TREC-3), NIST Special Publication 500--225. 1995. 39--53.Google Scholar
Strzalkowski, T. et al. 1996. Natural language information retrieval: TREC-4 report. In: Harman, D. (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication 500--236. Washington, DC: U.S. Government Printing Office, 1996. pp. 245--258.Google Scholar
Zhai, C., Tong, X., Milić-Frayling, N., and Evans D. 1997. Evaluation of syntactic phrase indexing - CLARIT TREC5 NLP track report. to appear in The Fifth Text REtrieval Conference (TREC-5), NIST special publication, 1997, forthcoming.Google Scholar

Fast statistical parsing of noun phrases for document indexing

Recommendations

Exploiting noun phrases and semantic relationships for text document clustering

Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy and hypernymy. ...
Read More
Paraphrasing Japanese noun phrases using character-based indexing
PARAPHRASE '03: Proceedings of the second international workshop on Paraphrasing - Volume 16

This paper proposes a novel method to extract paraphrases of Japanese noun phrases from a set of documents. The proposed method consists of three steps: (1) retrieving passages using character-based index terms given a noun phrase as an input query, (2) ...
Read More
Unsupervised Method for Parsing Coordinated Base Noun Phrases
CICLing '07: Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing

Syntactic parsing is an important processing step for various language processing applications including Information Extraction, Question Answering, and Machine Translation. Parsing base Noun Phrases is one particular parsing issue that is not handled ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ANLC '97: Proceedings of the fifth conference on Applied natural language processing
March 1997
417 pages
Program Chair:
Ralph Grishman
New York University, New York, NY
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 31 March 1997
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 593
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast statistical parsing of noun phrases for document indexing

ANLC '97: Proceedings of the fifth conference on Applied natural language processing

ABSTRACT

References

Cited By

Recommendations

Exploiting noun phrases and semantic relationships for text document clustering

Paraphrasing Japanese noun phrases using character-based indexing

Unsupervised Method for Parsing Coordinated Base Noun Phrases

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fast statistical parsing of noun phrases for document indexing

ANLC '97: Proceedings of the fifth conference on Applied natural language processing

ABSTRACT

References

Cited By

Recommendations

Exploiting noun phrases and semantic relationships for text document clustering

Paraphrasing Japanese noun phrases using character-based indexing

Unsupervised Method for Parsing Coordinated Base Noun Phrases

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media