Article

Free Access

Building a large-scale annotated Chinese corpus

Authors:
Nianwen Xue

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Fu-Dong Chiou

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Martha Palmer

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1August 2002Pages 1–8https://doi.org/10.3115/1072228.1072373

Published:24 August 2002Publication History

COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

Pages 1–8

ABSTRACT

In this paper we address issues related to building a large-scale Chinese corpus. We try to answer four questions: (i) how to speed up annotation, (ii) how to maintain high annotation quality, (iii) for what purposes is the corpus applicable, and finally (iv) what future work we anticipate.

References

David Chiang. 2000. Statistical parsing with an automatically-extracted tree adjoining grammar. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 456--463, Hong Kong, 2000 Google ScholarDigital Library
Fu-Dong Chiou, David Chiang, and Martha Palmer. 2001. Facilitating Treebank Annotation with a Statistical Parser. In Proc. of the Human Language Technology Conference (HLT-2001), San Diego, CA. Google ScholarDigital Library
Adwait Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger. 1996. In Proceedings of the Empirical Methods in Natural Language Processing Conference, May 17-18, 1996. University of Pennsylvania.Google Scholar
Andi Wu and Zixin Jiang. 2000. Statistically Enhanced New Word Identification in a Rule-Based Chinese System. In Proceedings of the Second Chinese Language Processing Workshop (in conjunction with ACL), HKUST, Hong Kong, p. 46--51. Google ScholarDigital Library
Fei Xia. 2000a. The Part-of-speech Guidelines for the Penn Chinese Treebank Project. Technical Report IRCS 00-06, University of Pennsylvania.Google Scholar
Fei Xia. 2000b. The Segmentation Guidelines for the Penn Chinese Treebank Project. Technical Report IRCS 00-07, University of Pennsylvania.Google Scholar
Fei Xia. 2001. Automatic Grammar Generation from Two Different Perspectives. PhD dissertation, University of Pennsylvania. Google ScholarDigital Library
Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe Huang, Tony Kroch, Mitch Marcus. 2000. Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In Proc. of the 2nd International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece.Google Scholar
Nianwen Xue and Fei Xia. 2000. The Bracketing Guidelines for the Penn Chinese Treebank Project. Technical Report IRCS 00-08, University of Pennsylvania.Google Scholar
Nianwen Xue. 2001. Defining and Automatically Identifying Words in Chinese. PhD Dissertation, University of Delaware, 2001.Google Scholar

Building a large-scale annotated Chinese corpus

Recommendations

Building a semantically annotated corpus of clinical texts

In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient ...
Read More
A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations

Abbreviations and acronyms are widely used in the biomedical literature and many of them represent important biomedical concepts. Because many abbreviations are ambiguous (e.g., CAT denotes both chloramphenicol acetyl transferase and computed axial ...
Read More
Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

Display Omitted We constructed a comprehensive syntactic and semantic corpus of Chinese clinical texts.Several annotation guidelines and an annotation method for Chinese clinical texts were proposed.Inter-annotator agreement evaluation shows that this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1
August 2002
1184 pages
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 24 August 2002
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,537of1,537submissions,100%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 44
  Total Citations
  View Citations
- 462
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building a large-scale annotated Chinese corpus

COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

Building a semantically annotated corpus of clinical texts

A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Building a large-scale annotated Chinese corpus

COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

Building a semantically annotated corpus of clinical texts

A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media