ABSTRACT
In this paper we address issues related to building a large-scale Chinese corpus. We try to answer four questions: (i) how to speed up annotation, (ii) how to maintain high annotation quality, (iii) for what purposes is the corpus applicable, and finally (iv) what future work we anticipate.
- David Chiang. 2000. Statistical parsing with an automatically-extracted tree adjoining grammar. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 456--463, Hong Kong, 2000 Google ScholarDigital Library
- Fu-Dong Chiou, David Chiang, and Martha Palmer. 2001. Facilitating Treebank Annotation with a Statistical Parser. In Proc. of the Human Language Technology Conference (HLT-2001), San Diego, CA. Google ScholarDigital Library
- Adwait Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger. 1996. In Proceedings of the Empirical Methods in Natural Language Processing Conference, May 17-18, 1996. University of Pennsylvania.Google Scholar
- Andi Wu and Zixin Jiang. 2000. Statistically Enhanced New Word Identification in a Rule-Based Chinese System. In Proceedings of the Second Chinese Language Processing Workshop (in conjunction with ACL), HKUST, Hong Kong, p. 46--51. Google ScholarDigital Library
- Fei Xia. 2000a. The Part-of-speech Guidelines for the Penn Chinese Treebank Project. Technical Report IRCS 00-06, University of Pennsylvania.Google Scholar
- Fei Xia. 2000b. The Segmentation Guidelines for the Penn Chinese Treebank Project. Technical Report IRCS 00-07, University of Pennsylvania.Google Scholar
- Fei Xia. 2001. Automatic Grammar Generation from Two Different Perspectives. PhD dissertation, University of Pennsylvania. Google ScholarDigital Library
- Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe Huang, Tony Kroch, Mitch Marcus. 2000. Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In Proc. of the 2nd International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece.Google Scholar
- Nianwen Xue and Fei Xia. 2000. The Bracketing Guidelines for the Penn Chinese Treebank Project. Technical Report IRCS 00-08, University of Pennsylvania.Google Scholar
- Nianwen Xue. 2001. Defining and Automatically Identifying Words in Chinese. PhD Dissertation, University of Delaware, 2001.Google Scholar
- Building a large-scale annotated Chinese corpus
Recommendations
Building a semantically annotated corpus of clinical texts
In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient ...
A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations
Abbreviations and acronyms are widely used in the biomedical literature and many of them represent important biomedical concepts. Because many abbreviations are ambiguous (e.g., CAT denotes both chloramphenicol acetyl transferase and computed axial ...
Building a comprehensive syntactic and semantic corpus of Chinese clinical texts
Display Omitted We constructed a comprehensive syntactic and semantic corpus of Chinese clinical texts.Several annotation guidelines and an annotation method for Chinese clinical texts were proposed.Inter-annotator agreement evaluation shows that this ...
Comments