Preferred Document Classification for a Highly Inflectional/Derivational Language

Min, Kyongho; Wilson, William H.; Moon, Yoo-Jin

doi:10.1007/3-540-36187-1_2

Kyongho Min³,
William H. Wilson⁴ &
Yoo-Jin Moon⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2557))

Included in the following conference series:

Australian Joint Conference on Artificial Intelligence

1128 Accesses
2 Citations

Abstract

This paper describes methods of document classification for a highly inflectional/derivational language that forms monolithic compound noun terms, like Dutch and Korean. The system is composed of three phases: (1) a Korean morphological analyzer called HAM (Kang, 1993), (2) an application of compound noun phrase analysis to the result of HAM analysis and extraction of terms whose syntactic categories are noun, name (proper noun), verb, and adjective, and (3) an effective document classification algorithm based on preferred class score heuristics. This paper focuses on the comparison of document classification methods including a simple heuristic method, and preferred class score heuristics employing two factors namely ICF (inverted class frequency) and IDF (inverted document frequency) with/without term frequency weight. In addition this paper describes a simple classification approach without a learning algorithm rather than a vector space model with a complex training and classification algorithm such as cosine similarity measurement. The experimental results show 95.7% correct classifications of 720 training data and 63.8%-71.3% of randomly chosen 80 testing data through various methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan, J., Leuski, A., Swan, R., Byrd, D.: Evaluating combinations of ranked lists and visualizations of inter-document similarity. Information Processing and Management. 37 (2001) 435–458
Article MATH Google Scholar
Apte, C., Demerau, F., Weiss M.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems. 12(3) (1994) 233–251
Article Google Scholar
Arppe A.: Term Extraction from Unrestricted Text. http://www.lingsoft.fi/doc/nptool/term-extraction. (1995)
Brasethvik, T., Gulla J.: Natural Language Analysis for Semantic Document Modeling. Data & Knowledge Engineering. 38 (2001) 45–62
Article MATH Google Scholar
Cohen, W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization, ACM Transactions on Information Systems, 7(2) (1999) 141–173
Article Google Scholar
Earley, J.: An Efficient Context-Free Parsing Algorithm. CACM. 13(2) (1970) 94–102
MATH Google Scholar
Fuketa, M., Lee, S., Tsuji, T., Okada, M., Aoe, J.: A Document Classification Method by Using Field Association Words. Information Science. 126 (2000) 57–70
Article MATH Google Scholar
Han, K., Sun, B., Han, S., Rim, K.: A Study on Development of Automatic Categorization System for Internet Documents. KIPS Journal. 7(9) (2000) 2867–2875
Google Scholar
Hirshberg, D.S.: Algorithms for the Longest Common Subsequence Problem. The Journal of ACM. 24(4) (1977) 664–675
Article Google Scholar
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference of Machine Learning (CIML97). (1997) 143–151
Google Scholar
Kang, S.: Korean Morphological Analysis Using Syllable Information and Multi-word Unit Information. Ph.D thesis. Seoul National University (1993)
Google Scholar
Kang, S.: Korean Morphological Analysis Program for Linux OS, http://nlp.kookmin.ac.kr. (2001)
Lewis, D., Jones, K.S.: Natural Language Processing for Information Retrieval. Communication of the ACM. 39(1) (1996) 92–101
Article Google Scholar
Li, Y., Jain, A.: Classification of Text Documents. The Computer Journal. 41(8) (1998) 537–546
Article MATH Google Scholar
Moon, Y., Min, K.: (2000). Verifying Appropriateness of the Semantic Networks and Integration for the Selectional Restriction Relation. Proceedings of the 2000 MIS/OA International Conference. Seoul Korea (2000) 535–539
Google Scholar
Mostafa, J., Lam, W.: Automatic classification using supervised learning in a medical document filtering application. Information Processing and Management. 36 (2000) 415–444
Article Google Scholar
Salton, G., Singhal, A., Mitra, M., Buckley C.: Automatic Text Structuring and Summarization. Information Processing and Management. 33(2) (1997) 193–207
Article Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. Proceedings of ACM SIGIR Conference on Research and Development Retrieval. (1999) 42–49
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Auckland University of Technology, Private Bag 92006, 1020, Auckland, New Zealand
Kyongho Min
School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
William H. Wilson
Dept of Management Information Systems, Hankook University of Foreign Studies, Yongin, Korea
Yoo-Jin Moon

Authors

Kyongho Min
View author publications
You can also search for this author in PubMed Google Scholar
William H. Wilson
View author publications
You can also search for this author in PubMed Google Scholar
Yoo-Jin Moon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Australian Defence Force Academy, University of New South Wales, ACT 2600, Canberra, Australia
Bob McKay
Computer Science Laboratory, Australian National University, RSISE Building, ACT 0200, Canberra, Australia
John Slaney

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Min, K., Wilson, W.H., Moon, YJ. (2002). Preferred Document Classification for a Highly Inflectional/Derivational Language. In: McKay, B., Slaney, J. (eds) AI 2002: Advances in Artificial Intelligence. AI 2002. Lecture Notes in Computer Science(), vol 2557. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36187-1_2

Download citation

DOI: https://doi.org/10.1007/3-540-36187-1_2
Published: 08 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00197-3
Online ISBN: 978-3-540-36187-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics