Skip to main content

Preferred Document Classification for a Highly Inflectional/Derivational Language

  • Conference paper
  • First Online:
AI 2002: Advances in Artificial Intelligence (AI 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2557))

Included in the following conference series:

Abstract

This paper describes methods of document classification for a highly inflectional/derivational language that forms monolithic compound noun terms, like Dutch and Korean. The system is composed of three phases: (1) a Korean morphological analyzer called HAM (Kang, 1993), (2) an application of compound noun phrase analysis to the result of HAM analysis and extraction of terms whose syntactic categories are noun, name (proper noun), verb, and adjective, and (3) an effective document classification algorithm based on preferred class score heuristics. This paper focuses on the comparison of document classification methods including a simple heuristic method, and preferred class score heuristics employing two factors namely ICF (inverted class frequency) and IDF (inverted document frequency) with/without term frequency weight. In addition this paper describes a simple classification approach without a learning algorithm rather than a vector space model with a complex training and classification algorithm such as cosine similarity measurement. The experimental results show 95.7% correct classifications of 720 training data and 63.8%-71.3% of randomly chosen 80 testing data through various methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allan, J., Leuski, A., Swan, R., Byrd, D.: Evaluating combinations of ranked lists and visualizations of inter-document similarity. Information Processing and Management. 37 (2001) 435–458

    Article  MATH  Google Scholar 

  2. Apte, C., Demerau, F., Weiss M.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems. 12(3) (1994) 233–251

    Article  Google Scholar 

  3. Arppe A.: Term Extraction from Unrestricted Text. http://www.lingsoft.fi/doc/nptool/term-extraction. (1995)

  4. Brasethvik, T., Gulla J.: Natural Language Analysis for Semantic Document Modeling. Data & Knowledge Engineering. 38 (2001) 45–62

    Article  MATH  Google Scholar 

  5. Cohen, W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization, ACM Transactions on Information Systems, 7(2) (1999) 141–173

    Article  Google Scholar 

  6. Earley, J.: An Efficient Context-Free Parsing Algorithm. CACM. 13(2) (1970) 94–102

    MATH  Google Scholar 

  7. Fuketa, M., Lee, S., Tsuji, T., Okada, M., Aoe, J.: A Document Classification Method by Using Field Association Words. Information Science. 126 (2000) 57–70

    Article  MATH  Google Scholar 

  8. Han, K., Sun, B., Han, S., Rim, K.: A Study on Development of Automatic Categorization System for Internet Documents. KIPS Journal. 7(9) (2000) 2867–2875

    Google Scholar 

  9. Hirshberg, D.S.: Algorithms for the Longest Common Subsequence Problem. The Journal of ACM. 24(4) (1977) 664–675

    Article  Google Scholar 

  10. Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference of Machine Learning (CIML97). (1997) 143–151

    Google Scholar 

  11. Kang, S.: Korean Morphological Analysis Using Syllable Information and Multi-word Unit Information. Ph.D thesis. Seoul National University (1993)

    Google Scholar 

  12. Kang, S.: Korean Morphological Analysis Program for Linux OS, http://nlp.kookmin.ac.kr. (2001)

  13. Lewis, D., Jones, K.S.: Natural Language Processing for Information Retrieval. Communication of the ACM. 39(1) (1996) 92–101

    Article  Google Scholar 

  14. Li, Y., Jain, A.: Classification of Text Documents. The Computer Journal. 41(8) (1998) 537–546

    Article  MATH  Google Scholar 

  15. Moon, Y., Min, K.: (2000). Verifying Appropriateness of the Semantic Networks and Integration for the Selectional Restriction Relation. Proceedings of the 2000 MIS/OA International Conference. Seoul Korea (2000) 535–539

    Google Scholar 

  16. Mostafa, J., Lam, W.: Automatic classification using supervised learning in a medical document filtering application. Information Processing and Management. 36 (2000) 415–444

    Article  Google Scholar 

  17. Salton, G., Singhal, A., Mitra, M., Buckley C.: Automatic Text Structuring and Summarization. Information Processing and Management. 33(2) (1997) 193–207

    Article  Google Scholar 

  18. Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. Proceedings of ACM SIGIR Conference on Research and Development Retrieval. (1999) 42–49

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Min, K., Wilson, W.H., Moon, YJ. (2002). Preferred Document Classification for a Highly Inflectional/Derivational Language. In: McKay, B., Slaney, J. (eds) AI 2002: Advances in Artificial Intelligence. AI 2002. Lecture Notes in Computer Science(), vol 2557. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36187-1_2

Download citation

  • DOI: https://doi.org/10.1007/3-540-36187-1_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00197-3

  • Online ISBN: 978-3-540-36187-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics