Skip to main content

Document-Base Extraction for Single-Label Text Classification

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5182))

Abstract

Many text mining applications, especially when investigating Text Classification (TC), require experiments to be performed using common text-collections, such that results can be compared with alternative approaches. With regard to single-label TC, most text-collections (textual data-sources) in their original form have at least one of the following limitations: the overall volume of textual data is too large for ease of experimentation; there are many predefined classes; most of the classes consist of only a very few documents; some documents are labeled with a single class whereas others have multiple classes; and there are documents found with little or no actual text-content. In this paper, we propose a standard approach to automatically extract “qualified” document-bases from a given textual data-source that can be used more effectively and reliably in single-label TC experiments. The experimental results demonstrate that document-bases extracted based on our approach can be used effectively in single-label TC experiments.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Antonie, M.-L., Zaïane, O.R.: Text Document Categorization by Term Association. In: Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, December 2002, pp. 19–26. IEEE, Los Alamitos (2002)

    Chapter  Google Scholar 

  2. Berger, H., Merkl, D.: A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics. In: Proceedings of the 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 2004, pp. 998–1003. Springer, Heidelberg (2004)

    Google Scholar 

  3. Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. Thesis, Instituto Superior Técnico – Universidade Ténica de Lisboa / INESC-ID, Portugal

    Google Scholar 

  4. Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Zhang, M., Wu, X.-B., Yang, M.: Two Odds-radio-based Text Classification Algorithms. In: Proceedings of the Third International Conference on Web Information Systems Engineering Workshop, Singapore, December 2002, pp. 223–231. IEEE, Los Alamitos (2002)

    Chapter  Google Scholar 

  5. Feng, Y., Wu, Z., Zhou, Z.: Multi-label Text Categorization using K-Nearest Neighbor Approach with M-Similarity. In: Proceedings of the 12th International Conference on String Processing and Information Retrieval, Buenos Aires, Argentina, November 2005, pp. 155–160. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  6. Fragoudis, D., Meretaskis, D., Likothanassis, S.: Best Terms: An Efficient Feature-Selection Algorithm for Text Categorization. Knowledge and Information Systems 8(1), 16–33 (2005)

    Article  Google Scholar 

  7. Giorgetti, D., Sebastiani, F.: Multiclass Text Categorization for Automated Survey Coding. In: Proceedings of the 2003 ACM Symposium on Applied Computing, Melbourne, FL, USA, March 2003, pp. 798–802. ACM Press, New York (2003)

    Chapter  Google Scholar 

  8. Hersh, W.R., Buckley, C., Leone, T.J., Hickman, D.H.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 1994, pp. 192–201. ACM/Springer (1994)

    Google Scholar 

  9. Hotho, A., Nürnberger, A., Paaß, G.: A Brief Survey of Text Mining. LDV Forum – GLDV Journal for Computational Linguistics and Language Technology 20(1), 19–62 (2005)

    Google Scholar 

  10. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. LS-8 Report 23 – Research Reports of the Unit no. VIII (AI), Computer Science Department, University of Dortmund, Germany

    Google Scholar 

  11. Lang, K.: NewsWeeder: Learning to Filter Netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, July 1995, pp. 331–339. Morgan Kaufmann Publishers, San Francisco (1995)

    Google Scholar 

  12. Li, X., Liu, B.: Learning to Classify Texts using Positive and Unlabeled Data. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 2003, pp. 587–594. Morgan Kaufmann Publishers, San Francisco (2003)

    Google Scholar 

  13. Maron, M.E.: Automatic Indexing: An Experimental Inquiry. Journal of the ACM (JACM) 8(3), 404–417 (1961)

    Article  MATH  Google Scholar 

  14. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  15. Wu, H., Phang, T.H., Liu, B., Li, X.: A Refinement Approach to Handling Model Misfit in Text Categorization. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 2002, pp. 207–215. ACM Press, New York (2002)

    Chapter  Google Scholar 

  16. Wu, K., Lu, B.-L., Uchiyama, M., Isahara, H.: A Probabilistic Approach to Feature Selection for Multi-class Text Categorization. In: Proceedings of the 4th International Symposium on Neural Networks, Nanjing, China, June 2007, pp. 1310–1317. Springer, Heidelberg (2007)

    Google Scholar 

  17. Zaïane, O.R., Antonie, M.-L.: Classifying Text Documents by Associating Terms with Text Categories. In: Proceedings of the 13th Australasian Database Conference, Melbourne, Victoria, Australia, January-February 2002, pp. 215–222. CRPIT 5 Australian Computer Society (2002)

    Google Scholar 

  18. Coenen, F., Leng, P., Sanderson, R., Wang, Y.J.: Statistical Identification of Key Phrases for Text Classification. In: Proceedings of the 5th International Conference on Machine Learning and Data Mining, Leipzig, Germany, July 2007, pp. 838–853. Springer, Heidelberg (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Il-Yeol Song Johann Eder Tho Manh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, Y.J., Sanderson, R., Coenen, F., Leng, P. (2008). Document-Base Extraction for Single-Label Text Classification. In: Song, IY., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2008. Lecture Notes in Computer Science, vol 5182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85836-2_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85836-2_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85835-5

  • Online ISBN: 978-3-540-85836-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics