Document-Base Extraction for Single-Label Text Classification

Wang, Yanbo J.; Sanderson, Robert; Coenen, Frans; Leng, Paul

doi:10.1007/978-3-540-85836-2_34

Document-Base Extraction for Single-Label Text Classification

Yanbo J. Wang¹,
Robert Sanderson¹,
Frans Coenen¹ &
…
Paul Leng¹

Conference paper

1804 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5182))

Abstract

Many text mining applications, especially when investigating Text Classification (TC), require experiments to be performed using common text-collections, such that results can be compared with alternative approaches. With regard to single-label TC, most text-collections (textual data-sources) in their original form have at least one of the following limitations: the overall volume of textual data is too large for ease of experimentation; there are many predefined classes; most of the classes consist of only a very few documents; some documents are labeled with a single class whereas others have multiple classes; and there are documents found with little or no actual text-content. In this paper, we propose a standard approach to automatically extract “qualified” document-bases from a given textual data-source that can be used more effectively and reliably in single-label TC experiments. The experimental results demonstrate that document-bases extracted based on our approach can be used effectively in single-label TC experiments.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Antonie, M.-L., Zaïane, O.R.: Text Document Categorization by Term Association. In: Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, December 2002, pp. 19–26. IEEE, Los Alamitos (2002)
Chapter Google Scholar
Berger, H., Merkl, D.: A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics. In: Proceedings of the 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 2004, pp. 998–1003. Springer, Heidelberg (2004)
Google Scholar
Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. Thesis, Instituto Superior Técnico – Universidade Ténica de Lisboa / INESC-ID, Portugal
Google Scholar
Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Zhang, M., Wu, X.-B., Yang, M.: Two Odds-radio-based Text Classification Algorithms. In: Proceedings of the Third International Conference on Web Information Systems Engineering Workshop, Singapore, December 2002, pp. 223–231. IEEE, Los Alamitos (2002)
Chapter Google Scholar
Feng, Y., Wu, Z., Zhou, Z.: Multi-label Text Categorization using K-Nearest Neighbor Approach with M-Similarity. In: Proceedings of the 12th International Conference on String Processing and Information Retrieval, Buenos Aires, Argentina, November 2005, pp. 155–160. Springer, Heidelberg (2005)
Chapter Google Scholar
Fragoudis, D., Meretaskis, D., Likothanassis, S.: Best Terms: An Efficient Feature-Selection Algorithm for Text Categorization. Knowledge and Information Systems 8(1), 16–33 (2005)
Article Google Scholar
Giorgetti, D., Sebastiani, F.: Multiclass Text Categorization for Automated Survey Coding. In: Proceedings of the 2003 ACM Symposium on Applied Computing, Melbourne, FL, USA, March 2003, pp. 798–802. ACM Press, New York (2003)
Chapter Google Scholar
Hersh, W.R., Buckley, C., Leone, T.J., Hickman, D.H.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 1994, pp. 192–201. ACM/Springer (1994)
Google Scholar
Hotho, A., Nürnberger, A., Paaß, G.: A Brief Survey of Text Mining. LDV Forum – GLDV Journal for Computational Linguistics and Language Technology 20(1), 19–62 (2005)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. LS-8 Report 23 – Research Reports of the Unit no. VIII (AI), Computer Science Department, University of Dortmund, Germany
Google Scholar
Lang, K.: NewsWeeder: Learning to Filter Netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, July 1995, pp. 331–339. Morgan Kaufmann Publishers, San Francisco (1995)
Google Scholar
Li, X., Liu, B.: Learning to Classify Texts using Positive and Unlabeled Data. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 2003, pp. 587–594. Morgan Kaufmann Publishers, San Francisco (2003)
Google Scholar
Maron, M.E.: Automatic Indexing: An Experimental Inquiry. Journal of the ACM (JACM) 8(3), 404–417 (1961)
Article MATH Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Wu, H., Phang, T.H., Liu, B., Li, X.: A Refinement Approach to Handling Model Misfit in Text Categorization. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 2002, pp. 207–215. ACM Press, New York (2002)
Chapter Google Scholar
Wu, K., Lu, B.-L., Uchiyama, M., Isahara, H.: A Probabilistic Approach to Feature Selection for Multi-class Text Categorization. In: Proceedings of the 4th International Symposium on Neural Networks, Nanjing, China, June 2007, pp. 1310–1317. Springer, Heidelberg (2007)
Google Scholar
Zaïane, O.R., Antonie, M.-L.: Classifying Text Documents by Associating Terms with Text Categories. In: Proceedings of the 13th Australasian Database Conference, Melbourne, Victoria, Australia, January-February 2002, pp. 215–222. CRPIT 5 Australian Computer Society (2002)
Google Scholar
Coenen, F., Leng, P., Sanderson, R., Wang, Y.J.: Statistical Identification of Key Phrases for Text Classification. In: Proceedings of the 5th International Conference on Machine Learning and Data Mining, Leipzig, Germany, July 2007, pp. 838–853. Springer, Heidelberg (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Liverpool, Ashton Building, Ashton Street, Liverpool, L69 3BX, UK
Yanbo J. Wang, Robert Sanderson, Frans Coenen & Paul Leng

Authors

Yanbo J. Wang
View author publications
You can also search for this author in PubMed Google Scholar
Robert Sanderson
View author publications
You can also search for this author in PubMed Google Scholar
Frans Coenen
View author publications
You can also search for this author in PubMed Google Scholar
Paul Leng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Il-Yeol Song Johann Eder Tho Manh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y.J., Sanderson, R., Coenen, F., Leng, P. (2008). Document-Base Extraction for Single-Label Text Classification. In: Song, IY., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2008. Lecture Notes in Computer Science, vol 5182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85836-2_34

Download citation

DOI: https://doi.org/10.1007/978-3-540-85836-2_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85835-5
Online ISBN: 978-3-540-85836-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics