Text Categorization Using Hyper Rectangular Keyword Extraction: Application to News Articles Classification

Hassaine, Abdelaali; Mecheter, Souad; Jaoua, Ali

doi:10.1007/978-3-319-24704-5_19

Abdelaali Hassaine¹⁶,
Souad Mecheter¹⁶ &
Ali Jaoua¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9348))

Included in the following conference series:

International Conference on Relational and Algebraic Methods in Computer Science

485 Accesses
3 Altmetric

Abstract

Automatic text categorization is still a very important research topic. Typical applications include assisting end-users in archiving existing documents, or helping them in browsing existing corpus of documents in a hierarchical way. Text categorization is usually composed of two main steps: keyword extraction and classification. In this paper, a corpus of documents is represented by a binary relation linking each document to the words it contains. From this relation, the Hyper Rectangle Algorithm extracts the list of the most representative words in a hierarchical way. A hyper-Rectangle associated to an element of the range of a binary relation is the union of all non-enlargeable rectangles containing it. The extracted keywords are fed into the random forest classifier in order to predict the category of each document. The method has been validated on the popular Reuters 21578 news articles database. Results are very promising and show the effectiveness of the Hyper Rectangular method in extracting relevant keywords.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

Improved Document Categorization Through Feature-Rich Combinations

References

Aphinyanaphongs, Y., Fu, L.D., Li, Z., Peskin, E.R., Efstathiadis, E., Aliferis, C.F., Statnikov, A.: A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology 65(10), 1964–1987 (2014)
Article Google Scholar
Azam, N., Yao, J.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications 39(5), 4760–4768 (2012)
Article Google Scholar
Birkhoff, G.: Lattice theory, vol. 25. American Mathematical Soc. (1967)
Google Scholar
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Cardoso-Cachopo, A.: Datasets for single label text categorization. artificial Intelligence Group, Department of Information Systems and Computer Science Instituto Superior Tecnico, Portugal (2009) http://web.ist.utl.pt/~acardoso/datasets/
Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa (2007)
Google Scholar
Cardoso-Cachopo, A., Oliveira, A.: Combining lsi with other classifiers to improve accuracy of single-label text categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning-EWLSATEL, vol. (2007)
Google Scholar
Ferjani, F., Jaoua, A., Elloumi, S., Yahia, S.B.: Hyper-rectangular relation decomposition and dimensionality reduction. In: 13th International Conference on Relational and Algebraic Methods in Computer Science, RAMiCS 2013 (2012)
Google Scholar
Ganter, B.: Two basic algorithms in concept analysis. In: Kwuida, L., Sertkaya, B. (eds.) ICFCA 2010. LNCS, vol. 5986, pp. 312–340. Springer, Heidelberg (2010)
Chapter Google Scholar
Ganter, B., Wille, R.: Formal concept analysis: mathematical foundations. Springer Science & Business Media (2012)
Google Scholar
Jaoua, A.: Pseudo-conceptual text and web Structuring. In: 16th International Conference on Conceptual Structures (ICCS 2008) (2008)
Google Scholar
Jia, S., Liang, J., Xie, Y., Deng, L.: A novel feature voting model for text classification. In: 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 306–311. IEEE (2014)
Google Scholar
Jiang, S., Pang, G., Wu, M., Kuang, L.: An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications 39(1), 1503–1509 (2012)
Article Google Scholar
Kurian, A., Josephine, M., Jeyabalaraja, V.: Scaling down dimensions and feature extraction in document repository classification. International Journal of Data Mining Techniques and Applications (2014)
Google Scholar
Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Applied Intelligence 37(1), 80–99 (2012)
Article Google Scholar
Lewis, D.D.: Reuters-21578 text categorization test collection, distribution 1.0 (1997). http://www.research.att.com/~lewis/reuters21578.html
Li, C.H., Yang, J.C., Park, S.C.: Text categorization algorithms using semantic approaches, corpus-based thesaurus and wordnet. Expert Systems with Applications 39(1), 765–772 (2012)
Article Google Scholar
Llc, B.: Relational Model: Relational Algebra, Relational Database Management System, Object-Relational Impedance Mismatch, Synonym, Codd’s Theorem. General Books LLC (2010). https://books.google.com.qa/books?id=JgDFbwAACAAJ
Uğuz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems 24(7), 1024–1032 (2011)
Article Google Scholar
Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing & Management 48(4), 741–754 (2012)
Article Google Scholar
Yoshikawa, Y., Iwata, T., Sawada, H.: Latent support measure machines for bag-of-words data classification. In: Advances in Neural Information Processing Systems, pp. 1961–1969 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering Department, College of Engineering, Qatar University, Doha, Qatar
Abdelaali Hassaine, Souad Mecheter & Ali Jaoua

Authors

Abdelaali Hassaine
View author publications
You can also search for this author in PubMed Google Scholar
Souad Mecheter
View author publications
You can also search for this author in PubMed Google Scholar
Ali Jaoua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdelaali Hassaine .

Editor information

Editors and Affiliations

McMaster University, Hamilton, Ontario, Canada
Wolfram Kahl
Brock University, St. Catharines, Ontario, Canada
Michael Winter
Universidade do Minho, Braga, Portugal
José Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hassaine, A., Mecheter, S., Jaoua, A. (2015). Text Categorization Using Hyper Rectangular Keyword Extraction: Application to News Articles Classification. In: Kahl, W., Winter, M., Oliveira, J. (eds) Relational and Algebraic Methods in Computer Science. RAMICS 2015. Lecture Notes in Computer Science(), vol 9348. Springer, Cham. https://doi.org/10.1007/978-3-319-24704-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-24704-5_19
Published: 08 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24703-8
Online ISBN: 978-3-319-24704-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics