Abstract
Automatic text categorization is still a very important research topic. Typical applications include assisting end-users in archiving existing documents, or helping them in browsing existing corpus of documents in a hierarchical way. Text categorization is usually composed of two main steps: keyword extraction and classification. In this paper, a corpus of documents is represented by a binary relation linking each document to the words it contains. From this relation, the Hyper Rectangle Algorithm extracts the list of the most representative words in a hierarchical way. A hyper-Rectangle associated to an element of the range of a binary relation is the union of all non-enlargeable rectangles containing it. The extracted keywords are fed into the random forest classifier in order to predict the category of each document. The method has been validated on the popular Reuters 21578 news articles database. Results are very promising and show the effectiveness of the Hyper Rectangular method in extracting relevant keywords.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aphinyanaphongs, Y., Fu, L.D., Li, Z., Peskin, E.R., Efstathiadis, E., Aliferis, C.F., Statnikov, A.: A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology 65(10), 1964–1987 (2014)
Azam, N., Yao, J.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications 39(5), 4760–4768 (2012)
Birkhoff, G.: Lattice theory, vol. 25. American Mathematical Soc. (1967)
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Cardoso-Cachopo, A.: Datasets for single label text categorization. artificial Intelligence Group, Department of Information Systems and Computer Science Instituto Superior Tecnico, Portugal (2009) http://web.ist.utl.pt/~acardoso/datasets/
Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa (2007)
Cardoso-Cachopo, A., Oliveira, A.: Combining lsi with other classifiers to improve accuracy of single-label text categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning-EWLSATEL, vol. (2007)
Ferjani, F., Jaoua, A., Elloumi, S., Yahia, S.B.: Hyper-rectangular relation decomposition and dimensionality reduction. In: 13th International Conference on Relational and Algebraic Methods in Computer Science, RAMiCS 2013 (2012)
Ganter, B.: Two basic algorithms in concept analysis. In: Kwuida, L., Sertkaya, B. (eds.) ICFCA 2010. LNCS, vol. 5986, pp. 312–340. Springer, Heidelberg (2010)
Ganter, B., Wille, R.: Formal concept analysis: mathematical foundations. Springer Science & Business Media (2012)
Jaoua, A.: Pseudo-conceptual text and web Structuring. In: 16th International Conference on Conceptual Structures (ICCS 2008) (2008)
Jia, S., Liang, J., Xie, Y., Deng, L.: A novel feature voting model for text classification. In: 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 306–311. IEEE (2014)
Jiang, S., Pang, G., Wu, M., Kuang, L.: An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications 39(1), 1503–1509 (2012)
Kurian, A., Josephine, M., Jeyabalaraja, V.: Scaling down dimensions and feature extraction in document repository classification. International Journal of Data Mining Techniques and Applications (2014)
Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Applied Intelligence 37(1), 80–99 (2012)
Lewis, D.D.: Reuters-21578 text categorization test collection, distribution 1.0 (1997). http://www.research.att.com/~lewis/reuters21578.html
Li, C.H., Yang, J.C., Park, S.C.: Text categorization algorithms using semantic approaches, corpus-based thesaurus and wordnet. Expert Systems with Applications 39(1), 765–772 (2012)
Llc, B.: Relational Model: Relational Algebra, Relational Database Management System, Object-Relational Impedance Mismatch, Synonym, Codd’s Theorem. General Books LLC (2010). https://books.google.com.qa/books?id=JgDFbwAACAAJ
Uğuz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems 24(7), 1024–1032 (2011)
Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing & Management 48(4), 741–754 (2012)
Yoshikawa, Y., Iwata, T., Sawada, H.: Latent support measure machines for bag-of-words data classification. In: Advances in Neural Information Processing Systems, pp. 1961–1969 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Hassaine, A., Mecheter, S., Jaoua, A. (2015). Text Categorization Using Hyper Rectangular Keyword Extraction: Application to News Articles Classification. In: Kahl, W., Winter, M., Oliveira, J. (eds) Relational and Algebraic Methods in Computer Science. RAMICS 2015. Lecture Notes in Computer Science(), vol 9348. Springer, Cham. https://doi.org/10.1007/978-3-319-24704-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-24704-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24703-8
Online ISBN: 978-3-319-24704-5
eBook Packages: Computer ScienceComputer Science (R0)