To read this content please select one of the options below:

A new neutrosophic TF-IDF term weighting for text mining tasks: text classification use case

Mariem Bounabi (Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Faculty of Sciences Dhar El Mahraz-Fes, Universite Sidi Mohamed Ben Abdellah, Fes, Morocco)
Karim Elmoutaouakil (Engineering Sciences Laboratory, Multidisciplinary Faculty of Taza, Sidi Mohamed Ben Abdallah University, Taza, Morocco)
Khalid Satori (Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Faculty of Sciences Dhar El Mahraz-Fes, Universite Sidi Mohamed Ben Abdellah, Fes, Morocco)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 8 April 2021

Issue publication date: 27 July 2021

221

Abstract

Purpose

This paper aims to present a new term weighting approach for text classification as a text mining task. The original method, neutrosophic term frequency – inverse term frequency (NTF-IDF), is an extended version of the popular fuzzy TF-IDF (FTF-IDF) and uses the neutrosophic reasoning to analyze and generate weights for terms in natural languages. The paper also propose a comparative study between the popular FTF-IDF and NTF-IDF and their impacts on different machine learning (ML) classifiers for document categorization goals.

Design/methodology/approach

After preprocessing textual data, the original Neutrosophic TF-IDF applies the neutrosophic inference system (NIS) to produce weights for terms representing a document. Using the local frequency TF, global frequency IDF and text N's length as NIS inputs, this study generate two neutrosophic weights for a given term. The first measure provides information on the relevance degree for a word, and the second one represents their ambiguity degree. Next, the Zhang combination function is applied to combine neutrosophic weights outputs and present the final term weight, inserted in the document's representative vector. To analyze the NTF-IDF impact on the classification phase, this study uses a set of ML algorithms.

Findings

Practicing the neutrosophic logic (NL) characteristics, the authors have been able to study the ambiguity of the terms and their degree of relevance to represent a document. NL's choice has proven its effectiveness in defining significant text vectorization weights, especially for text classification tasks. The experimentation part demonstrates that the new method positively impacts the categorization. Moreover, the adopted system's recognition rate is higher than 91%, an accuracy score not attained using the FTF-IDF. Also, using benchmarked data sets, in different text mining fields, and many ML classifiers, i.e. SVM and Feed-Forward Network, and applying the proposed term scores NTF-IDF improves the accuracy by 10%.

Originality/value

The novelty of this paper lies in two aspects. First, a new term weighting method, which uses the term frequencies as components to define the relevance and the ambiguity of term; second, the application of NL to infer weights is considered as an original model in this paper, which also aims to correct the shortcomings of the FTF-IDF which uses fuzzy logic and its drawbacks. The introduced technique was combined with different ML models to improve the accuracy and relevance of the obtained feature vectors to fed the classification mechanism.

Keywords

Citation

Bounabi, M., Elmoutaouakil, K. and Satori, K. (2021), "A new neutrosophic TF-IDF term weighting for text mining tasks: text classification use case", International Journal of Web Information Systems, Vol. 17 No. 3, pp. 229-249. https://doi.org/10.1108/IJWIS-11-2020-0067

Publisher

:

Emerald Publishing Limited

Copyright © 2021, Emerald Publishing Limited

Related articles