High Relevance Keyword Extraction facility for Bayesian text classification on different domains of varying characteristic

https://doi.org/10.1016/j.eswa.2011.07.116Get rights and content

Abstract

High Relevance Keyword Extraction (HRKE) facility is introduced to Bayesian text classification to perform feature/keyword extraction during the classifying stage, without needing extensive pre-classification processes. In order to perform the task of keyword extraction, HRKE facility uses the posterior probability value of keywords within a specific category associated with text document. The experimental results show that HRKE facility is able to ensure promising classification performance for Bayesian classifier while dealing with different text classification domains of varying characteristics. This method guarantees an effective and efficient Bayesian text classifier which is able to handle different domains of varying characteristics, with high accuracy while maintaining the simplicity and low cost processes of the conventional Bayesian classification approach.

Introduction

Text document classification denotes the task of assigning raw text documents to one or more pre-defined categories. This is a direct concept from machine learning, which implies the declaration of a set of labelled categories as a way to represent the documents, and a statistical classifier trained with a labelled training set. Classification is the process in which objects are initially recognized, differentiated and understood, and implies that objects are grouped into categories, usually for some specific purposes. Ideally, a category represents a relationship between the subject and the object of knowledge. Classification is fundamental in prediction, inference, and decision making. However, there are a variety of ways to approach classification task. An increasing number of supervised classification approaches have been developed for various types of classification tasks, such as rule induction (Apte et al., 1994, Provost, 1999), k-nearest neighbor classification (Han, Karypis, & Kumar, 1999), maximum entropy (Nigam, Lafferty, & McCallum, 1999), artificial neural network (Diligenti et al., 2003a, Diligenti et al., 2003b), support vector machines (Isa et al., 2008a, Isa et al., 2008b, Joachims, 1998, Lin, 1999), and Bayesian classification (Domingos and Pazzani, 1997, Eyheramendy et al., 2003, Kim et al., 2002, McCallum and Nigam, 2003, O’Brien and Vogel, 2003, Provost, 1999, Rish, 2001). Besides the supervised classification approaches, the unsupervised clustering approaches, such as self-organizing map (Adami et al., 2005, Hartley et al., 2006, Isa et al., 2009, Wang, 2001) have also been widely implemented in segmenting data into groups for further analysis and processing. Among these approaches, Bayesian classification has been widely implemented in many real world applications due to its relatively simple training and classifying algorithms.

One of the outstanding features of Bayesian classification as compared to other classification approaches is its ability and simplicity in handling raw text data directly, without requiring any pre-process to transform text data into a representation suitable format, typically in numerical form, as required by most of the successful and highly accurate text classification approaches, such as by the use of k-nearest neighbour (k-NN) and support vector machines (SVM) classifiers. As a trade-off to its simplicity, Bayesian classification has been reported as one of the poorest-performing classification approaches by many research groups through extensive experiments and evaluations (Brücher et al., 2002, Yang and Liu, 1999). In order to enhance the performance of Bayesian classifier, researchers have proposed several pre-processes such as stop word elimination, word stemming, and feature selection methods (Al-Mubaid and Umair, 2006, Apte et al., 1994, Chen et al., 2009, Dhillon et al., 2002, Eyheramendy et al., 2003, Han et al., 1999; Isa et al., 2008a, Isa et al., 2008b; Joachims, 1998, Joachims, 1999, Kim et al., 2002, McCallum and Nigam, 2003, Ozgur et al., 2005, Yang and Pedersen, 1997, Yang and Liu, 1999). However, by implementing the pre-processes, the simplicity and low cost training and classifying algorithms of Bayesian classification would have to be sacrificed due to the pre-processing stages which consume high time, physical memory and CPU usages, and also require extensive human expert interaction. It is the goal of this paper to enhance the classification effectiveness of Bayesian text classification, while maintaining the simplicity and low cost training and classification processes. In this paper, we introduce the High Relevance Keyword Extraction (HRKE) facility which uses a simple algorithm to perform stop word elimination and feature selection during the classifying stage, without the need for any pre-process and additional human experts’ involvement.

Conventional Bayesian classification takes the entire body of text document into account for training and classifying purposes. Due to the fact that text documents contain irrelevant words, the accuracy of Bayesian classification is severely degraded by the presence of noisy or irrelevant features. Extensive research works have been carried out in order to counter the problem above by introducing pre-processes such as stop word elimination and feature selection to be applied to the training set and the testing set in order to eliminate irrelevant and low informative features, and in the context of text classification, the low relevance keywords.

Stop word elimination is the procedure where common words which can be seen in many documents such as “a”, “an”, “the”, “to”, “for”, “be”, etc. are eliminated from the documents contained in the dataset (Baeza-Yates & Ribeiro-Neto, 1999). To perform stop word elimination, a list of stop words is used to match each individual word from the text documents. The words that contained in text documents which match any word from the list of stop words will not be taken into account for both the training and classifying processes. There is a potential drawback of stop word elimination, where certain words which are considered as stop words for a particular dataset (domain), but can be highly informative features for another dataset (domain) (Takamura, 2003).

Besides the simple stop word elimination technique, there are several statistical methods for feature selection which have been introduced as pre-processes for Bayesian text classification. These methods provide a measure for usefulness of each individual word in the classification task. Some of the common statistical feature selection methods are document frequency thresholding, information gain, mutual information, x2 statistic, and term strength. The feature selection methods mentioned above are discussed and compared in Yang and Pedersen (1997). Besides these, there are also some feature selection methods which have been invented specifically for Bayesian classification such as Multi-class Odds Ratio (MOR), and Class Discrimination Measure (CDM) which have been experimentally proven to be better than other feature selection approaches (Chen et al., 2009).

It was found here that most of the feature selection methods are carried out as pre-processes prior to the classification process. Furthermore, the pre-processes consume additional time, memory and CPU usages, hence the efficiency of the classification task is degraded. It is the goal of the work presented in this paper to introduce a keyword extraction facility to perform the feature selection during the classifying stage of the classifier, without needing any extensive and costly pre-process. This keyword extraction facility implements Bayesian probabilistic algorithm to determine the “Importance” of keywords based on their posterior probability values being annotated to each of the available categories, during the classifying stage performed by Bayesian classifier. Only the “Important” keywords are taken into consideration as the features for the classification and this contributes to an effective and efficient classification, by performing the feature selection during the process of classification.

Section snippets

Bayesian classification approach

The conventional Bayesian classification approach performs its classification tasks starting with the initial step of analyzing text document by extracting words which are contained in the document to generate a list of words (Isa, Lee, & Kallimani, 2008). The list of words is constructed with the assumption that input document consists of words w1, w2, w3,  , wn  1, wn, where the length of the document (in terms of number of words) is n.

Based on the list of words, the trained Bayesian classifier

High Relevance Keywords Extraction facility

Conventional pattern recognition systems consist of a data acquisition module, a feature selection and transformation mechanism, and a machine learning classification scheme, either supervised or unsupervised. The ordinary Bayesian classification approach takes the entire text body of a document into account to identify the right category of the document (Sahami, Dumais, Heckerman, & Horvitz, 1998). Improving on this classification approach, we have proposed, for our Bayesian text

Experiments and evaluations

The proposed High Relevance Keywords Extraction (HRKE) facility generally improves the classification accuracy of ordinary Bayesian classifier which takes the whole body of the text documents into account in the classification process. However, the optimal performance of a Bayesian text classifier equipped with HRKE facility is achieved depends on the degree of relevance in the extracted keywords to the classification task. This is done by setting a threshold in HRKE facility.

In order to

Conclusion

A Bayesian text classification enhancing technique, the HRKE facility is presented and described here. The enhancement is seen in terms of improvement in classification accuracy and is achieved through applying unique feature selection method based on the occurrence of keywords in documents from a specified category, and compares the occurrence of those keywords in each of the competing categories, respectively. Besides the improvement in terms of classification accuracy, the implementation of

References (33)

  • G. Adami et al.

    Clustering documents into web directory for bootstrapping a supervised classification

    Data and Knowledge Engineering

    (2005)
  • D. Isa et al.

    Using the self-organizing map for clustering of text documents

    Expert Systems With Applications (ESWA)

    (2009)
  • H. Al-Mubaid et al.

    A new text categorization technique using distributional clustering and learning logic

    IEEE Transactions on Knowledge and Data Engineering

    (2006)
  • C. Apte et al.

    Automated learning of decision rules for text categorization

    ACM Transactions on Information Systems (TOIS)

    (1994)
  • R. Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • Brücher, H., Knolmayer, G., & Mittermayer, M. A. (2002). Document classification methods for organizing explicit...
  • J.N. Chen et al.

    Feature selection for text classification with Naïve Bayes

    Expert Systems With Applications (ESWA)

    (2009)
  • Dhillon, I. S., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. In...
  • Diligenti, M., Maggini, M., & Rigutini, L. (2003a). Automatic text categorization using neural network. In Proceedings...
  • Diligenti, M., Maggini, M., & Rigutini, L. (2003b). Learning similarities for text documents using neural networks. In...
  • P. Domingos et al.

    On the optimality of the simple Bayesian classifier under zero-one loss

    Machine Learning

    (1997)
  • Eyheramendy, S., Genkin, A., Ju, W. H., Lewis, D., & Madigan, D. (2003). Sparce Bayesian classifiers for text...
  • Han, E. H., Karypis, G., & Kumar, V. (1999). Text categorization using weighted adjusted k-nearest neighbor...
  • Hartley, M., Isa, D., Kallimani, V. P., & Lee, L. H. (2006). A domain knowledge preserving in process engineering using...
  • D. Isa et al.

    Polychotomiser for case-based reasoning beyond the traditional Bayesian classification approach

    Journal of Computer and Information Science, Canadian Center of Science and Education (CCSE)

    (2008)
  • D. Isa et al.

    Text document pre-processing with the Bayes formula for classification using the support vector machine

    IEEE Transactions on Knowledge and Data Engineering (TKDE)

    (2008)
  • Cited by (38)

    • Finding significant keywords for document databases by two-phase Maximum Entropy Partitioning

      2019, Pattern Recognition Letters
      Citation Excerpt :

      There are different perspectives in literature of the interpretation of ‘significant’ keywords. The most forthcoming interpretation is: ‘relevant keywords that give information about the class of the document’ [18]. The relevance factor can be measured by mutual information, information gain and other information-theoretic measures [14,21] that measure the joint frequency of occurrence of the keyword with the class label.

    • Bidirectional LSTM with attention mechanism and convolutional layer for text classification

      2019, Neurocomputing
      Citation Excerpt :

      At present, there are three main types of the methods for text classification. These methods are: (1) the statistics-based classification methods, such as Bayesian classifier [4]; (2) the connected network learning classification methods, such as neural networks [5]; (3) the rule-making methods, such as decision tree classification [6]. Text classification mainly includes topic classification, question classification and sentiment analysis.

    • A recent overview of the state-of-the-art elements of text classification

      2018, Expert Systems with Applications
      Citation Excerpt :

      The most distinctive feature types are as follows: simple features (keywords or phrases), including uni-grams, bi-grams, and n-grams (Chang & Poon, 2009; Figueiredo et al., 2011; Lee, Isa, Choo, & Chue, 2012; Onan, Korukoglu, & Bulut, 2016a; Xie, Wu, & Zhu, 2017), taxonomies or ontologies of features (Cagliero & Garza, 2013; Kang, Haghighi, & Burstein, 2016; de Knijff, Frasincar, & Hogenboom, 2013; Li, Yang, & Park, 2012; Liu, He, Lim, & Wang, 2014; Pi, Martí, & Garcia, 2016; Saleh, Al Rahmawy, & Abulwafa, 2017b; Wu et al., 2017),

    • Feature Extraction of Dialogue Text Based on Big Data and Machine Learning

      2024, International Journal of Web-Based Learning and Teaching Technologies
    • Resume Classification using Elite Bag-of-Words Approach

      2023, Proceedings - 5th International Conference on Smart Systems and Inventive Technology, ICSSIT 2023
    View all citing articles on Scopus
    1

    Tel: +603 89248116.

    2

    Tel: +605 4688888.

    View full text