Feature selection based on a normalized difference measure for text classification

https://doi.org/10.1016/j.ipm.2016.12.004Get rights and content

Highlights

  • We analyzed Balanced Accuracy (ACC2)feature ranking metrics and identified its draw backs.

  • We proposed to normalize Balanced Accuracy by minimum of tpr and fpr values.

  • We compared results of proposed feature ranking metric with seven well known feature ranking metrics on seven datasets.

  • Newly proposed metric outperforms in more than 60% cases of our experimental trials.

Abstract

The goal of feature selection in text classification is to choose highly distinguishing features for improving the performance of a classifier. The well-known text classification feature selection metric named balanced accuracy measure (ACC2) (Forman, 2003) evaluates a term by taking the difference of its document frequency in the positive class (also known as true positives) and its document frequency in the negative class (also known as false positives). This however results in assigning equal ranks to terms having equal difference, ignoring their relative document frequencies in the classes. In this paper we propose a new feature ranking (FR) metric, called normalized difference measure (NDM), which takes into account the relative document frequencies. The performance of NDM is investigated against seven well known feature ranking metrics including odds ratio (OR), chi squared (CHI), information gain (IG), distinguishing feature selector (DFS), gini index (GINI) ,balanced accuracy measure (ACC2) and Poisson ratio (POIS) on seven datasets namely WebACE(WAP,K1a,K1b), Reuters (RE0, RE1),spam email dataset and 20 newsgroups using the multinomial naive Bayes (MNB) and supports vector machines (SVM) classifiers. Our results show that the NDM metric outperforms the seven metrics in 66% cases in terms of macro-F1 measure and in 51% cases in terms of micro F1 measure in our experimental trials on these datasets.

Introduction

We are living in an era of fast paced information technology, where large amounts of data are being generated every minute in audio, visual or text form. Twitter users post 300,000 tweets, Google search engine receives more than 4 million queries, email users send 240,000,000 messages in one minute (Data Never Sleeps 2.0, 2014). Significant amount of the data available over Internet is in text form (The Internet, 2009). It is a big challenge to search for information in such a large amount of data in a timely manner. Arranging documents into different categories reduces the search space for a user query (Chen, Schuffels, & Orwig, 1996).

Text classification (TC), or text categorization is the task of assigning one or more than one categories to the documents in a collection from a set of known categories (Sebastiani, 2002). The collection of documents under consideration is called a corpus. Text classification has found a number of applications in a number of domains, such as text mining and information retrieval (Aggarwal & Zhai, 2012). Separating spam emails from legitimate emails, placing documents in relevant folders, attaching comments with customer complaints and finding user interests based on their comments in social media are some examples (Marin, Holenstein, Sarikaya, & Ostendorf, 2014).

Text classification is a three stage process: feature extraction or preprocessing, feature selection and classification (Marin et al., 2014). Feature extraction generates features also known as terms from documents in a corpus, feature selection selects discriminating features, while classification takes documents containing features selected in feature selection as an input and assigns them labels from a set of known classes. Text data also contains few very frequently occurring terms, and a number of rarely occurring terms (Grimmer & Stewart, 2013). Words like “is”, “the”, “was” etc., which are used for grammatical structure and do not convey any meanings, are called stop words (Joshi, Pareek, Patel, & Chauhan, 2012). Stop words are removed using a list of stop words. Removal of too frequent and in-frequent terms is necessary as a preprocessing step to feature selection (Srividhya & Anitha, 2011). Topic specific frequent terms and rarely occurring terms are removed using a process called pruning (Aggarwal & Zhai, 2012). Pruning removes terms occurring above an upper threshold or below a lower threshold.

The most commonly used representation for text documents is “Bag of Words” (BoW) representation, which is borrowed from information retrieval (IR) (Lan, Tan, Su, & Low, 2007). BoW completely ignores the order of words in a document and considers only word occurrences (Wallach, 2006) called term count (tc) or term frequency (tf). A document is represented in the form of a vector D = {tw1,tw2,tw3,,twv} (Aggarwal, Zhai, 2012, Lan, Tan, Su, Lu, 2009), where twi is weight of ith term in a vocabulary containing v number of terms.

Text classification is inherently high dimensional where a moderate sized dataset can contain tens of thousands of unique words (Joachims, 2002, Wang, Zhang, Liu, Lv, Wang, 2014). Training time and classification accuracy of a classifier is greatly affected by high dimensional data (Wang, Zhang, Liu, Liu, & Wang, 2016). Representation in vector form makes text data highly sparse where most of the entries are zero (Su, Shirab, & Matwin, 2011). High dimensional data degrades classification performance in terms of running time and accuracy (Wu & Zhang, 2004). Classifiers should be provided with only relevant features for a classification task to reduce execution time and boost accuracy. The task of choosing only relevant features for a classification task is called feature selection.

The goal of feature selection is to provide data free from irrelevant and redundant features to the classifier. Many feature selection algorithms select features by using a feature ranking metric as a primary or auxiliary mechanism (Guyon & Elisseeff, 2003). Feature ranking algorithms determine the strength of a feature to discriminate instances into different classes (Van Hulse, Khoshgoftaar, & Napolitano, 2011), and choose top ranked features.

Features are ranked according to their values in positive and negative classes. More apart are the values for a feature in positive and negative classes, higher will be its rank. Feature values for text documents are their term frequencies, which are the number of occurrences of a term in a document. Feature ranking metrics use document frequency for the determination of term rank. The document frequency of a term in positive class is the number of true positives (tp), while the document frequency in the negative class is the number of false positives (fp).

Accuracy (ACC) (Forman, 2003) an intuitively simple feature ranking metric, only considers the difference between true positives and false positives of a term. ACC favors strong positive features. A variant of it termed as balanced accuracy (ACC2) (Forman, 2003) ranks features by taking absolute difference of true positive rate (tpr) and false positive rate (fpr), where tpr=tptp+fn and fpr=tntn+fp (Dasgupta, Drineas, Harb, Josifovski, & Mahoney, 2007).

We observe that considering only the difference between tp and fp can be misleading for text data. Two terms having the same difference between tp and fp are treated equally by ACC2. We argue that a term whose tp or fp is close to zero along with a high |tpfp| value is relatively more important. We illustrate this important behavior through an illustrative example in Section 3.1. In this paper we introduce a new feature ranking measure, namely Normalized Difference Measure (NDM), which elevates the rank of a term having either the tpr or fpr value closer to zero, among the terms having equal |tprfpr| values. We compare NDM with seven well known feature ranking metrics including information gain (IG), odds ratio (OR), chi squared (CHI), Poisson ratio (POIS), gini index (GINI) and distinguishing feature selector (DFS) and ACC2 on seven datasets, using naive Bayes (Stigler, 1983) and SVM (Cortes & Vapnik, 1995) classifiers.

The remainder of this paper is organized in four sections. Section 2 covers the related work. Section 3 explains the working of the newly proposed feature ranking metric. Experimental setup and results are shown in Section 4. Conclusions are drawn in Section 5.

Section snippets

Related work

In this section we discuss some existing feature selection methods used for ranking terms in text data. Feature selection methods are divided into three classes: filters, wrappers and embedded methods (Lal, Chapelle, Weston, & Elisseeff, 2006b). Filter methods select features independent of any classification algorithm (Dash & Liu, 1997). Wrappers select features with the support of a learning algorithm or classifier (Kohavi & John, 1997). Embedded methods work as part of a classification

Normalized Difference Measure (NDM): proposed feature ranking measure

In this section, we propose normalized difference measure (NDM) as a new feature ranking metric for text classification. NDM introduces minimum document frequency as a regularizer to the balanced accuracy measure (ACC2) (Forman, 2003).

Lets now look at the working of ACC2. Fig. 1 shows contour lines for balanced accuracy (ACC2) (Forman, 2003) measure. The contour lines go parallel to the diagonal. Terms located in the top left and bottom right corners are most discriminative. These terms have

Experimental setup

In this section we explain the experimental setup and report the results. After providing a brief introduction of the datasets used, we explain pre-processing performed on the datasets. Eight feature selection metrics CHI, NDM, DFS, IG, OR, GINI, ACC2 and POIS are used for feature selection. We evaluate the quality of the terms selected by FS algorithms using naive Bayes and SVM classifiers and report and compare macro and micro averaged F1 values for the eight feature selection algorithms.

Conclusions

Most of the FR metrics for text classification use document frequency to determine the goodness of a term. Balanced accuracy (ACC2) is a simple measure used for feature selection of text data. ACC2 defines |tprfpr| as criterion to determine discrimination power of a term assigning equal ranks to terms having equal |tprfpr| values. This leads to incorrect assessment of two terms having same |tprfpr| but different tpr and fpr values. In this paper we introduced a new feature ranking metric,

References (42)

  • Cardoso-Cachopo, A. (2007). Improving Methods for Single-label Text Categorization. PdD Thesis, Instituto Superior...
  • C.-C. Chang et al.

    LIBSVM: A library for support vector machines

    ACM Transactions on Intelligent Systems and Technology

    (2011)
  • H. Chen et al.

    Internet categorization and search: A self-organizing approach

    Journal of Visual Communication and Image Representation, Special Issue on Digital Libraries

    (1996)
  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • A. Dasgupta et al.

    Feature selection methods for text classification

    Proceedings of the 13th acm sigkdd international conference on knowledge discovery and data mining

    (2007)
  • Data never sleeps, (2014). 2.0. http://www.domo.com/blog/2014/04/data-never-sleeps-2-0. Accessed: January 02,...
  • G. Forman

    An extensive empirical study of feature selection metrics for text classification

    Journal of Machine Learning Research

    (2003)
  • G. Forman

    A pitfall and solution in multi-class feature selection for text classification

    Proceedings of the twenty-first international conference on machine learning

    (2004)
  • J. Grimmer et al.

    Text as data: The promise and pitfalls of automatic content analysis methods for political texts

    Political Analysis

    (2013)
  • I. Guyon et al.

    An introduction to variable and feature selection

    The Journal of Machine Learning Research

    (2003)
  • M. Hall et al.

    The weka data mining software: An update

    ACM SIGKDD Explorations Newsletter

    (2009)
  • Cited by (0)

    View full text