Feature selection based on a normalized difference measure for text classification
Introduction
We are living in an era of fast paced information technology, where large amounts of data are being generated every minute in audio, visual or text form. Twitter users post 300,000 tweets, Google search engine receives more than 4 million queries, email users send 240,000,000 messages in one minute (Data Never Sleeps 2.0, 2014). Significant amount of the data available over Internet is in text form (The Internet, 2009). It is a big challenge to search for information in such a large amount of data in a timely manner. Arranging documents into different categories reduces the search space for a user query (Chen, Schuffels, & Orwig, 1996).
Text classification (TC), or text categorization is the task of assigning one or more than one categories to the documents in a collection from a set of known categories (Sebastiani, 2002). The collection of documents under consideration is called a corpus. Text classification has found a number of applications in a number of domains, such as text mining and information retrieval (Aggarwal & Zhai, 2012). Separating spam emails from legitimate emails, placing documents in relevant folders, attaching comments with customer complaints and finding user interests based on their comments in social media are some examples (Marin, Holenstein, Sarikaya, & Ostendorf, 2014).
Text classification is a three stage process: feature extraction or preprocessing, feature selection and classification (Marin et al., 2014). Feature extraction generates features also known as terms from documents in a corpus, feature selection selects discriminating features, while classification takes documents containing features selected in feature selection as an input and assigns them labels from a set of known classes. Text data also contains few very frequently occurring terms, and a number of rarely occurring terms (Grimmer & Stewart, 2013). Words like “is”, “the”, “was” etc., which are used for grammatical structure and do not convey any meanings, are called stop words (Joshi, Pareek, Patel, & Chauhan, 2012). Stop words are removed using a list of stop words. Removal of too frequent and in-frequent terms is necessary as a preprocessing step to feature selection (Srividhya & Anitha, 2011). Topic specific frequent terms and rarely occurring terms are removed using a process called pruning (Aggarwal & Zhai, 2012). Pruning removes terms occurring above an upper threshold or below a lower threshold.
The most commonly used representation for text documents is “Bag of Words” (BoW) representation, which is borrowed from information retrieval (IR) (Lan, Tan, Su, & Low, 2007). BoW completely ignores the order of words in a document and considers only word occurrences (Wallach, 2006) called term count (tc) or term frequency (tf). A document is represented in the form of a vector D = (Aggarwal, Zhai, 2012, Lan, Tan, Su, Lu, 2009), where twi is weight of ith term in a vocabulary containing v number of terms.
Text classification is inherently high dimensional where a moderate sized dataset can contain tens of thousands of unique words (Joachims, 2002, Wang, Zhang, Liu, Lv, Wang, 2014). Training time and classification accuracy of a classifier is greatly affected by high dimensional data (Wang, Zhang, Liu, Liu, & Wang, 2016). Representation in vector form makes text data highly sparse where most of the entries are zero (Su, Shirab, & Matwin, 2011). High dimensional data degrades classification performance in terms of running time and accuracy (Wu & Zhang, 2004). Classifiers should be provided with only relevant features for a classification task to reduce execution time and boost accuracy. The task of choosing only relevant features for a classification task is called feature selection.
The goal of feature selection is to provide data free from irrelevant and redundant features to the classifier. Many feature selection algorithms select features by using a feature ranking metric as a primary or auxiliary mechanism (Guyon & Elisseeff, 2003). Feature ranking algorithms determine the strength of a feature to discriminate instances into different classes (Van Hulse, Khoshgoftaar, & Napolitano, 2011), and choose top ranked features.
Features are ranked according to their values in positive and negative classes. More apart are the values for a feature in positive and negative classes, higher will be its rank. Feature values for text documents are their term frequencies, which are the number of occurrences of a term in a document. Feature ranking metrics use document frequency for the determination of term rank. The document frequency of a term in positive class is the number of true positives (tp), while the document frequency in the negative class is the number of false positives (fp).
Accuracy (ACC) (Forman, 2003) an intuitively simple feature ranking metric, only considers the difference between true positives and false positives of a term. ACC favors strong positive features. A variant of it termed as balanced accuracy (ACC2) (Forman, 2003) ranks features by taking absolute difference of true positive rate (tpr) and false positive rate (fpr), where and (Dasgupta, Drineas, Harb, Josifovski, & Mahoney, 2007).
We observe that considering only the difference between tp and fp can be misleading for text data. Two terms having the same difference between tp and fp are treated equally by ACC2. We argue that a term whose tp or fp is close to zero along with a high value is relatively more important. We illustrate this important behavior through an illustrative example in Section 3.1. In this paper we introduce a new feature ranking measure, namely Normalized Difference Measure (NDM), which elevates the rank of a term having either the tpr or fpr value closer to zero, among the terms having equal values. We compare NDM with seven well known feature ranking metrics including information gain (IG), odds ratio (OR), chi squared (CHI), Poisson ratio (POIS), gini index (GINI) and distinguishing feature selector (DFS) and ACC2 on seven datasets, using naive Bayes (Stigler, 1983) and SVM (Cortes & Vapnik, 1995) classifiers.
The remainder of this paper is organized in four sections. Section 2 covers the related work. Section 3 explains the working of the newly proposed feature ranking metric. Experimental setup and results are shown in Section 4. Conclusions are drawn in Section 5.
Section snippets
Related work
In this section we discuss some existing feature selection methods used for ranking terms in text data. Feature selection methods are divided into three classes: filters, wrappers and embedded methods (Lal, Chapelle, Weston, & Elisseeff, 2006b). Filter methods select features independent of any classification algorithm (Dash & Liu, 1997). Wrappers select features with the support of a learning algorithm or classifier (Kohavi & John, 1997). Embedded methods work as part of a classification
Normalized Difference Measure (NDM): proposed feature ranking measure
In this section, we propose normalized difference measure (NDM) as a new feature ranking metric for text classification. NDM introduces minimum document frequency as a regularizer to the balanced accuracy measure (ACC2) (Forman, 2003).
Lets now look at the working of ACC2. Fig. 1 shows contour lines for balanced accuracy (ACC2) (Forman, 2003) measure. The contour lines go parallel to the diagonal. Terms located in the top left and bottom right corners are most discriminative. These terms have
Experimental setup
In this section we explain the experimental setup and report the results. After providing a brief introduction of the datasets used, we explain pre-processing performed on the datasets. Eight feature selection metrics CHI, NDM, DFS, IG, OR, GINI, ACC2 and POIS are used for feature selection. We evaluate the quality of the terms selected by FS algorithms using naive Bayes and SVM classifiers and report and compare macro and micro averaged F1 values for the eight feature selection algorithms.
Conclusions
Most of the FR metrics for text classification use document frequency to determine the goodness of a term. Balanced accuracy (ACC2) is a simple measure used for feature selection of text data. ACC2 defines as criterion to determine discrimination power of a term assigning equal ranks to terms having equal values. This leads to incorrect assessment of two terms having same but different tpr and fpr values. In this paper we introduced a new feature ranking metric,
References (42)
- et al.
Feature selection for classification
Intelligent Data Analysis
(1997) - et al.
A case-based technique for tracking concept drift in spam filtering
Knowledge-Based Systems
(2005) - et al.
Wrappers for feature subset selection
Artificial Intelligence
(1997) - et al.
A novel probabilistic feature selection method for text classification
Knowledge-Based Systems
(2012) - et al.
Unsupervised feature selection through gramschmidt orthogonalizationa word co-occurrence perspective
Neurocomputing
(2016) - et al.
t-test feature selection approach based on term frequency for text categorization
Pattern Recognition Letters
(2014) - et al.
A survey of text classification algorithms
Mining text data
(2012) - et al.
The odds ratio
BMJ
(2000) - et al.
Feature selection using linear support vector machines
Proceedings of the 3rd international conference on data mining methods and databases for engineering
(2002) - et al.
Interaction of feature selection methods and linear classification models
Workshop on text learning held at icml
(2002)