Elsevier

Knowledge-Based Systems

Volume 64, July 2014, Pages 22-31
Knowledge-Based Systems

Binary PSO with mutation operator for feature selection using decision tree applied to spam detection

https://doi.org/10.1016/j.knosys.2014.03.015Get rights and content

Abstract

In this paper, we proposed a novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam. First, we used the wrapper-based feature selection method to extract crucial features. Second, the decision tree was chosen as the classifier model with C4.5 as the training algorithm. Third, the cost matrix was introduced to give different weights to two error types, i.e., the false positive and the false negative errors. We define the weight parameter as α to adjust the relative importance of the two error types. Fourth, K-fold cross validation was employed to reduce out-of-sample error. Finally, the binary PSO with mutation operator (MBPSO) was used as the subset search strategy. Our experimental dataset contains 6000 emails, which were collected during the year of 2012. We conducted a Kolmogorov–Smirnov hypothesis test on the capital-run-length related features and found that all the p values were less than 0.001. Afterwards, we found α = 7 was the most appropriate in our model. Among seven meta-heuristic algorithms, we demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance. The sensitivity, specificity, and accuracy of the decision tree with feature selection by MBPSO were 91.02%, 97.51%, and 94.27%, respectively. We also compared the MBPSO with conventional feature selection methods such as SFS and SBS. The results showed that the MBPSO performs better than SFS and SBS. We also demonstrated that wrappers are more effective than filters with regard to classification performance indexes. It was clearly shown that the proposed method is effective, and it can reduce the false positive error without compromising the sensitivity and accuracy values.

Introduction

It is a common occurrence for a user to receive hundreds of emails daily. Nearly 92% of these emails are spam [1]. They include advertisements for a variety of products and services, such as pharmaceuticals, electronics, software, jewelry, stocks, gambling, loans, pornography, phishing, and malware attempts [2]. The spam not only consumes the users’ time by forcing them to identify the undesired messages, but also wastes mailbox space and network bandwidth. Therefore, spam detection is becoming a bigger challenge to process for individuals and organizations.

Researchers have proposed different features of extraction technology. The TF–IDF (Term Frequency and Inverse Document Frequency) method extracts features by splitting each message into tokens based on spaces, tabs, and symbols [3]. A simpler model that can be used is by only considering individual keywords [4]. Other more complex models include tag-based features [5] and behavior-based features [6]. In this paper, we found that spam is likely to contain more captain letters, such as the following examples: “85% DISCOUNT ONLY for YOU” and “The LAST DAY to”. Therefore, it is natural to consider the statistical measures of captain letters.

Afterwards, classifications were done using machine learning approaches including artificial immune system [7], support vector machine (SVM) [8], artificial neural networks (ANN) [9], and case-based technique [10]. However, these methods neither achieve classification accuracy, nor give physical meanings. Naive Bayes classifier [11] assumes the features contribute independently, which is oversimplified for this study [12].

In this paper, we proposed a hybrid system that combined the feature selection method and decision tree. The key advantage of our method is that it can achieve high classification accuracy and also give physical meanings to the users. Another important advantage to the method is that it discriminates the two types of errors. One of these errors is predicting a spam as a normal message that causes the message to be automatically filtered on the server. However, this is not that problematic because users can just delete them manually. On the other hand, the error of predicting a normal message as a spam can be very harmful. In fact, these messages are automatically transferred to spam box, and the user is not informed about this transfer. Therefore, it is possible for a very important message to be mistakenly transferred to the spam box. These two different errors should be taken into account differently.

We started the paper by describing the methodology in Section 2. We first presented the complete original feature set, then introduced the wrapper-based feature selection, brought in the classifier model, and discussed the search strategy. Section 3 contained the experiments on 6000 emails collected during 2012. We demonstrated the effectiveness of capital-run-length related features, and compared the proposed method with other meta-heuristics algorithms and feature selection algorithms, and with other spam detection algorithms, respectively by using terms of classification performance and computation time. In addition, we showed how to choose the optimal weight parameter in our model, and demonstrated the superiority of wrappers over filters. Section 4 discussed the results and analyzed their underlying reasons. Final Section 5 concluded the paper.

Section snippets

Complete feature set

In this section, we discuss how to establish the original feature set from emails. We collected 6000 emails by using the same standard as the UCI machine learning repository did (http://archive.ics.uci.edu/ml/datasets/Spambase), except we changed “1999” to “2012”. This was because their dataset was obtained in 1999 and our dataset picked up emails received and sent in 2012. The features contain three different types (Table 1). The first is the frequency of 48 common words. The second is the

Experiments and results

The computer programs were in-house developed and they ran on HP laptop with Intel core i3 3.2 GHz processor and 2G RAM. Matlab 2013a and Sipina 3.3 were served as the software platform. The data set contained 6000 emails, of which 3000 were labeled “spam” and the other 3000 were labeled “nonspam” manually, collected during year 2012.

Discussions and analysis

The results from Table 6 showed there were 6 p values less than the significant level (set as 0.001). It demonstrated that the average capital-run-length, the longest capital-run-length, and the total capital-run-length from the nonspam were distinctly less than those from their counterparts (spam). The results validated the conclusion from the spam database in UCI machine learning repository collected in 1999. It also demonstrated that the common words selected in 1999 were still moderately

Conclusions and further study

The main contribution and technical innovation of the paper falls within the following four points: (1) Made a Kolmogorov–Smirnov hypothesis test on capital-run-length related features and having the p values less than 0.001. (2) Used wrap-based feature selection method that can achieve high classification accuracy meanwhile select important features. (3) Used C4.5 decision tree as the classifier of the wrapper, and use binary PSO with mutation operator (MBPSO) as the search strategy of the

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Acknowledgments

We would like to express our gratitude to Nanjing Normal University Research Foundation for Talented Scholars (No. 2013119XGQ0061) and National Natural Science Foundation of China (No. 610011024).

References (48)

  • D. Puig et al.

    Automatic texture feature selection for image pixel classification

    Pattern Recogn.

    (2006)
  • R. Sikora et al.

    Framework for efficient feature selection in genetic algorithm based data mining

    Eur. J. Oper. Res.

    (2007)
  • Y. Zhang et al.

    A rule-based model for bankruptcy prediction based on an improved genetic ant colony algorithm

    Math. Probl. Eng.

    (2013)
  • M. Ture et al.

    Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients

    Expert Syst. Appl.

    (2009)
  • K. Polat et al.

    A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems

    Expert Syst. Appl.

    (2009)
  • M. Bramer

    Using J-pruning to reduce overfitting in classification trees

    Knowl.-Based Syst.

    (2002)
  • K. Polat et al.

    Classification of epileptiform EEG using a hybrid system based on decision tree classifier and fast Fourier transform

    Appl. Math. Comput.

    (2007)
  • K.-M. Osei-Bryson

    Post-pruning in decision tree induction using multiple performance measures

    Comput. Oper. Res.

    (2007)
  • C.-F. Tsai

    Feature selection in bankruptcy prediction

    Knowl.-Based Syst.

    (2009)
  • H.R. Kanan et al.

    GA-based optimal selection of PZMI features for face recognition

    Appl. Math. Comput.

    (2008)
  • S.-W. Lin

    Parameter determination of support vector machine and feature selection using simulated annealing approach

    Appl. Soft Comput.

    (2008)
  • Y. Zhang et al.

    UCAV path planning by fitness-scaling adaptive chaotic particle swarm optimization

    Math. Probl. Eng.

    (2013)
  • L. Lamberti

    An efficient simulated annealing algorithm for design optimization of truss structures

    Comput. Struct.

    (2008)
  • Y. Zhang

    An MR brain images classifier system via particle swarm optimization and kernel support vector machine

    Sci. World J.

    (2013)
  • Cited by (0)

    View full text