Binary PSO with mutation operator for feature selection using decision tree applied to spam detection
Introduction
It is a common occurrence for a user to receive hundreds of emails daily. Nearly 92% of these emails are spam [1]. They include advertisements for a variety of products and services, such as pharmaceuticals, electronics, software, jewelry, stocks, gambling, loans, pornography, phishing, and malware attempts [2]. The spam not only consumes the users’ time by forcing them to identify the undesired messages, but also wastes mailbox space and network bandwidth. Therefore, spam detection is becoming a bigger challenge to process for individuals and organizations.
Researchers have proposed different features of extraction technology. The TF–IDF (Term Frequency and Inverse Document Frequency) method extracts features by splitting each message into tokens based on spaces, tabs, and symbols [3]. A simpler model that can be used is by only considering individual keywords [4]. Other more complex models include tag-based features [5] and behavior-based features [6]. In this paper, we found that spam is likely to contain more captain letters, such as the following examples: “85% DISCOUNT ONLY for YOU” and “The LAST DAY to”. Therefore, it is natural to consider the statistical measures of captain letters.
Afterwards, classifications were done using machine learning approaches including artificial immune system [7], support vector machine (SVM) [8], artificial neural networks (ANN) [9], and case-based technique [10]. However, these methods neither achieve classification accuracy, nor give physical meanings. Naive Bayes classifier [11] assumes the features contribute independently, which is oversimplified for this study [12].
In this paper, we proposed a hybrid system that combined the feature selection method and decision tree. The key advantage of our method is that it can achieve high classification accuracy and also give physical meanings to the users. Another important advantage to the method is that it discriminates the two types of errors. One of these errors is predicting a spam as a normal message that causes the message to be automatically filtered on the server. However, this is not that problematic because users can just delete them manually. On the other hand, the error of predicting a normal message as a spam can be very harmful. In fact, these messages are automatically transferred to spam box, and the user is not informed about this transfer. Therefore, it is possible for a very important message to be mistakenly transferred to the spam box. These two different errors should be taken into account differently.
We started the paper by describing the methodology in Section 2. We first presented the complete original feature set, then introduced the wrapper-based feature selection, brought in the classifier model, and discussed the search strategy. Section 3 contained the experiments on 6000 emails collected during 2012. We demonstrated the effectiveness of capital-run-length related features, and compared the proposed method with other meta-heuristics algorithms and feature selection algorithms, and with other spam detection algorithms, respectively by using terms of classification performance and computation time. In addition, we showed how to choose the optimal weight parameter in our model, and demonstrated the superiority of wrappers over filters. Section 4 discussed the results and analyzed their underlying reasons. Final Section 5 concluded the paper.
Section snippets
Complete feature set
In this section, we discuss how to establish the original feature set from emails. We collected 6000 emails by using the same standard as the UCI machine learning repository did (http://archive.ics.uci.edu/ml/datasets/Spambase), except we changed “1999” to “2012”. This was because their dataset was obtained in 1999 and our dataset picked up emails received and sent in 2012. The features contain three different types (Table 1). The first is the frequency of 48 common words. The second is the
Experiments and results
The computer programs were in-house developed and they ran on HP laptop with Intel core i3 3.2 GHz processor and 2G RAM. Matlab 2013a and Sipina 3.3 were served as the software platform. The data set contained 6000 emails, of which 3000 were labeled “spam” and the other 3000 were labeled “nonspam” manually, collected during year 2012.
Discussions and analysis
The results from Table 6 showed there were 6 p values less than the significant level (set as 0.001). It demonstrated that the average capital-run-length, the longest capital-run-length, and the total capital-run-length from the nonspam were distinctly less than those from their counterparts (spam). The results validated the conclusion from the spam database in UCI machine learning repository collected in 1999. It also demonstrated that the common words selected in 1999 were still moderately
Conclusions and further study
The main contribution and technical innovation of the paper falls within the following four points: (1) Made a Kolmogorov–Smirnov hypothesis test on capital-run-length related features and having the p values less than 0.001. (2) Used wrap-based feature selection method that can achieve high classification accuracy meanwhile select important features. (3) Used C4.5 decision tree as the classifier of the wrapper, and use binary PSO with mutation operator (MBPSO) as the search strategy of the
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Acknowledgments
We would like to express our gratitude to Nanjing Normal University Research Foundation for Talented Scholars (No. 2013119XGQ0061) and National Natural Science Foundation of China (No. 610011024).
References (48)
- et al.
Spam detection using random boost
Pattern Recogn. Lett.
(2012) Technologies for spam detection
Netw. Secur.
(2009)Ranking of field association terms using Co-word analysis
Inf. Process. Manage.
(2008)Searching strategies for the Hungarian language
Inf. Process. Manage.
(2008)- et al.
An anti-spam scheme using pre-challenges
Comput. Commun.
(2006) Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks
Expert Syst. Appl.
(2009)Identification of SPAM messages using an approach inspired on the immune system
Biosystems
(2008)- et al.
A discrete mixture-based kernel for SVMs: application to spam and image categorization
Inf. Process. Manage.
(2009) - et al.
Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish
Pattern Recogn. Lett.
(2004) SpamHunting: an instance-based reasoning system for spam labelling and filtering
Decis. Support Syst.
(2007)