ReviewA review of machine learning approaches to Spam filtering
Introduction
In recent years, the increasing use of e-mail has led to the emergence and further escalation of problems caused by unsolicited bulk e-mail messages, commonly referred to as Spam. Evolving from a minor nuisance to a major concern, given the high circulating volume and offensive content of some of these messages, Spam is beginning to diminish the reliability of e-mail (Hoanca, 2006). Personal users and companies are affected by Spam due to the network bandwidth wasted receiving these messages and the time spent by users distinguishing between Spam and normal (legitimate or ham) messages. A business model relying on Spam marketing is usually advantageous because the costs for the sender are small, so that a large number of messages can be sent, maximizing the returns, this aggressive behavior being one of the defining characteristics of Spammers (those that send Spam messages) (Martin-Herran, Rubel, & Zaccour, 2008). The economical impacts of Spam have led some countries to adopt legislation (e.g., Carpinter and Hunt, 2006, Hoanca, 2006, Stern, 2008), although it is limited by the fact that many such messages are sent from various countries (Talbot, 2008). Besides, difficulties in tracking the actual senders of these messages can also limit the application of such laws. In addition to legislation, some authors have proposed changes in protocols and operation models (discussed in Hoanca (2006)).
Another approach adopted is the use of Spam filters, which, based on analysis of the message contents and additional information, attempt to identify Spam messages. The action to be taken once they are identified usually depends on the setting in which the filter is applied. If employed by a single user, as a client-side filter, they are usually sent to a folder which contains only Spam-labeled messages, making the identification of these messages easier. In contrast, if the filter operates in a mail server, handling messages from several users, they may either be labeled as Spam or deleted. Another possibility is a collaborative setting, in which filters running in different machines share information on the messages received, to improve their performance.
However, the use of filters has created an evolutionary scenario (Goodman et al., 2007, Hayes, 2007), in which Spammers employ tools (Stern, 2008) with various techniques specifically tailored to minimize the number of messages identified. Initially, Spam filters were based on user-defined rules, designed based on knowledge of regularities easily observed in such messages. In response, Spammers then began employing content “obfuscation” (or obscuring), by disguising certain terms that are very common in Spam messages (e.g., by writing “f r 3 3” instead of “free”), on an attempt to prevent the correct identification of these terms by Spam filters. Nowadays, Spam filtering is usually tackled by machine learning (ML) algorithms, aimed at discriminating between legitimate and Spam messages, providing an automated, adaptive approach, which are the focus of this review. Instead of relying on hand-coded rules, which are prone to the constantly changing nature of Spam messages, ML approaches are capable of extracting knowledge from a set of messages supplied, and using the obtained information in the classification of newly received messages. Given a collection of training documents labeled as legitimate or Spam, these algorithms can be described as learning a function , for labeling an instance (document or message) as legitimate or Spam , referred to as the classes. Another interesting feature of these algorithms is the ability to improve their performance through experience (Mitchell, 1997). Due to the fact that most practical filters employ a combination of ML and application-specific knowledge, in the form of hand-coded rules, understanding the changing characteristics of Spam is also important, and has been considered by some researchers (Gomes et al., 2007, Pu and Webb, 2006, Wang and Chen, 2007). Nevertheless, despite the growing research on Spam filtering, the evolution of Spam messages is still occurring, which can be seen in the development of new techniques for evading recognition, such as messages with contents embedded in images.
Text-based Spam filtering can be considered as a text categorization problem, and some works have been designed not only for Spam filtering, but also for the problem of sorting messages into folders (e-mail categorization). However, Spam filtering has several distinguishing characteristics, which should be incorporated in a system for ensuring its applicability. As discussed by Fawcett (2003), these include skewed and changing class distributions, unequal and uncertain misclassification costs of Spam and legitimate messages, complex text patterns and concept drift (a change in a target concept, such as terms indicative of Spam messages) and provide the opportunity for the development and application of new algorithms that explore these characteristics (Fawcett, 2003). Moreover, special attention should be given to the role of user feedback, in the form of immediate or delayed corrections for updating the classification model, as a way to deal with concept drift. In particular, user feedback is a growing theme not only in Spam filtering, but also in other areas of text processing (Culotta, Kristjansson, McCallum, & Viola, 2006). Spam filtering is also increasingly considered as a benchmark for testing newly developed machine learning algorithms, not specifically designed for this problem (e.g. Camastra and Verri, 2005, Gadat and Younes, 2007, Qin and Zhang, 2008).
In line with the growing concerns regarding Spam messages, there has been an increasing number of works dedicated to the problem. Wang and Cloete (2005) surveyed some approaches for e-mail classification, including Spam filtering and e-mail categorization. A relatively recent overview of approaches aimed at Spam filtering was presented by Carpinter and Hunt (2006), which focused on more general aspects of the problem. A more recent review has been conducted by Blanzieri and Bryl (2008). However, it did not discuss several of the more recent works, such as Case-Based Reasoning models and Artificial Immune Systems, which are included in this paper. Moreover, in the present review, we discuss two important aspects not widely considered in the literature: the bias imposed by the commonly used bag-of-words representation and an important difference between naive Bayes models. We also discuss the need to evaluate a filter in a realistic setting, according to some recent corpora available. Emphasis is given to recent works, minimizing the overlap with other reviews (Blanzieri and Bryl, 2008, Carpinter and Hunt, 2006, Wang and Cloete, 2005), although some early works proposing the use of some approaches are also discussed to outline the evolution of their use. Finally, although unsolicited content is current affecting not only e-mail, but also search engines (Gyongyi & Garcia-Molina, 2005) and blogs (Kolari, Java, Finin, Oates, & Joshi, 2006), this survey focuses solely on dealing with e-mail Spam.
This paper is organized in the following way: Section 2 presents an initial background of Spam filtering, discussing typical steps involved in most filters, the representation of messages, datasets used for evaluation and performance measures usually adopted. Sections 3 Naive Bayes, 4 Support Vector Machines (SVM), 5 Artificial Neural Networks, 6 Logistic regression, 7 Lazy learning, 8 Artificial Immune Systems, 9 Boosting, ensembles and related approaches, 10 Hybrid methods and others discuss different families of algorithms applied to textual-related analysis of message contents, while Section 11 presents works dedicated to comparing filters under the same experimental setup. Section 12 focuses on approaches developed for dealing with image Spam. We attempt to present only the most distinguishing characteristics of each algorithm, and focus on the application-specific aspects and experimental scenarios considered. Special attention is given to the datasets used, as different corpora have different number of messages and characteristics. Finally, Section 13 presents an overall discussion of the methods cited in this review, along with the final conclusions of this work.
Section snippets
Structure of a usual Spam filter
The information contained in a message is divided into the header (fields containing general information on the message, such as the subject, sender and recipient) and body (the actual contents of the message). Before the available information can be used by a classifier in a filter, appropriate pre-processing steps are required. The steps involved in the extraction of data from a message are illustrated in Fig. 1, and can be grouped into:
- (1)
tokenization, which extracts the words in the message
Naive Bayes
The application of the naive Bayes classifier to Spam filtering was initially proposed by Sahami, Dumais, Heckerman, and Horvitz (1998), who considered the problem in a decision theoretic framework given the confidence in the classification of a message. A particularly appealing characteristic of a Bayesian framework is its suitability for integrating evidence from different sources. In this sense, Sahami et al. investigated the use not only of the message words, but also application-specific
Support Vector Machines (SVM)
Support Vector Machines (SVM) (Scholkopf and Smola, 2002, Vapnik, 1998) were initially applied by Drucker, Wu, and Vapnik (1999), using the BoW representation with binary, frequency or tf-idf features, selected according to the information gain, and two private corpora. It was verified that boosting with decision trees achieved a slightly lower false positive rate than SVM, but the latter being more robust to different datasets and pre-processing procedures, and much more for efficient for
Artificial Neural Networks
Clark, Koprinska, and Poon (2003) proposed LINGER, which uses a multi-layer perceptron (Haykin, 1998) for e-mail categorization and Spam filtering. Messages are represented as BoW, with feature selection based on information gain or term-frequency variance. Experiments were conducted using the LingSpam and PU1 corpora, in addition to a private dataset, with 256 features and 10-fold cross-validation. Using the information gain, LINGER obtained perfect results, and outperformed a naive Bayes
Logistic regression
Goodman and Yih (2006) applied a logistic regression model (e.g., Hastie, Tibshirani, & Friedman, 2001), which is simple and can be easily updated. It uses binary features, distinguished based on whether they occur on the message headers or body, without the application of feature selection. It was verified that, in experiments with the TREC2005 and Enron datasets, in addition to a private corpus, the results obtained were competitive or even superior to some of the best known filters for each
Lazy learning
Sakkis et al. (2003) studied the performance of a lazy learning algorithm (Aha, 1997), the k-NN (nearest neighbors) classifier, on the LingSpam corpus in a cost-sensitive setting, where the similarity between two instances is given by the number of matching binary features. The experiments were conducted using the information gain for feature selection and 10-fold cross-validation. In comparison with naive Bayes, it obtained better results for , with a similar performance otherwise, although
Artificial Immune Systems
Oda and White (2003a) proposed an Artificial Immune System (see, e.g., de Castro & Timmis, 2002) for Spam filtering, where detectors, represented as regular expressions, are used for pattern matching in a message being analyzed. It assigns a weight to each detector, which is incremented (decremented) when it recognizes an expression in a Spam (legitimate) message, with the thresholded sum of the weights of the matching detectors being used to determine the classification of a message. The
Boosting, ensembles and related approaches
Carreras and Marquez (2001) used an AdaBoost (Hastie et al., 2001) variant, with decision trees as the base classifiers, in experiments with the PU1 corpus, 10-fold cross-validation and binary features. Given a sufficient number of training iterations, AdaBoost outperformed naive Bayes and decision trees, in terms of the F1 measure. When filters with low false positive rates were desired, AdaBoost obtained lower false positive rates, maintaining high true positive rates when base learners of a
Hybrid methods and others
In this section, we focus on Spam filters which integrate different machine learning paradigms. Models relatively unique in terms of their formulation, which cannot be easily classified according to the categories previously discussed, are also discussed.
Aiming at lower false positive rates, Zhao and Zhand (2005) applied Rough Set Theory (RST), a mathematical approach for approximate reasoning, to categorize messages into three classes: Spam, legitimate or suspicious. Features are first
Comparative studies
With the increasing development of Spam filters based on various learning paradigms, and difficulties in comparing filters based solely on performance figures, several works have been dedicated to comparing different filters under the same conditions. These works can provide not only an understanding of the best performing algorithms in certain cases, but also illuminate other aspects of the problem, such as the importance of considering additional message features besides the body, such as the
Image analysis
Spam filters need not concentrate only on the textual content of messages. Given the increasing numbers of Spam messages containing images, which usually contain no text on the body or subject, or only random words aimed at biasing the classification of the message, some researchers have considered how to detect these messages based on image analysis. These approaches are discussed in this section.
Aradhye, Myers, and Herson (2005) designed a method for identifying images typical of Spam
Discussion and conclusions
In this paper, a comprehensive review of recent machine learning approaches to Spam filters was presented. A quantitative analysis of the use of feature selection algorithms and datasets was conducted. It was verified that the information gain is the most commonly used method for feature selection, although it has been suggested that others (e.g., the term-frequency variance, in Koprinska et al. (2007)) may lead to improved results when used with certain machine learning algorithms. Among the
Acknowledgements
This work was supported by grants from UOL, through its Bolsa Pesquisa program (process number 20060519110414a), FAPEMIG and CNPq.
References (124)
- et al.
Tightening the net: A review of current and next generation spam filtering tools
Computers and Security
(2006) - et al.
Time-efficient spam e-mail filtering using n-gram models
Pattern Recognition Letters
(2008) - et al.
Corrective feedback and persistent learning for information extraction
Artificial Intelligence
(2006) - et al.
A case-based technique for tracking concept drift in spam filtering
Knowledge-Based Systems
(2005) An introduction to ROC analysis
Pattern Recognition Letters
(2006)- et al.
SpamHunting: An instance-based reasoning system for spam labelling and filtering
Decision Support Systems
(2007) - et al.
Applying lazy learning algorithms to tackle concept drift in spam filtering
Expert Systems with Applications
(2007) - et al.
Workload models of spam and legitimate e-mails
Performance Evaluation
(2007) - et al.
An HMM for detecting spam mail
Expert Systems with Applications
(2007) - et al.
Identification of spam messages using an approach inspired on the immune system
Biosystems
(2008)
An incremental cluster-based approach to spam filtering
Expert Systems with Applications
Learning to classify e-mail
Information Sciences
An empirical study of three machine learning methods for spam filtering
Knowledge-Based Systems
Competing for consumer’s attention
Automatica
Managing irrelevant knowledge in CBR models for unsolicited e-mail classification
Expert Systems with Applications
Adaptive anti-spam filtering for agglutinative languages: A special case for Turkish
Pattern Recognition Letters
Empirical likelihood confidence intervals for differences between two datasets with missing data
Pattern Recognition Letters
Collaborative spam filtering with heterogeneous agents
Expert Systems with Applications
Case-based reasoning: Foundational issues, methodological variations, and system approaches
Artificial Intelligence Communication
Lazy learning
Artificial Intelligence Review
An immunological filter for spam
Lecture Notes in Computer Science
Dirichlet-enhanced spam filtering based on biased samples
Advances in Neural Information Processing System
Spam filtering using statistical data compression models
Journal of Machine Learning Research
A novel kernel method for clustering
IEEE Transactions on Pattern Analysis and Machine Intelligence
Online supervised spam filter evaluation
ACM Transactions on Information Systems
Artificial immune systems: A new computational intelligence approach
Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches
Artificial Intelligence Review
An assessment of case-based reasoning for spam filtering
Artificial Intelligence Review
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
“In vivo” spam filtering: A challenge problem for KDD
SIGKDD Explorations
Spam filtering based on the analysis of text information embedded into images
Journal of Machine Learning Research
Cited by (416)
A hybrid correlation-based deep learning model for email spam classification using fuzzy inference system
2024, Decision Analytics JournalAn investigation of crowdsourcing methods in enhancing the machine learning approach for detecting online recruitment fraud
2023, International Journal of Information Management Data InsightsLaplacian Lp norm least squares twin support vector machine
2023, Pattern Recognition