Elsevier

Expert Systems with Applications

Volume 36, Issue 7, September 2009, Pages 10206-10222
Expert Systems with Applications

Review
A review of machine learning approaches to Spam filtering

https://doi.org/10.1016/j.eswa.2009.02.037Get rights and content

Abstract

In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual- and image-based approaches. Instead of considering Spam filtering as a standard classification problem, we highlight the importance of considering specific characteristics of the problem, especially concept drift, in designing new filters. Two particularly important aspects not widely recognized in the literature are discussed: the difficulties in updating a classifier based on the bag-of-words representation and a major difference between two early naive Bayes models. Overall, we conclude that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings.

Introduction

In recent years, the increasing use of e-mail has led to the emergence and further escalation of problems caused by unsolicited bulk e-mail messages, commonly referred to as Spam. Evolving from a minor nuisance to a major concern, given the high circulating volume and offensive content of some of these messages, Spam is beginning to diminish the reliability of e-mail (Hoanca, 2006). Personal users and companies are affected by Spam due to the network bandwidth wasted receiving these messages and the time spent by users distinguishing between Spam and normal (legitimate or ham) messages. A business model relying on Spam marketing is usually advantageous because the costs for the sender are small, so that a large number of messages can be sent, maximizing the returns, this aggressive behavior being one of the defining characteristics of Spammers (those that send Spam messages) (Martin-Herran, Rubel, & Zaccour, 2008). The economical impacts of Spam have led some countries to adopt legislation (e.g., Carpinter and Hunt, 2006, Hoanca, 2006, Stern, 2008), although it is limited by the fact that many such messages are sent from various countries (Talbot, 2008). Besides, difficulties in tracking the actual senders of these messages can also limit the application of such laws. In addition to legislation, some authors have proposed changes in protocols and operation models (discussed in Hoanca (2006)).

Another approach adopted is the use of Spam filters, which, based on analysis of the message contents and additional information, attempt to identify Spam messages. The action to be taken once they are identified usually depends on the setting in which the filter is applied. If employed by a single user, as a client-side filter, they are usually sent to a folder which contains only Spam-labeled messages, making the identification of these messages easier. In contrast, if the filter operates in a mail server, handling messages from several users, they may either be labeled as Spam or deleted. Another possibility is a collaborative setting, in which filters running in different machines share information on the messages received, to improve their performance.

However, the use of filters has created an evolutionary scenario (Goodman et al., 2007, Hayes, 2007), in which Spammers employ tools (Stern, 2008) with various techniques specifically tailored to minimize the number of messages identified. Initially, Spam filters were based on user-defined rules, designed based on knowledge of regularities easily observed in such messages. In response, Spammers then began employing content “obfuscation” (or obscuring), by disguising certain terms that are very common in Spam messages (e.g., by writing “f r 3 3” instead of “free”), on an attempt to prevent the correct identification of these terms by Spam filters. Nowadays, Spam filtering is usually tackled by machine learning (ML) algorithms, aimed at discriminating between legitimate and Spam messages, providing an automated, adaptive approach, which are the focus of this review. Instead of relying on hand-coded rules, which are prone to the constantly changing nature of Spam messages, ML approaches are capable of extracting knowledge from a set of messages supplied, and using the obtained information in the classification of newly received messages. Given a collection of training documents DtrD labeled as legitimate or Spam, these algorithms can be described as learning a function f:D{l,s}, for labeling an instance (document or message) dD as legitimate (l) or Spam (s), referred to as the classes. Another interesting feature of these algorithms is the ability to improve their performance through experience (Mitchell, 1997). Due to the fact that most practical filters employ a combination of ML and application-specific knowledge, in the form of hand-coded rules, understanding the changing characteristics of Spam is also important, and has been considered by some researchers (Gomes et al., 2007, Pu and Webb, 2006, Wang and Chen, 2007). Nevertheless, despite the growing research on Spam filtering, the evolution of Spam messages is still occurring, which can be seen in the development of new techniques for evading recognition, such as messages with contents embedded in images.

Text-based Spam filtering can be considered as a text categorization problem, and some works have been designed not only for Spam filtering, but also for the problem of sorting messages into folders (e-mail categorization). However, Spam filtering has several distinguishing characteristics, which should be incorporated in a system for ensuring its applicability. As discussed by Fawcett (2003), these include skewed and changing class distributions, unequal and uncertain misclassification costs of Spam and legitimate messages, complex text patterns and concept drift (a change in a target concept, such as terms indicative of Spam messages) and provide the opportunity for the development and application of new algorithms that explore these characteristics (Fawcett, 2003). Moreover, special attention should be given to the role of user feedback, in the form of immediate or delayed corrections for updating the classification model, as a way to deal with concept drift. In particular, user feedback is a growing theme not only in Spam filtering, but also in other areas of text processing (Culotta, Kristjansson, McCallum, & Viola, 2006). Spam filtering is also increasingly considered as a benchmark for testing newly developed machine learning algorithms, not specifically designed for this problem (e.g. Camastra and Verri, 2005, Gadat and Younes, 2007, Qin and Zhang, 2008).

In line with the growing concerns regarding Spam messages, there has been an increasing number of works dedicated to the problem. Wang and Cloete (2005) surveyed some approaches for e-mail classification, including Spam filtering and e-mail categorization. A relatively recent overview of approaches aimed at Spam filtering was presented by Carpinter and Hunt (2006), which focused on more general aspects of the problem. A more recent review has been conducted by Blanzieri and Bryl (2008). However, it did not discuss several of the more recent works, such as Case-Based Reasoning models and Artificial Immune Systems, which are included in this paper. Moreover, in the present review, we discuss two important aspects not widely considered in the literature: the bias imposed by the commonly used bag-of-words representation and an important difference between naive Bayes models. We also discuss the need to evaluate a filter in a realistic setting, according to some recent corpora available. Emphasis is given to recent works, minimizing the overlap with other reviews (Blanzieri and Bryl, 2008, Carpinter and Hunt, 2006, Wang and Cloete, 2005), although some early works proposing the use of some approaches are also discussed to outline the evolution of their use. Finally, although unsolicited content is current affecting not only e-mail, but also search engines (Gyongyi & Garcia-Molina, 2005) and blogs (Kolari, Java, Finin, Oates, & Joshi, 2006), this survey focuses solely on dealing with e-mail Spam.

This paper is organized in the following way: Section 2 presents an initial background of Spam filtering, discussing typical steps involved in most filters, the representation of messages, datasets used for evaluation and performance measures usually adopted. Sections 3 Naive Bayes, 4 Support Vector Machines (SVM), 5 Artificial Neural Networks, 6 Logistic regression, 7 Lazy learning, 8 Artificial Immune Systems, 9 Boosting, ensembles and related approaches, 10 Hybrid methods and others discuss different families of algorithms applied to textual-related analysis of message contents, while Section 11 presents works dedicated to comparing filters under the same experimental setup. Section 12 focuses on approaches developed for dealing with image Spam. We attempt to present only the most distinguishing characteristics of each algorithm, and focus on the application-specific aspects and experimental scenarios considered. Special attention is given to the datasets used, as different corpora have different number of messages and characteristics. Finally, Section 13 presents an overall discussion of the methods cited in this review, along with the final conclusions of this work.

Section snippets

Structure of a usual Spam filter

The information contained in a message is divided into the header (fields containing general information on the message, such as the subject, sender and recipient) and body (the actual contents of the message). Before the available information can be used by a classifier in a filter, appropriate pre-processing steps are required. The steps involved in the extraction of data from a message are illustrated in Fig. 1, and can be grouped into:

  • (1)

    tokenization, which extracts the words in the message

Naive Bayes

The application of the naive Bayes classifier to Spam filtering was initially proposed by Sahami, Dumais, Heckerman, and Horvitz (1998), who considered the problem in a decision theoretic framework given the confidence in the classification of a message. A particularly appealing characteristic of a Bayesian framework is its suitability for integrating evidence from different sources. In this sense, Sahami et al. investigated the use not only of the message words, but also application-specific

Support Vector Machines (SVM)

Support Vector Machines (SVM) (Scholkopf and Smola, 2002, Vapnik, 1998) were initially applied by Drucker, Wu, and Vapnik (1999), using the BoW representation with binary, frequency or tf-idf features, selected according to the information gain, and two private corpora. It was verified that boosting with decision trees achieved a slightly lower false positive rate than SVM, but the latter being more robust to different datasets and pre-processing procedures, and much more for efficient for

Artificial Neural Networks

Clark, Koprinska, and Poon (2003) proposed LINGER, which uses a multi-layer perceptron (Haykin, 1998) for e-mail categorization and Spam filtering. Messages are represented as BoW, with feature selection based on information gain or term-frequency variance. Experiments were conducted using the LingSpam and PU1 corpora, in addition to a private dataset, with 256 features and 10-fold cross-validation. Using the information gain, LINGER obtained perfect results, and outperformed a naive Bayes

Logistic regression

Goodman and Yih (2006) applied a logistic regression model (e.g., Hastie, Tibshirani, & Friedman, 2001), which is simple and can be easily updated. It uses binary features, distinguished based on whether they occur on the message headers or body, without the application of feature selection. It was verified that, in experiments with the TREC2005 and Enron datasets, in addition to a private corpus, the results obtained were competitive or even superior to some of the best known filters for each

Lazy learning

Sakkis et al. (2003) studied the performance of a lazy learning algorithm (Aha, 1997), the k-NN (nearest neighbors) classifier, on the LingSpam corpus in a cost-sensitive setting, where the similarity between two instances is given by the number of matching binary features. The experiments were conducted using the information gain for feature selection and 10-fold cross-validation. In comparison with naive Bayes, it obtained better results for λ=1, with a similar performance otherwise, although

Artificial Immune Systems

Oda and White (2003a) proposed an Artificial Immune System (see, e.g., de Castro & Timmis, 2002) for Spam filtering, where detectors, represented as regular expressions, are used for pattern matching in a message being analyzed. It assigns a weight to each detector, which is incremented (decremented) when it recognizes an expression in a Spam (legitimate) message, with the thresholded sum of the weights of the matching detectors being used to determine the classification of a message. The

Boosting, ensembles and related approaches

Carreras and Marquez (2001) used an AdaBoost (Hastie et al., 2001) variant, with decision trees as the base classifiers, in experiments with the PU1 corpus, 10-fold cross-validation and binary features. Given a sufficient number of training iterations, AdaBoost outperformed naive Bayes and decision trees, in terms of the F1 measure. When filters with low false positive rates were desired, AdaBoost obtained lower false positive rates, maintaining high true positive rates when base learners of a

Hybrid methods and others

In this section, we focus on Spam filters which integrate different machine learning paradigms. Models relatively unique in terms of their formulation, which cannot be easily classified according to the categories previously discussed, are also discussed.

Aiming at lower false positive rates, Zhao and Zhand (2005) applied Rough Set Theory (RST), a mathematical approach for approximate reasoning, to categorize messages into three classes: Spam, legitimate or suspicious. Features are first

Comparative studies

With the increasing development of Spam filters based on various learning paradigms, and difficulties in comparing filters based solely on performance figures, several works have been dedicated to comparing different filters under the same conditions. These works can provide not only an understanding of the best performing algorithms in certain cases, but also illuminate other aspects of the problem, such as the importance of considering additional message features besides the body, such as the

Image analysis

Spam filters need not concentrate only on the textual content of messages. Given the increasing numbers of Spam messages containing images, which usually contain no text on the body or subject, or only random words aimed at biasing the classification of the message, some researchers have considered how to detect these messages based on image analysis. These approaches are discussed in this section.

Aradhye, Myers, and Herson (2005) designed a method for identifying images typical of Spam

Discussion and conclusions

In this paper, a comprehensive review of recent machine learning approaches to Spam filters was presented. A quantitative analysis of the use of feature selection algorithms and datasets was conducted. It was verified that the information gain is the most commonly used method for feature selection, although it has been suggested that others (e.g., the term-frequency variance, in Koprinska et al. (2007)) may lead to improved results when used with certain machine learning algorithms. Among the

Acknowledgements

This work was supported by grants from UOL, through its Bolsa Pesquisa program (process number 20060519110414a), FAPEMIG and CNPq.

References (124)

  • W.-F. Hsiao et al.

    An incremental cluster-based approach to spam filtering

    Expert Systems with Applications

    (2008)
  • I. Koprinska et al.

    Learning to classify e-mail

    Information Sciences

    (2007)
  • C.-C. Lai

    An empirical study of three machine learning methods for spam filtering

    Knowledge-Based Systems

    (2007)
  • G. Martin-Herran et al.

    Competing for consumer’s attention

    Automatica

    (2008)
  • J.R. Mendez et al.

    Managing irrelevant knowledge in CBR models for unsolicited e-mail classification

    Expert Systems with Applications

    (2009)
  • L. Ozgur et al.

    Adaptive anti-spam filtering for agglutinative languages: A special case for Turkish

    Pattern Recognition Letters

    (2004)
  • Y. Qin et al.

    Empirical likelihood confidence intervals for differences between two datasets with missing data

    Pattern Recognition Letters

    (2008)
  • D.-H. Shih et al.

    Collaborative spam filtering with heterogeneous agents

    Expert Systems with Applications

    (2008)
  • A. Aamodt et al.

    Case-based reasoning: Foundational issues, methodological variations, and system approaches

    Artificial Intelligence Communication

    (1994)
  • Abi-Haidar, A., & Rocha, L. M. (2008). Adaptive spam detection inspired by a cross-regulation model of immune dynamics:...
  • D.W. Aha

    Lazy learning

    Artificial Intelligence Review

    (1997)
  • Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., & Spyropoulos, C. (2000). An evaluation of naive...
  • Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., & Spyropoulos, C. D. (2000). An experimental comparison of naïve...
  • Androutsopoulos, I., Paliouras, G., & Michelakis, E. (2004). Learning to filter unsolicited commercial e-mail. Tech....
  • Aradhye, H., Myers, G., & Herson, J. (2005). Image analysis for efficient categorization of image-based spam e-mail. In...
  • Asuncion, A., & Newman, D. (2007). UCI machine learning repository....
  • Bekkerman, R. (2005). Email classification on enron dataset. <http://www.cs.umass.edu/ronb/enron_dataset.html> (visited...
  • G.B. Bezerra et al.

    An immunological filter for spam

    Lecture Notes in Computer Science

    (2006)
  • S. Bickel et al.

    Dirichlet-enhanced spam filtering based on biased samples

    Advances in Neural Information Processing System

    (2007)
  • Biggio, B., Fumera, G., Pillai, I., & Roli, F. (2007). Image spam filtering using visual information. In Proc int conf...
  • Biggio, B., Fumera, G., Pillai, I., & Roli, F. (2008). Improving image spam filtering using image text features. In...
  • Blanzieri, E., & Bryl, A. (2008). A survey of learning-based techniques of email spam filtering. Tech. rep. DIT-06-056,...
  • A. Bratko et al.

    Spam filtering using statistical data compression models

    Journal of Machine Learning Research

    (2006)
  • Byun, B., Lee, C.-H., Webb, S., & Pu, C. (2007). A discriminative classifier learning approach to image modeling and...
  • F. Camastra et al.

    A novel kernel method for clustering

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • Carreras, X., & Marquez, L. (2001). Boosting trees for anti-spam email filtering. In Proc of the fourth int conf on...
  • Clark, J., Koprinska, I., & Poon, J. (2003). A neural network based approach to automated e-mail classification. In...
  • Cormack, G. V. (2006). TREC 2006 spam track overview. In Proc of TREC 2006: The 15th text retrieval...
  • Cormack, G. V. (2007). TREC 2007 spam track overview. In Proc of TREC 2007: The 16th text retrieval...
  • Cormack, G. V., & Lynam, T. (2005). TREC 2005 spam track overview. In: Proc of TREC 2005: The 14th text retrieval...
  • G.V. Cormack et al.

    Online supervised spam filter evaluation

    ACM Transactions on Information Systems

    (2007)
  • L.N. de Castro et al.

    Artificial immune systems: A new computational intelligence approach

    (2002)
  • S.J. Delany et al.

    Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches

    Artificial Intelligence Review

    (2006)
  • S.J. Delany et al.

    An assessment of case-based reasoning for spam filtering

    Artificial Intelligence Review

    (2005)
  • Denis, F., Gilleron, R., & Tommasi, M. (2002). Text classification from positive and unlabeled examples. In Proc of the...
  • Dornbos, J. (2002). Spam: What can you do about it? <http://www.dornbos.com/spam01.shtml> (visited on June...
  • Dredze, M., Gevaryahu, R., & Elias-Bachrach, A. (2007). Learning fast classifiers for image spam. In Proc of the fourth...
  • H. Drucker et al.

    Support vector machines for spam categorization

    IEEE Transactions on Neural Networks

    (1999)
  • T. Fawcett

    “In vivo” spam filtering: A challenge problem for KDD

    SIGKDD Explorations

    (2003)
  • G. Fumera et al.

    Spam filtering based on the analysis of text information embedded into images

    Journal of Machine Learning Research

    (2006)
  • Cited by (416)

    View all citing articles on Scopus
    View full text