Skip to main content
Log in

A survey of learning-based techniques of email spam filtering

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Email spam is one of the major problems of the today’s Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper we give an overview of the state of the art of machine learning applications for spam filtering, and of the ways of evaluation and comparison of different filtering methods. We also provide a brief description of other branches of anti-spam protection and discuss the use of various approaches in commercial and non-commercial anti-spam software solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Agrawal B, Kumar N, Molle M (2005) Controlling spam emails at the routers. In: Proceedings of the IEEE international conference on communications, ICC 2005, vol 3, pp 1588–1592

  • Albrecht K, Burri N, Wattenhofer R (2005) Spamato—an extendable spam filter system. In: Proceedings of second conference on email and anti-spam, CEAS’2005

  • Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000a) An evaluation of naive bayesian anti-spam filtering. In: Potamias G, Moustakis V, van Someren M (eds) Proceedings of the workshop on machine learning in the new information age, 11th European conference on machine learning, ECML 2000, pp 9–17

  • Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000b) An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00. ACM Press, New York, NY, USA, pp 160–167. ISBN 1-58113-226-3. http://doi.acm.org/10.1145/345508.345569

  • Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P (2000c) Learning to filter spam e-mail: a comparison of a naive bayesian and a memory-based approach. In: Zaragoza H, Gallinari P, Rajman M (eds) Proceedings of the workshop on machine learning and textual information access, 4th European conference on principles and practice of knowledge discovery in databases, PKDD 2000 pp 1–13

  • Androutsopoulos I, Paliouras G, Michelakis E (2004) Learning to filter unsolicited commercial e-mail (Technical Report 2004/2). NCSR “Demokritos”. Revised version

  • Androutsopoulos I, Magirou E, Vassilakis D (2005) A game theoretic model of spam e-mailing. In: Proceedings of second conference on email and anti-spam, CEAS’2005

  • Aradhye H, Myers G, Herson J (2005) Image analysis for efficient categorization of image-based spam e-mail. In: Proceedings of eighth international conference on document analysis and recognition, ICDAR 2005, vol 2. IEEE Computer Society, pp 914–918.

  • Blanzieri E, Bryl A (2007) Evaluation of the highest probability svm nearest neighbor classifier with variable relative error cost. In: Proceedings of fourth conference on email and anti-spam, CEAS’2007. pp 5

  • Boykin P, Roychowdhury V (2005) Leveraging social networks to fight spam. Computer 38(4): 61–68

    Article  MathSciNet  Google Scholar 

  • Bratko A, Cormack GV, Filipič B, Lynam TR, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7(Dec): 2673–2698

    MathSciNet  Google Scholar 

  • CAPTCHA (2005) The CAPTCHA project. http://www.captcha.net/ Accessed:31.05.06

  • Carreras X, Márquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of 4th international conference on recent advances in natural language processing, RANLP-01

  • Chan J, Koprinska I, Poon J (2004) Co-training on textual documents with a single natural feature set. In: Proceedings of the ninth Australasian document computing symposium (ADCS 2004)

  • Chirita PA, Diederich J, Nejdl W (2005) Mailrank:using ranking for spam detection. In: Proceedings of the 14th ACM international conference on information and knowledge management, CIKM 2005, ACM Press. pp 373–380.

  • Chuan Z, Xianliang L, Mengshu H, Xu Z (2005) A lvq-based neural network anti-spam email approach. ACM SIGOPS Oper Syst Rev 39(1):34–39 ISSN 0163-5980. http://doi.acm.org/10.1145/1044552.1044555

  • Cohen W (1996) Learning rules that classify e-mail. In: Proceedings of the 1996 AAAI spring symposium on machine learning in information access, MLIA ’96. AAAI Press

  • Cormack G, Lynam T (2005a) Spam corpus creation for TREC. In: Proceedings of second conference on email and anti-spam, CEAS’2005

  • Cormack G, Lynam T (2005b) TREC 2005 spam track overview. Available at http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05/, Accessed: 31.05.06

  • Cormack GV, Bratko A (2006) Batch and online spam filter comparison. In: Proceedings of the third conference on email and anti-spam, CEAS’2006

  • Cukier W, Cody S, Nesselroth E (2006) Genres of spam: expectations and deceptions. In: Proceedings of the 39th annual hawaii international conference on system sciences, HICSS ’06 3

  • Damiani E, De Capitani di Vimercati S, Paraboschi S, Samarati P (2004) P2P-based collaborative spam detection and filtering. In: Proceedings of fourth IEEE international conference on peer-to-peer computing, P2P’04 pp 176–183

  • Delany SJ, Cunningham P, Coyle L (2004) An assessment of case-based reasoning for spam filtering. In: Proceedings of fifteenth irish conference on artificial intelligence and cognitive science (AICS ’04) pp 9–18

  • Drake C, Oliver J, Koontz E (2004) Anatomy of a phishing email. In: Proceedings of the first conference on email and anti-spam, CEAS’2004

  • Dredze M, Gevaryahu R, Elias-Bachrach A (2007) Learning fast classifiers for image spam. In: Proceedings of the fourth conference on email and anti-spam, CEAS’2007

  • Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054

    Article  Google Scholar 

  • Duan Z, Dong Y, Gopalan K (2005) Diffmail: a differentiated message delivery architecture to control spam. In: Proceedings of 11th international conference on parallel and distributed systems, ICPADS 2005. vol 2, pp 255–259

  • Dwork C, Naor M (1992) Pricing via processing or combatting junk mail. In: Advances in cryptology-Crypto 92 proceedings. Springer Verlag, pp 139–147.

  • Fawcett T (2003) “in vivo” spam filtering: a challenge problem for data mining. KDD Explor 5(2):140–148 http://doi.acm.org/10.1145/980972.980990

  • Fecyk G (2003) Designated mailers protocol. http://www.pan-am.ca/dmp/draft-fecyk-dmp-01.txt, Accessed: 31.05.06, URL http://www.pan-am.ca/dmp/draft-fecyk-dmp-01.txt.

  • FerrisResearch (2005) The global economic impact of spam. report #409. Available at http://www.ferris.com/get_content_file.php?id=364 Accessed: 13.06.06

  • Fumera G, Pillai I, Roli F (2006) Spam filtering based on the analysis of text information embedded into images. J Mach Learn Res (7):2699–2720

  • Garg A, Battiti R, Cascella R (2006) “May I borrow your filter?” exchanging filters to combat spam in a community. In: AINA 2006. 20th international conference on advanced information networking and applications 2

  • Golbeck J, Hendler J (2004) Reputation network analysis for email filtering. In; Proceedings of the first conference on email and anti-spam, CEAS’2004

  • Gomes LH, Cazita C, Almeida JM, Almeida V, Meira W Jr. (2004) Characterizing a spam traffic. In: IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on internet measurement. ACM Press, New York, NY, USA. pp 356–369. ISBN 1-58113-821-0. http://doi.acm.org/10.1145/1028788.1028837

  • Goodman J (2004) IP addresses in email clients. In: Proceedings of the first conference on email and anti-spam, CEAS’2004

  • Goodman J, Cormack GV, Heckerman D (2007) Spam and the ongoing battle for the inbox. Commun of the ACM 50(2): 25–33

    Article  Google Scholar 

  • Goodman J, Rounthwaite R (2004) Stopping outgoing spam. In: EC’04: proceedings of the fifth ACM conference on electronic commerce

  • Goodman J, Yih WT (2006) Online discriminative spam filter training. In: Proceedings of third conference on email and anti-spam, CEAS’2006

  • Graham P (2002) A plan for spam. Available at http://www.paulgraham.com/spam.html Accessed: 14.05.07

  • Graham P (2003) Better bayesian filtering. Available at http://www.paulgraham.com/better.html Accessed: 12.07.06, URL http://www.paulgraham.com/better.html

  • Grimes GA (2007) Compliance with CAN-SPAM act of 2003. Communicationf of the ACM 50: 55–62

    Google Scholar 

  • Harris E (2003) The next step in the spam control war: greylisting. Available at http://projects.puremagic.com/greylisting/whitepaper.html Accessed: 02.10.07

  • Hershkop S (2006) Behavior-based email analysis with application to spam detection. Ph D Thesis. Available at http://www1.cs.columbia.edu/~sh553/publications/final-thesis.pdf Accessed: 12.07.06

  • HoneyPot (2004) Project honey pot: distributed spam harvester tracking network. Available at http://www.projecthoneypot.org/, Accessed: 07.06.06

  • Hulten G, Penta A, Seshadrinathan G, Mishra M (2004) Trends in spam products and methods. In: Proceedings of the first conference on email and anti-spam, CEAS’2004

  • ITU (2005) ITU survey on anti-spam legislation worldwide. Available at http://www.itu.int/osg/spu/spam/ Accessed: 31.05.06

  • Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher DH (eds) Proceedings of ICML-97, 14th international conference on machine learning. Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US, pp 143–151

  • Jung J , Sit E (2004) An empirical study of spam traffic and the use of dns black lists. In: IMC ’04: proceedings of the 4th ACM SIGCOMM conference on internet measurement. ACM Press, New York, NY, USA. pp 370–375. ISBN 1-58113-821-0. http://doi.acm.org/10.1145/1028788.1028838.

  • Klimt B, Yang Y (2004) Introducing the enron corpus. In: Proceedings of the first conference on email and anti-spam, CEAS’2004

  • Kuipers B, Liu A, Gautam A, Gouda M (2005) Zmail: zero-sum free market control of spam. In: Proceedings of the 25th IEEE international conference on distributed computing systems workshops, ICDCS 2005. IEEE Computer Society, pp 20–26.

  • Kun-Lun L, Kai L, Hou-Kuan H, Sheng-Feng T (2002) Active learning with simplified SVMs for spam categorization. Mach Learn Cybern 3: 1198–1202

    Google Scholar 

  • Lai C-C, Tsai M-C (2004) An empirical performance comparison of machine learning methods for spam e-mail categorization. Hybrid Intell Syst 44–48

  • Lazzari L, Mari M, Poggi A (2005) Cafe-collaborative agents for filtering e-mails. In: Proceedings of 14th IEEE international workshops on enabling technologies: infrastructure for collaborative enterprise, WETICE’05 pp 356–361

  • Lee H, Ng A (2005) Spam deobfuscation using a hidden markov model. In: Proceedings of second conference on email and anti-spam, CEAS’2005 URL http://www.ceas.cc/papers-2005/166.pdf

  • Leiba B, Ossher J, Rajan VT, Segal R, Wegman M (2005) SMTP path analysis. In: Proceedings of second conference on email and anti-spam, CEAS’2005 URL http://www.ceas.cc/papers-2005/176.pdf

  • Levine J, DeKok A (2004) Lightweight MTA authentication protocol (LMAP) discussion and comparison. http://www.taugh.com/draft-irtf-asrg-lmap-discussion-01.txt, Accessed: 31.05.06

  • Li K, Pu C, Ahamad M (2004) Resisting spam delivery by tcp damping. In: Proceedings of the first conference on email and anti-spam, CEAS’2004

  • Li K, Zhong Z (2006) Fast statistical spam filter by approximate classifications. SIGMETRICS Perform Eval Rev 34(1): 347–358 ISSN 0163-5999

    Article  MathSciNet  Google Scholar 

  • Lowd D, Meek C (2005) Good word attacks on statistical spam filters. In: Proceedings of second conference on email and anti-spam, CEAS’2005. URL http://www.ceas.cc/papers-2005/125.pdf

  • Lugaresi N (2004) European union vs. spam: a legal response. In: Proceedings of the first conference on email and anti-spam, CEAS’2004

  • Luo X, Zincir-Heywood N (2005) Comparison of a SOM based sequence analysis system and naive bayesian classifier for spam filtering. In: Proceedings of IEEE international joint conference on neural networks, IJCNN ’05. vol 4, pp 2571–2576

  • MAAWG. Messaging anti-abuse working group (2006) Email metrics repost. Third & fourth quarter 2006. Available at http://www.maawg.org/about/MAAWGMetric_2006_3_4_report.pdf Accessed: 04.06.07

  • Medlock B (2005) An adaptive approach to spam filtering on a new corpus. Accessed: 31.05.06. URL http://www.cl.cam.ac.uk/users/bwm23/genspam_paper.pdf

  • Medlock B (2006) An adaptive approach to spam filtering on a new corpus. In: Proceedings of the third conference on email and anti-spam, CEAS’2006

  • Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes? which naive bayes? In: Proceedings of third conference on email and anti-spam, CEAS’2006

  • Michelakis E, Androutsopoulos I, Paliouras G, Sakkis G, Stamatopoulos P (2004) Filtron: a learning-based anti-spam filter. In: Proceedings of the first conference on email and anti-spam, CEAS’2004

  • Mo G, Zhao W, Cao H, Dong J (2006) Multi-agent interaction based collaborative p2p system for fighting spam. In: IAT’06. IEEE/WIC/ACM international conference on intelligent agent technology 428–431

  • Moustakas E, Ranganathan C, Duquenoy P (2005) Combating spam through legislation: a comparative analysis of us and european approaches. In: Proceedings of second conference on email and anti-spam, CEAS’2005

  • Nagamalai D, Dhinakaran C, Lee JK (2007) Multi ayer approach to defend DDoS attacks caused by spam. In: MUE’07. International conference on multimedia and ubiquitous engineering 97–102

  • O’Brien C, Vogel C (2003) Spam filters: bayes vs. chi-squared; letters vs. words. In: Proceedings of the 1st international symposium on information and communication technologies, ISICT ’03, Trinity College Dublin, Dublin, Ireland, 2003. pp 291–296.

  • Pantel P, Lin D (1998) Spamcop: a spam classification & organization program. In: Learning for text categorization: papers from the 1998 workshop. AAAI Technical Report WS-98-05

  • Park SY, Kim JT, Kang SG (2006) Analysis of applicability of traditional spam regulations to voip spam. In: ICACT 2006. The 8th international conference on advanced communication technology. vol 2

  • Prince M, Dahl B, Holloway L, Keller A, Langheinrich E (2005) Understanding how spammers steal your e-mail address: an analysis of the first 6 months of data from project honey pot. In: Proceedings of second conference on email and anti-spam, CEAS’2005

  • Pu C, Webb S (2006) Observed trends in spam construction techniques: a case study of spam evolution. In: Proceedings of third conference on email and anti-spam, CEAS’2006

  • Ramachandran A, Feamster N (2006) Understanding the network-level behavior of spammers. In: SIGCOMM’06: proceedings of the 2006 conference on aplications, technologies, architectures, and protocols for computer communications

  • Rigoutsos I, Huynh T (2004) Chung-kwei: a pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (spam). In: Proceedings of the first conference on email and anti-spam, CEAS’2004

  • Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop. AAAI Technical Report WS-98-05

  • Saito T (2005) Anti-spam system: another way of preventing spam. In: Proceedings of the 16th international workshop on database and expert systems applications, DEXA 2005 pp 57–61

  • Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. In: Proceedings of empirical methods in natural language processing, EMNLP-2001 pp 44–50

  • Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6: 49–73

    Article  Google Scholar 

  • Sasaki M, Shinnou H (2005) Spam detection using text clustering. In: Proceedings of international conference on cyberworlds, CW2005. pp 316–319

  • Schiavone V, Brussin D, Koenig J, Cobb S, Everett-Church R (2003) Trusted e-mail open standard: a comprehencive policy and technology proposal for email reform. http://www.cobb.com/spam/teos/, Accessed: 31.05.06

  • Sculley D, Wachman GM (2007) Relaxed online svms for spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. pp 415–422

  • Seltzer L (2003) Should senders pay for the mess we call e-mail? eWeek, http://www.eweek.com/article2/0,4149,1273186,00.asp, Accessed: 31.05.06

  • Sender ID (2004) Sender ID technology: information for it professionals. Available at http://www.microsoft.com/mscorp/safety/technologies/senderid/technology.mspx, Accessed: 31.05.06

  • Siponen M, Stucke C (2006) Effective anti-spam strategies in companies: an international study. In: Proceedings of HICSS ’06 6

  • Soonthornphisaj N, Chaikulseriwat Kanokwan, Tang-On P (2002) Anti-spam filtering: a centroid-based classification approach. Signal Process 2: 1096–1099

    Google Scholar 

  • SpamDefined (2001) Spam defined. http://www.monkeys.com/spam-defined/ Accessed: 31.05.06

  • SPAMHAUS (2003) The spam definition and legalization game. Available at http://www.spamhaus.org/news.lasso?article=9, Accessed: 31.05.06

  • SPAMHAUS (2005) The definition of spam. Available at http://www.spamhaus.org/definition.html, Accessed: 10.06.06

  • SPF. FAQ. http://openspf.org/faq.html Accessed: 31.05.06

  • Twining RD, Williamson MM, Mowbray M, Rahmouni M (2004) Email prioritization: reducing delays on legitimate mail caused by junk mail. Technical Report HPL-2004-5R1, HP Labs

  • Wang X-L , Cloete I (2005) Learning to classify email: a survey. In: Proceedings of the 2005 international conference on machine learning and cybernetics, ICMLC 2005. pp 5716–5719

  • Wang Z, Josephson W, Lv Q, Charikar M, Li K (2007) Filtering image spam with near-duplicate detection. In: Proceedings of the fourth conference on email and anti-spam, CEAS’2007

  • Wittel G, Wu F (2004) On attacking statistical spam filters. In: Proceedings of first conference on email and anti-spam, CEAS’2004. URL http://www.ceas.cc/papers-2004/170.pdf

  • Woitaszek M, Shaaban M, Czernikowski R (2003) Identifying junk electronic mail in microsoft outlook with a support vector machine. In: Proceedings of the 2003 symposium on applications and the internet, SAINT 2003 pp 166–169

  • Wu C-T, Cheng K-T, Zhu Q, Wu Y-L (2005) Using visual features for anti-spam filtering. In: Proceedings of IEEE international conference on image processing, ICIP 2005 3:509–512

  • Yamai N, Okayama K, Miyashita T, Maruyama S, Nakamura M (2005) A protection method against massive error mails caused by sender spoofed spam mails. In: Proceedings of the 2005 symposium on applications and the internet, SAINT 2005. pp 384–390

  • Yeh C-Y, Wu C-H, Doong S-H (2005) Effective spam classification based on meta-heuristics. In: Proceedings of IEEE international conference on systems, man and cybernetics, SMC 2005. vol 4, pp 3872–3877

  • Yih W-t, Goodman J, Hulten G (2006) Learning at low positive rates. In: Proceedings of the third conference on email and anti-spam, CEAS’2006

  • Zhang L, Yao T (2003) Filtering junk mail with a maximum entropy model. In: Proceeding of 20th international conference on computer processing of oriental languages, ICCPOL03 pp 446–453

  • Zhang L, Zhu J, Yao T (2004) An evaluation of statistical spam filtering techniques. ACM Trans Asian Lang Inform Process (TALIP) 3(4):243–269. ISSN 1530-0226. http://doi.acm.org/10.1145/1039621.1039625

    Google Scholar 

  • Zhao W, Zhang Z (2005) An email classification model based on rough set theory. In: Proceedings of the 2005 international conference on active media technology, AMT05 pp 403–408

  • Zhou F, Zhuang L, Zhao B, Huang L, Joseph A, Kubiatowicz J (2003) Approximate object location and spam filtering on peer-to-peer systems. In: Proceedings of ACM/IFIP/USENIX international middleware conference, middleware 2003

  • Zhou Y, Mulekar MS, Nerellapalli P (2005) Adaptive spam filtering using dynamic feature space. In: Proceedings of 17th IEEE international conference on tools with artificial intelligence, ICTAI’05 pp 302–309

  • Zinman A, Donath J (2007) Is Britney Spears spam? In: Proceedings of the fourth conference on email and anti-spam, CEAS’2007

  • Zorkadis V, Panayotou M, Karras DA (2005) Improved spam e-mail filtering based on committee machines and information theoretic feature extraction. In: Poceedings of IEEE international joint conference on neural networks, IJCNN ’05. vol 1, pp 179–184

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anton Bryl.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Blanzieri, E., Bryl, A. A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29, 63–92 (2008). https://doi.org/10.1007/s10462-009-9109-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-009-9109-6

Keywords

Navigation