Abstract
Email spam is one of the major problems of the today’s Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper we give an overview of the state of the art of machine learning applications for spam filtering, and of the ways of evaluation and comparison of different filtering methods. We also provide a brief description of other branches of anti-spam protection and discuss the use of various approaches in commercial and non-commercial anti-spam software solutions.
Similar content being viewed by others
References
Agrawal B, Kumar N, Molle M (2005) Controlling spam emails at the routers. In: Proceedings of the IEEE international conference on communications, ICC 2005, vol 3, pp 1588–1592
Albrecht K, Burri N, Wattenhofer R (2005) Spamato—an extendable spam filter system. In: Proceedings of second conference on email and anti-spam, CEAS’2005
Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000a) An evaluation of naive bayesian anti-spam filtering. In: Potamias G, Moustakis V, van Someren M (eds) Proceedings of the workshop on machine learning in the new information age, 11th European conference on machine learning, ECML 2000, pp 9–17
Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000b) An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00. ACM Press, New York, NY, USA, pp 160–167. ISBN 1-58113-226-3. http://doi.acm.org/10.1145/345508.345569
Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P (2000c) Learning to filter spam e-mail: a comparison of a naive bayesian and a memory-based approach. In: Zaragoza H, Gallinari P, Rajman M (eds) Proceedings of the workshop on machine learning and textual information access, 4th European conference on principles and practice of knowledge discovery in databases, PKDD 2000 pp 1–13
Androutsopoulos I, Paliouras G, Michelakis E (2004) Learning to filter unsolicited commercial e-mail (Technical Report 2004/2). NCSR “Demokritos”. Revised version
Androutsopoulos I, Magirou E, Vassilakis D (2005) A game theoretic model of spam e-mailing. In: Proceedings of second conference on email and anti-spam, CEAS’2005
Aradhye H, Myers G, Herson J (2005) Image analysis for efficient categorization of image-based spam e-mail. In: Proceedings of eighth international conference on document analysis and recognition, ICDAR 2005, vol 2. IEEE Computer Society, pp 914–918.
Blanzieri E, Bryl A (2007) Evaluation of the highest probability svm nearest neighbor classifier with variable relative error cost. In: Proceedings of fourth conference on email and anti-spam, CEAS’2007. pp 5
Boykin P, Roychowdhury V (2005) Leveraging social networks to fight spam. Computer 38(4): 61–68
Bratko A, Cormack GV, Filipič B, Lynam TR, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7(Dec): 2673–2698
CAPTCHA (2005) The CAPTCHA project. http://www.captcha.net/ Accessed:31.05.06
Carreras X, Márquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of 4th international conference on recent advances in natural language processing, RANLP-01
Chan J, Koprinska I, Poon J (2004) Co-training on textual documents with a single natural feature set. In: Proceedings of the ninth Australasian document computing symposium (ADCS 2004)
Chirita PA, Diederich J, Nejdl W (2005) Mailrank:using ranking for spam detection. In: Proceedings of the 14th ACM international conference on information and knowledge management, CIKM 2005, ACM Press. pp 373–380.
Chuan Z, Xianliang L, Mengshu H, Xu Z (2005) A lvq-based neural network anti-spam email approach. ACM SIGOPS Oper Syst Rev 39(1):34–39 ISSN 0163-5980. http://doi.acm.org/10.1145/1044552.1044555
Cohen W (1996) Learning rules that classify e-mail. In: Proceedings of the 1996 AAAI spring symposium on machine learning in information access, MLIA ’96. AAAI Press
Cormack G, Lynam T (2005a) Spam corpus creation for TREC. In: Proceedings of second conference on email and anti-spam, CEAS’2005
Cormack G, Lynam T (2005b) TREC 2005 spam track overview. Available at http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05/, Accessed: 31.05.06
Cormack GV, Bratko A (2006) Batch and online spam filter comparison. In: Proceedings of the third conference on email and anti-spam, CEAS’2006
Cukier W, Cody S, Nesselroth E (2006) Genres of spam: expectations and deceptions. In: Proceedings of the 39th annual hawaii international conference on system sciences, HICSS ’06 3
Damiani E, De Capitani di Vimercati S, Paraboschi S, Samarati P (2004) P2P-based collaborative spam detection and filtering. In: Proceedings of fourth IEEE international conference on peer-to-peer computing, P2P’04 pp 176–183
Delany SJ, Cunningham P, Coyle L (2004) An assessment of case-based reasoning for spam filtering. In: Proceedings of fifteenth irish conference on artificial intelligence and cognitive science (AICS ’04) pp 9–18
Drake C, Oliver J, Koontz E (2004) Anatomy of a phishing email. In: Proceedings of the first conference on email and anti-spam, CEAS’2004
Dredze M, Gevaryahu R, Elias-Bachrach A (2007) Learning fast classifiers for image spam. In: Proceedings of the fourth conference on email and anti-spam, CEAS’2007
Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054
Duan Z, Dong Y, Gopalan K (2005) Diffmail: a differentiated message delivery architecture to control spam. In: Proceedings of 11th international conference on parallel and distributed systems, ICPADS 2005. vol 2, pp 255–259
Dwork C, Naor M (1992) Pricing via processing or combatting junk mail. In: Advances in cryptology-Crypto 92 proceedings. Springer Verlag, pp 139–147.
Fawcett T (2003) “in vivo” spam filtering: a challenge problem for data mining. KDD Explor 5(2):140–148 http://doi.acm.org/10.1145/980972.980990
Fecyk G (2003) Designated mailers protocol. http://www.pan-am.ca/dmp/draft-fecyk-dmp-01.txt, Accessed: 31.05.06, URL http://www.pan-am.ca/dmp/draft-fecyk-dmp-01.txt.
FerrisResearch (2005) The global economic impact of spam. report #409. Available at http://www.ferris.com/get_content_file.php?id=364 Accessed: 13.06.06
Fumera G, Pillai I, Roli F (2006) Spam filtering based on the analysis of text information embedded into images. J Mach Learn Res (7):2699–2720
Garg A, Battiti R, Cascella R (2006) “May I borrow your filter?” exchanging filters to combat spam in a community. In: AINA 2006. 20th international conference on advanced information networking and applications 2
Golbeck J, Hendler J (2004) Reputation network analysis for email filtering. In; Proceedings of the first conference on email and anti-spam, CEAS’2004
Gomes LH, Cazita C, Almeida JM, Almeida V, Meira W Jr. (2004) Characterizing a spam traffic. In: IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on internet measurement. ACM Press, New York, NY, USA. pp 356–369. ISBN 1-58113-821-0. http://doi.acm.org/10.1145/1028788.1028837
Goodman J (2004) IP addresses in email clients. In: Proceedings of the first conference on email and anti-spam, CEAS’2004
Goodman J, Cormack GV, Heckerman D (2007) Spam and the ongoing battle for the inbox. Commun of the ACM 50(2): 25–33
Goodman J, Rounthwaite R (2004) Stopping outgoing spam. In: EC’04: proceedings of the fifth ACM conference on electronic commerce
Goodman J, Yih WT (2006) Online discriminative spam filter training. In: Proceedings of third conference on email and anti-spam, CEAS’2006
Graham P (2002) A plan for spam. Available at http://www.paulgraham.com/spam.html Accessed: 14.05.07
Graham P (2003) Better bayesian filtering. Available at http://www.paulgraham.com/better.html Accessed: 12.07.06, URL http://www.paulgraham.com/better.html
Grimes GA (2007) Compliance with CAN-SPAM act of 2003. Communicationf of the ACM 50: 55–62
Harris E (2003) The next step in the spam control war: greylisting. Available at http://projects.puremagic.com/greylisting/whitepaper.html Accessed: 02.10.07
Hershkop S (2006) Behavior-based email analysis with application to spam detection. Ph D Thesis. Available at http://www1.cs.columbia.edu/~sh553/publications/final-thesis.pdf Accessed: 12.07.06
HoneyPot (2004) Project honey pot: distributed spam harvester tracking network. Available at http://www.projecthoneypot.org/, Accessed: 07.06.06
Hulten G, Penta A, Seshadrinathan G, Mishra M (2004) Trends in spam products and methods. In: Proceedings of the first conference on email and anti-spam, CEAS’2004
ITU (2005) ITU survey on anti-spam legislation worldwide. Available at http://www.itu.int/osg/spu/spam/ Accessed: 31.05.06
Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher DH (eds) Proceedings of ICML-97, 14th international conference on machine learning. Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US, pp 143–151
Jung J , Sit E (2004) An empirical study of spam traffic and the use of dns black lists. In: IMC ’04: proceedings of the 4th ACM SIGCOMM conference on internet measurement. ACM Press, New York, NY, USA. pp 370–375. ISBN 1-58113-821-0. http://doi.acm.org/10.1145/1028788.1028838.
Klimt B, Yang Y (2004) Introducing the enron corpus. In: Proceedings of the first conference on email and anti-spam, CEAS’2004
Kuipers B, Liu A, Gautam A, Gouda M (2005) Zmail: zero-sum free market control of spam. In: Proceedings of the 25th IEEE international conference on distributed computing systems workshops, ICDCS 2005. IEEE Computer Society, pp 20–26.
Kun-Lun L, Kai L, Hou-Kuan H, Sheng-Feng T (2002) Active learning with simplified SVMs for spam categorization. Mach Learn Cybern 3: 1198–1202
Lai C-C, Tsai M-C (2004) An empirical performance comparison of machine learning methods for spam e-mail categorization. Hybrid Intell Syst 44–48
Lazzari L, Mari M, Poggi A (2005) Cafe-collaborative agents for filtering e-mails. In: Proceedings of 14th IEEE international workshops on enabling technologies: infrastructure for collaborative enterprise, WETICE’05 pp 356–361
Lee H, Ng A (2005) Spam deobfuscation using a hidden markov model. In: Proceedings of second conference on email and anti-spam, CEAS’2005 URL http://www.ceas.cc/papers-2005/166.pdf
Leiba B, Ossher J, Rajan VT, Segal R, Wegman M (2005) SMTP path analysis. In: Proceedings of second conference on email and anti-spam, CEAS’2005 URL http://www.ceas.cc/papers-2005/176.pdf
Levine J, DeKok A (2004) Lightweight MTA authentication protocol (LMAP) discussion and comparison. http://www.taugh.com/draft-irtf-asrg-lmap-discussion-01.txt, Accessed: 31.05.06
Li K, Pu C, Ahamad M (2004) Resisting spam delivery by tcp damping. In: Proceedings of the first conference on email and anti-spam, CEAS’2004
Li K, Zhong Z (2006) Fast statistical spam filter by approximate classifications. SIGMETRICS Perform Eval Rev 34(1): 347–358 ISSN 0163-5999
Lowd D, Meek C (2005) Good word attacks on statistical spam filters. In: Proceedings of second conference on email and anti-spam, CEAS’2005. URL http://www.ceas.cc/papers-2005/125.pdf
Lugaresi N (2004) European union vs. spam: a legal response. In: Proceedings of the first conference on email and anti-spam, CEAS’2004
Luo X, Zincir-Heywood N (2005) Comparison of a SOM based sequence analysis system and naive bayesian classifier for spam filtering. In: Proceedings of IEEE international joint conference on neural networks, IJCNN ’05. vol 4, pp 2571–2576
MAAWG. Messaging anti-abuse working group (2006) Email metrics repost. Third & fourth quarter 2006. Available at http://www.maawg.org/about/MAAWGMetric_2006_3_4_report.pdf Accessed: 04.06.07
Medlock B (2005) An adaptive approach to spam filtering on a new corpus. Accessed: 31.05.06. URL http://www.cl.cam.ac.uk/users/bwm23/genspam_paper.pdf
Medlock B (2006) An adaptive approach to spam filtering on a new corpus. In: Proceedings of the third conference on email and anti-spam, CEAS’2006
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes? which naive bayes? In: Proceedings of third conference on email and anti-spam, CEAS’2006
Michelakis E, Androutsopoulos I, Paliouras G, Sakkis G, Stamatopoulos P (2004) Filtron: a learning-based anti-spam filter. In: Proceedings of the first conference on email and anti-spam, CEAS’2004
Mo G, Zhao W, Cao H, Dong J (2006) Multi-agent interaction based collaborative p2p system for fighting spam. In: IAT’06. IEEE/WIC/ACM international conference on intelligent agent technology 428–431
Moustakas E, Ranganathan C, Duquenoy P (2005) Combating spam through legislation: a comparative analysis of us and european approaches. In: Proceedings of second conference on email and anti-spam, CEAS’2005
Nagamalai D, Dhinakaran C, Lee JK (2007) Multi ayer approach to defend DDoS attacks caused by spam. In: MUE’07. International conference on multimedia and ubiquitous engineering 97–102
O’Brien C, Vogel C (2003) Spam filters: bayes vs. chi-squared; letters vs. words. In: Proceedings of the 1st international symposium on information and communication technologies, ISICT ’03, Trinity College Dublin, Dublin, Ireland, 2003. pp 291–296.
Pantel P, Lin D (1998) Spamcop: a spam classification & organization program. In: Learning for text categorization: papers from the 1998 workshop. AAAI Technical Report WS-98-05
Park SY, Kim JT, Kang SG (2006) Analysis of applicability of traditional spam regulations to voip spam. In: ICACT 2006. The 8th international conference on advanced communication technology. vol 2
Prince M, Dahl B, Holloway L, Keller A, Langheinrich E (2005) Understanding how spammers steal your e-mail address: an analysis of the first 6 months of data from project honey pot. In: Proceedings of second conference on email and anti-spam, CEAS’2005
Pu C, Webb S (2006) Observed trends in spam construction techniques: a case study of spam evolution. In: Proceedings of third conference on email and anti-spam, CEAS’2006
Ramachandran A, Feamster N (2006) Understanding the network-level behavior of spammers. In: SIGCOMM’06: proceedings of the 2006 conference on aplications, technologies, architectures, and protocols for computer communications
Rigoutsos I, Huynh T (2004) Chung-kwei: a pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (spam). In: Proceedings of the first conference on email and anti-spam, CEAS’2004
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop. AAAI Technical Report WS-98-05
Saito T (2005) Anti-spam system: another way of preventing spam. In: Proceedings of the 16th international workshop on database and expert systems applications, DEXA 2005 pp 57–61
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. In: Proceedings of empirical methods in natural language processing, EMNLP-2001 pp 44–50
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6: 49–73
Sasaki M, Shinnou H (2005) Spam detection using text clustering. In: Proceedings of international conference on cyberworlds, CW2005. pp 316–319
Schiavone V, Brussin D, Koenig J, Cobb S, Everett-Church R (2003) Trusted e-mail open standard: a comprehencive policy and technology proposal for email reform. http://www.cobb.com/spam/teos/, Accessed: 31.05.06
Sculley D, Wachman GM (2007) Relaxed online svms for spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. pp 415–422
Seltzer L (2003) Should senders pay for the mess we call e-mail? eWeek, http://www.eweek.com/article2/0,4149,1273186,00.asp, Accessed: 31.05.06
Sender ID (2004) Sender ID technology: information for it professionals. Available at http://www.microsoft.com/mscorp/safety/technologies/senderid/technology.mspx, Accessed: 31.05.06
Siponen M, Stucke C (2006) Effective anti-spam strategies in companies: an international study. In: Proceedings of HICSS ’06 6
Soonthornphisaj N, Chaikulseriwat Kanokwan, Tang-On P (2002) Anti-spam filtering: a centroid-based classification approach. Signal Process 2: 1096–1099
SpamDefined (2001) Spam defined. http://www.monkeys.com/spam-defined/ Accessed: 31.05.06
SPAMHAUS (2003) The spam definition and legalization game. Available at http://www.spamhaus.org/news.lasso?article=9, Accessed: 31.05.06
SPAMHAUS (2005) The definition of spam. Available at http://www.spamhaus.org/definition.html, Accessed: 10.06.06
SPF. FAQ. http://openspf.org/faq.html Accessed: 31.05.06
Twining RD, Williamson MM, Mowbray M, Rahmouni M (2004) Email prioritization: reducing delays on legitimate mail caused by junk mail. Technical Report HPL-2004-5R1, HP Labs
Wang X-L , Cloete I (2005) Learning to classify email: a survey. In: Proceedings of the 2005 international conference on machine learning and cybernetics, ICMLC 2005. pp 5716–5719
Wang Z, Josephson W, Lv Q, Charikar M, Li K (2007) Filtering image spam with near-duplicate detection. In: Proceedings of the fourth conference on email and anti-spam, CEAS’2007
Wittel G, Wu F (2004) On attacking statistical spam filters. In: Proceedings of first conference on email and anti-spam, CEAS’2004. URL http://www.ceas.cc/papers-2004/170.pdf
Woitaszek M, Shaaban M, Czernikowski R (2003) Identifying junk electronic mail in microsoft outlook with a support vector machine. In: Proceedings of the 2003 symposium on applications and the internet, SAINT 2003 pp 166–169
Wu C-T, Cheng K-T, Zhu Q, Wu Y-L (2005) Using visual features for anti-spam filtering. In: Proceedings of IEEE international conference on image processing, ICIP 2005 3:509–512
Yamai N, Okayama K, Miyashita T, Maruyama S, Nakamura M (2005) A protection method against massive error mails caused by sender spoofed spam mails. In: Proceedings of the 2005 symposium on applications and the internet, SAINT 2005. pp 384–390
Yeh C-Y, Wu C-H, Doong S-H (2005) Effective spam classification based on meta-heuristics. In: Proceedings of IEEE international conference on systems, man and cybernetics, SMC 2005. vol 4, pp 3872–3877
Yih W-t, Goodman J, Hulten G (2006) Learning at low positive rates. In: Proceedings of the third conference on email and anti-spam, CEAS’2006
Zhang L, Yao T (2003) Filtering junk mail with a maximum entropy model. In: Proceeding of 20th international conference on computer processing of oriental languages, ICCPOL03 pp 446–453
Zhang L, Zhu J, Yao T (2004) An evaluation of statistical spam filtering techniques. ACM Trans Asian Lang Inform Process (TALIP) 3(4):243–269. ISSN 1530-0226. http://doi.acm.org/10.1145/1039621.1039625
Zhao W, Zhang Z (2005) An email classification model based on rough set theory. In: Proceedings of the 2005 international conference on active media technology, AMT05 pp 403–408
Zhou F, Zhuang L, Zhao B, Huang L, Joseph A, Kubiatowicz J (2003) Approximate object location and spam filtering on peer-to-peer systems. In: Proceedings of ACM/IFIP/USENIX international middleware conference, middleware 2003
Zhou Y, Mulekar MS, Nerellapalli P (2005) Adaptive spam filtering using dynamic feature space. In: Proceedings of 17th IEEE international conference on tools with artificial intelligence, ICTAI’05 pp 302–309
Zinman A, Donath J (2007) Is Britney Spears spam? In: Proceedings of the fourth conference on email and anti-spam, CEAS’2007
Zorkadis V, Panayotou M, Karras DA (2005) Improved spam e-mail filtering based on committee machines and information theoretic feature extraction. In: Poceedings of IEEE international joint conference on neural networks, IJCNN ’05. vol 1, pp 179–184
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Blanzieri, E., Bryl, A. A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29, 63–92 (2008). https://doi.org/10.1007/s10462-009-9109-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-009-9109-6