skip to main content
10.1145/1081870.1081885acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Combining email models for false positive reduction

Published:21 August 2005Publication History

ABSTRACT

Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one can usually achieve very high accuracy, but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. In the case of spam detection, for example, even if one email is misclassified, this may be unacceptable if it is a very important email. Much work has been done to improve specific algorithms for the task of detecting unwanted messages, but less work has been report on leveraging multiple algorithms and correlating models in this particular domain of email analysis.EMT has been updated with new correlation functions allowing the analyst to integrate a number of EMT's user behavior models available in the core technology. We present results of combining classifier outputs for improving both accuracy and reducing false positives for the problem of spam detection. We apply these methods to a very large email data set and show results of different combination methods on these corpora. We introduce a new method to compare multiple and combined classifiers, and show how it differs from past work. The method analyzes the relative gain and maximum possible accuracy that can be achieved for certain combinations of classifiers to automatically choose the best combination.

References

  1. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G. and Spyropoulos, C. An Evauation of Naïve Bayesian Anti-Spam Filtering.]]Google ScholarGoogle Scholar
  2. Androutsopoulos, I., Koutsias, J., Chandrinos, K. and Spyropoulos, C., An experimental comparison of naive bayesian and keywordbased anit-spam filtering with personal email messages. in 23rd annual international ACM SIGIR conference on Research and development in information retrieval, (2000), 160--167.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Asker, L. and Maclin, R., Ensembles as a Sequence of Classifiers. in 15th International Joint Conference on Artificial Intelligence, (Nagoya, Japan, 1997), 860--865.]]Google ScholarGoogle Scholar
  4. Bhattacharyya, M., Hershkop, S., Eskin, E. and Stolfo, S.J., MET: An Experimental System for Malicious Email Tracking. in New Security Paradigms Workshop (NSPW-2002), (Virginia Beach, VA, 2002).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Carreras, X. and Mrquez, L., Boosting trees for anti-spam email filtering. in RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, (Tzigov Chark, BG, 2001).]]Google ScholarGoogle Scholar
  6. Clemen, R.T. Combining forecasts: A revew and annotated bibliography. International Journal of Forecasting, 5. 559 -- 583.]]Google ScholarGoogle Scholar
  7. Cohen, W., Learning rules that classify e-mail. in Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), (1996), 18--25.]]Google ScholarGoogle Scholar
  8. Damashek, M. Gauging Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text. Science, 267. 843--848.]]Google ScholarGoogle Scholar
  9. Dietterich, T.G. Ensemble Methods in Machine Learning. Lecture Notes in Computer Science, 1857. 1--15.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Drucker, H., Wu, D. and Vapnik, V.N. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural networks, 10 (5).]]Google ScholarGoogle Scholar
  11. Duda, R. and Hart, P. Pattern classification and scene analysis. John Wiley & Sons, New York, 1973.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Graham, P. A Plan For Spam, 2003.]]Google ScholarGoogle Scholar
  13. Hallam-Baker, P. A Plan For No Spam, Verisign, 2003.]]Google ScholarGoogle Scholar
  14. Hershkop, S. Using URL Clustering to Classify Spam, Columbia University, 2005.]]Google ScholarGoogle Scholar
  15. Hershkop, S. and Stolfo, S.J. Identifying Spam without Peeking at the Contents. ACM Crossroads.]]Google ScholarGoogle Scholar
  16. Hershkop, S., Wang, K., Lee, W. and Nimeskern, O. Email Mining Toolkit Technical Manual, Computer Science Dept, Columbia University, New York, 2004.]]Google ScholarGoogle Scholar
  17. Hidalgo, J.M.G. and Sanz, E.P., Combining Text and Heuristics for Cost-Sensitive Spam Filtering. in Fourth Conference on Computational Natural Language Learning and of the Second Learning Language in Logic Workshop, (Lisbon, 2000).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Itskevitch, J. Automatic Hierarchical E-Mail Classification Using Association Rules, 2001.]]Google ScholarGoogle Scholar
  19. John, G. and Langley, P., Estimating continuous distributions in Bayesian classifiers. in Eleventh Conference on Uncertainty in Artificial Intelligence, (1995), 338--345.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Katirai, H. Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes, 1999.]]Google ScholarGoogle Scholar
  21. Kiritchenko, S. and Matwin, S., Email Classification with Co-Training. in CASCON 2001, (2001).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kittler, J. and Alkoot, F.M. Sum versus Vote Fusion in Multiple Classifier Systems. IEEE Transactions on Patterns Analysis and Machine Intelligence, 25 (1).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kittler, J., Hatef, M., Duin, R.P.W. and Matas, J. On Combining Classifiers. IEEE Transactions on Patterns Analysis and Machine Intelligence, 20 (3).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kolcz, A. and Alspector, J., SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs. in Workshop on Text Mining (TextDM'2001), (San Jose, California, 2001).]]Google ScholarGoogle Scholar
  25. Larkey, L.S. and Croft, W.B., Combining Classifiers in Text Categorization. in SIGIR-96: 19th ACM International Conference on Research and Development in Information Retrieval, (Zurich, 1996), ACM Press, NY, US, 289--297.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Littlestone, N. and Warmuth, M.K. The Weighted Majority Algorithm. IEEE Symposium on Foundations of Computer Science.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Manber, U., Finding Similar Files in a Large File System. in Usenix Winter, (San Fransisco, CA, 1994), 1--10.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Massey, B., Thomure, M., Budrevich, R. and Long, S., Learning Spam: Simple Techniques for Freely-Available Software. in USENIX 2003, (2003).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mitchel, T. Machine Learning. McGraw-Hill, 1997.]]Google ScholarGoogle Scholar
  30. Peng, F. and Schuurmans, D., Combining Naive Bayes and n-Gram Language Models for Text Classi cation. in 25th European Conference on Information Retrieval Research (ECIR), (2003).]]Google ScholarGoogle ScholarCross RefCross Ref
  31. Pollock, S. A rule-based message filtering system. ACM Trans. Office Automation Systems, 6 (3). 232--254.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Provost, F. and Fawcett, T. Robust Classification for Imprecise Environments. Machine Learning, 42. 203--231.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Provost, J. Naïve-Bayes vs. Rule-Learning in Classification of Email, 1999.]]Google ScholarGoogle Scholar
  34. Rennie, J., ifile: An Application of Machine Learning to E-mail Filtering. in KDD-2000 Workshop on Text Mining, (2000).]]Google ScholarGoogle Scholar
  35. Rigoutsos, I. and Huynh, T., Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages. in ceas 2004, (Mountain View, California, 2004).]]Google ScholarGoogle Scholar
  36. Sahami, M., Dumais, S., Heckerman, D. and Horvitz, E., A Bayesian approach to filtering junk e-mail. in AAAI-98 Workshop on Learning for Text Categorization, (1998).]]Google ScholarGoogle Scholar
  37. Sakkis, G., Androutsopolous, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. and Stamatopoulos, P., Stacking classifiers for Anti-Spam Filtering of Emails. in 6th conference on Empirical Methods in Natural Language Processing (EMNLP 2001), (2001).]]Google ScholarGoogle Scholar
  38. Schneider, K.M., A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering. in 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), (Budapest, Hungary, 2003).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Segal, R.B. and Kephart, J.O., Incremental Learning in SwiftFile. in 17th International Conf. on Machine Learning, (San Francisco, CA, 2000), Morgan Kaufmann, 863--870.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Segal, R.B. and Kephart, J.O., MailCat: An Intelligent Assistant for Organizing E-Mail. in 3rd International Conference on Autonomous Agents, (1999).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W. A Behavior-based Approach to Securing Email Systems. Mathematical Methods, Models and Architectures for Computer Networks Security.]]Google ScholarGoogle Scholar
  42. Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W., Behavior Profiling of Email. in 1st NSF/NIJ Symposium on Intelligence & Security Informatics(ISI 2003), (Tucson, Arizona, 2003).]]Google ScholarGoogle Scholar
  43. Zheng, Z., Padmanabhan, B. and Zheng, H., A DEA Approach for Model Combination. in KDD2004, (Seattle, WA, 2004).]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Combining email models for false positive reduction

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
              August 2005
              844 pages
              ISBN:159593135X
              DOI:10.1145/1081870

              Copyright © 2005 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 21 August 2005

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate1,133of8,635submissions,13%

              Upcoming Conference

              KDD '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader