Article

Combining email models for false positive reduction

Authors:
Shlomo Hershkop

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

,
Salvatore J. Stolfo

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningAugust 2005Pages 98–107https://doi.org/10.1145/1081870.1081885

Published:21 August 2005Publication History

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Pages 98–107

ABSTRACT

Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one can usually achieve very high accuracy, but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. In the case of spam detection, for example, even if one email is misclassified, this may be unacceptable if it is a very important email. Much work has been done to improve specific algorithms for the task of detecting unwanted messages, but less work has been report on leveraging multiple algorithms and correlating models in this particular domain of email analysis.EMT has been updated with new correlation functions allowing the analyst to integrate a number of EMT's user behavior models available in the core technology. We present results of combining classifier outputs for improving both accuracy and reducing false positives for the problem of spam detection. We apply these methods to a very large email data set and show results of different combination methods on these corpora. We introduce a new method to compare multiple and combined classifiers, and show how it differs from past work. The method analyzes the relative gain and maximum possible accuracy that can be achieved for certain combinations of classifiers to automatically choose the best combination.

References

Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G. and Spyropoulos, C. An Evauation of Naïve Bayesian Anti-Spam Filtering.]]Google Scholar
Androutsopoulos, I., Koutsias, J., Chandrinos, K. and Spyropoulos, C., An experimental comparison of naive bayesian and keywordbased anit-spam filtering with personal email messages. in 23rd annual international ACM SIGIR conference on Research and development in information retrieval, (2000), 160--167.]] Google ScholarDigital Library
Asker, L. and Maclin, R., Ensembles as a Sequence of Classifiers. in 15th International Joint Conference on Artificial Intelligence, (Nagoya, Japan, 1997), 860--865.]]Google Scholar
Bhattacharyya, M., Hershkop, S., Eskin, E. and Stolfo, S.J., MET: An Experimental System for Malicious Email Tracking. in New Security Paradigms Workshop (NSPW-2002), (Virginia Beach, VA, 2002).]] Google ScholarDigital Library
Carreras, X. and Mrquez, L., Boosting trees for anti-spam email filtering. in RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, (Tzigov Chark, BG, 2001).]]Google Scholar
Clemen, R.T. Combining forecasts: A revew and annotated bibliography. International Journal of Forecasting, 5. 559 -- 583.]]Google Scholar
Cohen, W., Learning rules that classify e-mail. in Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), (1996), 18--25.]]Google Scholar
Damashek, M. Gauging Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text. Science, 267. 843--848.]]Google Scholar
Dietterich, T.G. Ensemble Methods in Machine Learning. Lecture Notes in Computer Science, 1857. 1--15.]] Google ScholarDigital Library
Drucker, H., Wu, D. and Vapnik, V.N. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural networks, 10 (5).]]Google Scholar
Duda, R. and Hart, P. Pattern classification and scene analysis. John Wiley & Sons, New York, 1973.]]Google ScholarDigital Library
Graham, P. A Plan For Spam, 2003.]]Google Scholar
Hallam-Baker, P. A Plan For No Spam, Verisign, 2003.]]Google Scholar
Hershkop, S. Using URL Clustering to Classify Spam, Columbia University, 2005.]]Google Scholar
Hershkop, S. and Stolfo, S.J. Identifying Spam without Peeking at the Contents. ACM Crossroads.]]Google Scholar
Hershkop, S., Wang, K., Lee, W. and Nimeskern, O. Email Mining Toolkit Technical Manual, Computer Science Dept, Columbia University, New York, 2004.]]Google Scholar
Hidalgo, J.M.G. and Sanz, E.P., Combining Text and Heuristics for Cost-Sensitive Spam Filtering. in Fourth Conference on Computational Natural Language Learning and of the Second Learning Language in Logic Workshop, (Lisbon, 2000).]] Google ScholarDigital Library
Itskevitch, J. Automatic Hierarchical E-Mail Classification Using Association Rules, 2001.]]Google Scholar
John, G. and Langley, P., Estimating continuous distributions in Bayesian classifiers. in Eleventh Conference on Uncertainty in Artificial Intelligence, (1995), 338--345.]]Google ScholarDigital Library
Katirai, H. Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes, 1999.]]Google Scholar
Kiritchenko, S. and Matwin, S., Email Classification with Co-Training. in CASCON 2001, (2001).]] Google ScholarDigital Library
Kittler, J. and Alkoot, F.M. Sum versus Vote Fusion in Multiple Classifier Systems. IEEE Transactions on Patterns Analysis and Machine Intelligence, 25 (1).]] Google ScholarDigital Library
Kittler, J., Hatef, M., Duin, R.P.W. and Matas, J. On Combining Classifiers. IEEE Transactions on Patterns Analysis and Machine Intelligence, 20 (3).]] Google ScholarDigital Library
Kolcz, A. and Alspector, J., SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs. in Workshop on Text Mining (TextDM'2001), (San Jose, California, 2001).]]Google Scholar
Larkey, L.S. and Croft, W.B., Combining Classifiers in Text Categorization. in SIGIR-96: 19th ACM International Conference on Research and Development in Information Retrieval, (Zurich, 1996), ACM Press, NY, US, 289--297.]] Google ScholarDigital Library
Littlestone, N. and Warmuth, M.K. The Weighted Majority Algorithm. IEEE Symposium on Foundations of Computer Science.]] Google ScholarDigital Library
Manber, U., Finding Similar Files in a Large File System. in Usenix Winter, (San Fransisco, CA, 1994), 1--10.]] Google ScholarDigital Library
Massey, B., Thomure, M., Budrevich, R. and Long, S., Learning Spam: Simple Techniques for Freely-Available Software. in USENIX 2003, (2003).]] Google ScholarDigital Library
Mitchel, T. Machine Learning. McGraw-Hill, 1997.]]Google Scholar
Peng, F. and Schuurmans, D., Combining Naive Bayes and n-Gram Language Models for Text Classi cation. in 25th European Conference on Information Retrieval Research (ECIR), (2003).]]Google ScholarCross Ref
Pollock, S. A rule-based message filtering system. ACM Trans. Office Automation Systems, 6 (3). 232--254.]] Google ScholarDigital Library
Provost, F. and Fawcett, T. Robust Classification for Imprecise Environments. Machine Learning, 42. 203--231.]] Google ScholarDigital Library
Provost, J. Naïve-Bayes vs. Rule-Learning in Classification of Email, 1999.]]Google Scholar
Rennie, J., ifile: An Application of Machine Learning to E-mail Filtering. in KDD-2000 Workshop on Text Mining, (2000).]]Google Scholar
Rigoutsos, I. and Huynh, T., Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages. in ceas 2004, (Mountain View, California, 2004).]]Google Scholar
Sahami, M., Dumais, S., Heckerman, D. and Horvitz, E., A Bayesian approach to filtering junk e-mail. in AAAI-98 Workshop on Learning for Text Categorization, (1998).]]Google Scholar
Sakkis, G., Androutsopolous, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. and Stamatopoulos, P., Stacking classifiers for Anti-Spam Filtering of Emails. in 6th conference on Empirical Methods in Natural Language Processing (EMNLP 2001), (2001).]]Google Scholar
Schneider, K.M., A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering. in 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), (Budapest, Hungary, 2003).]] Google ScholarDigital Library
Segal, R.B. and Kephart, J.O., Incremental Learning in SwiftFile. in 17th International Conf. on Machine Learning, (San Francisco, CA, 2000), Morgan Kaufmann, 863--870.]] Google ScholarDigital Library
Segal, R.B. and Kephart, J.O., MailCat: An Intelligent Assistant for Organizing E-Mail. in 3rd International Conference on Autonomous Agents, (1999).]] Google ScholarDigital Library
Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W. A Behavior-based Approach to Securing Email Systems. Mathematical Methods, Models and Architectures for Computer Networks Security.]]Google Scholar
Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W., Behavior Profiling of Email. in 1st NSF/NIJ Symposium on Intelligence & Security Informatics(ISI 2003), (Tucson, Arizona, 2003).]]Google Scholar
Zheng, Z., Padmanabhan, B. and Zheng, H., A DEA Approach for Model Combination. in KDD2004, (Seattle, WA, 2004).]] Google ScholarDigital Library

Index Terms

Combining email models for false positive reduction
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
  2. Modeling and simulation
    1. Model development and analysis
      1. Model verification and validation
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
  2. World Wide Web
    1. Web applications
      1. Internet communications tools
        Email

Recommendations

An Automatic Email Management Approach Using Data Mining Techniques
DaWaK 2013: Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery - Volume 8057

Email mining provides solution to email overload problem by automatically placing emails into some meaningful and similar groups based on email subject and contents. Existing email mining systems such as BuzzTrack, do not consider the semantic ...
Read More
False Positive Detection in Sender Domain Authentication by DMARC Report Analysis
ICISS '20: Proceedings of the 3rd International Conference on Information Science and Systems

The number of spoofed emails is increasing rapidly and become a serious problem, especially in business and e-commerce. Sender domain authentication is an effective countermeasure for spoofed e-mail. Although SPF, DKIM, and DMARC are famous sender ...
Read More
Spam filtering for network traffic security on a multi-core environment
Multi-core Supported Network and System Security

This paper presents an innovative fusion-based multi-classifier e-mail classification on a ubiquitous multi-core architecture. Many previous approaches used text-based single classifiers to identify spam messages from a large e-mail corpus with some ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
August 2005
844 pages
ISBN:159593135X
DOI:10.1145/1081870
General Chair:
Robert Grossman
University of Illinois at Chicago & Open Data Partners, USA
,
Program Chairs:
Roberto Bayardo
IBM Almaden Research, USA
,
Kristin Bennett
RPI, USA
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
aggregators
data mining
email mining
false positive reduction
model combination
multiple classifiers
spam
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 1,384
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Combining email models for false positive reduction

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Automatic Email Management Approach Using Data Mining Techniques

False Positive Detection in Sender Domain Authentication by DMARC Report Analysis

Spam filtering for network traffic security on a multi-core environment