ABSTRACT
Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one can usually achieve very high accuracy, but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. In the case of spam detection, for example, even if one email is misclassified, this may be unacceptable if it is a very important email. Much work has been done to improve specific algorithms for the task of detecting unwanted messages, but less work has been report on leveraging multiple algorithms and correlating models in this particular domain of email analysis.EMT has been updated with new correlation functions allowing the analyst to integrate a number of EMT's user behavior models available in the core technology. We present results of combining classifier outputs for improving both accuracy and reducing false positives for the problem of spam detection. We apply these methods to a very large email data set and show results of different combination methods on these corpora. We introduce a new method to compare multiple and combined classifiers, and show how it differs from past work. The method analyzes the relative gain and maximum possible accuracy that can be achieved for certain combinations of classifiers to automatically choose the best combination.
- Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G. and Spyropoulos, C. An Evauation of Naïve Bayesian Anti-Spam Filtering.]]Google Scholar
- Androutsopoulos, I., Koutsias, J., Chandrinos, K. and Spyropoulos, C., An experimental comparison of naive bayesian and keywordbased anit-spam filtering with personal email messages. in 23rd annual international ACM SIGIR conference on Research and development in information retrieval, (2000), 160--167.]] Google ScholarDigital Library
- Asker, L. and Maclin, R., Ensembles as a Sequence of Classifiers. in 15th International Joint Conference on Artificial Intelligence, (Nagoya, Japan, 1997), 860--865.]]Google Scholar
- Bhattacharyya, M., Hershkop, S., Eskin, E. and Stolfo, S.J., MET: An Experimental System for Malicious Email Tracking. in New Security Paradigms Workshop (NSPW-2002), (Virginia Beach, VA, 2002).]] Google ScholarDigital Library
- Carreras, X. and Mrquez, L., Boosting trees for anti-spam email filtering. in RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, (Tzigov Chark, BG, 2001).]]Google Scholar
- Clemen, R.T. Combining forecasts: A revew and annotated bibliography. International Journal of Forecasting, 5. 559 -- 583.]]Google Scholar
- Cohen, W., Learning rules that classify e-mail. in Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), (1996), 18--25.]]Google Scholar
- Damashek, M. Gauging Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text. Science, 267. 843--848.]]Google Scholar
- Dietterich, T.G. Ensemble Methods in Machine Learning. Lecture Notes in Computer Science, 1857. 1--15.]] Google ScholarDigital Library
- Drucker, H., Wu, D. and Vapnik, V.N. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural networks, 10 (5).]]Google Scholar
- Duda, R. and Hart, P. Pattern classification and scene analysis. John Wiley & Sons, New York, 1973.]]Google ScholarDigital Library
- Graham, P. A Plan For Spam, 2003.]]Google Scholar
- Hallam-Baker, P. A Plan For No Spam, Verisign, 2003.]]Google Scholar
- Hershkop, S. Using URL Clustering to Classify Spam, Columbia University, 2005.]]Google Scholar
- Hershkop, S. and Stolfo, S.J. Identifying Spam without Peeking at the Contents. ACM Crossroads.]]Google Scholar
- Hershkop, S., Wang, K., Lee, W. and Nimeskern, O. Email Mining Toolkit Technical Manual, Computer Science Dept, Columbia University, New York, 2004.]]Google Scholar
- Hidalgo, J.M.G. and Sanz, E.P., Combining Text and Heuristics for Cost-Sensitive Spam Filtering. in Fourth Conference on Computational Natural Language Learning and of the Second Learning Language in Logic Workshop, (Lisbon, 2000).]] Google ScholarDigital Library
- Itskevitch, J. Automatic Hierarchical E-Mail Classification Using Association Rules, 2001.]]Google Scholar
- John, G. and Langley, P., Estimating continuous distributions in Bayesian classifiers. in Eleventh Conference on Uncertainty in Artificial Intelligence, (1995), 338--345.]]Google ScholarDigital Library
- Katirai, H. Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes, 1999.]]Google Scholar
- Kiritchenko, S. and Matwin, S., Email Classification with Co-Training. in CASCON 2001, (2001).]] Google ScholarDigital Library
- Kittler, J. and Alkoot, F.M. Sum versus Vote Fusion in Multiple Classifier Systems. IEEE Transactions on Patterns Analysis and Machine Intelligence, 25 (1).]] Google ScholarDigital Library
- Kittler, J., Hatef, M., Duin, R.P.W. and Matas, J. On Combining Classifiers. IEEE Transactions on Patterns Analysis and Machine Intelligence, 20 (3).]] Google ScholarDigital Library
- Kolcz, A. and Alspector, J., SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs. in Workshop on Text Mining (TextDM'2001), (San Jose, California, 2001).]]Google Scholar
- Larkey, L.S. and Croft, W.B., Combining Classifiers in Text Categorization. in SIGIR-96: 19th ACM International Conference on Research and Development in Information Retrieval, (Zurich, 1996), ACM Press, NY, US, 289--297.]] Google ScholarDigital Library
- Littlestone, N. and Warmuth, M.K. The Weighted Majority Algorithm. IEEE Symposium on Foundations of Computer Science.]] Google ScholarDigital Library
- Manber, U., Finding Similar Files in a Large File System. in Usenix Winter, (San Fransisco, CA, 1994), 1--10.]] Google ScholarDigital Library
- Massey, B., Thomure, M., Budrevich, R. and Long, S., Learning Spam: Simple Techniques for Freely-Available Software. in USENIX 2003, (2003).]] Google ScholarDigital Library
- Mitchel, T. Machine Learning. McGraw-Hill, 1997.]]Google Scholar
- Peng, F. and Schuurmans, D., Combining Naive Bayes and n-Gram Language Models for Text Classi cation. in 25th European Conference on Information Retrieval Research (ECIR), (2003).]]Google ScholarCross Ref
- Pollock, S. A rule-based message filtering system. ACM Trans. Office Automation Systems, 6 (3). 232--254.]] Google ScholarDigital Library
- Provost, F. and Fawcett, T. Robust Classification for Imprecise Environments. Machine Learning, 42. 203--231.]] Google ScholarDigital Library
- Provost, J. Naïve-Bayes vs. Rule-Learning in Classification of Email, 1999.]]Google Scholar
- Rennie, J., ifile: An Application of Machine Learning to E-mail Filtering. in KDD-2000 Workshop on Text Mining, (2000).]]Google Scholar
- Rigoutsos, I. and Huynh, T., Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages. in ceas 2004, (Mountain View, California, 2004).]]Google Scholar
- Sahami, M., Dumais, S., Heckerman, D. and Horvitz, E., A Bayesian approach to filtering junk e-mail. in AAAI-98 Workshop on Learning for Text Categorization, (1998).]]Google Scholar
- Sakkis, G., Androutsopolous, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. and Stamatopoulos, P., Stacking classifiers for Anti-Spam Filtering of Emails. in 6th conference on Empirical Methods in Natural Language Processing (EMNLP 2001), (2001).]]Google Scholar
- Schneider, K.M., A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering. in 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), (Budapest, Hungary, 2003).]] Google ScholarDigital Library
- Segal, R.B. and Kephart, J.O., Incremental Learning in SwiftFile. in 17th International Conf. on Machine Learning, (San Francisco, CA, 2000), Morgan Kaufmann, 863--870.]] Google ScholarDigital Library
- Segal, R.B. and Kephart, J.O., MailCat: An Intelligent Assistant for Organizing E-Mail. in 3rd International Conference on Autonomous Agents, (1999).]] Google ScholarDigital Library
- Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W. A Behavior-based Approach to Securing Email Systems. Mathematical Methods, Models and Architectures for Computer Networks Security.]]Google Scholar
- Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W., Behavior Profiling of Email. in 1st NSF/NIJ Symposium on Intelligence & Security Informatics(ISI 2003), (Tucson, Arizona, 2003).]]Google Scholar
- Zheng, Z., Padmanabhan, B. and Zheng, H., A DEA Approach for Model Combination. in KDD2004, (Seattle, WA, 2004).]] Google ScholarDigital Library
Index Terms
- Combining email models for false positive reduction
Recommendations
An Automatic Email Management Approach Using Data Mining Techniques
DaWaK 2013: Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery - Volume 8057Email mining provides solution to email overload problem by automatically placing emails into some meaningful and similar groups based on email subject and contents. Existing email mining systems such as BuzzTrack, do not consider the semantic ...
False Positive Detection in Sender Domain Authentication by DMARC Report Analysis
ICISS '20: Proceedings of the 3rd International Conference on Information Science and SystemsThe number of spoofed emails is increasing rapidly and become a serious problem, especially in business and e-commerce. Sender domain authentication is an effective countermeasure for spoofed e-mail. Although SPF, DKIM, and DMARC are famous sender ...
Spam filtering for network traffic security on a multi-core environment
Multi-core Supported Network and System SecurityThis paper presents an innovative fusion-based multi-classifier e-mail classification on a ubiquitous multi-core architecture. Many previous approaches used text-based single classifiers to identify spam messages from a large e-mail corpus with some ...
Comments