ABSTRACT
A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of existing malware, are extracted by malware analysts from known malware samples, and stored in a database often referred to as a virus dictionary. This process often involves a significant amount of human efforts. In addition, there are two major limitations in this technique. First, not all malicious programs have bit patterns that are evidence of their malicious nature. Therefore, some malware is not recorded in the virus dictionary and can not be detected through signature matching. Second, searching for specific bit patterns will not work on malware that can take many forms--obfuscated malware. Signature matching has been shown to be incapable of identifying new malware patterns and fails to recognize obfuscated malware. This paper presents a malware detection technique that discovers malware by means of a learning engine trained on a set of malware instances and a set of benign code instances. The learning engine uses an adaptive data compression model--prediction by partial matching (PPM)--to build two compression models, one from the malware instances and the other from the benign code instances. A code instance is classified, either as "malware" or "benign", by minimizing its estimated cross entropy. Our preliminary results are very promising. We achieved about 0.94 true positive rate with as low as 0.016 false positive rate. Our experiments also demonstrate that this technique can effectively detect unknown and obfuscated malware.
- A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan. Spam filtering using statistical data compression models. Journal of Machine Learning Research (JMLR), 7:2673--2698, December 2006. Google ScholarDigital Library
- M. Christodorescu, S. Jha, J. Kinder, S. Katzenbeisser, and H. Veith. Software transformations to improve malware detection. Journal in Computer Virology, 3:253--265, 2007.Google ScholarCross Ref
- J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, COM--32(4):396--402, 1984.Google ScholarCross Ref
- J. Cleary and I. Witten. Unbounded length contexts of ppm. The computer Journal, 40(2/3):67--75, 1997.Google Scholar
- G. Cormack and R. Horspool. Data compression using dynamic Markov modeling. The Computer Journal, 30(6):541--550, 1987. Google ScholarDigital Library
- M. Drinic, D. Kirovski, and M. Potkonjak. Ppm model cleaning. In DCC '03: Proceedings of the Conference on Data Compression, page 163, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- E. Frank, C. Chui, and I. Witten. Text categorization using compression models. In IEEE Data Compression Conference (DCC--00), pages 200--209. IEEE CS Press, 2000. Google ScholarDigital Library
- Z. Jorgensen, Y. Zhou, and M. Inge. A multiple instance learning strategy for combating good word attacks on spam filters. Journal of Machine Learning Research (JMLR), 9:1115--1146, June 2008. Google ScholarDigital Library
- S. Josse. Secure and advanced unpacking using computer emulation. Journal in Computer Virology, 3:221--236, 2007.Google ScholarCross Ref
- J. Kephart, G. Sorkin, W. Arnold, D. Cheese, G. Tesauro, and S. White. Biologically inspired defenses against computer viruses. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI95), pages 985--996. Morgan Kaufman, 1995. Google ScholarDigital Library
- J.Z. Kolter and M.A. Maloof. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7:2721--2744, 2006. Google ScholarDigital Library
- C. Nachenberg. Us patent no. 5,826,013: Polymorphic virus detection module, 1998.Google Scholar
- P. Royal, M. Halpin, D. Dagon, R. Edmonds, and W. Lee. Polyunpack: Automating the hidden-code extraction of unpack-executing malware. In ACSAC '06: Proceedings of the 22nd Annual Computer Security Applications Conference on Annual Computer Security Applications Conference, pages 289--300, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- M. Schultz, E. Eskin, E. Zadok, and S. Stolfo. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy, pages 38--49, Los Alamitos, CA, 2001. IEEE Press. Google ScholarDigital Library
- S.K. Udupa, S.K. Debray, and M. Madou. Deobfuscation: Reverse engineering obfuscated code. In WCRE '05: Proceedings of the 12th Working Conference on Reverse Engineering, pages 45--54, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- I. Witten. Applications of lossless compression in adaptive text mining. In Proceedings of the 34th Annual Conference on Information Sciences and Systems (CISS-00), New Jersey, 2000.Google Scholar
- I. Witten, R. Neal, and J. Cleary. Arithmetic coding for data compression. In Communications of the ACM, pages 520--540. June, 1987. Google ScholarDigital Library
- Y. Zhou, M.S. Mulekar, and P. Nerellapalli. Adaptive spam filtering using dynamic feature space. In ICTAI '05: Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence, pages 302--309, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
Index Terms
- Malware detection using adaptive data compression
Recommendations
A Survey on Malware Detection Using Data Mining Techniques
In the Internet age, malware (such as viruses, trojans, ransomware, and bots) has posed serious and evolving security threats to Internet users. To protect legitimate users from these threats, anti-malware software products from different companies, ...
Opcode sequences as representation of executables for data-mining-based unknown malware detection
Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...
Malware Detection Method Focusing on Anti-debugging Functions
CANDAR '14: Proceedings of the 2014 Second International Symposium on Computing and NetworkingMalware has received much attention in recent years. Antivirus software is widely used as a countermeasure against malware. However, some kinds of malware can evade detection by antivirus software, hence, a new detection method is required. In this ...
Comments