research-article

Malware detection using adaptive data compression

Authors:
Yan Zhou

University of South Alabama, Mobile, AL, USA

University of South Alabama, Mobile, AL, USA
View Profile

,
W. Meador Inge

University of South Alabama, Mobile, AL, USA

University of South Alabama, Mobile, AL, USA
View Profile

AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISecOctober 2008Pages 53–60https://doi.org/10.1145/1456377.1456393

Published:27 October 2008Publication History

AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

Pages 53–60

ABSTRACT

A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of existing malware, are extracted by malware analysts from known malware samples, and stored in a database often referred to as a virus dictionary. This process often involves a significant amount of human efforts. In addition, there are two major limitations in this technique. First, not all malicious programs have bit patterns that are evidence of their malicious nature. Therefore, some malware is not recorded in the virus dictionary and can not be detected through signature matching. Second, searching for specific bit patterns will not work on malware that can take many forms--obfuscated malware. Signature matching has been shown to be incapable of identifying new malware patterns and fails to recognize obfuscated malware. This paper presents a malware detection technique that discovers malware by means of a learning engine trained on a set of malware instances and a set of benign code instances. The learning engine uses an adaptive data compression model--prediction by partial matching (PPM)--to build two compression models, one from the malware instances and the other from the benign code instances. A code instance is classified, either as "malware" or "benign", by minimizing its estimated cross entropy. Our preliminary results are very promising. We achieved about 0.94 true positive rate with as low as 0.016 false positive rate. Our experiments also demonstrate that this technique can effectively detect unknown and obfuscated malware.

References

A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan. Spam filtering using statistical data compression models. Journal of Machine Learning Research (JMLR), 7:2673--2698, December 2006. Google ScholarDigital Library
M. Christodorescu, S. Jha, J. Kinder, S. Katzenbeisser, and H. Veith. Software transformations to improve malware detection. Journal in Computer Virology, 3:253--265, 2007.Google ScholarCross Ref
J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, COM--32(4):396--402, 1984.Google ScholarCross Ref
J. Cleary and I. Witten. Unbounded length contexts of ppm. The computer Journal, 40(2/3):67--75, 1997.Google Scholar
G. Cormack and R. Horspool. Data compression using dynamic Markov modeling. The Computer Journal, 30(6):541--550, 1987. Google ScholarDigital Library
M. Drinic, D. Kirovski, and M. Potkonjak. Ppm model cleaning. In DCC '03: Proceedings of the Conference on Data Compression, page 163, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
E. Frank, C. Chui, and I. Witten. Text categorization using compression models. In IEEE Data Compression Conference (DCC--00), pages 200--209. IEEE CS Press, 2000. Google ScholarDigital Library
Z. Jorgensen, Y. Zhou, and M. Inge. A multiple instance learning strategy for combating good word attacks on spam filters. Journal of Machine Learning Research (JMLR), 9:1115--1146, June 2008. Google ScholarDigital Library
S. Josse. Secure and advanced unpacking using computer emulation. Journal in Computer Virology, 3:221--236, 2007.Google ScholarCross Ref
J. Kephart, G. Sorkin, W. Arnold, D. Cheese, G. Tesauro, and S. White. Biologically inspired defenses against computer viruses. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI95), pages 985--996. Morgan Kaufman, 1995. Google ScholarDigital Library
J.Z. Kolter and M.A. Maloof. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7:2721--2744, 2006. Google ScholarDigital Library
C. Nachenberg. Us patent no. 5,826,013: Polymorphic virus detection module, 1998.Google Scholar
P. Royal, M. Halpin, D. Dagon, R. Edmonds, and W. Lee. Polyunpack: Automating the hidden-code extraction of unpack-executing malware. In ACSAC '06: Proceedings of the 22nd Annual Computer Security Applications Conference on Annual Computer Security Applications Conference, pages 289--300, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
M. Schultz, E. Eskin, E. Zadok, and S. Stolfo. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy, pages 38--49, Los Alamitos, CA, 2001. IEEE Press. Google ScholarDigital Library
S.K. Udupa, S.K. Debray, and M. Madou. Deobfuscation: Reverse engineering obfuscated code. In WCRE '05: Proceedings of the 12th Working Conference on Reverse Engineering, pages 45--54, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
I. Witten. Applications of lossless compression in adaptive text mining. In Proceedings of the 34th Annual Conference on Information Sciences and Systems (CISS-00), New Jersey, 2000.Google Scholar
I. Witten, R. Neal, and J. Cleary. Arithmetic coding for data compression. In Communications of the ACM, pages 520--540. June, 1987. Google ScholarDigital Library
Y. Zhou, M.S. Mulekar, and P. Nerellapalli. Adaptive spam filtering using dynamic feature space. In ICTAI '05: Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence, pages 302--309, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library

Index Terms

Malware detection using adaptive data compression
1. Computing methodologies
  1. Machine learning

Recommendations

A Survey on Malware Detection Using Data Mining Techniques

In the Internet age, malware (such as viruses, trojans, ransomware, and bots) has posed serious and evolving security threats to Internet users. To protect legitimate users from these threats, anti-malware software products from different companies, ...
Read More
Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...
Read More
Malware Detection Method Focusing on Anti-debugging Functions
CANDAR '14: Proceedings of the 2014 Second International Symposium on Computing and Networking

Malware has received much attention in recent years. Antivirus software is widely used as a countermeasure against malware. However, some kinds of malware can evade detection by antivirus software, hence, a new detection method is required. In this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec
October 2008
84 pages
ISBN:9781605582917
DOI:10.1145/1456377
Program Chairs:
Dirk Balfanz
Google, USA
,
Jessica Staddon
PARC, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
machine learning
malware detection
statistical data compression
Qualifiers
- research-article
Conference

Acceptance Rates
AISec '08 Paper Acceptance Rate9of20submissions,45%Overall Acceptance Rate94of231submissions,41%
More
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 977
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Malware detection using adaptive data compression

AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Survey on Malware Detection Using Data Mining Techniques

Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware Detection Method Focusing on Anti-debugging Functions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Malware detection using adaptive data compression

AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Survey on Malware Detection Using Data Mining Techniques

Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware Detection Method Focusing on Anti-debugging Functions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media