skip to main content
10.1145/1456377.1456393acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Malware detection using adaptive data compression

Published:27 October 2008Publication History

ABSTRACT

A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of existing malware, are extracted by malware analysts from known malware samples, and stored in a database often referred to as a virus dictionary. This process often involves a significant amount of human efforts. In addition, there are two major limitations in this technique. First, not all malicious programs have bit patterns that are evidence of their malicious nature. Therefore, some malware is not recorded in the virus dictionary and can not be detected through signature matching. Second, searching for specific bit patterns will not work on malware that can take many forms--obfuscated malware. Signature matching has been shown to be incapable of identifying new malware patterns and fails to recognize obfuscated malware. This paper presents a malware detection technique that discovers malware by means of a learning engine trained on a set of malware instances and a set of benign code instances. The learning engine uses an adaptive data compression model--prediction by partial matching (PPM)--to build two compression models, one from the malware instances and the other from the benign code instances. A code instance is classified, either as "malware" or "benign", by minimizing its estimated cross entropy. Our preliminary results are very promising. We achieved about 0.94 true positive rate with as low as 0.016 false positive rate. Our experiments also demonstrate that this technique can effectively detect unknown and obfuscated malware.

References

  1. A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan. Spam filtering using statistical data compression models. Journal of Machine Learning Research (JMLR), 7:2673--2698, December 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Christodorescu, S. Jha, J. Kinder, S. Katzenbeisser, and H. Veith. Software transformations to improve malware detection. Journal in Computer Virology, 3:253--265, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  3. J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, COM--32(4):396--402, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. Cleary and I. Witten. Unbounded length contexts of ppm. The computer Journal, 40(2/3):67--75, 1997.Google ScholarGoogle Scholar
  5. G. Cormack and R. Horspool. Data compression using dynamic Markov modeling. The Computer Journal, 30(6):541--550, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Drinic, D. Kirovski, and M. Potkonjak. Ppm model cleaning. In DCC '03: Proceedings of the Conference on Data Compression, page 163, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Frank, C. Chui, and I. Witten. Text categorization using compression models. In IEEE Data Compression Conference (DCC--00), pages 200--209. IEEE CS Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Jorgensen, Y. Zhou, and M. Inge. A multiple instance learning strategy for combating good word attacks on spam filters. Journal of Machine Learning Research (JMLR), 9:1115--1146, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Josse. Secure and advanced unpacking using computer emulation. Journal in Computer Virology, 3:221--236, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. Kephart, G. Sorkin, W. Arnold, D. Cheese, G. Tesauro, and S. White. Biologically inspired defenses against computer viruses. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI95), pages 985--996. Morgan Kaufman, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J.Z. Kolter and M.A. Maloof. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7:2721--2744, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Nachenberg. Us patent no. 5,826,013: Polymorphic virus detection module, 1998.Google ScholarGoogle Scholar
  13. P. Royal, M. Halpin, D. Dagon, R. Edmonds, and W. Lee. Polyunpack: Automating the hidden-code extraction of unpack-executing malware. In ACSAC '06: Proceedings of the 22nd Annual Computer Security Applications Conference on Annual Computer Security Applications Conference, pages 289--300, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Schultz, E. Eskin, E. Zadok, and S. Stolfo. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy, pages 38--49, Los Alamitos, CA, 2001. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S.K. Udupa, S.K. Debray, and M. Madou. Deobfuscation: Reverse engineering obfuscated code. In WCRE '05: Proceedings of the 12th Working Conference on Reverse Engineering, pages 45--54, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Witten. Applications of lossless compression in adaptive text mining. In Proceedings of the 34th Annual Conference on Information Sciences and Systems (CISS-00), New Jersey, 2000.Google ScholarGoogle Scholar
  17. I. Witten, R. Neal, and J. Cleary. Arithmetic coding for data compression. In Communications of the ACM, pages 520--540. June, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Zhou, M.S. Mulekar, and P. Nerellapalli. Adaptive spam filtering using dynamic feature space. In ICTAI '05: Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence, pages 302--309, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Malware detection using adaptive data compression

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec
      October 2008
      84 pages
      ISBN:9781605582917
      DOI:10.1145/1456377

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      AISec '08 Paper Acceptance Rate9of20submissions,45%Overall Acceptance Rate94of231submissions,41%

      Upcoming Conference

      CCS '24
      ACM SIGSAC Conference on Computer and Communications Security
      October 14 - 18, 2024
      Salt Lake City , UT , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader