Skip to main content

Advertisement

Log in

Malware classification based on call graph clustering

  • Original Paper
  • Published:
Journal in Computer Virology Aims and scope Submit manuscript

Abstract

Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, enabling the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)

  2. Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th international conference on Recent advances in intrusion detection, pp. 178–197. Springer, Berlin (2007)

  3. Bayer, U.: Large-scale dynamic malware analysis. Ph.D. dissertation, Technischen Universität Wien, December 2009

  4. Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, Behavior-Based Malware Clustering. In: 16th Annual Network and Distributed System Security (2009)

  5. Bayer, U., Kirda, E., Kruegel, C.: Improving the efficiency of dynamic malware analysis. In: Proceedings of the 2010 ACM Symposium on Applied Computing, ser. SAC ’10, pp. 1871–1878. ACM, New York (2010). doi:10.1145/1774088.1774484

  6. Bilar D.: Opcodes as predictor for malware. Int. J. Electron. Security Digital Forensics 1(2), 156–168 (2007)

    Article  Google Scholar 

  7. Borello J.-M., Mé L.: Code obfuscation techniques for metamorphic viruses. J. Comput. Virol. 4, 211–220 (2008). doi:10.1007/s11416-008-0084-2

    Article  Google Scholar 

  8. Bradde, S., Braunstein, A., Mahmoudi, H., Tria, F., Weigt, M., Zecchina, R.: Aligning graphs and finding substructures by message passing, May 2009, Retrieved on March 2010. http://arxiv.org/abs/0905.1893

  9. Briones, I., Gomez, A.: Graphs, entropy and grid computing: Automatic comparison of malware. In: Proceedings of the 2008 Virus Bulletin Conference, 2008, Retrieved on May 2010. http://www.virusbtn.com/conference/vb2008

  10. Bruschi D., Martignoni L., Monga M.: Code normalization for self-mutating malware. IEEE Security Privacy 5(2), 46–54 (2007)

    Article  Google Scholar 

  11. Carrera, E., Erdélyi, G.: Digital genome mapping-advanced binary malware analysis. In: Virus Bulletin Conference, 2004, Retrieved on May 2010. http://www.virusbtn.com/conference/vb2004

  12. Dasgupta, S.: The hardness of k-means clustering, Tech. Rep. CS2008-0916 (2008)

  13. Duda R.O., Hart P.E., Stork D.G.: Pattern Classification, ch. 10, 2nd edn, pp. 517–598. Wiley, London (2000)

    Google Scholar 

  14. Dullien, T.: Structural comparison of executable objects. In: Proceedings of the IEEE Conference on Detection of Intrusions, Malware and Vulnerability Assessment (DIMVA), pp. 161–173 (2004)

  15. Dullien, T., Rolles, R.: Graph-based comparison of executable objects. In: Symposium sur la Sécurité des Technologies de l’Information et des Communications (SSTIC), 2005, Retrieved on May 2010. http://actes.sstic.org/SSTIC05/Analyse_differentielle_de_binaires/

  16. Erdélyi, G.: Senior Manager, Anti-malware Research F-Secure Corporation, personal communication (2010)

  17. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd International Conference of Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press (1996)

  18. Funabiki, N., Kitamichi, J.: A two-stage discrete optimization method for largest common subgraph problems. In: IEICE Transactions on Information and Systems, 82(8), 1145–1153, 19990825. http://ci.nii.ac.jp/naid/110003210164/en/

  19. Gao X., Xiao B., Tao D., Li X.: Image categorization: Graph edit distance+edge direction histogram. Pattern Recognit. 41(10), 3179–3191 (2008)

    Article  MATH  Google Scholar 

  20. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco, January 1979

  21. Hex-rays. The IDA Pro disassembler and debugger. http://www.hex-rays.com/idapro/. Retrieved on 12-2-2010

  22. Hex-rays. Fast library identification and recognition technology. http://www.hex-rays.com/idapro/flirt.htm, 2010, Retrieved on 12-2-2010

  23. Hu, X., Chiueh, T., Shin, K.G.: Large-scale malware indexing using function-call graphs. In: Al-Shaer, E., Jha, S., Keromytis, A.D. (eds) ACM Conference on Computer and Communications Security, pp. 611–620. ACM (2009)

  24. Justice, A., Hero, D.: A binary linear programming formulation of the graph edit distance. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, pp. 1200–1214 (2006). http://people.ee.duke.edu/~lcarin/JusticeHero.pdf

  25. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis (Wiley Series in Probability and Statistics), pp. 68–125. Wiley, London (2005)

  26. Kinable, J.: Malware Detection Through Call Graphs. Master’s thesis, Department of Information and Computer Science, Aalto University, Finland (2010)

  27. Kostakis, O., Kinable, J., Mahmoudi, H., Mustonen, K.: Improved Call Graph Comparison Using Simulated Annealing. In: Proceedings of the 2011 ACM Symposium on Applied Computing (SAC 2011), March 2011 (to appear)

  28. Manning C.D., Raghavan P., Schütze H.: Introduction to Information Retrieval, ch. 16, 1st edn. Cambridge University Press, Cambridge (2008)

    Google Scholar 

  29. Microsoft. Microsoft portable executable and common object file format specification, 2008. Retrieved on 12-2-2010. http://www.microsoft.com/whdc/system/platform/firmware/PECOFFdwn.mspx

  30. Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: Computer Security Applications Conference, 2007, pp. 421–430. doi:10.1109/ACSAC.2007.21

  31. Neuhaus, M., Riesen, K., Bunke, H.: Fast suboptimal algorithms for the computation of graph edit distance. In: Structural, Syntactic, and Statistical Pattern Recognition. LNCS, vol. 4109/2006, pp. 163–170. Springer, Berlin (2006)

  32. Pietrek, M.: An in-depth look into the win32 portable executable file format, 2002, Retrieved on 12-2-2010. http://msdn.microsoft.com/nl-nl/magazine/cc301805%28en-us%29.aspx

  33. Raymond J.W., Willett P.: Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J. Comput. Aided Molecular Design 16, 2002 (2002)

    Google Scholar 

  34. Riesen, K., Bunke, H.: Approximate graph edit distance computation by means of bipartite graph matching. Image Vis. Comput. 27(7), 950–959, (2009). 7th IAPR-TC15 Workshop on Graph-based Representations (GbR 2007)

  35. Riesen, K., Neuhaus, M., Bunke, H.: Bipartite graph matching for computing the edit distance of graphs. In: Graph-Based Representations in Pattern Recognition, 2007, pp. 1–12. doi:10.1007/978-3-540-72903-7_1

  36. Rousseeuw P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)

    Article  MATH  Google Scholar 

  37. Ryder B.: Constructing the call graph of a program. IEEE Trans. Softw. Eng. SE-5(3), 216–226 (1979)

    Article  MathSciNet  Google Scholar 

  38. Symantec Corporation. Symantec Global Internet Security Threat Report Volume—Trends for 2009—Volume XV, April 2010, Retrieved on March 2010. http://www.symantec.com

  39. Szor P.: The Art of Computer Virus Research and Defense, ch. 6. Addison-Wesley, Reading (2005)

    Google Scholar 

  40. Tan P.-N., Steinbach M., Kumar V.: Introduction to Data Mining, ch. 8, pp. 487–568. Addison Wesley, Reading (2005)

    Google Scholar 

  41. Wagener M., Gasteiger J.: The determination of maximum common substructures by a genetic algorithm: Application in synthesis design and for the structural analysis of biological activity. Angewandte Chem. Int. Ed. 33, 1189–1192 (1994)

    Article  Google Scholar 

  42. Walenstein, A., Venable, M., Hayes, M., Thompson, C., Lakhotia, A.: Exploiting similarity between variants to defeat malware: vilo method for comparing and searching binary programs. In: Proceedings of BlackHat DC 2007 (2007)

  43. Weskamp N., Hullermeier E., Kuhn D., Klebe G.: Multiple graph alignment for the structural analysis of protein active sites. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(2), 310–320 (2007)

    Article  Google Scholar 

  44. West D.B.: Introduction to Graph Theory, 2nd edn. Prentice-Hall, Englewood cliffs (2000)

    Google Scholar 

  45. Willems, C., Holz, T., Freiling, F.: Toward automated dynamic malware analysis using cwsandbox. IEEE Security Privacy 5, 32–39 (2007). http://portal.acm.org/citation.cfm?id=1262542.1262675

    Google Scholar 

  46. Zeng Z., Tung A.K.H., Wang J., Feng J., Zhou L.: Comparing stars: on approximating graph edit distance. PVLDB 2(1), 25–36 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joris Kinable.

Additional information

This research has been supported by TEKES—the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/09.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kinable, J., Kostakis, O. Malware classification based on call graph clustering. J Comput Virol 7, 233–245 (2011). https://doi.org/10.1007/s11416-011-0151-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11416-011-0151-y

Keywords

Navigation