Abstract
Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, enabling the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.
Similar content being viewed by others
References
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th international conference on Recent advances in intrusion detection, pp. 178–197. Springer, Berlin (2007)
Bayer, U.: Large-scale dynamic malware analysis. Ph.D. dissertation, Technischen Universität Wien, December 2009
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, Behavior-Based Malware Clustering. In: 16th Annual Network and Distributed System Security (2009)
Bayer, U., Kirda, E., Kruegel, C.: Improving the efficiency of dynamic malware analysis. In: Proceedings of the 2010 ACM Symposium on Applied Computing, ser. SAC ’10, pp. 1871–1878. ACM, New York (2010). doi:10.1145/1774088.1774484
Bilar D.: Opcodes as predictor for malware. Int. J. Electron. Security Digital Forensics 1(2), 156–168 (2007)
Borello J.-M., Mé L.: Code obfuscation techniques for metamorphic viruses. J. Comput. Virol. 4, 211–220 (2008). doi:10.1007/s11416-008-0084-2
Bradde, S., Braunstein, A., Mahmoudi, H., Tria, F., Weigt, M., Zecchina, R.: Aligning graphs and finding substructures by message passing, May 2009, Retrieved on March 2010. http://arxiv.org/abs/0905.1893
Briones, I., Gomez, A.: Graphs, entropy and grid computing: Automatic comparison of malware. In: Proceedings of the 2008 Virus Bulletin Conference, 2008, Retrieved on May 2010. http://www.virusbtn.com/conference/vb2008
Bruschi D., Martignoni L., Monga M.: Code normalization for self-mutating malware. IEEE Security Privacy 5(2), 46–54 (2007)
Carrera, E., Erdélyi, G.: Digital genome mapping-advanced binary malware analysis. In: Virus Bulletin Conference, 2004, Retrieved on May 2010. http://www.virusbtn.com/conference/vb2004
Dasgupta, S.: The hardness of k-means clustering, Tech. Rep. CS2008-0916 (2008)
Duda R.O., Hart P.E., Stork D.G.: Pattern Classification, ch. 10, 2nd edn, pp. 517–598. Wiley, London (2000)
Dullien, T.: Structural comparison of executable objects. In: Proceedings of the IEEE Conference on Detection of Intrusions, Malware and Vulnerability Assessment (DIMVA), pp. 161–173 (2004)
Dullien, T., Rolles, R.: Graph-based comparison of executable objects. In: Symposium sur la Sécurité des Technologies de l’Information et des Communications (SSTIC), 2005, Retrieved on May 2010. http://actes.sstic.org/SSTIC05/Analyse_differentielle_de_binaires/
Erdélyi, G.: Senior Manager, Anti-malware Research F-Secure Corporation, personal communication (2010)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd International Conference of Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press (1996)
Funabiki, N., Kitamichi, J.: A two-stage discrete optimization method for largest common subgraph problems. In: IEICE Transactions on Information and Systems, 82(8), 1145–1153, 19990825. http://ci.nii.ac.jp/naid/110003210164/en/
Gao X., Xiao B., Tao D., Li X.: Image categorization: Graph edit distance+edge direction histogram. Pattern Recognit. 41(10), 3179–3191 (2008)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco, January 1979
Hex-rays. The IDA Pro disassembler and debugger. http://www.hex-rays.com/idapro/. Retrieved on 12-2-2010
Hex-rays. Fast library identification and recognition technology. http://www.hex-rays.com/idapro/flirt.htm, 2010, Retrieved on 12-2-2010
Hu, X., Chiueh, T., Shin, K.G.: Large-scale malware indexing using function-call graphs. In: Al-Shaer, E., Jha, S., Keromytis, A.D. (eds) ACM Conference on Computer and Communications Security, pp. 611–620. ACM (2009)
Justice, A., Hero, D.: A binary linear programming formulation of the graph edit distance. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, pp. 1200–1214 (2006). http://people.ee.duke.edu/~lcarin/JusticeHero.pdf
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis (Wiley Series in Probability and Statistics), pp. 68–125. Wiley, London (2005)
Kinable, J.: Malware Detection Through Call Graphs. Master’s thesis, Department of Information and Computer Science, Aalto University, Finland (2010)
Kostakis, O., Kinable, J., Mahmoudi, H., Mustonen, K.: Improved Call Graph Comparison Using Simulated Annealing. In: Proceedings of the 2011 ACM Symposium on Applied Computing (SAC 2011), March 2011 (to appear)
Manning C.D., Raghavan P., Schütze H.: Introduction to Information Retrieval, ch. 16, 1st edn. Cambridge University Press, Cambridge (2008)
Microsoft. Microsoft portable executable and common object file format specification, 2008. Retrieved on 12-2-2010. http://www.microsoft.com/whdc/system/platform/firmware/PECOFFdwn.mspx
Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: Computer Security Applications Conference, 2007, pp. 421–430. doi:10.1109/ACSAC.2007.21
Neuhaus, M., Riesen, K., Bunke, H.: Fast suboptimal algorithms for the computation of graph edit distance. In: Structural, Syntactic, and Statistical Pattern Recognition. LNCS, vol. 4109/2006, pp. 163–170. Springer, Berlin (2006)
Pietrek, M.: An in-depth look into the win32 portable executable file format, 2002, Retrieved on 12-2-2010. http://msdn.microsoft.com/nl-nl/magazine/cc301805%28en-us%29.aspx
Raymond J.W., Willett P.: Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J. Comput. Aided Molecular Design 16, 2002 (2002)
Riesen, K., Bunke, H.: Approximate graph edit distance computation by means of bipartite graph matching. Image Vis. Comput. 27(7), 950–959, (2009). 7th IAPR-TC15 Workshop on Graph-based Representations (GbR 2007)
Riesen, K., Neuhaus, M., Bunke, H.: Bipartite graph matching for computing the edit distance of graphs. In: Graph-Based Representations in Pattern Recognition, 2007, pp. 1–12. doi:10.1007/978-3-540-72903-7_1
Rousseeuw P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)
Ryder B.: Constructing the call graph of a program. IEEE Trans. Softw. Eng. SE-5(3), 216–226 (1979)
Symantec Corporation. Symantec Global Internet Security Threat Report Volume—Trends for 2009—Volume XV, April 2010, Retrieved on March 2010. http://www.symantec.com
Szor P.: The Art of Computer Virus Research and Defense, ch. 6. Addison-Wesley, Reading (2005)
Tan P.-N., Steinbach M., Kumar V.: Introduction to Data Mining, ch. 8, pp. 487–568. Addison Wesley, Reading (2005)
Wagener M., Gasteiger J.: The determination of maximum common substructures by a genetic algorithm: Application in synthesis design and for the structural analysis of biological activity. Angewandte Chem. Int. Ed. 33, 1189–1192 (1994)
Walenstein, A., Venable, M., Hayes, M., Thompson, C., Lakhotia, A.: Exploiting similarity between variants to defeat malware: vilo method for comparing and searching binary programs. In: Proceedings of BlackHat DC 2007 (2007)
Weskamp N., Hullermeier E., Kuhn D., Klebe G.: Multiple graph alignment for the structural analysis of protein active sites. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(2), 310–320 (2007)
West D.B.: Introduction to Graph Theory, 2nd edn. Prentice-Hall, Englewood cliffs (2000)
Willems, C., Holz, T., Freiling, F.: Toward automated dynamic malware analysis using cwsandbox. IEEE Security Privacy 5, 32–39 (2007). http://portal.acm.org/citation.cfm?id=1262542.1262675
Zeng Z., Tung A.K.H., Wang J., Feng J., Zhou L.: Comparing stars: on approximating graph edit distance. PVLDB 2(1), 25–36 (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
This research has been supported by TEKES—the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/09.
Rights and permissions
About this article
Cite this article
Kinable, J., Kostakis, O. Malware classification based on call graph clustering. J Comput Virol 7, 233–245 (2011). https://doi.org/10.1007/s11416-011-0151-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-011-0151-y