Malware classification based on call graph clustering

Kinable, Joris; Kostakis, Orestis

doi:10.1007/s11416-011-0151-y

Malware classification based on call graph clustering

Original Paper
Published: 03 February 2011

Volume 7, pages 233–245, (2011)
Cite this article

Journal in Computer Virology Aims and scope Submit manuscript

Joris Kinable^1,2 &
Orestis Kostakis¹

1451 Accesses
132 Citations
3 Altmetric
Explore all metrics

Abstract

Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, enabling the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How different are different diff algorithms in Git?

Article Open access 11 September 2019

AndroMalPack: enhancing the ML-based malware classification by detection and removal of repacked apps for Android systems

Article Open access 14 November 2022

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

References

Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th international conference on Recent advances in intrusion detection, pp. 178–197. Springer, Berlin (2007)
Bayer, U.: Large-scale dynamic malware analysis. Ph.D. dissertation, Technischen Universität Wien, December 2009
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, Behavior-Based Malware Clustering. In: 16th Annual Network and Distributed System Security (2009)
Bayer, U., Kirda, E., Kruegel, C.: Improving the efficiency of dynamic malware analysis. In: Proceedings of the 2010 ACM Symposium on Applied Computing, ser. SAC ’10, pp. 1871–1878. ACM, New York (2010). doi:10.1145/1774088.1774484
Bilar D.: Opcodes as predictor for malware. Int. J. Electron. Security Digital Forensics 1(2), 156–168 (2007)
Article Google Scholar
Borello J.-M., Mé L.: Code obfuscation techniques for metamorphic viruses. J. Comput. Virol. 4, 211–220 (2008). doi:10.1007/s11416-008-0084-2
Article Google Scholar
Bradde, S., Braunstein, A., Mahmoudi, H., Tria, F., Weigt, M., Zecchina, R.: Aligning graphs and finding substructures by message passing, May 2009, Retrieved on March 2010. http://arxiv.org/abs/0905.1893
Briones, I., Gomez, A.: Graphs, entropy and grid computing: Automatic comparison of malware. In: Proceedings of the 2008 Virus Bulletin Conference, 2008, Retrieved on May 2010. http://www.virusbtn.com/conference/vb2008
Bruschi D., Martignoni L., Monga M.: Code normalization for self-mutating malware. IEEE Security Privacy 5(2), 46–54 (2007)
Article Google Scholar
Carrera, E., Erdélyi, G.: Digital genome mapping-advanced binary malware analysis. In: Virus Bulletin Conference, 2004, Retrieved on May 2010. http://www.virusbtn.com/conference/vb2004
Dasgupta, S.: The hardness of k-means clustering, Tech. Rep. CS2008-0916 (2008)
Duda R.O., Hart P.E., Stork D.G.: Pattern Classification, ch. 10, 2nd edn, pp. 517–598. Wiley, London (2000)
Google Scholar
Dullien, T.: Structural comparison of executable objects. In: Proceedings of the IEEE Conference on Detection of Intrusions, Malware and Vulnerability Assessment (DIMVA), pp. 161–173 (2004)
Dullien, T., Rolles, R.: Graph-based comparison of executable objects. In: Symposium sur la Sécurité des Technologies de l’Information et des Communications (SSTIC), 2005, Retrieved on May 2010. http://actes.sstic.org/SSTIC05/Analyse_differentielle_de_binaires/
Erdélyi, G.: Senior Manager, Anti-malware Research F-Secure Corporation, personal communication (2010)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd International Conference of Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press (1996)
Funabiki, N., Kitamichi, J.: A two-stage discrete optimization method for largest common subgraph problems. In: IEICE Transactions on Information and Systems, 82(8), 1145–1153, 19990825. http://ci.nii.ac.jp/naid/110003210164/en/
Gao X., Xiao B., Tao D., Li X.: Image categorization: Graph edit distance+edge direction histogram. Pattern Recognit. 41(10), 3179–3191 (2008)
Article MATH Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco, January 1979
Hex-rays. The IDA Pro disassembler and debugger. http://www.hex-rays.com/idapro/. Retrieved on 12-2-2010
Hex-rays. Fast library identification and recognition technology. http://www.hex-rays.com/idapro/flirt.htm, 2010, Retrieved on 12-2-2010
Hu, X., Chiueh, T., Shin, K.G.: Large-scale malware indexing using function-call graphs. In: Al-Shaer, E., Jha, S., Keromytis, A.D. (eds) ACM Conference on Computer and Communications Security, pp. 611–620. ACM (2009)
Justice, A., Hero, D.: A binary linear programming formulation of the graph edit distance. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, pp. 1200–1214 (2006). http://people.ee.duke.edu/~lcarin/JusticeHero.pdf
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis (Wiley Series in Probability and Statistics), pp. 68–125. Wiley, London (2005)
Kinable, J.: Malware Detection Through Call Graphs. Master’s thesis, Department of Information and Computer Science, Aalto University, Finland (2010)
Kostakis, O., Kinable, J., Mahmoudi, H., Mustonen, K.: Improved Call Graph Comparison Using Simulated Annealing. In: Proceedings of the 2011 ACM Symposium on Applied Computing (SAC 2011), March 2011 (to appear)
Manning C.D., Raghavan P., Schütze H.: Introduction to Information Retrieval, ch. 16, 1st edn. Cambridge University Press, Cambridge (2008)
Google Scholar
Microsoft. Microsoft portable executable and common object file format specification, 2008. Retrieved on 12-2-2010. http://www.microsoft.com/whdc/system/platform/firmware/PECOFFdwn.mspx
Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: Computer Security Applications Conference, 2007, pp. 421–430. doi:10.1109/ACSAC.2007.21
Neuhaus, M., Riesen, K., Bunke, H.: Fast suboptimal algorithms for the computation of graph edit distance. In: Structural, Syntactic, and Statistical Pattern Recognition. LNCS, vol. 4109/2006, pp. 163–170. Springer, Berlin (2006)
Pietrek, M.: An in-depth look into the win32 portable executable file format, 2002, Retrieved on 12-2-2010. http://msdn.microsoft.com/nl-nl/magazine/cc301805%28en-us%29.aspx
Raymond J.W., Willett P.: Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J. Comput. Aided Molecular Design 16, 2002 (2002)
Google Scholar
Riesen, K., Bunke, H.: Approximate graph edit distance computation by means of bipartite graph matching. Image Vis. Comput. 27(7), 950–959, (2009). 7th IAPR-TC15 Workshop on Graph-based Representations (GbR 2007)
Riesen, K., Neuhaus, M., Bunke, H.: Bipartite graph matching for computing the edit distance of graphs. In: Graph-Based Representations in Pattern Recognition, 2007, pp. 1–12. doi:10.1007/978-3-540-72903-7_1
Rousseeuw P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)
Article MATH Google Scholar
Ryder B.: Constructing the call graph of a program. IEEE Trans. Softw. Eng. SE-5(3), 216–226 (1979)
Article MathSciNet Google Scholar
Symantec Corporation. Symantec Global Internet Security Threat Report Volume—Trends for 2009—Volume XV, April 2010, Retrieved on March 2010. http://www.symantec.com
Szor P.: The Art of Computer Virus Research and Defense, ch. 6. Addison-Wesley, Reading (2005)
Google Scholar
Tan P.-N., Steinbach M., Kumar V.: Introduction to Data Mining, ch. 8, pp. 487–568. Addison Wesley, Reading (2005)
Google Scholar
Wagener M., Gasteiger J.: The determination of maximum common substructures by a genetic algorithm: Application in synthesis design and for the structural analysis of biological activity. Angewandte Chem. Int. Ed. 33, 1189–1192 (1994)
Article Google Scholar
Walenstein, A., Venable, M., Hayes, M., Thompson, C., Lakhotia, A.: Exploiting similarity between variants to defeat malware: vilo method for comparing and searching binary programs. In: Proceedings of BlackHat DC 2007 (2007)
Weskamp N., Hullermeier E., Kuhn D., Klebe G.: Multiple graph alignment for the structural analysis of protein active sites. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(2), 310–320 (2007)
Article Google Scholar
West D.B.: Introduction to Graph Theory, 2nd edn. Prentice-Hall, Englewood cliffs (2000)
Google Scholar
Willems, C., Holz, T., Freiling, F.: Toward automated dynamic malware analysis using cwsandbox. IEEE Security Privacy 5, 32–39 (2007). http://portal.acm.org/citation.cfm?id=1262542.1262675
Google Scholar
Zeng Z., Tung A.K.H., Wang J., Feng J., Zhou L.: Comparing stars: on approximating graph edit distance. PVLDB 2(1), 25–36 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Computer Science, Helsinki Institute for Information Technology Aalto University, P. O. Box 15400, 00076, Aalto, Finland
Joris Kinable & Orestis Kostakis
Department of Computer Science, Katholieke Universiteit Leuven (Kortrijk), Etienne Sabbelaan 53, 8500, Kortrijk, Belgium
Joris Kinable

Authors

Joris Kinable
View author publications
You can also search for this author in PubMed Google Scholar
Orestis Kostakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joris Kinable.

Additional information

This research has been supported by TEKES—the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/09.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kinable, J., Kostakis, O. Malware classification based on call graph clustering. J Comput Virol 7, 233–245 (2011). https://doi.org/10.1007/s11416-011-0151-y

Download citation

Received: 06 August 2010
Accepted: 07 January 2011
Published: 03 February 2011
Issue Date: November 2011
DOI: https://doi.org/10.1007/s11416-011-0151-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Malware classification based on call graph clustering

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

AndroMalPack: enhancing the ML-based malware classification by detection and removal of repacked apps for Android systems

Clustering graph data: the roadmap to spectral techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Malware classification based on call graph clustering

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

AndroMalPack: enhancing the ML-based malware classification by detection and removal of repacked apps for Android systems

Clustering graph data: the roadmap to spectral techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation