skip to main content
research-article

Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

Published:13 November 2023Publication History
Skip Abstract Section

Abstract

Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this article, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.

REFERENCES

  1. [1] Ahmadi Mansour, Ulyanov Dmitry, Semenov Stanislav, Trofimov Mikhail, and Giacinto Giorgio. 2016. Novel feature extraction, selection and fusion for effective malware family classification. In Proceedings of the 6th ACM Conference on Data and Application Security and Privacy. 183194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Akiba Takuya, Sano Shotaro, Yanase Toshihiko, Ohta Takeru, and Koyama Masanori. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 26232631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Alexandrov B. S., Alexandrov L. B., and al. V. G. Stanev et2020. Source identification by non-negative matrix factorization combined with semi-supervised clustering. US Patent S10,776,718 (2020).Google ScholarGoogle Scholar
  4. [4] Boian S. Alexandrov, Ludmil B. Alexandrov, Filip L. Iliev, Valentin G. Stanev, and Velimir V. Vesselinov. 2020. Source identification by non-negative matrix factorization combined with semi-supervised clustering. US Patent S10,776,718.Google ScholarGoogle Scholar
  5. [5] Ludmil B. Alexandrov, Jaegil Kim, Nicholas J. Haradhvala, Mi Ni Huang, Alvin Wei Tian Ng, Yang Wu, Arnoud Boot, Kyle R. Covington, Dmitry A. Gordenin, Erik N. Bergstrom, S. M. Ashiqul Islam, Nuria Lopez-Bigas, Leszek J. Klimczak, John R. McPherson, Sandro Morganella, Radhakrishnan Sabarinathan, David A. Wheeler, Ville Mustonen, Paul Boutros, Kin Chan, Akihiro Fujimoto, Gad Getz, Marat Kazanov, Michael Lawrence, Iñigo Martincorena, Hidewaki Nakagawa, Paz Polak, Stephenie Prokopec, Steven A. Roberts, Steven G. Rozen, Natalie Saini, Tatsuhiro Shibata, Yuichi Shiraishi, Michael R. Stratton, Bin Tean Teh, Ignacio Vázquez-García, Fouad Yousif, Willie Yu, Lauri A. Aaltonen, Federico Abascal, Adam Abeshouse, Hiroyuki Aburatani, David J. Adams, Nishant Agrawal, Keun Soo Ahn, Sung-Min Ahn, Hiroshi Aikata, Rehan Akbani, Kadir C. Akdemir, Hikmat Al-Ahmadie, Sultan T. Al-Sedairy, Fatima Al-Shahrour, Malik Alawi, Monique Albert, Kenneth Aldape, Adrian Ally, Kathryn Alsop, Eva G. Alvarez, Fernanda Amary, Samirkumar B. Amin, Brice Aminou, Ole Ammerpohl, Matthew J. Anderson, Yeng Ang, Davide Antonello, Pavana Anur, Samuel Aparicio, Elizabeth L. Appelbaum, Yasuhito Arai, Axel Aretz, Koji Arihiro, Shun-ichi Ariizumi, Joshua Armenia, Laurent Arnould, Sylvia Asa, Yassen Assenov, Gurnit Atwal, Sietse Aukema, J. Todd Auman, Miriam R. R. Aure, Philip Awadalla, Marta Aymerich, Gary D. Bader, Adrian Baez-Ortega, Matthew H. Bailey, Peter J. Bailey, Miruna Balasundaram, Saianand Balu, Pratiti Bandopadhayay, Rosamonde E. Banks, Stefano Barbi, Andrew P. Barbour, Jonathan Barenboim, Jill Barnholtz- Sloan, Hugh Barr, Elisabet Barrera, John Bartlett, Javier Bartolome, Claudio Bassi, Oliver F. Bathe, Daniel Baumhoer, Prashant Bavi, Stephen B. Baylin, Wojciech Bazant, Duncan Beardsmore, Timothy A. Beck, Sam Behjati, Andreas Behren, Beifang Niu, Cindy Bell, Sergi Beltran, Christopher Benz, Andrew Berchuck, Anke K. Bergmann, Benjamin P. Berman, Daniel M. Berney, Stephan H. Bernhart, Rameen Beroukhim, Mario Berrios, Samantha Bersani, Johanna Bertl, Miguel Betancourt, Vinayak Bhandari, Shriram G. Bhosle, Andrew V. Biankin, Matthias Bieg, Darell Bigner, Hans Binder, Ewan Birney, Michael Birrer, Nidhan K. Biswas, Bodil Bjerkehagen, Tom Bodenheimer, Lori Boice, Giada Bonizzato, Johann S. De Bono, Moiz S. Bootwalla, Ake Borg, Arndt Borkhardt, Keith A. Boroevich, Ivan Borozan, Christoph Borst, Marcus Bosenberg, Mattia Bosio, Jacqueline Boultwood, Guillaume Bourque, Paul C. Boutros, G. Steven Bova, David T. Bowen, Reanne Bowlby, David D. L. Bowtell, Sandrine Boyault, Rich Boyce, Jeffrey Boyd, Alvis Brazma, Paul Brennan, Daniel S. Brewer, Arie B. Brinkman, Robert G. Bristow, Russell R. Broaddus, Jane E. Brock, Malcolm Brock, Annegien Broeks, Angela N. Brooks, Denise Brooks, Benedikt Brors, Søren Brunak, Timothy J. C. Bruxner, Alicia L. Bruzos, Alex Buchanan, Ivo Buchhalter, Christiane Buchholz, Susan Bullman, Hazel Burke, Birgit Burkhardt, Kathleen H. Burns, John Busanovich, Carlos D. Bustamante, Adam P. Butler, Atul J. Butte, Niall J. Byrne, Anne-Lise Børresen-Dale, Samantha J. Caesar-Johnson, Andy Cafferkey, Declan Cahill, Claudia Calabrese, Carlos Caldas, Fabien Calvo, Niedzica Camacho, Peter J. Campbell, Elias Campo, Cinzia Cantù, Shaolong Cao, Thomas E. Carey, Joana Carlevaro-Fita, Rebecca Carlsen, Ivana Cataldo, Mario Cazzola, Jonathan Cebon, Robert Cerfolio, Dianne E. Chadwick, Dimple Chakravarty, Don Chalmers, Calvin Wing Yiu Chan, Michelle Chan-Seng-Yue, Vishal S. Chandan, David K. Chang, Stephen J. Chanock, Lorraine A. Chantrill, Aurélien Chateigner, Nilanjan Chatterjee, Kazuaki Chayama, Hsiao-Wei Chen, Jieming Chen, Ken Chen, Yiwen Chen, Zhaohong Chen, Andrew D. Cherniack, Jeremy Chien, Yoke-Eng Chiew, Suet-Feung Chin, Juok Cho, Sunghoon Cho, Jung Kyoon Choi, Wan Choi, Christine Chomienne, Zechen Chong, Su Pin Choo, Angela Chou, Angelika N. Christ, Elizabeth L. Christie, Eric Chuah, Carrie Cibulskis, Kristian Cibulskis, Sara Cingarlini, Peter Clapham, Alexander Claviez, Sean Cleary, Nicole Cloonan, Marek Cmero, Colin C. Collins, Ashton A. Connor, Susanna L. Cooke, Colin S. Cooper, Leslie Cope, Vincenzo Corbo, Matthew G. Cordes, Stephen M. Cordner, Isidro Cortés-Ciriano, Kyle Covington, Prue A. Cowin, Brian Craft, David Craft, Chad J. Creighton, Yupeng Cun, Erin Curley, Ioana Cutcutache, Karolina Czajka, Bogdan Czerniak, Rebecca A. Dagg, Ludmila Danilova, Maria Vittoria Davi, Natalie R. Davidson, Helen Davies, Ian J. Davis, Brandi N. Davis-Dusenbery, Kevin J. Dawson, Francisco M. De La Vega, Ricardo De Paoli-Iseppi, Timothy Defreitas, Angelo P. Dei Tos, Olivier Delaneau, John A. Demchok, PCAWG Mutational Signatures Working Group, and P. C. A. W. G. Consortium. 2020. The repertoire of mutational signatures in human cancer. Nature 578, 7793 (01 Feb 2020), 94–101. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Ludmil B. Alexandrov, Serena Nik-Zainal, David C. Wedge, Samuel A. J. R. Aparicio, Sam Behjati, Andrew V. Biankin, Graham R. Bignell, Niccolò Bolli, Ake Borg, Anne-Lise Børresen-Dale, Sandrine Boyault, Birgit Burkhardt, Adam P. Butler, Carlos Caldas, Helen R. Davies, Christine Desmedt, Roland Eils, Jórunn Erla Eyfjörd, John A. Foekens, Mel Greaves, Fumie Hosoda, Barbara Hutter, Tomislav Ilicic, Sandrine Imbeaud, Marcin Imielinski, Natalie Jäger, David T. W. Jones, David Jones, Stian Knappskog, Marcel Kool, Sunil R. Lakhani, Carlos López-Otín, Sancha Martin, Nikhil C. Munshi, Hiromi Nakamura, Paul A. Northcott, Marina Pajic, Elli Papaemmanuil, Angelo Paradiso, John V. Pearson, Xose S. Puente, Keiran Raine, Manasa Ramakrishna, Andrea L. Richardson, Julia Richter, Philip Rosenstiel, Matthias Schlesner, Ton N. Schumacher, Paul N. Span, Jon W. Teague, Yasushi Totoki, Andrew N. J. Tutt, Rafael Valdés-Mas, Marit M. van Buuren, Laura van ’t Veer, Anne Vincent-Salomon, Nicola Waddell, Lucy R. Yates, Jessica Zucman-Rossi, P. Andrew Futreal, Ultan McDermott, Peter Lichter, Matthew Meyerson, Sean M. Grimmond, Reiner Siebert, Elías Campo, Tatsuhiro Shibata, Stefan M. Pfister, Peter J. Campbell, Michael R. Stratton, Australian Pancreatic Cancer Genome Initiative, ICGC Breast Cancer Consortium, ICGC MMML-Seq Consortium, and I. C. G. C. PedBrain. 2013. Signatures of mutational processes in human cancer. Nature 500, 7463 (01 Aug 2013), 415–421. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Alexandrov Ludmil B., Nik-Zainal Serena, Wedge David C., Campbell Peter J., and Stratton Michael R.. 2013. Deciphering signatures of mutational processes operative in human cancer. Cell Reports 3, 1 (2013), 246259.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] H. S. Anderson and P. Roth. 2018. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. ArXiv e-prints (April 2018). arXiv:1804.04637 [cs.CR].Google ScholarGoogle Scholar
  9. [9] Arp Daniel, Spreitzenbarth Michael, Hubner Malte, Gascon Hugo, Rieck Konrad, and Siemens CERT. 2014. Drebin: Effective and explainable detection of Android malware in your pocket.. In NDSS, 14. 2326.Google ScholarGoogle Scholar
  10. [10] Bak Márton, Papp Dorottya, Tamás Csongor, and Buttyán Levente. 2020. Clustering IoT malware based on binary similarity. In Proceedings of the IEEE/IFIP Network Operations and Management Symposium (NOMS 2020). IEEE, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Bhattarai Manish, Boureima Ismael, Skau Erik, Nebgen Benjamin, Djidjev Hristo, Rajopadhye Sanjay, Smith James P., Alexandrov Boian, et al. 2023. Distributed non-negative rescal with automatic model selection for exascale data. J. Parallel and Distrib. Comput. 179 (2023), 104709.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Manish Bhattarai, Namita Kharat, Ismael Boureima, Erik Skau, Benjamin Nebgen, Hristo Djidjev, Sanjay Rajopadhye, James P. Smith, and Boian Alexandrov. 2023. Distributed non-negative RESCAL with automatic model selection for exascale data. J. Parallel and Distrib. Comput. 179 (2023), 104709. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Bishop Christopher M.. 1999. Bayesian PCA. Advances in Neural Information Processing Systems (1999), 382388.Google ScholarGoogle Scholar
  14. [14] Bissell K. and Ponemon L.. 2019. The Cost of Cybercrime. Technical Report. Accenture, Ponemon Institute. https://www.accenture.com/_acnmedia/PDF-96/Accenture-2019-Cost-of-Cybercrime-Study-Final.pdfGoogle ScholarGoogle Scholar
  15. [15] Boureima Ismael, Bhattarai Manish, Eren Maksim Ekin, Skau Erik West, Romero Philip, Eidenbenz Stephan Johannes, and Alexandrov Boian S.. 2022. Distributed out-of-memory NMF on CPU/GPU architectures. The Journal of Supercomputing (2022). https://api.semanticscholar.org/CorpusID:247011761Google ScholarGoogle Scholar
  16. [16] Boureima Ismael, Bhattarai Manish, Eren Maksim E., Solovyev Nick, Djidjev Hristo, and Alexandrov Boian S.. 2022. Distributed out-of-memory SVD on CPU/GPU architectures. In 2022 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Brunet Jean-Philippe, Tamayo Pablo, Golub Todd R., and Mesirov Jill P.. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences 101, 12 (2004), 41644169.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Canny John. 2004. GaP: A factor model for discrete data. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 122129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Carel Léna and Alquier Pierre. 2021. Simultaneous dimension reduction and clustering via the NMF-EM algorithm. Advances in Data Analysis and Classification 15 (2021), 231260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Chen Tianqi and Guestrin Carlos. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, CA)) (KDD ’16). ACM, New York,, 785794. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Faleiros Thiago de Paulo and Lopes Alneu de Andrade. 2016. On the equivalence between algorithms for non-negative matrix factorization and Latent Dirichlet Allocation.. In ESANN.Google ScholarGoogle Scholar
  22. [22] Eren Maksim E., Bhattarai Manish, Rasmussen Kim, Alexandrov Boian S., and Nicholas Charles. 2023. MalwareDNA: Simultaneous classification of malware, malware families, and novel malware. arXiv preprint arXiv:2309.01350 (2023).Google ScholarGoogle Scholar
  23. [23] Eren Maksim E., Bhattarai Manish, Solovyev Nicholas, Richards Luke E., Yus Roberto, Nicholas Charles, and Alexandrov Boian S.. 2022. One-shot federated group collaborative filtering. In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). 647652. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Eren Maksim E., Solovyev Nick, Bhattarai Manish, Rasmussen Kim Ø., Nicholas Charles, and Alexandrov Boian S.. 2022. SeNMFk-SPLIT: Large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection. In Proceedings of the 22nd ACM Symposium on Document Engineering (San Jose, CA) (DocEng ’22). ACM, New York,, Article 10, 4 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Source External Data. 2018. VirusShare Dataset. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Fan Ming, Liu Jun, Luo Xiapu, Chen Kai, Tian Zhenzhou, Zheng Qinghua, and Liu Ting. 2018. Android malware familial classification and representative sample selection via frequent subgraph analysis. IEEE Transactions on Information Forensics and Security 13, 8 (2018), 18901905. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Févotte Cédric and Cemgil A. Taylan. 2009. Nonnegative matrix factorizations as probabilistic inference in composite models. In Proceedings of the2009 17th European Signal Processing Conference. IEEE, 19131917.Google ScholarGoogle Scholar
  28. [28] Gillis Nicolas, Kuang Da, and Park Haesun. 2014. Hierarchical clustering of hyperspectral images using rank-two nonnegative matrix factorization. IEEE Transactions on Geoscience and Remote Sensing 53, 4 (2014), 20662078.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Greene Derek, O’Callaghan Derek, and Cunningham Pádraig. 2014. How many topics? stability analysis for topic models. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498513.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Grotheer Rachel, Huang Yihuan, Li Pengyu, Rebrova Elizaveta, Needell Deanna, Huang Longxiu, Kryshchenko Alona, Li Xia, Ha Kyung, and Kryshchenko Oleksandr. 2020. COVID-19 literature topic-based search via hierarchical NMF. arXiv preprint arXiv:2009.09074 (2020).Google ScholarGoogle Scholar
  31. [31] Hansen Steven Strandlund, Larsen Thor Mark Tampus, Stevanovic Matija, and Pedersen Jens Myrup. 2016. An approach for detection and family classification of malware based on behavioral analysis. In Proceedings of the2016 International Conference on Computing, Networking and Communications (ICNC). 15. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Haykin Simon. 1994. Neural Networks: A Comprehensive Foundation. Prentice Hall PTR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Haynes Winston. 2013. Wilcoxon Rank Sum Test. Springer New York, , 23542355. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Huang Wenyi and Stokes Jay. 2016. MtNet: A multi-task neural network for dynamic malware classification. In Proceedings of 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2016) .). Springer, 399418. https://www.microsoft.com/en-us/research/publication/mtnet-multi-task-neural-network-dynamic-malware-classification/Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] IBM. 2021. Cost of a Data Breach Report. Technical Report. IBM. https://www.ibm.com/security/data-breachGoogle ScholarGoogle Scholar
  36. [36] S. M. Ashiqul Islam, Marcos Diaz-Gay, Yang Wu, Mark Barnes, Raviteja Vangara, Erik N. Bergstrom, Yudou He, Mike Vella, Jingwei Wang, Jon W. Teague, Peter Clapham, Sarah Moody, Sergey Senkin, Yun Rose Li, Laura Riva, Tongwu Zhang, Andreas J. Gruber, Christopher D. Steele, Burcak Otlu, Azhar Khandekar, Ammal Abbasi, Laura Humphreys, Natalia Syulyukina, Samuel W. Brady, Boian S. Alexandrov, Nischalan Pillay, Jinghui Zhang, David J. Adams, Inigo Martincorena, David C. Wedge, Maria Teresa Landi, Paul Brennan, Michael R. Stratton, Steven G. Rozen, and Ludmil B. Alexandrov. 2022. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genomics 2, 11 (2022), 100179. Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Järvelin Kalervo and Kekäläinen Jaana. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (Oct. 2002), 422446. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Jiang Jianguo, Li Song, Yu Min, Li Gang, Liu Chao, Chen Kai, Liu Hui, and Huang Weiqing. 2019. Android malware family classification based on sensitive opcode sequence. In Proceedings of the 2019 IEEE Symposium on Computers and Communications (ISCC). IEEE, 17.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Kaspersky. 2020. Machine Learning Methods for Malware Detection. Technical Report.Google ScholarGoogle Scholar
  40. [40] Ke Guolin, Meng Qi, Finley Thomas, Wang Taifeng, Chen Wei, Ma Weidong, Ye Qiwei, and Liu Tie-Yan. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17) (Long Beach, CA) . Curran Associates Inc., Red Hook, NY, 31493157.Google ScholarGoogle Scholar
  41. [41] Kuang Da and Park Haesun. 2013. Fast rank-2 nonnegative matrix factorization for hierarchical document clustering. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 739747.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Lee Daniel D. and Seung H. Sebastian. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788791.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Li Lisha, Jamieson Kevin, DeSalvo Giulia, Rostamizadeh Afshin, and Talwalkar Ameet. 2018. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research 18, 185 (2018), 152. http://jmlr.org/papers/v18/16-558.htmlGoogle ScholarGoogle Scholar
  44. [44] Ling Yeong Tyng, Sani Nor Fazlida Mohd, Abdullah Mohd Taufik, and Hamid Nor Asilah Wati Abdul. 2019. Nonnegative matrix factorization and metamorphic malware detection. Journal of Computer Virology and Hacking Techniques 15, 3 (2019), 195208.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Loi Nicola, Borile Claudio, and Ucci Daniele. 2021. Towards an automated pipeline for detecting and classifying malware through machine learning. arXiv preprint arXiv:2106.05625 (2021).Google ScholarGoogle Scholar
  46. [46] MacKay David J. C.. 1994. Bayesian nonlinear modeling for the prediction competition. ASHRAE Transactions 100, 2 (1994), 10531062.Google ScholarGoogle Scholar
  47. [47] Team Microsoft 365 Defender Threat Intelligence. 2020. Microsoft Researchers Work with Intel Labs to Explore New Deep Learning Approaches for Malware Classification. https://www.microsoft.com/security/blogGoogle ScholarGoogle Scholar
  48. [48] Mohaisen Aziz, Alrawi Omar, and Mohaisen Manar. 2015. AMAL: High-fidelity, behavior-based automated malware analysis and classification. Computers & Security 52 (2015), 251266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Mørup Morten and Hansen Lars Kai. 2009. Tuning pruning in sparse non-negative matrix factorization. In Proceedings of the 2009 17th European Signal Processing Conference. IEEE, 19231927.Google ScholarGoogle Scholar
  50. [50] Nataraj L., Karthikeyan S., Jacob G., and Manjunath B. S.. 2011. Malware images: Visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security (VizSec ’11) (Pittsburgh, PA) . ACM, New York,, Article 4, 7 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Nebgen Benjamin T., Vangara Raviteja, Hombrados-Herrera Miguel A., Kuksova Svetlana, and Alexandrov Boian S.. 2021. A neural network for determination of latent dimensionality in non-negative matrix factorization. Machine Learning: Science and Technology 2, 2 (2021), 025012.Google ScholarGoogle Scholar
  52. [52] Nguyen Andre T., Raff Edward, Nicholas Charles, and Holt James. 2021. Leveraging uncertainty for improved static malware detection under extreme false positive constraints. arXiv preprint arXiv:2108.04081 (2021).Google ScholarGoogle Scholar
  53. [53] Quintero Bernardo. 2019. VirusTotal += Bitdefender Theta. https://blog.virustotal.com/2019/10/virustotal-bitdefender-theta.htmlGoogle ScholarGoogle Scholar
  54. [54] Quintero Bernardo. 2019. VirusTotal += Sangfor Engine Zero. https://blog.virustotal.com/2019/11/virustotal-sangfor-engine-zero.htmlGoogle ScholarGoogle Scholar
  55. [55] Raff Edward and Nicholas Charles. 2017. An alternative to NCD for large sequences, Lempel-Ziv Jaccard distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17) (Halifax, NS, Canada) . ACM, New York,, 10071015. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Raff Edward and Nicholas C.. 2020. A survey of machine learning methods and challenges for windows malware classification. ArXiv abs/2006.09271 (2020).Google ScholarGoogle Scholar
  57. [57] Raff Edward, Nicholas Charles, and McLean Mark. 2020. A new burrows wheeler transform Markov distance. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. http://arxiv.org/abs/1912.13046Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Raff Edward and Nicholas Charles K.. 2018. Lempel-Ziv Jaccard Distance, an effective alternative to ssdeep and sdhash. Digital Investigation (Feb. 2018). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Rousseeuw Peter J.. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 5365. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Shi Tian, Kang Kyeongpil, Choo Jaegul, and Reddy Chandan K.. 2018. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In Proceedings of the 2018 World Wide Web Conference. 11051114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Sun Bowen, Li Qi, Guo Yanhui, Wen Qiaokun, Lin Xiaoxi, and Liu Wenhan. 2017. Malware family classification method based on static feature extraction. In Proceedings of the 2017 3rd IEEE International Conference on Computer and Communications (ICCC). 507513. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Tan Vincent Y. F. and Févotte Cédric. 2012. Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 7 (2012), 15921605.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Institute The Independent IT Security. 2021. Malware Statistics & Trends Report: AV-TEST. https://www.av-test.org/en/statistics/malware/Google ScholarGoogle Scholar
  64. [64] Trigeorgis George, Bousmalis Konstantinos, Zafeiriou Stefanos, and Schuller Bjoern. 2014. A deep semi-NMF model for learning hidden representations. In Proceedings of the International Conference on Machine Learning. PMLR, 16921700.Google ScholarGoogle Scholar
  65. [65] Maaten Laurens van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 25792605. http://www.jmlr.org/papers/v9/vandermaaten08a.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Vangara Raviteja, Bhattarai Manish, Skau Erik, Chennupati Gopinath, Djidjev Hristo, Tierney Thomas, Smith James P., Stanev Valentin G, and Alexandrov Boian S.. 2021. Finding the number of latent topics with semantic non-negative matrix factorization. IEEE Access (2021).Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Vangara Raviteja, Skau Erik, Chennupati Gopinath, Djidjev Hristo, Tierney Thomas, Smith James P., Bhattarai Manish, Stanev Valentin G., and Alexandrov Boian S.. 2020. Semantic nonnegative matrix factorization with automatic model determination for topic modeling. In Proceedings of the2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 328335.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Vangara Raviteja, Skau Erik, Chennupati Gopinath, Djidjev Hristo, Tierney Thomas, Smith James P., Bhattarai Manish, Stanev Valentin G., and Alexandrov Boian S.. 2020. Semantic nonnegative matrix factorization with automatic model determination for topic modeling. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). 328335. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Vinayakumar R., Alazab Mamoun, Soman K. P., Poornachandran Prabaharan, and Venkatraman Sitalakshmi. 2019. Robust intelligent malware detection using deep learning. IEEE Access 7 (2019), 4671746738. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Xu Wei, Liu Xin, and Gong Yihong. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 267273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Yarowsky David. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (Cambridge, Massachusetts) (ACL ’95). Association for Computational Linguistics, , 189196. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] Zhang Lijun, Chen Zhengguang, Zheng Miao, and He Xiaofei. 2011. Robust non-negative matrix factorization. Frontiers of Electrical and Electronic Engineering in China 6, 2 (2011), 192200.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Zhang Shao-Huai, Kuo Cheng-Chung, and Yang Chu-Sing. 2019. Static PE malware type classification using machine learning techniques. In Proceedings of the 2019 International Conference on Intelligent Computing and its Emerging Applications (ICEA). 8186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Zhang Yanxin, Sui Yulei, Pan Shirui, Zheng Zheng, Ning Baodi, Tsang Ivor, and Zhou Wanlei. 2020. Familial clustering for weakly-labeled android malware using hybrid representation learning. IEEE Transactions on Information Forensics and Security 15 (2020), 34013414. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Privacy and Security
            ACM Transactions on Privacy and Security  Volume 26, Issue 4
            November 2023
            260 pages
            ISSN:2471-2566
            EISSN:2471-2574
            DOI:10.1145/3614236
            • Editor:
            • Ninghui Li
            Issue’s Table of Contents

            Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 November 2023
            • Online AM: 18 September 2023
            • Accepted: 4 September 2023
            • Revised: 20 October 2022
            • Received: 1 January 2022
            Published in tops Volume 26, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)467
            • Downloads (Last 6 weeks)52

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text