research-article

Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

Authors:
Maksim E. Eren

Advanced Research in Cyber Systems, Los Alamos National Laboratory, USA

Advanced Research in Cyber Systems, Los Alamos National Laboratory, USA

0000-0002-4362-0256
View Profile

,
Manish Bhattarai

Theoretical Division, Los Alamos National Laboratory, USA

Theoretical Division, Los Alamos National Laboratory, USA

0000-0002-1421-3643
View Profile

,
Robert J. Joyce

Machine Learning Research Group, Booz Allen Hamilton, USA

Machine Learning Research Group, Booz Allen Hamilton, USA

0009-0003-7168-1237
View Profile

,
Edward Raff

Machine Learning Research Group, Booz Allen Hamilton, USA

Machine Learning Research Group, Booz Allen Hamilton, USA

0000-0002-9900-1972
View Profile

,
Charles Nicholas

Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, USA

Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, USA

0000-0001-9494-7139
View Profile

,
Boian S. Alexandrov

Theoretical Division, Los Alamos National Laboratory, USA

Theoretical Division, Los Alamos National Laboratory, USA

0000-0001-8636-4603
View Profile

Authors Info & Claims

ACM Transactions on Privacy and Security Volume 26 Issue 4Article No.: 48pp 1–27https://doi.org/10.1145/3624567

Published:13 November 2023Publication History

ACM Transactions on Privacy and Security

Abstract

Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this article, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.

REFERENCES

[1] Ahmadi Mansour, Ulyanov Dmitry, Semenov Stanislav, Trofimov Mikhail, and Giacinto Giorgio. 2016. Novel feature extraction, selection and fusion for effective malware family classification. In Proceedings of the 6th ACM Conference on Data and Application Security and Privacy. 183–194.Google ScholarDigital Library
[2] Akiba Takuya, Sano Shotaro, Yanase Toshihiko, Ohta Takeru, and Koyama Masanori. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2623–2631.Google ScholarDigital Library
[3] Alexandrov B. S., Alexandrov L. B., and al. V. G. Stanev et2020. Source identification by non-negative matrix factorization combined with semi-supervised clustering. US Patent S10,776,718 (2020).Google Scholar
[4] Boian S. Alexandrov, Ludmil B. Alexandrov, Filip L. Iliev, Valentin G. Stanev, and Velimir V. Vesselinov. 2020. Source identification by non-negative matrix factorization combined with semi-supervised clustering. US Patent S10,776,718.Google Scholar
[5] Ludmil B. Alexandrov, Jaegil Kim, Nicholas J. Haradhvala, Mi Ni Huang, Alvin Wei Tian Ng, Yang Wu, Arnoud Boot, Kyle R. Covington, Dmitry A. Gordenin, Erik N. Bergstrom, S. M. Ashiqul Islam, Nuria Lopez-Bigas, Leszek J. Klimczak, John R. McPherson, Sandro Morganella, Radhakrishnan Sabarinathan, David A. Wheeler, Ville Mustonen, Paul Boutros, Kin Chan, Akihiro Fujimoto, Gad Getz, Marat Kazanov, Michael Lawrence, Iñigo Martincorena, Hidewaki Nakagawa, Paz Polak, Stephenie Prokopec, Steven A. Roberts, Steven G. Rozen, Natalie Saini, Tatsuhiro Shibata, Yuichi Shiraishi, Michael R. Stratton, Bin Tean Teh, Ignacio Vázquez-García, Fouad Yousif, Willie Yu, Lauri A. Aaltonen, Federico Abascal, Adam Abeshouse, Hiroyuki Aburatani, David J. Adams, Nishant Agrawal, Keun Soo Ahn, Sung-Min Ahn, Hiroshi Aikata, Rehan Akbani, Kadir C. Akdemir, Hikmat Al-Ahmadie, Sultan T. Al-Sedairy, Fatima Al-Shahrour, Malik Alawi, Monique Albert, Kenneth Aldape, Adrian Ally, Kathryn Alsop, Eva G. Alvarez, Fernanda Amary, Samirkumar B. Amin, Brice Aminou, Ole Ammerpohl, Matthew J. Anderson, Yeng Ang, Davide Antonello, Pavana Anur, Samuel Aparicio, Elizabeth L. Appelbaum, Yasuhito Arai, Axel Aretz, Koji Arihiro, Shun-ichi Ariizumi, Joshua Armenia, Laurent Arnould, Sylvia Asa, Yassen Assenov, Gurnit Atwal, Sietse Aukema, J. Todd Auman, Miriam R. R. Aure, Philip Awadalla, Marta Aymerich, Gary D. Bader, Adrian Baez-Ortega, Matthew H. Bailey, Peter J. Bailey, Miruna Balasundaram, Saianand Balu, Pratiti Bandopadhayay, Rosamonde E. Banks, Stefano Barbi, Andrew P. Barbour, Jonathan Barenboim, Jill Barnholtz- Sloan, Hugh Barr, Elisabet Barrera, John Bartlett, Javier Bartolome, Claudio Bassi, Oliver F. Bathe, Daniel Baumhoer, Prashant Bavi, Stephen B. Baylin, Wojciech Bazant, Duncan Beardsmore, Timothy A. Beck, Sam Behjati, Andreas Behren, Beifang Niu, Cindy Bell, Sergi Beltran, Christopher Benz, Andrew Berchuck, Anke K. Bergmann, Benjamin P. Berman, Daniel M. Berney, Stephan H. Bernhart, Rameen Beroukhim, Mario Berrios, Samantha Bersani, Johanna Bertl, Miguel Betancourt, Vinayak Bhandari, Shriram G. Bhosle, Andrew V. Biankin, Matthias Bieg, Darell Bigner, Hans Binder, Ewan Birney, Michael Birrer, Nidhan K. Biswas, Bodil Bjerkehagen, Tom Bodenheimer, Lori Boice, Giada Bonizzato, Johann S. De Bono, Moiz S. Bootwalla, Ake Borg, Arndt Borkhardt, Keith A. Boroevich, Ivan Borozan, Christoph Borst, Marcus Bosenberg, Mattia Bosio, Jacqueline Boultwood, Guillaume Bourque, Paul C. Boutros, G. Steven Bova, David T. Bowen, Reanne Bowlby, David D. L. Bowtell, Sandrine Boyault, Rich Boyce, Jeffrey Boyd, Alvis Brazma, Paul Brennan, Daniel S. Brewer, Arie B. Brinkman, Robert G. Bristow, Russell R. Broaddus, Jane E. Brock, Malcolm Brock, Annegien Broeks, Angela N. Brooks, Denise Brooks, Benedikt Brors, Søren Brunak, Timothy J. C. Bruxner, Alicia L. Bruzos, Alex Buchanan, Ivo Buchhalter, Christiane Buchholz, Susan Bullman, Hazel Burke, Birgit Burkhardt, Kathleen H. Burns, John Busanovich, Carlos D. Bustamante, Adam P. Butler, Atul J. Butte, Niall J. Byrne, Anne-Lise Børresen-Dale, Samantha J. Caesar-Johnson, Andy Cafferkey, Declan Cahill, Claudia Calabrese, Carlos Caldas, Fabien Calvo, Niedzica Camacho, Peter J. Campbell, Elias Campo, Cinzia Cantù, Shaolong Cao, Thomas E. Carey, Joana Carlevaro-Fita, Rebecca Carlsen, Ivana Cataldo, Mario Cazzola, Jonathan Cebon, Robert Cerfolio, Dianne E. Chadwick, Dimple Chakravarty, Don Chalmers, Calvin Wing Yiu Chan, Michelle Chan-Seng-Yue, Vishal S. Chandan, David K. Chang, Stephen J. Chanock, Lorraine A. Chantrill, Aurélien Chateigner, Nilanjan Chatterjee, Kazuaki Chayama, Hsiao-Wei Chen, Jieming Chen, Ken Chen, Yiwen Chen, Zhaohong Chen, Andrew D. Cherniack, Jeremy Chien, Yoke-Eng Chiew, Suet-Feung Chin, Juok Cho, Sunghoon Cho, Jung Kyoon Choi, Wan Choi, Christine Chomienne, Zechen Chong, Su Pin Choo, Angela Chou, Angelika N. Christ, Elizabeth L. Christie, Eric Chuah, Carrie Cibulskis, Kristian Cibulskis, Sara Cingarlini, Peter Clapham, Alexander Claviez, Sean Cleary, Nicole Cloonan, Marek Cmero, Colin C. Collins, Ashton A. Connor, Susanna L. Cooke, Colin S. Cooper, Leslie Cope, Vincenzo Corbo, Matthew G. Cordes, Stephen M. Cordner, Isidro Cortés-Ciriano, Kyle Covington, Prue A. Cowin, Brian Craft, David Craft, Chad J. Creighton, Yupeng Cun, Erin Curley, Ioana Cutcutache, Karolina Czajka, Bogdan Czerniak, Rebecca A. Dagg, Ludmila Danilova, Maria Vittoria Davi, Natalie R. Davidson, Helen Davies, Ian J. Davis, Brandi N. Davis-Dusenbery, Kevin J. Dawson, Francisco M. De La Vega, Ricardo De Paoli-Iseppi, Timothy Defreitas, Angelo P. Dei Tos, Olivier Delaneau, John A. Demchok, PCAWG Mutational Signatures Working Group, and P. C. A. W. G. Consortium. 2020. The repertoire of mutational signatures in human cancer. Nature 578, 7793 (01 Feb 2020), 94–101. Google ScholarCross Ref
[6] Ludmil B. Alexandrov, Serena Nik-Zainal, David C. Wedge, Samuel A. J. R. Aparicio, Sam Behjati, Andrew V. Biankin, Graham R. Bignell, Niccolò Bolli, Ake Borg, Anne-Lise Børresen-Dale, Sandrine Boyault, Birgit Burkhardt, Adam P. Butler, Carlos Caldas, Helen R. Davies, Christine Desmedt, Roland Eils, Jórunn Erla Eyfjörd, John A. Foekens, Mel Greaves, Fumie Hosoda, Barbara Hutter, Tomislav Ilicic, Sandrine Imbeaud, Marcin Imielinski, Natalie Jäger, David T. W. Jones, David Jones, Stian Knappskog, Marcel Kool, Sunil R. Lakhani, Carlos López-Otín, Sancha Martin, Nikhil C. Munshi, Hiromi Nakamura, Paul A. Northcott, Marina Pajic, Elli Papaemmanuil, Angelo Paradiso, John V. Pearson, Xose S. Puente, Keiran Raine, Manasa Ramakrishna, Andrea L. Richardson, Julia Richter, Philip Rosenstiel, Matthias Schlesner, Ton N. Schumacher, Paul N. Span, Jon W. Teague, Yasushi Totoki, Andrew N. J. Tutt, Rafael Valdés-Mas, Marit M. van Buuren, Laura van ’t Veer, Anne Vincent-Salomon, Nicola Waddell, Lucy R. Yates, Jessica Zucman-Rossi, P. Andrew Futreal, Ultan McDermott, Peter Lichter, Matthew Meyerson, Sean M. Grimmond, Reiner Siebert, Elías Campo, Tatsuhiro Shibata, Stefan M. Pfister, Peter J. Campbell, Michael R. Stratton, Australian Pancreatic Cancer Genome Initiative, ICGC Breast Cancer Consortium, ICGC MMML-Seq Consortium, and I. C. G. C. PedBrain. 2013. Signatures of mutational processes in human cancer. Nature 500, 7463 (01 Aug 2013), 415–421. Google ScholarCross Ref
[7] Alexandrov Ludmil B., Nik-Zainal Serena, Wedge David C., Campbell Peter J., and Stratton Michael R.. 2013. Deciphering signatures of mutational processes operative in human cancer. Cell Reports 3, 1 (2013), 246–259.Google ScholarCross Ref
[8] H. S. Anderson and P. Roth. 2018. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. ArXiv e-prints (April 2018). arXiv:1804.04637 [cs.CR].Google Scholar
[9] Arp Daniel, Spreitzenbarth Michael, Hubner Malte, Gascon Hugo, Rieck Konrad, and Siemens CERT. 2014. Drebin: Effective and explainable detection of Android malware in your pocket.. In NDSS, 14. 23–26.Google Scholar
[10] Bak Márton, Papp Dorottya, Tamás Csongor, and Buttyán Levente. 2020. Clustering IoT malware based on binary similarity. In Proceedings of the IEEE/IFIP Network Operations and Management Symposium (NOMS 2020). IEEE, 1–6.Google ScholarDigital Library
[11] Bhattarai Manish, Boureima Ismael, Skau Erik, Nebgen Benjamin, Djidjev Hristo, Rajopadhye Sanjay, Smith James P., Alexandrov Boian, et al. 2023. Distributed non-negative rescal with automatic model selection for exascale data. J. Parallel and Distrib. Comput. 179 (2023), 104709.Google ScholarDigital Library
[12] Manish Bhattarai, Namita Kharat, Ismael Boureima, Erik Skau, Benjamin Nebgen, Hristo Djidjev, Sanjay Rajopadhye, James P. Smith, and Boian Alexandrov. 2023. Distributed non-negative RESCAL with automatic model selection for exascale data. J. Parallel and Distrib. Comput. 179 (2023), 104709. Google ScholarDigital Library
[13] Bishop Christopher M.. 1999. Bayesian PCA. Advances in Neural Information Processing Systems (1999), 382–388.Google Scholar
[14] Bissell K. and Ponemon L.. 2019. The Cost of Cybercrime. Technical Report. Accenture, Ponemon Institute. https://www.accenture.com/_acnmedia/PDF-96/Accenture-2019-Cost-of-Cybercrime-Study-Final.pdfGoogle Scholar
[15] Boureima Ismael, Bhattarai Manish, Eren Maksim Ekin, Skau Erik West, Romero Philip, Eidenbenz Stephan Johannes, and Alexandrov Boian S.. 2022. Distributed out-of-memory NMF on CPU/GPU architectures. The Journal of Supercomputing (2022). https://api.semanticscholar.org/CorpusID:247011761Google Scholar
[16] Boureima Ismael, Bhattarai Manish, Eren Maksim E., Solovyev Nick, Djidjev Hristo, and Alexandrov Boian S.. 2022. Distributed out-of-memory SVD on CPU/GPU architectures. In 2022 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–8.Google ScholarCross Ref
[17] Brunet Jean-Philippe, Tamayo Pablo, Golub Todd R., and Mesirov Jill P.. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences 101, 12 (2004), 4164–4169.Google ScholarCross Ref
[18] Canny John. 2004. GaP: A factor model for discrete data. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 122–129.Google ScholarDigital Library
[19] Carel Léna and Alquier Pierre. 2021. Simultaneous dimension reduction and clustering via the NMF-EM algorithm. Advances in Data Analysis and Classification 15 (2021), 231–260.Google ScholarDigital Library
[20] Chen Tianqi and Guestrin Carlos. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, CA)) (KDD ’16). ACM, New York,, 785–794. DOI:Google ScholarDigital Library
[21] Faleiros Thiago de Paulo and Lopes Alneu de Andrade. 2016. On the equivalence between algorithms for non-negative matrix factorization and Latent Dirichlet Allocation.. In ESANN.Google Scholar
[22] Eren Maksim E., Bhattarai Manish, Rasmussen Kim, Alexandrov Boian S., and Nicholas Charles. 2023. MalwareDNA: Simultaneous classification of malware, malware families, and novel malware. arXiv preprint arXiv:2309.01350 (2023).Google Scholar
[23] Eren Maksim E., Bhattarai Manish, Solovyev Nicholas, Richards Luke E., Yus Roberto, Nicholas Charles, and Alexandrov Boian S.. 2022. One-shot federated group collaborative filtering. In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). 647–652. DOI:Google ScholarCross Ref
[24] Eren Maksim E., Solovyev Nick, Bhattarai Manish, Rasmussen Kim Ø., Nicholas Charles, and Alexandrov Boian S.. 2022. SeNMFk-SPLIT: Large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection. In Proceedings of the 22nd ACM Symposium on Document Engineering (San Jose, CA) (DocEng ’22). ACM, New York,, Article 10, 4 pages. DOI:Google ScholarDigital Library
[25] Source External Data. 2018. VirusShare Dataset. DOI:Google ScholarCross Ref
[26] Fan Ming, Liu Jun, Luo Xiapu, Chen Kai, Tian Zhenzhou, Zheng Qinghua, and Liu Ting. 2018. Android malware familial classification and representative sample selection via frequent subgraph analysis. IEEE Transactions on Information Forensics and Security 13, 8 (2018), 1890–1905. DOI:Google ScholarCross Ref
[27] Févotte Cédric and Cemgil A. Taylan. 2009. Nonnegative matrix factorizations as probabilistic inference in composite models. In Proceedings of the2009 17th European Signal Processing Conference. IEEE, 1913–1917.Google Scholar
[28] Gillis Nicolas, Kuang Da, and Park Haesun. 2014. Hierarchical clustering of hyperspectral images using rank-two nonnegative matrix factorization. IEEE Transactions on Geoscience and Remote Sensing 53, 4 (2014), 2066–2078.Google ScholarCross Ref
[29] Greene Derek, O’Callaghan Derek, and Cunningham Pádraig. 2014. How many topics? stability analysis for topic models. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498–513.Google ScholarDigital Library
[30] Grotheer Rachel, Huang Yihuan, Li Pengyu, Rebrova Elizaveta, Needell Deanna, Huang Longxiu, Kryshchenko Alona, Li Xia, Ha Kyung, and Kryshchenko Oleksandr. 2020. COVID-19 literature topic-based search via hierarchical NMF. arXiv preprint arXiv:2009.09074 (2020).Google Scholar
[31] Hansen Steven Strandlund, Larsen Thor Mark Tampus, Stevanovic Matija, and Pedersen Jens Myrup. 2016. An approach for detection and family classification of malware based on behavioral analysis. In Proceedings of the2016 International Conference on Computing, Networking and Communications (ICNC). 1–5. DOI:Google ScholarCross Ref
[32] Haykin Simon. 1994. Neural Networks: A Comprehensive Foundation. Prentice Hall PTR.Google ScholarDigital Library
[33] Haynes Winston. 2013. Wilcoxon Rank Sum Test. Springer New York, , 2354–2355. DOI:Google ScholarCross Ref
[34] Huang Wenyi and Stokes Jay. 2016. MtNet: A multi-task neural network for dynamic malware classification. In Proceedings of 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2016) .). Springer, 399–418. https://www.microsoft.com/en-us/research/publication/mtnet-multi-task-neural-network-dynamic-malware-classification/Google ScholarDigital Library
[35] IBM. 2021. Cost of a Data Breach Report. Technical Report. IBM. https://www.ibm.com/security/data-breachGoogle Scholar
[36] S. M. Ashiqul Islam, Marcos Diaz-Gay, Yang Wu, Mark Barnes, Raviteja Vangara, Erik N. Bergstrom, Yudou He, Mike Vella, Jingwei Wang, Jon W. Teague, Peter Clapham, Sarah Moody, Sergey Senkin, Yun Rose Li, Laura Riva, Tongwu Zhang, Andreas J. Gruber, Christopher D. Steele, Burcak Otlu, Azhar Khandekar, Ammal Abbasi, Laura Humphreys, Natalia Syulyukina, Samuel W. Brady, Boian S. Alexandrov, Nischalan Pillay, Jinghui Zhang, David J. Adams, Inigo Martincorena, David C. Wedge, Maria Teresa Landi, Paul Brennan, Michael R. Stratton, Steven G. Rozen, and Ludmil B. Alexandrov. 2022. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genomics 2, 11 (2022), 100179. Google ScholarCross Ref
[37] Järvelin Kalervo and Kekäläinen Jaana. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (Oct. 2002), 422–446. DOI:Google ScholarDigital Library
[38] Jiang Jianguo, Li Song, Yu Min, Li Gang, Liu Chao, Chen Kai, Liu Hui, and Huang Weiqing. 2019. Android malware family classification based on sensitive opcode sequence. In Proceedings of the 2019 IEEE Symposium on Computers and Communications (ISCC). IEEE, 1–7.Google ScholarCross Ref
[39] Kaspersky. 2020. Machine Learning Methods for Malware Detection. Technical Report.Google Scholar
[40] Ke Guolin, Meng Qi, Finley Thomas, Wang Taifeng, Chen Wei, Ma Weidong, Ye Qiwei, and Liu Tie-Yan. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17) (Long Beach, CA) . Curran Associates Inc., Red Hook, NY, 3149–3157.Google Scholar
[41] Kuang Da and Park Haesun. 2013. Fast rank-2 nonnegative matrix factorization for hierarchical document clustering. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 739–747.Google ScholarDigital Library
[42] Lee Daniel D. and Seung H. Sebastian. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788–791.Google ScholarCross Ref
[43] Li Lisha, Jamieson Kevin, DeSalvo Giulia, Rostamizadeh Afshin, and Talwalkar Ameet. 2018. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research 18, 185 (2018), 1–52. http://jmlr.org/papers/v18/16-558.htmlGoogle Scholar
[44] Ling Yeong Tyng, Sani Nor Fazlida Mohd, Abdullah Mohd Taufik, and Hamid Nor Asilah Wati Abdul. 2019. Nonnegative matrix factorization and metamorphic malware detection. Journal of Computer Virology and Hacking Techniques 15, 3 (2019), 195–208.Google ScholarCross Ref
[45] Loi Nicola, Borile Claudio, and Ucci Daniele. 2021. Towards an automated pipeline for detecting and classifying malware through machine learning. arXiv preprint arXiv:2106.05625 (2021).Google Scholar
[46] MacKay David J. C.. 1994. Bayesian nonlinear modeling for the prediction competition. ASHRAE Transactions 100, 2 (1994), 1053–1062.Google Scholar
[47] Team Microsoft 365 Defender Threat Intelligence. 2020. Microsoft Researchers Work with Intel Labs to Explore New Deep Learning Approaches for Malware Classification. https://www.microsoft.com/security/blogGoogle Scholar
[48] Mohaisen Aziz, Alrawi Omar, and Mohaisen Manar. 2015. AMAL: High-fidelity, behavior-based automated malware analysis and classification. Computers & Security 52 (2015), 251–266. Google ScholarDigital Library
[49] Mørup Morten and Hansen Lars Kai. 2009. Tuning pruning in sparse non-negative matrix factorization. In Proceedings of the 2009 17th European Signal Processing Conference. IEEE, 1923–1927.Google Scholar
[50] Nataraj L., Karthikeyan S., Jacob G., and Manjunath B. S.. 2011. Malware images: Visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security (VizSec ’11) (Pittsburgh, PA) . ACM, New York,, Article 4, 7 pages. DOI:Google ScholarDigital Library
[51] Nebgen Benjamin T., Vangara Raviteja, Hombrados-Herrera Miguel A., Kuksova Svetlana, and Alexandrov Boian S.. 2021. A neural network for determination of latent dimensionality in non-negative matrix factorization. Machine Learning: Science and Technology 2, 2 (2021), 025012.Google Scholar
[52] Nguyen Andre T., Raff Edward, Nicholas Charles, and Holt James. 2021. Leveraging uncertainty for improved static malware detection under extreme false positive constraints. arXiv preprint arXiv:2108.04081 (2021).Google Scholar
[53] Quintero Bernardo. 2019. VirusTotal += Bitdefender Theta. https://blog.virustotal.com/2019/10/virustotal-bitdefender-theta.htmlGoogle Scholar
[54] Quintero Bernardo. 2019. VirusTotal += Sangfor Engine Zero. https://blog.virustotal.com/2019/11/virustotal-sangfor-engine-zero.htmlGoogle Scholar
[55] Raff Edward and Nicholas Charles. 2017. An alternative to NCD for large sequences, Lempel-Ziv Jaccard distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17) (Halifax, NS, Canada) . ACM, New York,, 1007–1015. DOI:Google ScholarDigital Library
[56] Raff Edward and Nicholas C.. 2020. A survey of machine learning methods and challenges for windows malware classification. ArXiv abs/2006.09271 (2020).Google Scholar
[57] Raff Edward, Nicholas Charles, and McLean Mark. 2020. A new burrows wheeler transform Markov distance. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. http://arxiv.org/abs/1912.13046Google ScholarCross Ref
[58] Raff Edward and Nicholas Charles K.. 2018. Lempel-Ziv Jaccard Distance, an effective alternative to ssdeep and sdhash. Digital Investigation (Feb. 2018). DOI:Google ScholarCross Ref
[59] Rousseeuw Peter J.. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53–65. DOI:Google ScholarDigital Library
[60] Shi Tian, Kang Kyeongpil, Choo Jaegul, and Reddy Chandan K.. 2018. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In Proceedings of the 2018 World Wide Web Conference. 1105–1114.Google ScholarDigital Library
[61] Sun Bowen, Li Qi, Guo Yanhui, Wen Qiaokun, Lin Xiaoxi, and Liu Wenhan. 2017. Malware family classification method based on static feature extraction. In Proceedings of the 2017 3rd IEEE International Conference on Computer and Communications (ICCC). 507–513. DOI:Google ScholarCross Ref
[62] Tan Vincent Y. F. and Févotte Cédric. 2012. Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 7 (2012), 1592–1605.Google ScholarDigital Library
[63] Institute The Independent IT Security. 2021. Malware Statistics & Trends Report: AV-TEST. https://www.av-test.org/en/statistics/malware/Google Scholar
[64] Trigeorgis George, Bousmalis Konstantinos, Zafeiriou Stefanos, and Schuller Bjoern. 2014. A deep semi-NMF model for learning hidden representations. In Proceedings of the International Conference on Machine Learning. PMLR, 1692–1700.Google Scholar
[65] Maaten Laurens van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605. http://www.jmlr.org/papers/v9/vandermaaten08a.htmlGoogle ScholarDigital Library
[66] Vangara Raviteja, Bhattarai Manish, Skau Erik, Chennupati Gopinath, Djidjev Hristo, Tierney Thomas, Smith James P., Stanev Valentin G, and Alexandrov Boian S.. 2021. Finding the number of latent topics with semantic non-negative matrix factorization. IEEE Access (2021).Google ScholarCross Ref
[67] Vangara Raviteja, Skau Erik, Chennupati Gopinath, Djidjev Hristo, Tierney Thomas, Smith James P., Bhattarai Manish, Stanev Valentin G., and Alexandrov Boian S.. 2020. Semantic nonnegative matrix factorization with automatic model determination for topic modeling. In Proceedings of the2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 328–335.Google ScholarCross Ref
[68] Vangara Raviteja, Skau Erik, Chennupati Gopinath, Djidjev Hristo, Tierney Thomas, Smith James P., Bhattarai Manish, Stanev Valentin G., and Alexandrov Boian S.. 2020. Semantic nonnegative matrix factorization with automatic model determination for topic modeling. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). 328–335. DOI:Google ScholarCross Ref
[69] Vinayakumar R., Alazab Mamoun, Soman K. P., Poornachandran Prabaharan, and Venkatraman Sitalakshmi. 2019. Robust intelligent malware detection using deep learning. IEEE Access 7 (2019), 46717–46738. DOI:Google ScholarCross Ref
[70] Xu Wei, Liu Xin, and Gong Yihong. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 267–273.Google ScholarDigital Library
[71] Yarowsky David. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (Cambridge, Massachusetts) (ACL ’95). Association for Computational Linguistics, , 189–196. DOI:Google ScholarDigital Library
[72] Zhang Lijun, Chen Zhengguang, Zheng Miao, and He Xiaofei. 2011. Robust non-negative matrix factorization. Frontiers of Electrical and Electronic Engineering in China 6, 2 (2011), 192–200.Google ScholarCross Ref
[73] Zhang Shao-Huai, Kuo Cheng-Chung, and Yang Chu-Sing. 2019. Static PE malware type classification using machine learning techniques. In Proceedings of the 2019 International Conference on Intelligent Computing and its Emerging Applications (ICEA). 81–86. DOI:Google ScholarCross Ref
[74] Zhang Yanxin, Sui Yulei, Pan Shirui, Zheng Zheng, Ning Baodi, Tsang Ivor, and Zhou Wanlei. 2020. Familial clustering for weakly-labeled android malware using hybrid representation learning. IEEE Transactions on Information Forensics and Security 15 (2020), 3401–3414. DOI:Google ScholarCross Ref

Index Terms

Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Malware and its mitigation

Recommendations

Label consistent semi-supervised non-negative matrix factorization for maintenance activities identification

Health prognostic is playing an increasingly essential role in product and system management, for which non-negative matrix factorization (NMF) has been an effective method to model the high dimensional recorded data of the device or system. However, ...
Read More
Semi-supervised non-negative matrix factorization for image clustering with graph Laplacian

Non-negative matrix factorization (NMF) plays an important role in multivariate data analysis, and has been widely applied in information retrieval, computer vision, and pattern recognition. NMF is an effective method to capture the underlying structure ...
Read More
Robust discriminative non-negative matrix factorization

Traditional non-negative matrix factorization (NMF) is an unsupervised method that represents non-negative data by a part-based dictionary and non-negative codes. Recently, the unsupervised NMF has been extended to discriminative ones for classification ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Privacy and Security Volume 26, Issue 4
November 2023
260 pages
ISSN:2471-2566
EISSN:2471-2574
DOI:10.1145/3614236
Editor:
Ninghui Li
Purdue University, USA
Issue’s Table of Contents
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 November 2023
- Online AM: 18 September 2023
- Accepted: 4 September 2023
- Revised: 20 October 2022
- Received: 1 January 2022
Published in tops Volume 26, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Malware
malware families
non-negative matrix factorization
semi-supervised
hierarchical
model selection
class imbalance
abstaining prediction
reject-option
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 467
  Total Downloads
- Downloads (Last 12 months)467
- Downloads (Last 6 weeks)52
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

ACM Transactions on Privacy and Security

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Label consistent semi-supervised non-negative matrix factorization for maintenance activities identification

Semi-supervised non-negative matrix factorization for image clustering with graph Laplacian

Robust discriminative non-negative matrix factorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

ACM Transactions on Privacy and Security

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Label consistent semi-supervised non-negative matrix factorization for maintenance activities identification

Semi-supervised non-negative matrix factorization for image clustering with graph Laplacian

Robust discriminative non-negative matrix factorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media