Detection of unknown computer worms based on behavioral classification of the host

doi:10.1016/j.csda.2008.01.028

Computational Statistics & Data Analysis

Volume 52, Issue 9, 15 May 2008, Pages 4544-4566

https://doi.org/10.1016/j.csda.2008.01.028 Get rights and content

Abstract

Machine learning techniques are widely used in many fields. One of the applications of machine learning in the field of information security is classification of a computer behavior into malicious and benign. Antiviruses consisting of signature-based methods are helpless against new (unknown) computer worms. This paper focuses on the feasibility of accurately detecting unknown worm activity in individual computers while minimizing the required set of features collected from the monitored computer. A comprehensive experiment for testing the feasibility of detecting unknown computer worms, employing several computer configurations, background applications, and user activity, was performed. During the experiments 323 computer features were monitored by an agent that was developed. Four feature selection methods were used to reduce the number of features and four learning algorithms were applied on the resulting feature subsets. The evaluation results suggest that by using classification algorithms applied on only 20 features the mean detection accuracy exceeded 90%, and for specific unknown worms accuracy reached above 99%, while maintaining a low level of false positive rate.

Introduction

Malicious code (malcode) detection, transmitted over computer networks has been researched intensively in recent years (Kabiri and Ghorbani, 2005). One type of abundant malcode is worms, which proactively propagate across networks while exploiting vulnerabilities in operating systems and programs. Other types of malcodes include computer viruses, Trojan horses, spyware, and adware. In this study we focus on worms, though we plan to extend the proposed approach to other types of malcodes.

Nowadays, excellent technology (i.e., antivirus software packages) exists for detecting and eliminating known malicious code. Typically, antivirus software packages inspect each file that enters the system, looking for known signs (signatures) which uniquely identify an instance of known malcode. Nevertheless, antivirus technology is based on prior explicit knowledge of malcode signatures and cannot be used for detecting unknown malcode. Following the appearance of a new worm, a patch is provided by the operating system provider (if needed) and the antivirus vendors update their signature-base accordingly. This solution is not perfect since worms propagate very rapidly and by the time the local antivirus software tools have been updated, very expensive damage has already been inflicted by the worm (Fosnock, 2005).

Intrusion detection, commonly at the network level, called network-based intrusion detection (NIDS), was researched substantially (Kabiri and Ghorbani, 2005). However, NIDS are limited in their detection capabilities (like any detection system). In order to detect malcodes which slipped through the NIDS at the network level, detection operations are performed locally at the host level. Detection systems at the host level, called Host-based Intrusion Detection (HIDS), are currently very limited in their ability to detect unknown malcode.

Recent studies have proposed methods for detecting unknown malcode using Machine Learning techniques. Given a training set of malicious and benign executable binary code, a classifier is trained to identify and classify unknown malicious executables as being malicious (Schultz et al., 2001, Abou-Assaleh et al., 2004, Kolter and Maloof, 2006, Caia et al., 2007).

Existing methods rely on the analysis of the binary for the detection of unknown malcode. Some less typical worms are left undetectable. Therefore an additional detection layer at runtime is required. The proposed approach assumes that the malicious actions are reflected in the general behavior of the host. Thus, by monitoring the host, one can inexplicitly identify malcodes. This property can be used as an additional protection layer.

In this study, we focus on detecting the presence of a worm based on the computer’s (host) behavior. Our suggested approach can be classified under HIDS. The main contribution of our approach is that the knowledge is acquired automatically using inductive learning, given a dataset of known worms (avoids the need for manual acquisition of knowledge). While the new approach does not prevent infection, it enables a fast detection of an infection which may result in an alert, which can be further reasoned by the system administrator. Further reasoning based on the network-topology can be performed by a network and system administration function, and relevant decisions and policies, such as disconnecting a single computer or a cluster, can be applied.

Generally speaking, malcode within the same category (e.g., worms, Trojans, spyware, adware) share similar characteristics and behavior patterns. These patterns are reflected by the infected computer’s behavior. Thus, we hypothesize that it is feasible to learn the computer behavior in the presence of a certain type of malcode, which can be measured through the collection of various parameters along time (CPU, Memory, etc.). In the proposed approach, a classifier is trained with computer measurements from infected and not infected computers. Based on the generalization capability of the learning algorithm, we argue that a classifier can further detect previously unknown worm activity. Nevertheless, this approach may be affected by the variance in computer and application configurations as well as user activity (running and using various applications) on each computer. In this study, we investigate whether an unknown worm activity can be detected, at a high level of accuracy, given the variation in hardware and software environmental conditions on individual computers, while minimizing the set of monitored features.

In this paper we introduce three main contributions: We show that current machine learning techniques are capable of detecting and classifying worms solely by monitoring the host activity. Using feature selection techniques we show that a relatively small set of features are sufficient for solving the problem without sacrifice accuracy. We present empirical results from an extensive study of various machine configurations suggesting that the proposed methods achieve high detection rates on previously unseen worms.

The rest of the paper is structured as follows: in Section 2, a survey of the relevant background for this study is presented. The methods used in this study are described in Section 3, followed by the description of the experimental design in Section 4. In Section 5 we present the evaluation results and conclude with summary and conclusions in Section 6.

Section snippets

Malicious code and worms

The term ‘malicious code’ (malcode) refers to a piece of code, not necessarily an executable file, intended to harm, whether generally or in particular, a specific owner (host). The approach suggested in this study aims at detecting any malcode activity, whether known or unknown. However, since our preliminary research is on worms, we will focus on them in this section.

Kienzle and Elder (2003) define a worm by several aspects through which it can be distinguished from other types of malcodes:

Methods

The goal of this study was to assess the feasibility of detecting unknown malicious code, in particular computer worms, based on the computer’s behavior (measurements), using machine learning techniques, and the potential accuracy of such methods. In order to create the datasets we built an isolated local network of computers, simulating a real Internet network which allows worms to propagate. This setup enabled us to inject worms into a controlled environment, while monitoring the computer

Experimental design

Our main goal in this study was to investigate whether the approach presented here, in which unknown malicious code is detected, based on the computer behavior (measurements), is feasible and enables a high level of accuracy when applied to a variety of computers. We defined four research questions accordingly:

Q₁:
In the detection of known malicious code, based on a computer’s measurements, using machine learning techniques, what is the achievable level of accuracy?
Q₂:
Is it possible to reduce the

Experiment I

Our objective in $e 1$ was to determine the best: feature selection approach, feature selection method, number of top features, and classification algorithms. We ran 132 (four classification algorithms applied to 33 feature sets) evaluations (each comprises 64 runs), summing up to 8448 evaluation runs.

Fig. 4 shows the mean accuracy (of all the classification algorithms) achieved for each environment feature consolidation (unified or averaged), each feature selection method, top 5, 10, 20, 30

Discussion and conclusions

In this paper we explored the feasibility of detecting unknown worm activity in individual computers, at a high level of accuracy, given the variation in hardware and software environmental conditions, while minimizing the set of features collected from the monitored computer. Four research questions were investigated, referring to the feasibility of the approach, the best settings, and the level of achieved accuracy, for which a dataset was created and several corresponding experiments were

Acknowledgments

This work was supported by Deutsche Telekom Co. We would like to thank the undergraduate students Shai and Ido and Clint Feher who contributed to the preparation of the dataset.

References (30)

M. Botha et al.
Utilising fuzzy logic and trend analysis for effective intrusion detection
Computers & Security
(2003)
J. Pearl
Fusion, propagation, and structuring in belief networks
Artificial Intelligence
(1986)
Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R., 2004. N-gram based detection of new malicious code. In:...
Barbara, D., Wu, N., Jajodia, S., 2001. Detecting novel network intrusions using Bayes estimators. In: Proceedings of...
C. Bishop
Neural Networks for Pattern Recognition
(1995)
S.M. Bridges et al.
Fuzzy data mining and genetic algorithms applied to intrusion detection
D.M. Caia et al.
Comparison of feature selection and classification algorithms in identifying malicious executables
Computational Statistics & Data Analysis
(2007)
CERT. CERT Advisory CA-2000-04, Love letter worm....
H. Demuth et al.
Neural Network Toolbox for Use with Matlab
(1998)
Dickerson, J.E., Dickerson, J.A., 2000. Fuzzy network profiling for intrusion detection. In: Proceedings of NAFIPS 19th...

P. Domingos et al.

On the optimality of simple Bayesian classifier under zero-one loss

Machine Learning

(1997)

Fosnock, C., 2005. Computer worms: Past, present and future. East Carolina...

P.Z. Hu et al.

P. Kabiri et al.

Research on intrusion detection and response: A survey

International Journal of Network Security

(2005)

H.G. Kayacik et al.

On the capability of a Som based intrusion detection system

Cited by (72)

Decompiled APK based malicious code classification
2020, Future Generation Computer Systems
Citation Excerpt :
Endpoint detection and response (EDR) solutions try to supplement traditional signature-based technologies for richer behavior-based anomaly detection and visibility across endpoints. Analysis of potentially malicious code traditionally uses two main approaches: static analysis [21,22] and dynamic analysis [23]. Dynamic analysis involves executing the malware and studying its run-time behaviors based on the interaction with the operating system.
Due to the increasing growth in the variety of Android malware, it is important to distinguish between the unique types of each. In this paper, we introduce the use of a decompiled source code for malicious code classification. This decompiled source code provides deeper analysis opportunities and understanding of the nature of malware. Malicious code differs from text due to syntax rules of compilers and the effort of attackers to evade potential detection. Hence, we adapt Natural Language Processing-based techniques under some constraints for malicious code classification. First, the proposed methodology decompiles the Android Package Kit files, then API calls, keywords, and non-obfuscated tokens are extracted from the source code and categorized to stop-tokens, feature-tokens, and long-tail-tokens. We also introduce the use of generalized N-tokens to represent tokens that are typically less frequent. Our approach was evaluated, in comparison to the use of API calls and permissions for features, as a baseline, and their combination, as well as in comparison to the use of neural network architectures based on decompiled Android Package Kits. A rigorous evaluation of comprehensive public real-world Android malware datasets, including 24,553 apps that were categorized to 71 families for the malicious families classification, and 60,000 apps for malicious code detection was performed. Our approach outperformed the baselines in both tasks.
Android malware detection with unbiased confidence guarantees
2018, Neurocomputing
The impressive growth of smartphone devices in combination with the rising ubiquity of using mobile platforms for sensitive applications such as Internet banking, have triggered a rapid increase in mobile malware. In recent literature, many studies examine Machine Learning techniques, as the most promising approach for mobile malware detection, without however quantifying the uncertainty involved in their detections. In this paper, we address this problem by proposing a machine learning dynamic analysis approach that provides provably valid confidence guarantees in each malware detection. Moreover the particular guarantees hold for both the malicious and benign classes independently and are unaffected by any bias in the data. The proposed approach is based on a novel machine learning framework, called Conformal Prediction, combined with a random forests classifier. We examine its performance on a large-scale dataset collected by installing 1866 malicious and 4816 benign applications on a real android device. We make this collection of dynamic analysis data available to the research community. The obtained experimental results demonstrate the empirical validity, usefulness and unbiased nature of the outputs produced by the proposed approach.
Automated multi-level malware detection system based on reconstructed semantic view of executables using machine learning techniques at VMM
2018, Future Generation Computer Systems
Citation Excerpt :
Generally, malware uses the stealthy technique to exploit the system and network vulnerabilities in order to gain control of the user system to achieve unauthorized activities [7]. Its prime target is not only restricted to destroy the single system or group of systems, it also targets to disrupt the normal functions of the computer networks [8]. This results in increasing threat to the information systems that are used in day to day activities.
In order to fulfill the requirements like stringent timing restraints and demand on resources, Cyber–Physical System (CPS) must deploy on the virtualized environment such as cloud computing. To protect Virtual Machines (VMs) in which CPSs are functioning against malware-based attacks, malware detection and mitigation technique is emerging as a highly crucial concern. The traditional VM-based anti-malware software themselves a potential target for malware-based attack since they are easily subverted by sophisticated malware. Thus, a reliable and robust malware monitoring and detection systems are needed to detect and mitigate rapidly the malware based cyber-attacks in real time particularly for virtualized environment. The Virtual Machine Introspection (VMI) has emerged as a fine-grained out-of-VM security solution to detect malware by introspecting and reconstructing the volatile memory state of the live guest Operating System (OS) by functioning at the Virtual Machine Monitor (VMM) or hypervisor. However, the reconstructed semantic details by the VMI are available in a combination of benign and malicious states at the hypervisor. In order to distinguish between these two states, extensive manual analysis is required by the existing out-of-VM security solutions. To address the foremost issue, in this paper, we propose an advanced VMM-based guest-assisted Automated Multilevel Malware Detection System (AMMDS) that leverages both VMI and Memory Forensic Analysis (MFA) techniques to predict early symptoms of malware execution by detecting stealthy hidden processes on a live guest OS. More specifically, the AMMDS system detects and classifies the actual running malicious executables from the semantically reconstructed process view of the guest OS. The two sub-components of the AMMDS are: Online Malware Detector (OMD) and Offline Malware Classifier (OFMC). The OMD recognizes whether the running processes are benign or malicious using its Local Malware Signature Database (LMSD) and online malware scanner and the OFMC classify unknown malware by adopting machine learning techniques at the hypervisor. The AMMDS has been evaluated by executing large real-world malware and benign executables on to the live guest OSs. The evaluation results achieved 100% of accuracy and zero False Positive Rate (FPR) on the 10-fold cross-validation in classifying unknown malware with maximum performance overhead of 5.8%.
Leveraging virtual machine introspection with memory forensics to detect and characterize unknown malware using machine learning techniques at hypervisor
2017, Digital Investigation
The Virtual Machine Introspection (VMI) has emerged as a fine-grained, out-of-VM security solution that detects malware by introspecting and reconstructing the volatile memory state of the live guest Operating System (OS). Specifically, it functions by the Virtual Machine Monitor (VMM), or hypervisor. The reconstructed semantic details obtained by the VMI are available in a combination of benign and malicious states at the hypervisor. In order to distinguish between these two states, the existing out-of-VM security solutions require extensive manual analysis. In this paper, we propose an advanced VMM-based, guest-assisted Automated Internal-and-External (A-IntExt) introspection system by leveraging VMI, Memory Forensics Analysis (MFA), and machine learning techniques at the hypervisor. Further, we use the VMI-based technique to introspect digital artifacts of the live guest OS to obtain a semantic view of the processes details. We implemented an Intelligent Cross View Analyzer (ICVA) and implanted it into our proposed A-IntExt system, which examines the data supplied by the VMI to detect hidden, dead, and dubious processes, while also predicting early symptoms of malware execution on the introspected guest OS in a timely manner. Machine learning techniques are used to analyze the executables that are mined and extracted using MFA-based techniques and ascertain the malicious executables. The practicality of the A-IntExt system is evaluated by executing large real-world malware and benign executables onto the live guest OSs. The evaluation results achieved 99.55% accuracy and 0.004 False Positive Rate (FPR) on the 10-fold cross-validation to detect unknown malware on the generated dataset. Additionally, the proposed system was validated against other benchmarked malware datasets and the A-IntExt system outperforms the detection of real-world malware at the VMM with performance exceeding 6.3%.
A survey of data mining and machine learning-based intrusion detection system for cyber security
2023, Risk Detection and Cyber Security for the Success of Contemporary Computing
Unleashing the Power of Machine and Deep Learning for Advanced Network Intrusion Detection: An Analysis and Exploration
2023, Proceedings of the 2nd IEEE International Conference on Advances in Computing, Communication and Applied Informatics, ACCAI 2023

View all citing articles on Scopus

View full text