Detection of unknown computer worms based on behavioral classification of the host

https://doi.org/10.1016/j.csda.2008.01.028Get rights and content

Abstract

Machine learning techniques are widely used in many fields. One of the applications of machine learning in the field of information security is classification of a computer behavior into malicious and benign. Antiviruses consisting of signature-based methods are helpless against new (unknown) computer worms. This paper focuses on the feasibility of accurately detecting unknown worm activity in individual computers while minimizing the required set of features collected from the monitored computer. A comprehensive experiment for testing the feasibility of detecting unknown computer worms, employing several computer configurations, background applications, and user activity, was performed. During the experiments 323 computer features were monitored by an agent that was developed. Four feature selection methods were used to reduce the number of features and four learning algorithms were applied on the resulting feature subsets. The evaluation results suggest that by using classification algorithms applied on only 20 features the mean detection accuracy exceeded 90%, and for specific unknown worms accuracy reached above 99%, while maintaining a low level of false positive rate.

Introduction

Malicious code (malcode) detection, transmitted over computer networks has been researched intensively in recent years (Kabiri and Ghorbani, 2005). One type of abundant malcode is worms, which proactively propagate across networks while exploiting vulnerabilities in operating systems and programs. Other types of malcodes include computer viruses, Trojan horses, spyware, and adware. In this study we focus on worms, though we plan to extend the proposed approach to other types of malcodes.

Nowadays, excellent technology (i.e., antivirus software packages) exists for detecting and eliminating known malicious code. Typically, antivirus software packages inspect each file that enters the system, looking for known signs (signatures) which uniquely identify an instance of known malcode. Nevertheless, antivirus technology is based on prior explicit knowledge of malcode signatures and cannot be used for detecting unknown malcode. Following the appearance of a new worm, a patch is provided by the operating system provider (if needed) and the antivirus vendors update their signature-base accordingly. This solution is not perfect since worms propagate very rapidly and by the time the local antivirus software tools have been updated, very expensive damage has already been inflicted by the worm (Fosnock, 2005).

Intrusion detection, commonly at the network level, called network-based intrusion detection (NIDS), was researched substantially (Kabiri and Ghorbani, 2005). However, NIDS are limited in their detection capabilities (like any detection system). In order to detect malcodes which slipped through the NIDS at the network level, detection operations are performed locally at the host level. Detection systems at the host level, called Host-based Intrusion Detection (HIDS), are currently very limited in their ability to detect unknown malcode.

Recent studies have proposed methods for detecting unknown malcode using Machine Learning techniques. Given a training set of malicious and benign executable binary code, a classifier is trained to identify and classify unknown malicious executables as being malicious (Schultz et al., 2001, Abou-Assaleh et al., 2004, Kolter and Maloof, 2006, Caia et al., 2007).

Existing methods rely on the analysis of the binary for the detection of unknown malcode. Some less typical worms are left undetectable. Therefore an additional detection layer at runtime is required. The proposed approach assumes that the malicious actions are reflected in the general behavior of the host. Thus, by monitoring the host, one can inexplicitly identify malcodes. This property can be used as an additional protection layer.

In this study, we focus on detecting the presence of a worm based on the computer’s (host) behavior. Our suggested approach can be classified under HIDS. The main contribution of our approach is that the knowledge is acquired automatically using inductive learning, given a dataset of known worms (avoids the need for manual acquisition of knowledge). While the new approach does not prevent infection, it enables a fast detection of an infection which may result in an alert, which can be further reasoned by the system administrator. Further reasoning based on the network-topology can be performed by a network and system administration function, and relevant decisions and policies, such as disconnecting a single computer or a cluster, can be applied.

Generally speaking, malcode within the same category (e.g., worms, Trojans, spyware, adware) share similar characteristics and behavior patterns. These patterns are reflected by the infected computer’s behavior. Thus, we hypothesize that it is feasible to learn the computer behavior in the presence of a certain type of malcode, which can be measured through the collection of various parameters along time (CPU, Memory, etc.). In the proposed approach, a classifier is trained with computer measurements from infected and not infected computers. Based on the generalization capability of the learning algorithm, we argue that a classifier can further detect previously unknown worm activity. Nevertheless, this approach may be affected by the variance in computer and application configurations as well as user activity (running and using various applications) on each computer. In this study, we investigate whether an unknown worm activity can be detected, at a high level of accuracy, given the variation in hardware and software environmental conditions on individual computers, while minimizing the set of monitored features.

In this paper we introduce three main contributions: We show that current machine learning techniques are capable of detecting and classifying worms solely by monitoring the host activity. Using feature selection techniques we show that a relatively small set of features are sufficient for solving the problem without sacrifice accuracy. We present empirical results from an extensive study of various machine configurations suggesting that the proposed methods achieve high detection rates on previously unseen worms.

The rest of the paper is structured as follows: in Section 2, a survey of the relevant background for this study is presented. The methods used in this study are described in Section 3, followed by the description of the experimental design in Section 4. In Section 5 we present the evaluation results and conclude with summary and conclusions in Section 6.

Section snippets

Malicious code and worms

The term ‘malicious code’ (malcode) refers to a piece of code, not necessarily an executable file, intended to harm, whether generally or in particular, a specific owner (host). The approach suggested in this study aims at detecting any malcode activity, whether known or unknown. However, since our preliminary research is on worms, we will focus on them in this section.

Kienzle and Elder (2003) define a worm by several aspects through which it can be distinguished from other types of malcodes:

Methods

The goal of this study was to assess the feasibility of detecting unknown malicious code, in particular computer worms, based on the computer’s behavior (measurements), using machine learning techniques, and the potential accuracy of such methods. In order to create the datasets we built an isolated local network of computers, simulating a real Internet network which allows worms to propagate. This setup enabled us to inject worms into a controlled environment, while monitoring the computer

Experimental design

Our main goal in this study was to investigate whether the approach presented here, in which unknown malicious code is detected, based on the computer behavior (measurements), is feasible and enables a high level of accuracy when applied to a variety of computers. We defined four research questions accordingly:

  • Q1:

    In the detection of known malicious code, based on a computer’s measurements, using machine learning techniques, what is the achievable level of accuracy?

  • Q2:

    Is it possible to reduce the

Experiment I

Our objective in e1 was to determine the best: feature selection approach, feature selection method, number of top features, and classification algorithms. We ran 132 (four classification algorithms applied to 33 feature sets) evaluations (each comprises 64 runs), summing up to 8448 evaluation runs.

Fig. 4 shows the mean accuracy (of all the classification algorithms) achieved for each environment feature consolidation (unified or averaged), each feature selection method, top 5, 10, 20, 30

Discussion and conclusions

In this paper we explored the feasibility of detecting unknown worm activity in individual computers, at a high level of accuracy, given the variation in hardware and software environmental conditions, while minimizing the set of features collected from the monitored computer. Four research questions were investigated, referring to the feasibility of the approach, the best settings, and the level of achieved accuracy, for which a dataset was created and several corresponding experiments were

Acknowledgments

This work was supported by Deutsche Telekom Co. We would like to thank the undergraduate students Shai and Ido and Clint Feher who contributed to the preparation of the dataset.

References (30)

  • M. Botha et al.

    Utilising fuzzy logic and trend analysis for effective intrusion detection

    Computers & Security

    (2003)
  • J. Pearl

    Fusion, propagation, and structuring in belief networks

    Artificial Intelligence

    (1986)
  • Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R., 2004. N-gram based detection of new malicious code. In:...
  • Barbara, D., Wu, N., Jajodia, S., 2001. Detecting novel network intrusions using Bayes estimators. In: Proceedings of...
  • C. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • S.M. Bridges et al.

    Fuzzy data mining and genetic algorithms applied to intrusion detection

  • D.M. Caia et al.

    Comparison of feature selection and classification algorithms in identifying malicious executables

    Computational Statistics & Data Analysis

    (2007)
  • CERT. CERT Advisory CA-2000-04, Love letter worm....
  • H. Demuth et al.

    Neural Network Toolbox for Use with Matlab

    (1998)
  • Dickerson, J.E., Dickerson, J.A., 2000. Fuzzy network profiling for intrusion detection. In: Proceedings of NAFIPS 19th...
  • P. Domingos et al.

    On the optimality of simple Bayesian classifier under zero-one loss

    Machine Learning

    (1997)
  • Fosnock, C., 2005. Computer worms: Past, present and future. East Carolina...
  • P.Z. Hu et al.
  • P. Kabiri et al.

    Research on intrusion detection and response: A survey

    International Journal of Network Security

    (2005)
  • H.G. Kayacik et al.

    On the capability of a Som based intrusion detection system

  • Cited by (72)

    • Decompiled APK based malicious code classification

      2020, Future Generation Computer Systems
      Citation Excerpt :

      Endpoint detection and response (EDR) solutions try to supplement traditional signature-based technologies for richer behavior-based anomaly detection and visibility across endpoints. Analysis of potentially malicious code traditionally uses two main approaches: static analysis [21,22] and dynamic analysis [23]. Dynamic analysis involves executing the malware and studying its run-time behaviors based on the interaction with the operating system.

    • Automated multi-level malware detection system based on reconstructed semantic view of executables using machine learning techniques at VMM

      2018, Future Generation Computer Systems
      Citation Excerpt :

      Generally, malware uses the stealthy technique to exploit the system and network vulnerabilities in order to gain control of the user system to achieve unauthorized activities [7]. Its prime target is not only restricted to destroy the single system or group of systems, it also targets to disrupt the normal functions of the computer networks [8]. This results in increasing threat to the information systems that are used in day to day activities.

    • A survey of data mining and machine learning-based intrusion detection system for cyber security

      2023, Risk Detection and Cyber Security for the Success of Contemporary Computing
    • Unleashing the Power of Machine and Deep Learning for Advanced Network Intrusion Detection: An Analysis and Exploration

      2023, Proceedings of the 2nd IEEE International Conference on Advances in Computing, Communication and Applied Informatics, ACCAI 2023
    View all citing articles on Scopus
    View full text