Efficient intrusion detection using representative instances

doi:10.1016/j.cose.2013.08.003

Computers & Security

Volume 39, Part B, November 2013, Pages 255-267

https://doi.org/10.1016/j.cose.2013.08.003 Get rights and content

Highlights

•
A new metric is proposed to measure the representative power of an instance.
•
A new centroid-based partition strategy is proposed to partition data sets.
•
Representative instances can be used to train classification model.
•
Training time can be greatly reduced when the representative instances are used.

Abstract

Because of their feasibility and effectiveness, artificial intelligence-based intrusion detection systems attract considerable interest from researchers. However, when confronted with large-scale data sets, many artificial intelligence-based intrusion detection systems could suffer from a high computational burden, even though the feature selection method can help to reduce the computational complexity. To improve the efficiency, we propose a representative instance selection method to preprocess the original data set before training a classifier, which is independent of the learning algorithm that is used for constructing the intrusion detection system. In this study, a new metric is introduced to measure the representative power of an instance with respect to its class. Based on an implementation of representativeness, we select the most representative instance in each subset divided by a novel centroid-based partitioning strategy, and then, we utilise the result as training data to build various intrusion detection models efficiently. Experimental results on a labelled flow-based data set introduced in 2009 show that ANN, KNN, SVM and Liblinear learning with a largely reduced set of representative instances can not only achieve high efficiency in detecting network attacks but also provide comparable detection performance in terms of the detection rate, precision, F-score and accuracy, as compared with four corresponding classifiers built with the original large data set.

Introduction

In recent decades, the explosive growth of the Internet has resulted in an increasing number of people using the Internet in their daily life and business, which has prompted researchers to work on various techniques to secure networks. However, these techniques have failed to fully protect networks from increasingly sophisticated attacks. As a result, an intrusion detection system (IDS) has become a crucial component of any network security infrastructure to detect network attacks before they induce widespread damage (Wu and Banzhaf, 2010).

IDS is a security measure that helps to identify a set of malicious actions that compromise the integrity, confidentiality, and availability of information resources (Dokas et al., 2002). From the viewpoint of classification, the main objective of building an IDS is to train a classifier that can categorise normal vs. intrusive data from the original data set (Toosi and Kahani, 2007). Two main approaches to intrusion detection are currently used, namely, misuse-based detection (detect what is known) and anomaly-based detection (detect what is different from what is known). The detailed advantages and disadvantages of these two approaches can be seen in the review in Garcia-Teodoro et al. (2009).

Recently, artificial intelligence-based IDS has been an important part of IDS (Ying et al., 2012). To date, various artificial intelligence technologies have been applied to the field of intrusion detection, such as naive Bayes classification (Schultz et al., 2001), K nearest neighbour (KNN) (Liao and Vemuri, 2002), decision trees (Bouzida et al., 2004), artificial neural networks (ANNs) (Yao et al., 2006), and support vector machines (SVMs) (Chen et al., 2005). In all of the intelligence algorithms above, SVM and KNN are the most commonly-used techniques that are used on intrusion detection; especially the SVM is receiving more attention for an intrusion detection model design (Tsai et al., 2009). In addition, ANNs have been found by some researchers to be effective for intrusion detection because of their ability to produce good results in complex domains (Chen et al., 2005, Yao et al., 2006). The multilayer perceptron (MLP) is one of the most commonly used neural network architectures in many pattern recognition problems, including the problem of intrusion detection (Yao et al., 2006). In the development of IDS, many studies also pointed out that a hybrid approach can improve the detection capability of IDS (Tsai et al., 2009). The idea behind a hybrid classifier is to combine several learning techniques to obtain a better detection performance than a single classifier. Recently, many hybrid approaches (Chung and Wahid, 2012, Khor et al., 2012, Li et al., 2008, Tsai and Lin, 2010, Shon and Moon, 2007) were proposed to construct IDSs, and some of them have achieved admirable detection performance.

With respect to IDS, the effectiveness and efficiency are undoubtedly two key factors. Generally speaking, the effectiveness of IDS can be evaluated by the detection rate, false positive rate, and accuracy, while the efficiency is usually measured by the response time during a network attack (Su, 2011). There is no doubt that it is desirable to design an IDS that is effective and computationally efficient. Moreover, the time complexity for training a classifier for intrusion detection should also be of concern, because many artificial intelligence algorithms (e.g., ANNs and SVMs) typically involve expensive computation during the training process (Horng et al., 2011, Pereira et al., 2012), especially when the amount of training data is large. As previously mentioned, much of the previous work has applied various artificial intelligence algorithms to solve the application of intrusion detection, and some of these studies achieved favourable detection performance. However, the majority of artificial intelligence algorithms applied to intrusion detection could suffer from requiring a large amount of computational time when used directly on the rough data obtained from network traffic or other local or remote applications (Bouzida et al., 2004). One of the main reasons for this problem is that the number of features extracted from raw network traffic data and the amount of network traffic data that an IDS must process are all typically large.

To improve on the efficiency of artificial intelligence-based IDS, many researchers (Amiri et al., 2011, Bouzida et al., 2004, Chebrolu et al., 2005, Li et al., 2009, Sivatha Sindhu et al., 2012) employed feature selection as a preprocessing phase to remove information redundancy and reduce the computational complexity in recent years. Essentially, feature selection can be seen as a method of data reduction (Chebrolu et al., 2005, Davis and Clark, 2011); it decreases the dimensionality of the feature space by eliminating some redundant or irrelevant features. In this line of research, many feature selection algorithms have been used to optimise the intelligence algorithm-based IDS, such as principal component analysis (Bouzida et al., 2004), chi-Square (CS) (Tsang et al., 2007), and information gain (IG) (Kayacik et al., 2005). Indeed, using feature selection as a preprocessing phase can help to improve the efficiency of many Intelligent algorithms. However, because the number of instances in the data set did not drop much at all after performing the feature selection, many of the artificial intelligence algorithms (e.g., ANN and SVM) would still suffer from having a high computational cost when operating the data sets with a large number of instances in the intrusion detection domain, especially in the classifier training phase. Because the computational complexities (including training and testing) of many artificial intelligence algorithms are associated with the size of the training data set and the performance unavoidably depends on the quality of the training data, it is favourable to have relatively small but high-quality training data sets.

The goal of this study is to select a small subset, which can be seen as the representative of the original training data set, to build various artificial intelligence-based IDSs that possess effectiveness as well as efficiency. For this purpose, we propose a novel method for selecting a high-quality subset that has far fewer instances from the original training data to build an intrusion detection model. This method can be considered to be a technique for data reduction; it is applied in the preprocessing phase, and thus, it is independent of the artificial intelligence algorithm that is used for building an intrusion detection model. Specifically, we first introduce a new concept called “representativeness”, to measure the representative power of an instance with respect to its class in a given data set, which can be used as a metric for selecting “representative” instances. Then, we develop a novel centroid-based strategy to partition each class of the original training data set into several subsets of equal size. Subsequently, we combine the most “representative” instance in each subset as a new training data set for building an intrusion detection model. Finally, we examine the feasibility and effectiveness of our representative selection method for intrusion detection by conducting several experiments on a labelled flow-based data set for intrusion detection, which was introduced in 2009.

The remainder of this article is organised as follows. The formal definitions of the problem of intrusion detection and selecting a high-quality subset are given in Section 2. Section 3 presents a detailed description of each step of our method. Section 4 gives an analysis of the time complexity and memory requirement for intelligence algorithms with representative instances. The experimental setup and results of our method are given in Section 5. Finally, Section 6 draws conclusions and outlines future work.

Section snippets

Problem formulation

Similar to the studies in Benferhat et al., 2013, Khor et al., 2012, Pereira et al., 2012, Shon and Moon, 2007 and Sivatha Sindhu et al. (2012), this article views intrusion detection as a supervised classification problem. Let $X = {(x_{1}, Y_{1}), \dots, (x_{M}, Y_{M})}$ be a labelled intrusion detection training data set with M training instances, where x_i represents an instance and can be defined, for example, over some d-dimensional feature space, such as $x_{i} = {x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{d}}$ . Moreover, $χ_{i} = {x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{M}}$ denotes

The proposed method

For many artificial intelligence algorithms, a larger amount of data would typically lead to higher computational complexity, especially in the process of training a classifier. To decrease the size of the training data that is used directly for the training classifier, we propose a data reduction method to select a small subset from the original training data set. To successfully select a high-quality subset, a new metric is proposed to assess the capacity of the representative of each

Complexity analysis

To analyse the complexity, let m be the number of classes in the training data set, and let $N_{i} (1 \leq i \leq m)$ be the number of instances in each class. For each class, we partition the instances into n_i subsets of equal size $N_{i}^{'}$ , $n_{i} = N_{i} / N_{i}^{'}$ , and thus, the representative instance selection would be applied n_i times. After calculating the distances between each instance and its centroid, O(N_i) is required to obtain the median of the distances for all of the instances; thus, this partition can be performed

Data sets

The data set used for evaluating our method for intrusion detection is a labelled flow-based data set (Sperotto et al., 2009) that is intended for training and evaluating IDS, which has also been widely used for recent publications, such as (Liao et al., 2013, Sheikhan and Jadidi, 2012, Winter et al., 2011). This data set was captured in the University of Twente network, where there was a 10 Gbps optical Internet connection with an average load of 650 Mbps and peaks of up to 1.0 Gbps, and the

Conclusion

Although some artificial intelligence-based IDSs have shown good detection performance, they are not favourable for large-scale data because they require a large amount of time and memory. Many studies use feature selection for data reduction, to decrease the computational complexity. In this study, we adopt a new method to perform the task of data deduction. Here, representativeness was introduced as a new concept that is useful as a metric for measuring the representative power of an instance

Acknowledgement

The authors thank the anonymous referees for their valuable comments and suggestions, which improved the technical content and the presentation of the paper. This paper is supported by the National Natural Science Foundation of China under Grant 60972077, 61303232, the Foundation of Henan Educational Committee under Grant No. 13A413750, 13A413747, and the Natural Science Foundation of He'nan Province of China under Grant No. 132300410393.

Chun Guo received his B.S. degree in 2008 and his M.S. degree in 2011 in Communication Engineering and Communication and Information System respectively, both from Guizhou University, China. He is currently a Ph.D. candidate at Beijing University of Posts and Telecommunications, Beijing, China. His current research interests include pattern classification, machine learning, intrusion detection.

References (39)

F. Amiri et al.
Mutual information-based feature selection for intrusion detection systems
J Netw Comput Appl
(2011)
J.R. Cano et al.
Stratification for scaling up evolutionary prototype selection
Pattern Recogn Lett
(2005)
S. Chebrolu et al.
Feature deduction and ensemble design of intrusion detection systems
Comput Secur
(2005)
W.H. Chen et al.
Application of SVM and ANN for intrusion detection
Comput Oper Res
(2005)
J.J. Davis et al.
Data preprocessing for anomaly based network intrusion detection: a review
Comput Secur
(2011)
P. Garcia-Teodoro et al.
Anomaly-based network intrusion detection: techniques, systems and challenges
Comput Secur
(2009)
S.J. Horng et al.
A novel intrusion detection system based on hierarchical clustering and support vector machines
Expert Syst Appl
(2011)
Y. Li et al.
Building lightweight intrusion detection system using wrapper-based feature selection mechanisms
Comput Secur
(2009)
Y. Liao et al.
Use of K-nearest neighbor classifier for intrusion detection
Comput Secur
(2002)
C.R. Pereira et al.
An Optimum-Path Forest framework for intrusion detection in computer networks
Eng Appl Artif Intel
(2012)

T. Shon et al.

A hybrid machine learning approach to network anomaly detection

Inform Sciences

(2007)

S.S. Sivatha Sindhu et al.

Decision tree based light weight intrusion detection using a wrapper approach

Expert Syst Appl

(2012)

M.Y. Su

Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification

J Netw Comput Appl

(2011)

A.N. Toosi et al.

A new approach to intrusion detection based on an evolutionary soft computing model using neuro-fuzzy classifiers

Comput Commun

(2007)

C.F. Tsai et al.

Intrusion detection by machine learning: a review

Expert Syst Appl

(2009)

C.F. Tsai et al.

A triangle area based nearest neighbors approach to intrusion detection

Pattern Recogn

(2010)

C.H. Tsang et al.

Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection

Pattern Recogn

(2007)

S.X. Wu et al.

The use of computational intelligence in intrusion detection systems: a review

Appl Soft Comput

(2010)

S. Benferhat et al.

An intrusion detection and alert correlation approach based on revising probabilistic classifiers using expert knowledge

Appl Intell

(2013)

Cited by (0)

Ya-Jian Zhou was born in China in 1971. He received his Ph.D. degree in 2003 in communications engineering from Xidian University at Xi'an. He is currently an associated professor in School of Computer Science and Technology of Beijing University of Posts and Telecommunications. His main research interests include mobile communications, security of wireless networks, security of databases, cryptography theory and its application, etc.

Yuan-Ping is currently a lecturer in Xuchang University, Xuchang, China. His current research interests include pattern classification, machine learning, text categorisation, security protocol and operating system security.

Shou-Shan Luo was born in Anhui, P.R. China, 1962. He is a professor and Ph.D supervisor of Information Security Center Beijing University of Posts and Telecommunications. His research interests include information security, network security and Secure Multi-Party Computation, etc.

Yu-Ping Lai is currently a Ph.D. candidate at Beijing University of Posts and Telecommunications, Beijing, China. His current research interests include pattern classification, machine learning, text categorisation.

Zhong-Kun Zhang is currently a graduate student at Beijing University of Posts and Telecommunications, Beijing, China. His current research interests include pattern classification, machine learning, text categorisation.

View full text

Efficient intrusion detection using representative instances

Highlights

Abstract

Introduction

Section snippets

Problem formulation

The proposed method

Complexity analysis

Data sets

Conclusion

Acknowledgement

J Netw Comput Appl

Pattern Recogn Lett

Comput Secur

Comput Oper Res

Comput Secur

Comput Secur

Expert Syst Appl

Comput Secur

Comput Secur

Eng Appl Artif Intel

Inform Sciences

Expert Syst Appl

J Netw Comput Appl

Comput Commun

Expert Syst Appl

Pattern Recogn

Pattern Recogn

Appl Soft Comput

An intrusion detection and alert correlation approach based on revising probabilistic classifiers using expert knowledge

Appl Intell