Efficient intrusion detection using representative instances
Introduction
In recent decades, the explosive growth of the Internet has resulted in an increasing number of people using the Internet in their daily life and business, which has prompted researchers to work on various techniques to secure networks. However, these techniques have failed to fully protect networks from increasingly sophisticated attacks. As a result, an intrusion detection system (IDS) has become a crucial component of any network security infrastructure to detect network attacks before they induce widespread damage (Wu and Banzhaf, 2010).
IDS is a security measure that helps to identify a set of malicious actions that compromise the integrity, confidentiality, and availability of information resources (Dokas et al., 2002). From the viewpoint of classification, the main objective of building an IDS is to train a classifier that can categorise normal vs. intrusive data from the original data set (Toosi and Kahani, 2007). Two main approaches to intrusion detection are currently used, namely, misuse-based detection (detect what is known) and anomaly-based detection (detect what is different from what is known). The detailed advantages and disadvantages of these two approaches can be seen in the review in Garcia-Teodoro et al. (2009).
Recently, artificial intelligence-based IDS has been an important part of IDS (Ying et al., 2012). To date, various artificial intelligence technologies have been applied to the field of intrusion detection, such as naive Bayes classification (Schultz et al., 2001), K nearest neighbour (KNN) (Liao and Vemuri, 2002), decision trees (Bouzida et al., 2004), artificial neural networks (ANNs) (Yao et al., 2006), and support vector machines (SVMs) (Chen et al., 2005). In all of the intelligence algorithms above, SVM and KNN are the most commonly-used techniques that are used on intrusion detection; especially the SVM is receiving more attention for an intrusion detection model design (Tsai et al., 2009). In addition, ANNs have been found by some researchers to be effective for intrusion detection because of their ability to produce good results in complex domains (Chen et al., 2005, Yao et al., 2006). The multilayer perceptron (MLP) is one of the most commonly used neural network architectures in many pattern recognition problems, including the problem of intrusion detection (Yao et al., 2006). In the development of IDS, many studies also pointed out that a hybrid approach can improve the detection capability of IDS (Tsai et al., 2009). The idea behind a hybrid classifier is to combine several learning techniques to obtain a better detection performance than a single classifier. Recently, many hybrid approaches (Chung and Wahid, 2012, Khor et al., 2012, Li et al., 2008, Tsai and Lin, 2010, Shon and Moon, 2007) were proposed to construct IDSs, and some of them have achieved admirable detection performance.
With respect to IDS, the effectiveness and efficiency are undoubtedly two key factors. Generally speaking, the effectiveness of IDS can be evaluated by the detection rate, false positive rate, and accuracy, while the efficiency is usually measured by the response time during a network attack (Su, 2011). There is no doubt that it is desirable to design an IDS that is effective and computationally efficient. Moreover, the time complexity for training a classifier for intrusion detection should also be of concern, because many artificial intelligence algorithms (e.g., ANNs and SVMs) typically involve expensive computation during the training process (Horng et al., 2011, Pereira et al., 2012), especially when the amount of training data is large. As previously mentioned, much of the previous work has applied various artificial intelligence algorithms to solve the application of intrusion detection, and some of these studies achieved favourable detection performance. However, the majority of artificial intelligence algorithms applied to intrusion detection could suffer from requiring a large amount of computational time when used directly on the rough data obtained from network traffic or other local or remote applications (Bouzida et al., 2004). One of the main reasons for this problem is that the number of features extracted from raw network traffic data and the amount of network traffic data that an IDS must process are all typically large.
To improve on the efficiency of artificial intelligence-based IDS, many researchers (Amiri et al., 2011, Bouzida et al., 2004, Chebrolu et al., 2005, Li et al., 2009, Sivatha Sindhu et al., 2012) employed feature selection as a preprocessing phase to remove information redundancy and reduce the computational complexity in recent years. Essentially, feature selection can be seen as a method of data reduction (Chebrolu et al., 2005, Davis and Clark, 2011); it decreases the dimensionality of the feature space by eliminating some redundant or irrelevant features. In this line of research, many feature selection algorithms have been used to optimise the intelligence algorithm-based IDS, such as principal component analysis (Bouzida et al., 2004), chi-Square (CS) (Tsang et al., 2007), and information gain (IG) (Kayacik et al., 2005). Indeed, using feature selection as a preprocessing phase can help to improve the efficiency of many Intelligent algorithms. However, because the number of instances in the data set did not drop much at all after performing the feature selection, many of the artificial intelligence algorithms (e.g., ANN and SVM) would still suffer from having a high computational cost when operating the data sets with a large number of instances in the intrusion detection domain, especially in the classifier training phase. Because the computational complexities (including training and testing) of many artificial intelligence algorithms are associated with the size of the training data set and the performance unavoidably depends on the quality of the training data, it is favourable to have relatively small but high-quality training data sets.
The goal of this study is to select a small subset, which can be seen as the representative of the original training data set, to build various artificial intelligence-based IDSs that possess effectiveness as well as efficiency. For this purpose, we propose a novel method for selecting a high-quality subset that has far fewer instances from the original training data to build an intrusion detection model. This method can be considered to be a technique for data reduction; it is applied in the preprocessing phase, and thus, it is independent of the artificial intelligence algorithm that is used for building an intrusion detection model. Specifically, we first introduce a new concept called “representativeness”, to measure the representative power of an instance with respect to its class in a given data set, which can be used as a metric for selecting “representative” instances. Then, we develop a novel centroid-based strategy to partition each class of the original training data set into several subsets of equal size. Subsequently, we combine the most “representative” instance in each subset as a new training data set for building an intrusion detection model. Finally, we examine the feasibility and effectiveness of our representative selection method for intrusion detection by conducting several experiments on a labelled flow-based data set for intrusion detection, which was introduced in 2009.
The remainder of this article is organised as follows. The formal definitions of the problem of intrusion detection and selecting a high-quality subset are given in Section 2. Section 3 presents a detailed description of each step of our method. Section 4 gives an analysis of the time complexity and memory requirement for intelligence algorithms with representative instances. The experimental setup and results of our method are given in Section 5. Finally, Section 6 draws conclusions and outlines future work.
Section snippets
Problem formulation
Similar to the studies in Benferhat et al., 2013, Khor et al., 2012, Pereira et al., 2012, Shon and Moon, 2007 and Sivatha Sindhu et al. (2012), this article views intrusion detection as a supervised classification problem. Let be a labelled intrusion detection training data set with M training instances, where xi represents an instance and can be defined, for example, over some d-dimensional feature space, such as . Moreover, denotes
The proposed method
For many artificial intelligence algorithms, a larger amount of data would typically lead to higher computational complexity, especially in the process of training a classifier. To decrease the size of the training data that is used directly for the training classifier, we propose a data reduction method to select a small subset from the original training data set. To successfully select a high-quality subset, a new metric is proposed to assess the capacity of the representative of each
Complexity analysis
To analyse the complexity, let m be the number of classes in the training data set, and let be the number of instances in each class. For each class, we partition the instances into ni subsets of equal size , , and thus, the representative instance selection would be applied ni times. After calculating the distances between each instance and its centroid, O(Ni) is required to obtain the median of the distances for all of the instances; thus, this partition can be performed
Data sets
The data set used for evaluating our method for intrusion detection is a labelled flow-based data set (Sperotto et al., 2009) that is intended for training and evaluating IDS, which has also been widely used for recent publications, such as (Liao et al., 2013, Sheikhan and Jadidi, 2012, Winter et al., 2011). This data set was captured in the University of Twente network, where there was a 10 Gbps optical Internet connection with an average load of 650 Mbps and peaks of up to 1.0 Gbps, and the
Conclusion
Although some artificial intelligence-based IDSs have shown good detection performance, they are not favourable for large-scale data because they require a large amount of time and memory. Many studies use feature selection for data reduction, to decrease the computational complexity. In this study, we adopt a new method to perform the task of data deduction. Here, representativeness was introduced as a new concept that is useful as a metric for measuring the representative power of an instance
Acknowledgement
The authors thank the anonymous referees for their valuable comments and suggestions, which improved the technical content and the presentation of the paper. This paper is supported by the National Natural Science Foundation of China under Grant 60972077, 61303232, the Foundation of Henan Educational Committee under Grant No. 13A413750, 13A413747, and the Natural Science Foundation of He'nan Province of China under Grant No. 132300410393.
Chun Guo received his B.S. degree in 2008 and his M.S. degree in 2011 in Communication Engineering and Communication and Information System respectively, both from Guizhou University, China. He is currently a Ph.D. candidate at Beijing University of Posts and Telecommunications, Beijing, China. His current research interests include pattern classification, machine learning, intrusion detection.
References (39)
- et al.
Mutual information-based feature selection for intrusion detection systems
J Netw Comput Appl
(2011) - et al.
Stratification for scaling up evolutionary prototype selection
Pattern Recogn Lett
(2005) - et al.
Feature deduction and ensemble design of intrusion detection systems
Comput Secur
(2005) - et al.
Application of SVM and ANN for intrusion detection
Comput Oper Res
(2005) - et al.
Data preprocessing for anomaly based network intrusion detection: a review
Comput Secur
(2011) - et al.
Anomaly-based network intrusion detection: techniques, systems and challenges
Comput Secur
(2009) - et al.
A novel intrusion detection system based on hierarchical clustering and support vector machines
Expert Syst Appl
(2011) - et al.
Building lightweight intrusion detection system using wrapper-based feature selection mechanisms
Comput Secur
(2009) - et al.
Use of K-nearest neighbor classifier for intrusion detection
Comput Secur
(2002) - et al.
An Optimum-Path Forest framework for intrusion detection in computer networks
Eng Appl Artif Intel
(2012)
A hybrid machine learning approach to network anomaly detection
Inform Sciences
Decision tree based light weight intrusion detection using a wrapper approach
Expert Syst Appl
Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification
J Netw Comput Appl
A new approach to intrusion detection based on an evolutionary soft computing model using neuro-fuzzy classifiers
Comput Commun
Intrusion detection by machine learning: a review
Expert Syst Appl
A triangle area based nearest neighbors approach to intrusion detection
Pattern Recogn
Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection
Pattern Recogn
The use of computational intelligence in intrusion detection systems: a review
Appl Soft Comput
An intrusion detection and alert correlation approach based on revising probabilistic classifiers using expert knowledge
Appl Intell
Cited by (0)
Chun Guo received his B.S. degree in 2008 and his M.S. degree in 2011 in Communication Engineering and Communication and Information System respectively, both from Guizhou University, China. He is currently a Ph.D. candidate at Beijing University of Posts and Telecommunications, Beijing, China. His current research interests include pattern classification, machine learning, intrusion detection.
Ya-Jian Zhou was born in China in 1971. He received his Ph.D. degree in 2003 in communications engineering from Xidian University at Xi'an. He is currently an associated professor in School of Computer Science and Technology of Beijing University of Posts and Telecommunications. His main research interests include mobile communications, security of wireless networks, security of databases, cryptography theory and its application, etc.
Yuan-Ping is currently a lecturer in Xuchang University, Xuchang, China. His current research interests include pattern classification, machine learning, text categorisation, security protocol and operating system security.
Shou-Shan Luo was born in Anhui, P.R. China, 1962. He is a professor and Ph.D supervisor of Information Security Center Beijing University of Posts and Telecommunications. His research interests include information security, network security and Secure Multi-Party Computation, etc.
Yu-Ping Lai is currently a Ph.D. candidate at Beijing University of Posts and Telecommunications, Beijing, China. His current research interests include pattern classification, machine learning, text categorisation.
Zhong-Kun Zhang is currently a graduate student at Beijing University of Posts and Telecommunications, Beijing, China. His current research interests include pattern classification, machine learning, text categorisation.