Characterizing network traffic by means of the NetMine framework
Introduction
Due to the continuous growth in network speed, terabytes of data may be transferred through a network every day. Thus, two major issues hamper network data capture and analysis: (i) a huge amount of data can be collected in a very short time (e.g., an Ethernet frame at 10 Gbps is received in less than 70 ns). (ii) It is hard to identify correlations and detect anomalies in real-time on such large network traffic traces. New efficient techniques able to deal with huge network traffic data need to be devised. A significant effort has been devoted to the application of data mining techniques to network traffic analysis [9]. The application domains include studying correlations among data (e.g., association rule extraction for network traffic characterization [5], [17] or for router misconfiguration detection [22]), extracting information for prediction (e.g., multilevel traffic classification [19], Naive Bayes classification [27]), grouping network data with similar properties (e.g., clustering algorithms for intrusion detection [32], or for classification [12], [13], [36]), and context specific applications (e.g., multi-level association rules in spatial databases [23]). Data mining techniques play an important role in intrusion detection systems [34], [30], where association rules are successfully exploited for anomaly identification [31].
While supervised classification algorithms require previous knowledge of the application domain (e.g., a labeled traffic trace), association rule extraction does not. Hence, the latter is a widely used exploratory technique which may be exploited to highlight hidden knowledge in network flows. The extraction process is driven by enforcing a minimum frequency (i.e., support) constraint on the mined correlations. However, to discover (potentially relevant) knowledge, a very low support constraint has to be enforced, hence generating a huge number of unmanageable rules [5]. To address this issue, a higher level abstraction than traditional association rules and a network summarized representation for traffic data are needed.
This work presents a framework, called NetMine, which performs network traffic analysis by means of data mining techniques to characterize traffic data and detect anomalies. NetMine performs (i) on-line stream analysis to aggregate and filter network traffic, (ii) refinement analysis to discover relationships among captured data, and (iii) rule classification into different semantic groups. NetMine allows on-line stream analysis concurrently with data capture by means of user-defined continuous queries. “Continuous queries” [4] is an efficient technique to perform real-time aggregation and filtering; thus it allows an effective reduction of the amount of network data to be analyzed. The result of this step is a meaningful summary of network traffic data appropriate for pattern discovery. NetMine performs a refinement analysis to discover traffic features from network data summaries by means of generalized association rule extraction. Differently from previous approaches [15], NetMine generalized rule extraction is support driven, i.e., rules are generalized only when lower levels in the taxonomy are below a frequency threshold. Thus, in NetMine generalization is automatically performed only when needed. Finally, extracted rules may be classified in several semantic categories, depending on the traffic features they observe. NetMine ’s final output is a set of categorized and generalized rules [18] which are able to characterize network traffic and to show correlation and recurrence of patterns among data. Experiments performed on different network dumps showed the efficiency and effectiveness of the NetMine framework in characterizing traffic data and highlighting meaningful features.
The paper is organized as follows. Section 2 presents an overview of the NetMine framework and the main features of its building blocks. In Section 3 experiments to validate the proposed framework are reported. Section 4 discusses related works, while Section 5 draws conclusions.
A traffic capture process produces a trace which holds information on the stream of packets. A trace is composed of a set of records, each of which is a set of tagged items (also denoted as itemset in the following). While items are values captured from the net (e.g., the IP address 130.192.3.17), their tags are the description of the represented information (e.g., the label IP destination address).
Relevant patterns may be represented by means of association rules. An association rule is represented in the form , where X and Y are disjoint conjunctions of tagged items. Each rule is usually characterized by the support and confidence quality index [18]. The support is the prior probability of X and Y (i.e., its observed frequency in the data set). The confidence is the conditional probability of Y given X and characterizes the “strength” of a rule. The traditional association rule mining problem can be described as follows. Given a database of transactions (in our case a network trace), a minimum confidence threshold and a minimum support threshold, find all association rules whose confidence and support are above given thresholds. However, the following example highlights the need of a more powerful abstraction of association rules.
Consider a web server on port 80 having address 130.192.5.7. To describe the activity of a client connecting to this server, a rule in the form
may be extracted, where are support and confidence values. Since a single source address is a 1-itemset which is infrequent in a very large traffic network trace, extracting such rule would require enforcing a very low support threshold, which makes the task infeasible. However, a higher level view of the network may be provided by the following generalized association rule:
which shows a subnet generating most of the traffic. This generalized rule may provide valuable knowledge for network monitoring. The number of different tagged items in network traffic may be very large (e.g., different IP addresses) and association rules on single tagged items provide excessively detailed and hardly usable knowledge. Generalized association rules allow raising the abstraction level at which correlations are represented. Hence, they are a powerful tool to address this issue.
Section snippets
The NetMine framework
NetMine is a framework to efficiently perform network traffic analysis. NetMine addresses two main issues: (i) data stream processing to dynamically reduce the amount of traffic data and allow a more effective use, both in time and space, of data analysis techniques, (ii) correlation extraction from traffic data to characterize network traffic, detect anomalies, and identify recurrent patterns.
Fig. 1 shows NetMine main blocks: data stream processing, refinement analysis, and rule
Experimental validation
A set of experiments has been performed by applying the proposed techniques on network traffic datasets. We analyzed the effect of the support and confidence thresholds on various classes of generalized rules (Section 3.1), the impact of the window size parameter (Section 3.2) and the effectiveness of our generalization approach in extracting hidden knowledge (Section 3.3). The selection of interesting rules by means of the lift quality index is discussed in Section 3.4. Section 3.5 presents
Related work
A significant effort has been devoted to the application of data mining techniques to the problem of network traffic analysis. The work described in [9] presents some theoretical considerations on the application of data mining techniques to network monitoring. Since the network traffic analysis domain is rather broad, research activities have addressed many different application areas, e.g., web log analysis [37], enterprise-wide management [21], and traffic classification [14], [29], [33].
Conclusions
NetMine is a framework to efficiently perform network traffic analysis. It combines two phases: data stream analysis and refinement analysis. The former reduces the amount of traffic data while the latter extracts relevant and unexpected correlations and recurrence of patterns among data. The proposed technique exploits a taxonomy to drive the extraction process by automatically aggregating extracted information according to the support threshold. Experiments proved the capability and the
Acknowledgements
We are grateful to Fulvio Risso for providing the real traffic datasets captured from the campus network, and to Claudio Testa and Felice Iasevoli for developing parts of the NetMine framework.
Daniele Apiletti is a Ph.D. student of the Database and Data Mining Group at the Dipartimento di Automatica e Informatica of the Politecnico di Torino since January 2006. He holds a Master degree in Computer Engineering from Politecnico di Torino (2005). His research interests are in the fields of bioinformatics and sensor data analysis. In the bioinformatics area, his research activities are focused on microarray data classification and feature selection techniques. In the area of sensor data
References (38)
- et al.
TCP/IP security threats and attack methods
Computer Communications
(1999) - et al.
Network anomaly detection with incomplete audit data
Computer Networks
(2007) - et al.
An overview of anomaly detection techniques: existing solutions and latest technological trends
Computer Networks
(2007) - R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proceedings of the 20th...
- et al.
The CQL continuous query language: semantic foundations and query execution
The VLDB Journal The International Journal on Very Large Data Bases
(2006) - et al.
Bayesian neural networks for internet traffic classification
IEEE Transactions on Neural Networks
(2007) - et al.
Continuous queries over data streams
ACM SIGMOD Record
(2001) - M. Baldi, E. Baralis, F. Risso, Data mining techniques for effective and scalable traffic analysis, in: Integrated...
- E. Baralis, T. Cerquitelli, V. D’Elia, Generalized itemset discovery by means of opportunistic aggregation, Technical...
- Bart Goethals, Frequent Pattern Mining Implementations,...
Traffic classification on the fly
ACM SIGCOMM Computer Communication Review
Trajectory sampling for direct traffic observation
IEEE/ACM Transactions on Networking
Traffic classification using clustering algorithms
Y-Means: a clustering method for intrusion detection
Proceedings of Canadian Conference on Electrical and Computer Engineering
ACAS: automated construction of application signatures
Mining multiple-level association rules in large databases
IEEE Transactions on Knowledge and Data Engineering
Adaptive intrusion detection with data mining
IEEE Internation Conference on Systems, Man and Cybernetics
Cited by (28)
A Parallel MapReduce Algorithm to Efficiently Support Itemset Mining on High Dimensional Data
2017, Big Data ResearchCitation Excerpt :Collected datasets, that include traffic flows in which the item are flow attributes ([36], [37], [38]), represent an appealing domain where PaMPa-HD can be efficiently exploited, are already a very promising application domain for data mining techniques.
Discovering users with similar internet access performance through cluster analysis
2016, Expert Systems with ApplicationsGeneralized association rule mining with constraints
2012, Information SciencesCitation Excerpt :Thus, generalized items and itemsets provide a high level view of the patterns hidden in the analyzed data. They have been profitably exploited in different application domains (e.g., market-basket analysis [3,23], network traffic domain [4]) to provide a high level abstraction of the mined knowledge. The exhaustive evaluation of taxonomies may cause the extraction of a huge amount of patterns.
Session-based classification of internet applications in 3G wireless networks
2011, Computer NetworksCitation Excerpt :Bifurcations of classification not only involve discrimination between types of application but also discovery of the relationship between host traffic data and roles. Apiletti et al. [15] suggests a framework that can analyze the aggregate and subsequently filter network traffic, determine the relationships among captured data, and classify the data based on the rules. Discovering the relationships among the data is closely related to the work of Karagiannis et al. [16]; they shifted the main focus of classification to relating Internet hosts with the application.
A survey of data mining and machine learning-based intrusion detection system for cyber security
2023, Risk Detection and Cyber Security for the Success of Contemporary ComputingIntelligent data encryption classifying complex security breaches using machine learning technique
2023, Effective AI, Blockchain, and E-Governance Applications for Knowledge Discovery and Management
Daniele Apiletti is a Ph.D. student of the Database and Data Mining Group at the Dipartimento di Automatica e Informatica of the Politecnico di Torino since January 2006. He holds a Master degree in Computer Engineering from Politecnico di Torino (2005). His research interests are in the fields of bioinformatics and sensor data analysis. In the bioinformatics area, his research activities are focused on microarray data classification and feature selection techniques. In the area of sensor data analysis he is developing data mining techniques for physiological data processing on mobile devices.
Elena Baralis is full professor at the Dipartimento di Automatica e Informatica of the Politecnico di Torino since January 2005. She holds a Dr.Ing. degree in Electrical Engineering and a Ph.D. in Computer Engineering, both from Politecnico di Torino. Her current research interests are in the field of databases, in particular data mining, sensor databases, and bioinformatics. She has published numerous papers on journal and conference proceedings. She has served on the program committees of several international conferences and workshops, among which IEEE ICDM, VLDB, ACM CIKM, DaWak, ACM SAC, PKDD. She has managed several Italian and EU research projects.
Tania Cerquitelli is a post-doctoral researcher in Computer Engineering at the Politecnico di Torino since January 2007. She holds a Ph.D. degree and a Master degree in Computer Engineering, both from the Politecnico di Torino. She also obtained a Master degree in Computer Science from Universidad De Las Américas Puebla. Her research activities are focused on the integration of data mining algorithms into database system kernels and on the exploitation of data mining techniques to analyze streaming data collected by sensor networks. She served in the program committee of DS2ME’08 and IADIS’06.
Vincenzo D’Elia is a Ph.D. student of the Database and Data Mining Group at the Dipartimento di Automatica e Informatica of the Politecnico di Torino since January 2007. He obtained a Master degree in Computer Engineering from Politecnico di Torino in November 2006. He is currently working in the field of data mining, and time series analysis. In particular, his activity is focused on the analysis of sensor network data to support effective data collection by means of data mining techniques. His research activities are also devoted to analyze physiological signals to study athlete conditions while they perform their activities.