Elsevier

Neurocomputing

Volume 156, 25 May 2015, Pages 252-267
Neurocomputing

Effective packet number for early stage internet traffic identification

https://doi.org/10.1016/j.neucom.2014.12.053Get rights and content

Abstract

Accurately identifying Internet traffic at the early stage is very important for the applications of traffic identification. Recent years, more and more research works have tried to build effective machine learning models to identify an Internet flow with the few packets at its early stage. However, a basic and important problem still needs to be studied in depth, that is how many packets are most effective in early stage Internet traffic identification. In this paper, we try to resolve this problem. Three Internet traffic data sets are applied. And the sizes of the first 10 packets are extracted for study. We firstly apply mutual information to analyze the information that the first n packets provide to the flow type. Then correlation analysis of each pair of adjacent packets is carried out to find out the feature redundancies. And then we execute a number of crossover identification experiments with different numbers of packets using 11 well-known supervised learning algorithms. Finally, statistical tests are applied for the experimental results to find out which number is the best performed one. Our experimental results show that 5–7 are the best packet numbers for early stage traffic identification.

Introduction

Recent years, early stage traffic identification has caught enough interests at the research community. Most traditional machine learning based traffic identification techniques extract features on a whole network flow [15], [29], [34]. The most widely followed feature extracting method is presented by Moore et al. in 2005 [33]. They extract 248 statistical features based on a whole flow, such as the maximum, minimum and average values of the packet sizes, and RTT. Classifiers using such statistical features can get very high identification performances [36], and such statistical features even have been successfully applied in anomaly detection [16]. However, in real circumstances, it makes no sense to recognize Internet flows when they have ended. Thus, we must identify them accurately in their early stages so that we can apply the subsequent management and security policies. Therefore, some researchers have turned to find effective models which are able to identify Internet flows at their early stages. This makes early stage identification to become a hot topic in traffic identification researches [10]. Qu et al. have studied the problem of accuracy of early stage traffic identification, and found that it is possible to identify traffic accurately at their early stages [40].

However, an important problem still should be further studied: How many packets are most effective for early stage traffic identification? As far as we know, there are only a few studies that concern on this issue. In this study, we set out to study selecting the most effective number of packets in early stage traffic identification using both information analysis methods and empirical methods. Three traffic data sets and eleven widely used classifiers are applied for our study. We use the application layer payload size of each packet as the feature. Mutual information analysis is firstly applied to discover how much identification information the first n packets can provide. Then we try to find the redundances between each pair of adjacent packets using correlation analysis. And then all selected classifiers are applied in a group of identification experiments using different number of packets. At last, the experimental results are analyzed using statistical hypothesis tests to find out the best behaved packet number.

The rest of the paper is organized as follows: We firstly review the related work in Section 2, and then introduce the data sets used in the study in Section 3. Section 4 illustrates the framework and detailed steps of our study. Then the methods applied are depicted in Section 5, include the basic theories of mutual information and Pearson׳s correlation coefficient, selected classifiers, and statistical tests used in the study. The details of results and analysis are given in Section 6. Some discussions will be presented in Section 7. Finally, we make some conclusions in Section 8.

Section snippets

Related work

It is relatively hard to recognize an Internet flow only using the few early stage packets. Thus, a key problem of early stage traffic identification is to extract effective features. Bernaille et al. presented a famous early stage traffic identification technique in 2006 [2]. They use the size of the first few data packets of each TCP flow as the features, and by applying K-means clustering technique, they got high identification rates for 10 types of application traffics. Este et al. have

Auckland II traffic traces

Auckland II is a collection of long GPS-synchronized traces taken using a pair of DAG 2 cards at the University of Auckland which is available at [47]. There are 85 trace files which were captured from November 1999 to July 2000. Most traces were targeted at 24 h runs, but hardware failures have resulted in most traces being significantly shorter. We selected two trace files captured at February 14, 2000 (20000214-185536-0.pcap and 20000214-185536-1.pcap) for our study. The traces include only

Study framework

Fig. 1 depicts the main flow chart of the study. It should be noticed that we execute such a procedure for each data set, and get the identification results, and then carry out the statistical analysis:

  • Filter mice traffic: Mice traffic are that with few packets or bytes. It is hard to identify mice traffic in Internet because they are too “little” to obtain effective features. Furthermore, it makes little sense to identify such traffic from the point of view of traffic identification, since

Mutual information

Mutual information is a useful measure in information theory which is widely used for feature selection [38], image processing [31], speech recognition [1] and so on. The mutual information of two random variables X and Y is a measure of the variables׳ mutual dependence. In information theory, mutual information is defined asI(X;Y)=H(X)H(X|Y)=H(Y)H(Y|X)=H(X)+H(Y)H(X,Y)=H(X,Y)H(X|Y)H(Y|X)where H(X) and H(Y) are the marginal entropies of X and Y, respectively, H(X|Y) and H(Y|X) are the

Results and analysis

In this section, we are to give the experimental results and analysis. Mutual information analysis will be firstly carried out in Section 6.1. And then we will try to find out the redundancies among the early stage packets using correlation analysis in Section 6.2. In Section 6.3, we will validate the effectiveness of specified number of packets by applying selected classifiers and statistical hypothesis tests. Finally, the visualizations of the early stage packet sizes of Auckland II data set

Discussions

According to the experimental results, we have found some interesting issues:

  • The packet size of the early stage packets carries enough information for traffic identification. The identification experiment results show that most classifiers can achieve high performances using the early stage packet sizes, and the case study of visualization also shows that different types of traffic show their own behavior patterns with respect to the early stage packet size.

  • All results of the information

Conclusions

We have tried to find out in this paper that how many non-zero payload packets are most effective in the early stage identification of network traffics. Three traffic data sets include two opening data sets and eleven well-known classifiers are applied. According to the experimental results, we conclude that the payload sizes of early stage packets are effective enough to achieve ideal identification performances; an effective number of non-zero packets using for early stage identification is a

Acknowledgments

This research was partially supported by National Natural Science Foundation of China under Grant nos. 61472164, 61373054, 61173078, 61203105, 61173079, the Provincial Natural Science Foundation of Shandong under Grant nos. ZR2012FM010, ZR2011FZ001,ZR2013FL002 and ZR2012FQ016.

Lizhi Peng received the B.S. and M.E. degrees in computer science from Xi׳an Jiaotong University and Jinan University, China, in 1998 and 2004, respectively. He is currently an associate professor in the Department of Information Science and Engineering, Jinan University. His main research interests include computer network, computational intelligence, data mining, and parallel computing. He has published numerous papers in these areas.

References (48)

  • D.S. Broomhead et al.

    Multivariable functional interpolation and adaptive networks

    Complex Syst.

    (1998)
  • S. le Cessie et al.

    Ridge estimators in logistic regression

    Appl. Stat.

    (1992)
  • C. Cortes et al.

    Support vector networks

    Mach. Learn.

    (1995)
  • A. Dainotti, A. Pescapé, C. Sansone, Early classification of network traffic through multi-classification, in: Lecture...
  • A. Dainotti et al.

    Issues and future directions in traffic classification

    IEEE Netw.

    (2012)
  • P. Domingos et al.

    On the optimality of the simple Bayesian classifier under zero-one loss

    Mach. Learn.

    (1997)
  • J. Erman, A. Mahanti, M. Arlitt, Byte me: a case for byte accuracy in traffic classification, in: Proceedings of...
  • C. Estan et al.

    New directions in traffic measurement and accountingfocusing on the elephants, ignoring the mice

    ACM Trans. Comput. Syst.

    (2003)
  • A. Este, F. Gringoli, L. Salgarelli, On the stability of the information carried by traffic flow features at the packet...
  • E. Frank, I. H. Witten, Generating accurate rule sets without global optimization, in: Proceedings of the 15th...
  • N. Friedman et al.

    Bayesian network classifiers

    Mach. Learn.

    (1997)
  • R.C. Holte

    Very simple classification rules perform well on most commonly used datasets

    Mach. Learn.

    (1993)
  • N. Huang, G. Jai, H. Chao, Early identifying application traffic with application characteristics, in: Proceedings of...
  • J. Huang et al.

    Using AUC and accuracy in evaluating learning algorithms

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • Cited by (50)

    • Network traffic identification in packet sampling environment

      2023, Digital Communications and Networks
    • Intelligently modeling, detecting, and scheduling elephant flows in software defined energy cloud: A survey

      2020, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      Some novel applications such as virtual reality and augmented reality may generate a new type of elephants that are both bandwidth-hungry and latency-sensitive, generating new challenges for the existing approaches to model, detect, and schedule elephants. Moreover, many elephant models are based on flow-level features [30,65,93,161] extracted from the whole packets of a flow, leading to a late detection and making the schedule of the subsequent packets of elephants for a better energy saving meaningless [14,66,68,119,121,126,135]. Previously proposed surveys typically categorize the related approaches according to the locations in which the detection and schedule are proceeded, but other factors such as the completeness of the traffic and the adaptability of the models used for elephants detection should also be discussed.

    View all citing articles on Scopus

    Lizhi Peng received the B.S. and M.E. degrees in computer science from Xi׳an Jiaotong University and Jinan University, China, in 1998 and 2004, respectively. He is currently an associate professor in the Department of Information Science and Engineering, Jinan University. His main research interests include computer network, computational intelligence, data mining, and parallel computing. He has published numerous papers in these areas.

    Bo Yang received the B.E. degree from Jinan University, Jinan, China, in 1984, and the M.S. degree from South China University, in 1989. He is a Professor and Vice-President at Jinan University, Jinan, China. He is the Director of the Provincial Key Laboratory of Information and Control Engineering, and also acts as the Associate Director of Shandong Computer Federation, and Member of the Technical Committee of Intelligent Control of Chinese Association of Automation. His main research interests include computer networks, artificial intelligence, machine learning, knowledge discovery, and data mining. He has published numerous papers and gotten some important scientific awards in this area.

    Yuehui Chen (M׳02) was born in 1964 in Shandong Province, China. He received the B.Sc. degree in mathematics/automatics from Shandong University, China, in 1985, and the M.S. and Ph.D. degrees in electrical engineering from the Kumamoto University, Japan, in 1999 and 2001, respectively. During 2001–2003, he had worked as the Senior Researcher of the Memory-Tech Corporation, Tokyo, Japan. Since 2003, he has been a member of the Faculty of Electrical Engineering, Jinan University, Jinan, China, where he is currently Head of the Laboratory of Computational Intelligence. His research interests include evolutionary computation, neural networks, fuzzy logic systems, hybrid computational intelligence and their applications in time-series prediction, system identification, intelligent control, intrusion detection systems, web intelligence and bioinformatics. He is the author and the coauthor of more than 70 technique papers. He is a member of the IEEE Systems, Man and Cybernetics Society and the Computational Intelligence Society, a member of Young Researchers Committee of the World Federation on Soft Computing, and a member of the CCF Young Computer Science and Engineering Forum of China. He is an Editor-in-Chief of the Journal of Computational Intelligence in Bioinformatics and the Associate Editor of the International Journal of Computational Intelligence Research. More information can be found at http://cilab.ujn.edu.cn.

    View full text