doi:10.1016/j.comnet.2004.11.017
Copyright © 2005 Elsevier B.V. All rights reserved.
On the wavelet spectrum diagnostic for Hurst parameter estimation in the analysis of Internet traffic
aDepartment of Mathematics and Statistics, Boston University, Boston, MA 02215, United States
bStatistical and Applied Mathematical Sciences Institute, 19 T.W. Alexander Drive, P.O. Box 14006, Research Triangle Park, NC 27709-4006, United States
cDepartment of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599-3260, United States
Available online 6 January 2005.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
The fluctuations of Internet traffic possess an intricate structure which cannot be simply explained by long-range dependence and self-similarity. In this work, we explore the use of the wavelet spectrum, whose slope is commonly used to estimate the Hurst parameter of long-range dependence. We show that much more than simple slope estimates are needed for detecting important traffic features. In particular, the multi-scale nature of the traffic does not admit simple description of the type attempted by the Hurst parameter.
By using simulated examples, we demonstrate the causes of a number of interesting effects in the wavelet spectrum of the data. This analysis leads us to a better understanding of several challenging phenomena observed in real network traffic. Although the wavelet analysis is robust to many smooth trends, high-frequency oscillations and non-stationarities such as abrupt changes in the mean have an important effect. In particular, the breaks and level-shifts in the local mean of the traffic rate can lead one to overestimate the Hurst parameter of the time series. Novel statistical techniques are required to address such issues in practice.
Keywords: Long-range dependence; Internet traffic; Wavelet spectrum; Hurst parameter estimation; Breaks; Non-stationarity
Fig. 1. The top plot shows the time series of the number of packets arriving on a link in 1 ms time intervals. Observe that this time scale is rather fine and hence the distribution of the time series appears to be skewed and non-Gaussian. The mean of the time series is over-plotted in white. The average utilization of the link is about 5.2%. The lower-left plot displays the aggregate number of packets (for the same trace) in 1 s intervals. In white, we show the mean of this time series. On the lower-right plot we show the log-scale wavelet spectrum of the time series displayed in the top plot. Observe that the wavelet spectrum is essentially linear at large scales (10
j). The Hurst parameter, estimated by fitting a line to the wavelet spectrum on scales j1 = 10, 11, … , 20 is
(for more details, see Section 3 below). This trace is effectively modeled by standard FGN.
 |
Fig. 2. The top-left plot displays the time series of packet arrivals per 1 s obtained from the 2-h trace shown in Fig. 1. The plot on the right displays a time series of simulated fractional Gaussian noise with sample mean, sample variance and Hurst parameter equal to the estimated mean, variance and Hurst parameter from the packet trace in Fig. 1. The bottom-left plot shows the empirical quantiles of the standardized packet trace above versus the quantiles of standard normal distribution (in red). To indicate the sampling variability, we also add (in blue) 100 independent QQ-plots based on samples from the standard normal distribution. The bottom-right plot displays the wavelet spectrum of the packet trace in the top left and of the “simulated” FGN time series in the top right. Observe that the two spectra are very similar. (The vertical segments on the plot indicate 95% confidence intervals for the statistics Sj corresponding to FGN.) (For interpretation of color in this figure, the reader is referred to the web version of this article.)
Fig. 3. The top plot shows the time series of the number of packets arriving on a link in 1 ms time intervals. The mean of the time series is over-plotted in white. The average utilization of the link is about 6.5%. The lower-left plot displays the aggregate number of packets (for the same trace) in 1 s intervals. In white, we show the mean of this time series. On the lower-right plot we show the log-scale wavelet spectrum of the time series given in the top plot. Observe the sharp spike in the wavelet spectrum around scales j = 11 and 12. The spectrum appears linear on large scales 14
j
20. The estimated Hurst parameter over this range of large scales (j1 = 14, 15, … , 20) is about
. The spike is uncharacteristic of FGN data, and is explained in Section 5.
Fig. 4. The top plot shows the time series of the number of packets arriving on a link in 1 ms time intervals. The mean of the time series is over-plotted in white. The average utilization of the link is about 9.4%. The lower-left plot displays the aggregate number of packets (for the same trace) in 1 s intervals. In white, we show the mean of this time series. On the lower-right plot we show the log-scale wavelet spectrum of the time series given in the top plot. Observe that the spectrum exhibits high variability on large scales. The estimated Hurst parameter over the range of scales 10
j
20 is about
. This variability in the spectrum is explained in Section 5.
Fig. 5. The top plot shows the time series of the number of packets arriving on a link in 10 ms time intervals. Observe that this time scale is coarser than the time scale of the top plots in Fig. 1 and Fig. 3. The mean of the time series is over-plotted in white. The average utilization of the link is about 18.4%. The lower-left plot displays the aggregate number of packets (for the same trace) in 1 s intervals. In white, we show the mean of this time series. On the lower-right plot we show the log-scale wavelet spectrum of the time series given in the top plot. Note that the wavelet spectrum appears to be linear on large scales: 7
j
15. The estimated Hurst parameter over this range of scales is
. Observe, however, the striking difference between the shape of the wavelet spectrum in this case and in the cases shown in Fig. 1 and Fig. 3, above, and Fig. 6, below. This unusual wavelet spectrum shape is explained in Section 5 below.
 |
Fig. 7. Comparison between the wavelet and the local Whittle estimators for fractional Gaussian noise (FGN). The top plot on the left displays one simulated path of FGN with Hurst parameter H = 0.8. The plot on the right shows the wavelet log-scale spectrum of this path. The vertical segments on this plot are the estimated 95% confidence intervals of the log-mean-energy statistics of the wavelet spectrum. The bottom-left plot displays wavelet estimates of the Hurst parameter H obtained from 100 independently simulated paths of FGN by using different choices of the parameter j1. The variability of these estimates is summarized through 95% sample confidence intervals, based on a normal approximation. There is a small bias at small j1 (no initialization was done) and the variance increases as j1 increases. The bottom plot on the right contains estimates of H obtained via the local Whittle estimator. They were computed over the same set of 100 independent paths as the wavelet estimators, with different choices of the frequency cut-off parameter m. As for the wavelet estimators, we display 95% sample confidence intervals. Note the resemblance between the two plots, with m≈N/2j1.
Fig. 8. The top plot on the left displays one simulated path of fractional Gaussian noise (FGN) with H = 0.8 plus a “smooth” additive trend, over-plotted in white. The plot on the right shows the wavelet log-scale spectrum of this path and the spectrum of the corresponding path of “pure” FGN (see Fig. 7). Observe that these two spectra essentially coincide. The bottom-left plot displays wavelet estimates of the Hurst parameter H obtained from 100 independently simulated paths of FGN plus the same additive trend. As in Fig. 7, we show 95% sample confidence intervals for various choices of the scale j1. The bottom plot on the right contains estimates of H obtained via the local Whittle estimator, computed over the same set of 100 independent paths as the wavelet estimators. As for the wavelet estimators, we display 95% sample confidence intervals. Note that, in contrast to the wavelet estimator, the local Whittle estimator is greatly affected by the trend.
Fig. 9. These four plots display wavelet spectra of a FGN, perturbed by adding a high-frequency trend as in (4.1). The length of the time series is N = 300 000. We show what happens for four different values of the frequency ν = 10, 100, 1000 and 10 000. Observe that the impact of this perturbation on the spectra is well-localized around scales j ≈ log2(N/ν). The rest of the spectra essentially coincide with that of the FGN time series. If the number of scales j were significantly increased, we expect the top-left plot to look like the bottom-right plot.
Fig. 10. Let
, where GH is FGN and hν(k)=sin(2πνk/N) with ν = 100. The plots in the first column show the time series of wavelet coefficients (on several different scales j) of the time series
. The second column of plots displays the wavelet coefficients of the function hν(k) and the third column—those of the FGN GH(k). The parameter ν equals 100. One clearly sees that the wavelet coefficients of the function hν dominate on scales 10 and 11, giving rise to the bump in the wavelet spectrum of Fig. 9.
Fig. 11. The top-left plot displays a FGN time series perturbed by adding deterministic breaks displayed in the bottom-right plot. The corresponding pure FGN time series is shown on the bottom-left plot. The plot on the top-right shows the wavelet spectra of the three time series: the data (perturbed FGN), the function (deterministic breaks) and the FGN (the original fractional Gaussian noise time series). Observe that the spectrum of the FGN is essentially linear, so is the spectrum of the function, however the slopes of the two lines are quite different. At large scales j, the spectrum of the function dominates that of the FGN and hence it determines the behavior of the spectrum of the data. Consequently, the wavelet estimator of the Hurst parameter is essentially misleading. A linear fit starting from scale j1 = 9 and using all larger scales yields
whereas
.
Fig. 12. The first column of plots show time series of wavelet coefficients of the time series Y(k), k = 1, … , N, N = 300 000, displayed in Fig. 11. The second and third columns of plots contain the wavelet coefficients of the deterministic function and the FGN time series from Fig. 11, respectively. Observe that at scales j = 10–13 the wavelet coefficients of the function become larger in magnitude than those of the FGN, and this affects the corresponding wavelet spectra (as shown in the top-right plot in Fig. 11).
Fig. 13. The first column of plots show realizations of binned paths of Poisson point processes with non-constant intensities Λν, defined in (4.3), with ν = 10, 100, 1000. The corresponding intensities are over-plotted in white (the white band in the bottom plot is due to the very rapid oscillations). The plots on the right show the wavelet spectrum of the corresponding binned path of the Poisson process and the wavelet spectrum of the corresponding intensity function Λν. Note that the value of the frequency ν controls the location of the spike on the wavelet spectrum of the Poisson process as in Fig. 9. The spike appears roughly around j = 13, 10 and 7, respectively.
Fig. 14. The top plot shows one realization if a binned path of a Cox process, which is a non-homogeneous Poisson process with random intensity. The underlying intensity is over-plotted in white. The lower-left plot displays a more detailed portion of the path above. The lower-right plot shows the wavelet spectrum of the path of the Cox process and the wavelet spectrum of its intensity. Observe that the two spectra essentially coincide on large scales. At small time scales, as expected, the wavelet spectrum of the Cox process is flat, which is consistent with the spectrum of a homogeneous Poisson process or a white noise process.
Fig. 15. The top plot shows the periodogram of a 2-h packet trace. It was computed from the time series of packet arrivals per 10 ms time intervals. The bottom-left plot shows the low-frequency part of the periodogram (to the left of the vertical line in the top plot). The spike located at frequency ν ≈ 0.2 × 104 = 2000 may be related to the bump in the wavelet spectrum of this traffic trace in Fig. 3. Indeed, using j ≈ log2(N/ν), we get j ≈ log2(300 000/2000) ≈ 7. This corresponds to scale j ≈ 7 + log2(10) ≈ 10 in the wavelet spectrum displayed in Fig. 3 which involves the time series of packet arrivals per 1 ms rather than per 10 ms. The bottom-right plot indicates the 1/f or long-range dependence behavior of the time series of packet arrivals.