Online wavelet-based density estimation for non-stationary streaming data

https://doi.org/10.1016/j.csda.2011.08.009Get rights and content

Abstract

There has been an important emergence of applications in which data arrives in an online time-varying fashion (e.g. computer network traffic, sensor data, web searches, ATM transactions) and it is not feasible to exchange or to store all the arriving data in traditional database systems to operate on it. For this kind of applications, as it is for traditional static database schemes, density estimation is a fundamental block for data analysis. A novel online approach for probability density estimation based on wavelet bases suitable for applications involving rapidly changing streaming data is presented. The proposed approach is based on a recursive formulation of the wavelet-based orthogonal estimator using a sliding window and includes an optimised procedure for reevaluating only relevant scaling and wavelet functions each time new data items arrive. The algorithm is tested and compared using both simulated and real world data.

Introduction

Traditional database and data processing systems are based on the storage and analysis of static records which generally have no predefined notion of time (Golab and Ozsu, 2003). In those models, conceived as fixed static data archives, data take the form of persistent relations that require persistent data storage as well as complex querying operations (Babcock et al., 2002, Golab and Ozsu, 2003). Recently, a new class of applications has emerged in which information occurs in the form of a sequence and a large amount of data is generated at a rapid rate; notable examples are: sensor networks, network traffic analysis, financial tickers, web clicks, transaction log analysis, etc. What is common in all these particular applications is that data arrive in a continuous, rapid and time varying fashion and an immediate processing and analysis of large transient data streams is required. In order to manage the issues of this rapidly changing data, new data model paradigms have been introduced in the literature; see for example the work of Babcock et al. (2002), Golab and Ozsu (2003) and Domingos and Hulten (2003). These new data models, referred by Babcock et al. (2002) as data stream models, are basically founded on the non-feasibility to store complete streams (which are considered potentially unbound in size) and as a result, time-based processing and querying operations are required to be performed as new data items arrive. Specifically, according to Golab and Ozsu (2003), a data stream is a real-time, continuous, ordered sequence of items whose order is implicit when is represented by the arrival time or explicit when it is indicated by timestamps.

In data processing and analysis systems, density estimation is a fundamental block, for both, fixed static data-based and streaming data-based applications. According to the data model paradigm in which they are found, i.e. fixed static or streaming, two main variants of density estimators can be distinguished: batch-processing techniques and online-processing methods, respectively. In batch-processing algorithms, the underlying density is obtained by processing all the data at the same time. On the other hand, in online-processing methods, data items are processed as they become available or as they arrive.

The problem of estimating density functions for fixed-static data models has been thoroughly studied in the literature; see for example Scott (1992), Vannucci (1995) and Silverman (1998). However, for the case of data streams and within the framework of nonparametric estimators, only the following publications have suggested possible online solutions: (Wegman and Marchette, 2003, Procopiuc and Procopiuc, 2005, Heinz and Seeger, 2005, Wegman and Caudle, 2006, Heinz, 2007, Boedihardjo et al., 2008, Caudle and Wegman, 2009). Note that the density estimation problem in the context of data streams is particularly different from the case of fixed static data models in basically three fundamental aspects, all of them implicitly related to the data stream paradigm previously described. First, it considers a limited memory storage, which means that the estimator cannot recompute the entire density using all the data samples available every time new data items arrive. Second, the time-based order of data items is relevant in this context, which means that the temporal relation of the data should be considered by the estimator. Third, the estimation should be fast enough to update the density estimate before new data arrive. For the three reasons mentioned above, a feasible data streams density estimator should process the incoming data in a recursive fashion considering the addition/inclusion of new data as well as the discounting of old data.

Wavelet-based estimators belong to the class of orthogonal series estimators, a general class of nonparametric methods introduced in the literature by Céncov (1962), from which the two most representative and applied algorithms are based on Fourier and Hermite orthogonal basis functions. The most relevant characteristic of traditional orthogonal estimators is their computation simplicity. In contrast, their major drawback is their inability to estimate local properties of the underlying density, since they are based on global basis functions (i.e. Fourier and Hermite basis functions). Since wavelets are localised basis functions (in time and frequency), this problem is overcome with wavelet-based estimators, which allow local learning and local manipulation of the estimated density. Furthermore, wavelet-based estimators offer more flexibility in terms of convergence and smoothness due to the availability of several families of orthogonal wavelet functions that can be used in the estimator. In general, as Safavi et al. (2004) pointed out, since wavelet-based estimators inherit the advantages of wavelets and multiresolution analysis, they are superior to other orthogonal estimators.

Wavelet-based estimators have been intensively investigated in the literature; see for example Vannucci (1995), Donoho et al. (1996), Herrick et al. (2001) and Safavi et al. (2004), however, most of the techniques available in the literature are based on a batch-processing concept and just few algorithms, such as the ones investigated by Heinz and Seeger (2005), Wegman and Caudle (2006) and Caudle and Wegman (2009), are based on the online processing framework. Particularly, Wegman and Caudle (2006) and Caudle and Wegman (2009) proposed a recursive wavelet-based density estimator that consists of two main parts. First, an initial estimate of the underlying density is obtained by the use of traditional wavelet density estimation techniques. Second, this initial estimate is recursively updated as new data arrive considering both the addition of new data as well as an exponential discounting of the old one. Specifically, the key idea behind the updating procedure relies on the theoretical fundamental that, in orthogonal series estimators the coefficients can be approximated by the expectation of the projections of each data item over the orthogonal series. Additionally, in Caudle and Wegman (2009), a static block discounting is also proposed in which the parameter that controls the block discounting procedure is determined according to the degree of stationarity of the data using the Kolmogorov–Smirnov test (KS-test). Although this general technique is simple and effective, it cannot be completely useful in applications in which arriving blocks or subsets of data require the same level of emphasis/importance. Consider, for instance, applications such as environmental monitoring, where air quality standards consider exposition time to pollutants and “running” estimates are required for some specific periods of time (e.g.the last 8 and 24 h, depending on the pollutant). On the other hand, the algorithm presented by Heinz and Seeger (2005) is derived from the framework for maintaining nonparametric estimators over data streams initially proposed by Blohsfeld et al. (2005) whose main idea is the initial partition of the data stream into blocks, then for each block, an estimator is constructed. The next step is the merging of all these “block” estimators in an overall estimator. Finally, this overall estimator is compressed in order to assure the consumption of a constant amount of memory. The major drawback of this algorithm is that, data items cannot be processed as they arrive, in real time fashion, instead, they are handled block by block, with density estimates available only after each block is completed. Furthermore, this approach does not consider any discounting procedure for old data, and for that particular reason it is not potentially useful for non-stationary data streams.

A novel online density estimator for non-stationary streaming data is addressed in this paper. The proposed algorithm is derived from: (1) the well documented wavelet-based density estimation framework for fixed static data studied by Vannucci (1995), Vidakovic (1999) and Herrick et al. (2001), (2) the recursive estimator proposed by Wegman and Caudle (2006) and Caudle and Wegman (2009), and (3) data stream model concepts described by Babcock et al. (2002) and Golab and Ozsu (2003). The approach presented in this paper considers the online updating of the estimator as well as the selective reevaluation of its coefficients for new arriving data items. The proposed estimator is fundamentally different with respect to the methods reported by Wegman and Caudle (2006) and by Caudle and Wegman (2009) by the fact that, it is based on sliding window concepts suitable for non-stationary streaming data, instead of considering landmarks windows which are suitable for stationary cases. In that sense, the estimator reported here is a novel estimator whose estimation capabilities are particularly oriented towards density estimation for non-stationary streaming data.

The contribution of this paper is then twofold. First, the technique proposed for the updating of the estimator coefficients, which includes both the inclusion/addition of new data items as well as the discount of old information using the concept of sliding windows, has not been applied in this context before. The second contribution is the optimisation procedure for the selective reevaluation of the estimator coefficients. Note that none of the work published so far addressing the issue of wavelet-based density estimation in data streams has considered the fundamental difference between batch and online processing in the selective reevaluation of wavelet coefficients. It is important to highlight that the computational cost of such evaluation could be substantially reduced by considering that some of the estimator coefficients/parameters remain unaltered from one timestamp to the next one. This consideration, within the framework of wavelet-based orthogonal estimators, is completely new. Results reported in this work clearly show that the proposed estimator outperforms the technique proposed by Caudle and Wegman (2009) in the adaptation capability for cases in which the underlying distribution is changing over time. This improved capability potentially allows the use of this particular density estimation algorithm in the context of non-stationary applications.

The rest of the paper is organised as follows. In Section 2, orthogonal wavelet decomposition is briefly reviewed from a very general point of view. Additionally, fundamental concepts behind orthogonal series estimators and wavelet-based density estimators for static data is also introduced in this section. Afterwards, the algorithm proposed for the online density estimation is presented in Section 3. Section 4 includes both simulated and real-world data experiments performed to evaluate the proposed framework and their corresponding results. Finally, in Section 5, the conclusions of this work are presented.

Section snippets

Wavelets

Wavelet analysis is a well-established discipline that has been widely applied in a great variety of applications. Among the most relevant ones, we could cite: data compression, signal filtering and denoising, image processing, as well pattern recognition and system identification. The basic concept in wavelet transforms, as Bruce et al. (2002) emphasises, is the projection of data onto a basis of wavelet functions in order to separate fine-scale and large-scale information. Particularly, data

Proposed online wavelet-based orthogonal series estimator

The recursive approach proposed in this paper is based on a sliding window that moves forward, replacing old items as new data items arrive. Note that, according to Babcock et al. (2002), sliding windows are a natural method for the analysis of data streams with the specific property of emphasising recent data; moreover, they are particularly useful in situations in which an excerpt of the stream is of interest at any given time (e.g. running hourly and daily data) Golab and Ozsu (2003).

The

Simulation experiment

In order to compare the performance of the proposed online density estimator, initially we construct simulated streaming data using three different mixture distributions considering 4000, 2000 and 2000 samples from them, respectively. Note that, since the benchmark estimator (the approach proposed by Caudle and Wegman, 2009) used for performance comparisons requires an initial estimate of the underlying density, it is necessary to increase the number of timestamps evaluated for the first

Final remarks

In recent years, there has been an important emergence of applications involving streaming data. For these applications, an important issue to be addressed is the estimation of the underlying probability density of the data. In this paper, the problem of density estimation in the context of data streams is investigated following the concept of sliding windows and wavelet-based orthogonal series estimators. In that sense, a novel online algorithm for density estimation of data streams is

Acknowledgement

The first author gratefully acknowledges the financial support from the National Council on Science and Technology (CONACYT) Mexico.

References (39)

  • K.A. Caudle et al.

    Discounting older data

    Wiley Interdisciplinary Reviews: Computational Statistics

    (2011)
  • N. Céncov

    Evaluation of an unknown distribution density from observations

    Soviet Mathematics Doklady

    (1962)
  • I. Daubechies et al.

    Two-scale difference equations II. Local regularity, infinite products of matrices and fractals

    SIAM Journal on Mathematical Analysis

    (1992)
  • P. Domingos et al.

    A general framework for mining massive data streams

    Journal of Computational and Graphical Statistics

    (2003)
  • D. Donoho et al.

    Minimax estimation via wavelet shrinkage

    The Annals of Statistics

    (1998)
  • D. Donoho et al.

    Density estimation by wavelet thresholding

    The Annals of Statistics

    (1996)
  • Golab, L., Ozsu, M.T., 2003. Issues in data stream management. In: ACM Special Interest Group on Managment of Data...
  • P. Hall

    On the rate of convergence of orthogonal series density estimators

    Journal of the Royal Statistical Society. Series B (Methodological)

    (1986)
  • P. Hall et al.

    Formulae for mean integrated squared error of nonlinear wavelet-based density estimators

    The Annals of Statistics

    (1995)
  • Cited by (14)

    • Online density estimation over high-dimensional stationary and non-stationary data streams

      2019, Data and Knowledge Engineering
      Citation Excerpt :

      All methods described only deal with one-dimensional data streams. The work in [15], however, proposes an extension to higher dimensions, but provides no experimental results in support of its applicability to higher dimensions. In this paper we present a framework for online density estimation for high-dimensional data streams, based on Bayesian sequential partitioning (BSP) algorithm [16].

    • The radial wavelet frame density estimator

      2019, Computational Statistics and Data Analysis
      Citation Excerpt :

      Note that the MLWDE version implemented in this section only considers scaling functions since one of the key findings of the work reported in Peter and Rangarajan (2008) is that the best results for one and two dimensional data are obtained using the single-level scaling function representation of the density. The benchmark algorithm for third and fourth experiments is the linear version of the online wavelet estimator proposed by Caudle et al. (2011) modified to incorporate the selective evaluation procedure reported in García-Treviño and Barria (2012). The experiment consists of approximating the underlying density for each random sample with the proposed and benchmark approaches and comparing the results against the true density using as metric the Mean Integrated Squared Error (MISE).

    • Multiwavelet density estimation

      2013, Applied Mathematics and Computation
      Citation Excerpt :

      The main downfall of these bases is their infinite support, demanding a large number of terms in the series expansion to accurately approximate complex densities containing multiple modes and abrupt variations. With the advent and growing use of wavelets, we are now seeing more uses of OSE ([13–16]). In fact, [17] show wavelet density estimators (WDE) often outperform many other nonparametric density estimators.

    • A fast and recursive algorithm for clustering large datasets with k-medians

      2012, Computational Statistics and Data Analysis
      Citation Excerpt :

      Moreover, as argued in Bottou (2010), the development of fast algorithms is even more crucial when the computation time is limited and the sample is potentially very large, since fast procedures will be able to deal with a larger number of observations and will finally provide better estimates than slower ones. See also García-Treviño and Barria (2012) for recent applications of recursive estimation procedures for streaming data. Simulation 2: larger dimension with different correlation levels

    View all citing articles on Scopus
    View full text