Online wavelet-based density estimation for non-stationary streaming data

doi:10.1016/j.csda.2011.08.009

Computational Statistics & Data Analysis

Volume 56, Issue 2, 1 February 2012, Pages 327-344

https://doi.org/10.1016/j.csda.2011.08.009 Get rights and content

Abstract

There has been an important emergence of applications in which data arrives in an online time-varying fashion (e.g. computer network traffic, sensor data, web searches, ATM transactions) and it is not feasible to exchange or to store all the arriving data in traditional database systems to operate on it. For this kind of applications, as it is for traditional static database schemes, density estimation is a fundamental block for data analysis. A novel online approach for probability density estimation based on wavelet bases suitable for applications involving rapidly changing streaming data is presented. The proposed approach is based on a recursive formulation of the wavelet-based orthogonal estimator using a sliding window and includes an optimised procedure for reevaluating only relevant scaling and wavelet functions each time new data items arrive. The algorithm is tested and compared using both simulated and real world data.

Introduction

Traditional database and data processing systems are based on the storage and analysis of static records which generally have no predefined notion of time (Golab and Ozsu, 2003). In those models, conceived as fixed static data archives, data take the form of persistent relations that require persistent data storage as well as complex querying operations (Babcock et al., 2002, Golab and Ozsu, 2003). Recently, a new class of applications has emerged in which information occurs in the form of a sequence and a large amount of data is generated at a rapid rate; notable examples are: sensor networks, network traffic analysis, financial tickers, web clicks, transaction log analysis, etc. What is common in all these particular applications is that data arrive in a continuous, rapid and time varying fashion and an immediate processing and analysis of large transient data streams is required. In order to manage the issues of this rapidly changing data, new data model paradigms have been introduced in the literature; see for example the work of Babcock et al. (2002), Golab and Ozsu (2003) and Domingos and Hulten (2003). These new data models, referred by Babcock et al. (2002) as data stream models, are basically founded on the non-feasibility to store complete streams (which are considered potentially unbound in size) and as a result, time-based processing and querying operations are required to be performed as new data items arrive. Specifically, according to Golab and Ozsu (2003), a data stream is a real-time, continuous, ordered sequence of items whose order is implicit when is represented by the arrival time or explicit when it is indicated by timestamps.

In data processing and analysis systems, density estimation is a fundamental block, for both, fixed static data-based and streaming data-based applications. According to the data model paradigm in which they are found, i.e. fixed static or streaming, two main variants of density estimators can be distinguished: batch-processing techniques and online-processing methods, respectively. In batch-processing algorithms, the underlying density is obtained by processing all the data at the same time. On the other hand, in online-processing methods, data items are processed as they become available or as they arrive.

The problem of estimating density functions for fixed-static data models has been thoroughly studied in the literature; see for example Scott (1992), Vannucci (1995) and Silverman (1998). However, for the case of data streams and within the framework of nonparametric estimators, only the following publications have suggested possible online solutions: (Wegman and Marchette, 2003, Procopiuc and Procopiuc, 2005, Heinz and Seeger, 2005, Wegman and Caudle, 2006, Heinz, 2007, Boedihardjo et al., 2008, Caudle and Wegman, 2009). Note that the density estimation problem in the context of data streams is particularly different from the case of fixed static data models in basically three fundamental aspects, all of them implicitly related to the data stream paradigm previously described. First, it considers a limited memory storage, which means that the estimator cannot recompute the entire density using all the data samples available every time new data items arrive. Second, the time-based order of data items is relevant in this context, which means that the temporal relation of the data should be considered by the estimator. Third, the estimation should be fast enough to update the density estimate before new data arrive. For the three reasons mentioned above, a feasible data streams density estimator should process the incoming data in a recursive fashion considering the addition/inclusion of new data as well as the discounting of old data.

Wavelet-based estimators belong to the class of orthogonal series estimators, a general class of nonparametric methods introduced in the literature by Céncov (1962), from which the two most representative and applied algorithms are based on Fourier and Hermite orthogonal basis functions. The most relevant characteristic of traditional orthogonal estimators is their computation simplicity. In contrast, their major drawback is their inability to estimate local properties of the underlying density, since they are based on global basis functions (i.e. Fourier and Hermite basis functions). Since wavelets are localised basis functions (in time and frequency), this problem is overcome with wavelet-based estimators, which allow local learning and local manipulation of the estimated density. Furthermore, wavelet-based estimators offer more flexibility in terms of convergence and smoothness due to the availability of several families of orthogonal wavelet functions that can be used in the estimator. In general, as Safavi et al. (2004) pointed out, since wavelet-based estimators inherit the advantages of wavelets and multiresolution analysis, they are superior to other orthogonal estimators.

Wavelet-based estimators have been intensively investigated in the literature; see for example Vannucci (1995), Donoho et al. (1996), Herrick et al. (2001) and Safavi et al. (2004), however, most of the techniques available in the literature are based on a batch-processing concept and just few algorithms, such as the ones investigated by Heinz and Seeger (2005), Wegman and Caudle (2006) and Caudle and Wegman (2009), are based on the online processing framework. Particularly, Wegman and Caudle (2006) and Caudle and Wegman (2009) proposed a recursive wavelet-based density estimator that consists of two main parts. First, an initial estimate of the underlying density is obtained by the use of traditional wavelet density estimation techniques. Second, this initial estimate is recursively updated as new data arrive considering both the addition of new data as well as an exponential discounting of the old one. Specifically, the key idea behind the updating procedure relies on the theoretical fundamental that, in orthogonal series estimators the coefficients can be approximated by the expectation of the projections of each data item over the orthogonal series. Additionally, in Caudle and Wegman (2009), a static block discounting is also proposed in which the parameter that controls the block discounting procedure is determined according to the degree of stationarity of the data using the Kolmogorov–Smirnov test (KS-test). Although this general technique is simple and effective, it cannot be completely useful in applications in which arriving blocks or subsets of data require the same level of emphasis/importance. Consider, for instance, applications such as environmental monitoring, where air quality standards consider exposition time to pollutants and “running” estimates are required for some specific periods of time (e.g.the last 8 and 24 h, depending on the pollutant). On the other hand, the algorithm presented by Heinz and Seeger (2005) is derived from the framework for maintaining nonparametric estimators over data streams initially proposed by Blohsfeld et al. (2005) whose main idea is the initial partition of the data stream into blocks, then for each block, an estimator is constructed. The next step is the merging of all these “block” estimators in an overall estimator. Finally, this overall estimator is compressed in order to assure the consumption of a constant amount of memory. The major drawback of this algorithm is that, data items cannot be processed as they arrive, in real time fashion, instead, they are handled block by block, with density estimates available only after each block is completed. Furthermore, this approach does not consider any discounting procedure for old data, and for that particular reason it is not potentially useful for non-stationary data streams.

A novel online density estimator for non-stationary streaming data is addressed in this paper. The proposed algorithm is derived from: (1) the well documented wavelet-based density estimation framework for fixed static data studied by Vannucci (1995), Vidakovic (1999) and Herrick et al. (2001), (2) the recursive estimator proposed by Wegman and Caudle (2006) and Caudle and Wegman (2009), and (3) data stream model concepts described by Babcock et al. (2002) and Golab and Ozsu (2003). The approach presented in this paper considers the online updating of the estimator as well as the selective reevaluation of its coefficients for new arriving data items. The proposed estimator is fundamentally different with respect to the methods reported by Wegman and Caudle (2006) and by Caudle and Wegman (2009) by the fact that, it is based on sliding window concepts suitable for non-stationary streaming data, instead of considering landmarks windows which are suitable for stationary cases. In that sense, the estimator reported here is a novel estimator whose estimation capabilities are particularly oriented towards density estimation for non-stationary streaming data.

The contribution of this paper is then twofold. First, the technique proposed for the updating of the estimator coefficients, which includes both the inclusion/addition of new data items as well as the discount of old information using the concept of sliding windows, has not been applied in this context before. The second contribution is the optimisation procedure for the selective reevaluation of the estimator coefficients. Note that none of the work published so far addressing the issue of wavelet-based density estimation in data streams has considered the fundamental difference between batch and online processing in the selective reevaluation of wavelet coefficients. It is important to highlight that the computational cost of such evaluation could be substantially reduced by considering that some of the estimator coefficients/parameters remain unaltered from one timestamp to the next one. This consideration, within the framework of wavelet-based orthogonal estimators, is completely new. Results reported in this work clearly show that the proposed estimator outperforms the technique proposed by Caudle and Wegman (2009) in the adaptation capability for cases in which the underlying distribution is changing over time. This improved capability potentially allows the use of this particular density estimation algorithm in the context of non-stationary applications.

The rest of the paper is organised as follows. In Section 2, orthogonal wavelet decomposition is briefly reviewed from a very general point of view. Additionally, fundamental concepts behind orthogonal series estimators and wavelet-based density estimators for static data is also introduced in this section. Afterwards, the algorithm proposed for the online density estimation is presented in Section 3. Section 4 includes both simulated and real-world data experiments performed to evaluate the proposed framework and their corresponding results. Finally, in Section 5, the conclusions of this work are presented.

Section snippets

Wavelets

Wavelet analysis is a well-established discipline that has been widely applied in a great variety of applications. Among the most relevant ones, we could cite: data compression, signal filtering and denoising, image processing, as well pattern recognition and system identification. The basic concept in wavelet transforms, as Bruce et al. (2002) emphasises, is the projection of data onto a basis of wavelet functions in order to separate fine-scale and large-scale information. Particularly, data

Proposed online wavelet-based orthogonal series estimator

The recursive approach proposed in this paper is based on a sliding window that moves forward, replacing old items as new data items arrive. Note that, according to Babcock et al. (2002), sliding windows are a natural method for the analysis of data streams with the specific property of emphasising recent data; moreover, they are particularly useful in situations in which an excerpt of the stream is of interest at any given time (e.g. running hourly and daily data) Golab and Ozsu (2003).

The

Simulation experiment

In order to compare the performance of the proposed online density estimator, initially we construct simulated streaming data using three different mixture distributions considering 4000, 2000 and 2000 samples from them, respectively. Note that, since the benchmark estimator (the approach proposed by Caudle and Wegman, 2009) used for performance comparisons requires an initial estimate of the underlying density, it is necessary to increase the number of timestamps evaluated for the first

Final remarks

In recent years, there has been an important emergence of applications involving streaming data. For these applications, an important issue to be addressed is the estimation of the underlying probability density of the data. In this paper, the problem of density estimation in the context of data streams is investigated following the concept of sliding windows and wavelet-based orthogonal series estimators. In that sense, a novel online algorithm for density estimation of data streams is

Acknowledgement

The first author gratefully acknowledges the financial support from the National Council on Science and Technology (CONACYT) Mexico.

References (39)

K.A. Caudle et al.
Nonparametric density estimation of streaming data using orthogonal series
Computational Statistics & Data Analysis
(2009)
A. Cohen et al.
Wavelets on the interval and fast wavelet transforms
Applied and Computational Harmonic Analysis
(1993)
E. Masry
Probability density estimation from dependent observations using wavelets orthonormal bases
Statistics and Probability Letters
(1994)
A. Pinheiro et al.
Estimating the square root of a density via compactly supported wavelets
Computational Statistics & Data Analysis
(1997)
B. Babcock et al.
Models and issues in data stream systems
Blohsfeld, B., Heinz, C., Seeger, B., 2005. Maintaining nonparametric estimators over data streams. In: The German...
A.P. Boedihardjo et al.
A framework for estimating complex probability density structures in data streams
A. Boggess et al.
A First Course in Wavelets with Fourier Analysis
(2009)
L. Bruce et al.
Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction
IEEE Transactions on Geoscience and Remote Sensing
(2002)
C. Burrus et al.
Introduction to Wavelets and Wavelet Transforms: A Primer
(1997)

K.A. Caudle et al.

Discounting older data

Wiley Interdisciplinary Reviews: Computational Statistics

(2011)

N. Céncov

Evaluation of an unknown distribution density from observations

Soviet Mathematics Doklady

(1962)

I. Daubechies et al.

Two-scale difference equations II. Local regularity, infinite products of matrices and fractals

SIAM Journal on Mathematical Analysis

(1992)

P. Domingos et al.

A general framework for mining massive data streams

Journal of Computational and Graphical Statistics

(2003)

D. Donoho et al.

Minimax estimation via wavelet shrinkage

The Annals of Statistics

(1998)

D. Donoho et al.

Density estimation by wavelet thresholding

The Annals of Statistics

(1996)

Golab, L., Ozsu, M.T., 2003. Issues in data stream management. In: ACM Special Interest Group on Managment of Data...

P. Hall

On the rate of convergence of orthogonal series density estimators

Journal of the Royal Statistical Society. Series B (Methodological)

(1986)

P. Hall et al.

Formulae for mean integrated squared error of nonlinear wavelet-based density estimators

The Annals of Statistics

(1995)

Cited by (14)

Online density estimation over high-dimensional stationary and non-stationary data streams
2019, Data and Knowledge Engineering
Citation Excerpt :
All methods described only deal with one-dimensional data streams. The work in [15], however, proposes an extension to higher dimensions, but provides no experimental results in support of its applicability to higher dimensions. In this paper we present a framework for online density estimation for high-dimensional data streams, based on Bayesian sequential partitioning (BSP) algorithm [16].
Efficient density estimation over an open-ended stream of high-dimensional data is of primary importance to machine learning. In general, parametric methods for density estimation are not suitable for high dimensions, and the widely used non-parametric methods like kernel density estimation (KDE) method fail for high-dimensional datasets. In this paper we present a framework for density estimation over stationary and non-stationary high-dimensional data streams. It is based on a blockized implementation of the Bayesian sequential partitioning (BSP) algorithm. The proposed framework satisfies the general design criteria for systems with the mission of online machine learning and data mining over data streams.
The radial wavelet frame density estimator
2019, Computational Statistics and Data Analysis
Citation Excerpt :
Note that the MLWDE version implemented in this section only considers scaling functions since one of the key findings of the work reported in Peter and Rangarajan (2008) is that the best results for one and two dimensional data are obtained using the single-level scaling function representation of the density. The benchmark algorithm for third and fourth experiments is the linear version of the online wavelet estimator proposed by Caudle et al. (2011) modified to incorporate the selective evaluation procedure reported in García-Treviño and Barria (2012). The experiment consists of approximating the underlying density for each random sample with the proposed and benchmark approaches and comparing the results against the true density using as metric the Mean Integrated Squared Error (MISE).
The estimation of probability densities is one of the fundamental problems in scientific research. It has been shown that Wavelet Density Estimators, which are a well-documented nonparametric approach, outperform other nonparametric estimators in problems involving densities with discontinuities and local features. However, the use of this type of estimators is not widely extended in the scientific community mainly because of their heavy computational complexity and their difficult algorithmic implementation. A novel multidimensional Wavelet Density Estimator approach based on new multidimensional scaling functions with analytic closed-form expressions is proposed. The key advantages of the proposed estimator are its simpler multidimensional algorithmic implementation and its significant reduction in computational complexity. Algorithmic formulations for four different data analysis scenarios are presented: (1) batch processing of input data, (2) online estimation for stationary process, (3) online estimation for non-stationary contexts and (4) batch estimation of high-dimensional data. The assessment results show that the proposed approach reduces the computational time of the estimation process while maintaining competitive estimation errors.
Multiwavelet density estimation
2013, Applied Mathematics and Computation
Citation Excerpt :
The main downfall of these bases is their infinite support, demanding a large number of terms in the series expansion to accurately approximate complex densities containing multiple modes and abrupt variations. With the advent and growing use of wavelets, we are now seeing more uses of OSE ([13–16]). In fact, [17] show wavelet density estimators (WDE) often outperform many other nonparametric density estimators.
Accurate density estimation methodologies play an integral role in a variety of scientific disciplines with applications including simulation models, decision support tools, and exploratory data analysis. In the past, histograms and kernel density estimators have been the predominant tools of choice, primarily due to their ease of use and mathematical simplicity. More recently, the use of wavelets for density estimation has gained in popularity due to their ability to approximate a large class of functions, including those with localized, abrupt variations. However, a well-known attribute of wavelet bases is that they cannot be simultaneously symmetric, orthogonal, and compactly supported. Multiwavelets—more general, vector-valued constructions of wavelets—overcome this disadvantage, making them natural choices for estimating density functions, many of which exhibit local symmetries around features such as a mode. We extend the methodology of wavelet density estimation to use multiwavelet bases and illustrate several empirical examples of multiwavelet density estimation.
A fast and recursive algorithm for clustering large datasets with k-medians
2012, Computational Statistics and Data Analysis
Citation Excerpt :
Moreover, as argued in Bottou (2010), the development of fast algorithms is even more crucial when the computation time is limited and the sample is potentially very large, since fast procedures will be able to deal with a larger number of observations and will finally provide better estimates than slower ones. See also García-Treviño and Barria (2012) for recent applications of recursive estimation procedures for streaming data. Simulation 2: larger dimension with different correlation levels
Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. A new class of recursive stochastic gradient algorithms designed for the $k$ -medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions which are known to have better performances. A data-driven procedure that permits a fully automatic selection of the value of the descent step is also proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as $k$ -means, trimmed $k$ -means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.
An Improved Algorithm for Learning Drifting Discrete Distributions
2024, arXiv
Spline Local Basis Methods for Nonparametric Density Estimation
2023, SSRN

View all citing articles on Scopus

View full text

Online wavelet-based density estimation for non-stationary streaming data

Abstract

Introduction

Section snippets

Wavelets

Proposed online wavelet-based orthogonal series estimator

Simulation experiment

Final remarks

Acknowledgement

Computational Statistics & Data Analysis

Applied and Computational Harmonic Analysis

Statistics and Probability Letters

Computational Statistics & Data Analysis

Models and issues in data stream systems

A framework for estimating complex probability density structures in data streams

A First Course in Wavelets with Fourier Analysis

Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction

IEEE Transactions on Geoscience and Remote Sensing

Introduction to Wavelets and Wavelet Transforms: A Primer

Discounting older data

Wiley Interdisciplinary Reviews: Computational Statistics

Evaluation of an unknown distribution density from observations

Soviet Mathematics Doklady

Two-scale difference equations II. Local regularity, infinite products of matrices and fractals

SIAM Journal on Mathematical Analysis

A general framework for mining massive data streams

Journal of Computational and Graphical Statistics

Minimax estimation via wavelet shrinkage

The Annals of Statistics

Density estimation by wavelet thresholding

The Annals of Statistics

On the rate of convergence of orthogonal series density estimators

Journal of the Royal Statistical Society. Series B (Methodological)

Formulae for mean integrated squared error of nonlinear wavelet-based density estimators

The Annals of Statistics