Copyright © 2005 Elsevier B.V. All rights reserved.
Online clustering of parallel data streams
Received 10 March 2005;
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
In recent years, the management and processing of so-called data streams has become a topic of active research in several fields of computer science such as, e.g., distributed systems, database systems, and data mining. A data stream can roughly be thought of as a transient, continuously increasing sequence of time-stamped data. In this paper, we consider the problem of clustering parallel streams of real-valued data, that is to say, continuously evolving time series. In other words, we are interested in grouping data streams the evolution over time of which is similar in a specific sense. In order to maintain an up-to-date clustering structure, it is necessary to analyze the incoming data in an online manner, tolerating not more than a constant time delay. For this purpose, we develop an efficient online version of the classical K-means clustering algorithm. Our method’s efficiency is mainly due to a scalable online transformation of the original data which allows for a fast computation of approximate distances between streams.
Keywords: Data mining; Clustering; Data streams; Fuzzy sets
Article Outline
- 1. Introduction
- 2. Background
- 2.1. The data stream model
- 2.2. Clustering
- 2.3. Related work
- 3. Preprocessing and maintaining data streams
- 3.1. Data streams and sliding windows
- 3.2. Normalization
- 3.3. Discrete Fourier transform
- 3.4. Computation of DFT coefficients
- 3.5. Distance approximation and smoothing
- 4. Clustering data streams
- 5. Fuzzy clustering
- 6. Implementation
- 7. Experimental validation
- 7.1. Synthetic data
- 7.1.1. Efficiency
- 7.1.2. Quality
- 7.2. Real-world data: stock rates
- 8. Summary
- Acknowledgements
- References
- Vitae






E-mail Article
Add to my Quick Links

Cited By in Scopus (6)






