Copyright © 2005 Elsevier B.V. All rights reserved.
A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation
Available online 27 December 2005.
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the “you only get to see it once” constraint on stream data calls for different computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying profiles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different trend sequencing scenarios.
Keywords: Mining evolving data streams; Web clickstreams; Web mining; Text mining; User profiles
Article Outline
- 1. Introduction
- 2. Tecno-streams (tracking evolving clusters in noisy streams)
- 2.1. Cloning in the dynamic immune system
- 2.2. Learning new data and relation to emerging trend detection
- 2.3. Tecno-streams: tracking evolving clusters in noisy data streams with a scalable immune system learning model
- 2.4. Example of learning an evolving synopsis from a noisy data stream
- 3. Mining evolving user profiles from noisy Web clickstream data
- 3.1. Similarity measures used in the learning phase of single-pass mining of clusters in Web data
- 3.2. Validation metrics for single-pass mining of evolving Web data streams
- 3.3. Validation methodology for single-pass mining of evolving Web data streams
- 4. Single-pass mining of evolving topics in text data
- 4.1. Simulation results on the 20 newsgroups data
- 4.2. Simulation results with single-pass mining of user profiles from real Web clickstream data
- 5. Conclusions
- Acknowledgements
- References
- Vitae






E-mail Article
Add to my Quick Links

Cited By in Scopus (4)







