ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
Computer Networks
Volume 50, Issue 10, 14 July 2006, Pages 1488-1512
I. Web Dynamics
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (590 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/j.comnet.2005.10.021    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2005 Elsevier B.V. All rights reserved.

A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation

Olfa Nasraouia, Corresponding Author Contact Information, E-mail The Corresponding Author, Carlos Rojasa, E-mail The Corresponding Author and Cesar Cardonab, 1, E-mail The Corresponding Author

aDepartment of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, United States bMagnify Inc., Chicago, United States

Available online 27 December 2005.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the “you only get to see it once” constraint on stream data calls for different computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying profiles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different trend sequencing scenarios.

Keywords: Mining evolving data streams; Web clickstreams; Web mining; Text mining; User profiles

Article Outline

1. Introduction
1.1. Contributions of this paper
1.2. Organization of this paper
2. Tecno-streams (tracking evolving clusters in noisy streams)
2.1. Cloning in the dynamic immune system
2.2. Learning new data and relation to emerging trend detection
2.3. Tecno-streams: tracking evolving clusters in noisy data streams with a scalable immune system learning model
2.4. Example of learning an evolving synopsis from a noisy data stream
3. Mining evolving user profiles from noisy Web clickstream data
3.1. Similarity measures used in the learning phase of single-pass mining of clusters in Web data
3.2. Validation metrics for single-pass mining of evolving Web data streams
3.3. Validation methodology for single-pass mining of evolving Web data streams
4. Single-pass mining of evolving topics in text data
4.1. Simulation results on the 20 newsgroups data
4.2. Simulation results with single-pass mining of user profiles from real Web clickstream data
5. Conclusions
Acknowledgements
References
Vitae




















Computer Networks
Volume 50, Issue 10, 14 July 2006, Pages 1488-1512
I. Web Dynamics
 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.