Abstract
Various data mining approaches are now available, which help in handling large static data sets, in spite of limited computational resources. However, these approaches lack in mining high-speed endless streams, as their learning procedure though simple require the entire training process to be repeated for each new arriving information instance. The main challenges while dealing with continuous data streams: they are of sizes many times greater than the available memory, are real-time, and the new instances should be inspected at most once, and predictions must be made. Another issue with continuous real-time data is changing of concepts with time, which is often called concept drift. This paper addresses the above stated problems, and provides a solution by proposing a real-time, scalable, and robust architecture. It is a general-purpose architecture, based on online machine learning, which efficiently logs and mines the stream data in a fault-tolerant manner. It consists of two frameworks: (1) Event aggregation framework, which reliably collects events and messages from multiple sources and ships them to a destination for processing (2) Real-time computation framework, which processes streams online for extraction of information patterns. It guarantees reliable processing of billions of messages per second. Furthermore, it facilitates the evaluation of the stream learning algorithms and offers change detection strategies to detect concept drifts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Golab and Ozsu M. T.: Issues in Data Stream Management. In SIGMOD Record, Volume 32, Number 2, June (2003) 5–14.
Garofalakis M., Gehrke J., Rastogi R.: Querying and mining data streams: you only get one look a tutorial. SIGMOD Conference 2002: 35. (2002).
Babcock B., Babu S., Datar M., Motwani R., and Widom J.:Models and issues in data stream systems. In Proceedings of PODS (2002).
Muthukrishnan S.: Data streams: algorithms and applications. Proceedings of the fourteenth annual ACMSIAM symposium on discrete algorithms (2003).
http://developer.yahoo.com/blogs/hadoop/posts/2010/06/enabling_hadoop_batch_processi_1/.
Kafka, http://sna-projects.com/kafka/.
Cloudera’s Flume, https://github.com/cloudera/flume.
Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. “Mining Data Streams: A Review”, VIC3145, Australia, ACM SIGMOD Record Vol. 34, No. 2; June 2005.
Albert Bifet and Richard Kirkby. Massive Online Analysis, August 2009.
Alexey Tsymbal. (2004) The Problem of Concept Drift: Definitions and Related Work.
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavalda, R. (2009). New ensemble methods for evolving data streams. In 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Bifet, A. (2010). Adaptive Stream Mining: Pattern Learning and Mining from Evolving DataStreams, IOS Press.
Bifet, A. and Gavalda, R. (2007). Learning from Time-Changing Data with Adaptive Windowing, in SIAM Int. Conf. on Data Mining (SDM’07).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Hussain, A.R., Hameed, M.A., Fatima, S. (2016). A Proposal: High-Throughput Robust Architecture for Log Analysis and Data Stream Mining. In: Saini, H., Sayal, R., Rawat, S. (eds) Innovations in Computer Science and Engineering. Advances in Intelligent Systems and Computing, vol 413. Springer, Singapore. https://doi.org/10.1007/978-981-10-0419-3_36
Download citation
DOI: https://doi.org/10.1007/978-981-10-0419-3_36
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-0417-9
Online ISBN: 978-981-10-0419-3
eBook Packages: EngineeringEngineering (R0)