Methods to Investigate Concept Drift in Big Data Streams

Nidhi; Mangat, Veenu; Gupta, Vishal; Vig, Renu

doi:10.1007/978-981-10-6680-1_3

Nidhi³,
Veenu Mangat³,
Vishal Gupta³ &
…
Renu Vig³

815 Accesses
6 Citations

Abstract

The explosion of information from various social networking sites, Web clickstream, information retrieval, customers’ records, users’ reviews, business transactions, network event logs, etc. Results in generating a continuous deluge of data at different rates, called streaming data. Organizing, indexing, analyzing, or mining hidden knowledge from such a data deluge becomes a critical functionality for a broad range of content analysis tasks that includes emerging topic detection, interesting content identification, user interest profiling, and real-time Web search. But managing such ‘Big Data’ becomes even more challenging when streaming data is taken for analyzing and producing results in real time. The streaming data may include numeric, categorical, or mixed value. Most of the current research has been done on numeric data streams by exploiting the statistical properties of the numeric data. But now categorical/textual data streams have also gained researchers’ interest due to the high availability of data in textual format on the Internet. Applying classification for managing data streams is an unrealistic approach as not every incoming data has a class label. So, in such a case, for managing unlabeled data streams, a clustering technique is applied. One property that can affect the results of any clustering algorithm is concept drift. Therefore, detecting and managing concept drift over a period imposes a great challenge to better cluster analysis. This chapter provides an in-depth critique of various algorithms that have been introduced to handle concept drift in a real environment. A framework for examining concept drift in big data streams is also proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Concept Drift for Big Data

Discussion and review on evolving data streams and concept drift adapting

Article 05 October 2016

A Brief Survey on Concept Drift

References

Zhang, B., Qin, S., Wang, W., Wang, D., & Xue, L. (2016). Data stream clustering based on fuzzy C-mean algorithm and entropy theory. Journal of Signal Processing, 126, 111–116.
Article Google Scholar
Lifna, C., & Vijaylakshmi, M. (2015). Identifying concept drifts in Twitter streams. In International Conference on Advanced Computing Technologies and Applications (pp. 86–94).
Google Scholar
Xhafa, F., Naranjo, V., Barolli, L., & Takizawa, M. (2015). On Streaming Consistency of Big Data Stream Processing in Heterogenous Clusters. In 18th IEEE International Conference on Network-Based Information Systems (pp. 476–482).
Google Scholar
Schnitzler, K., Davies, N., Ross, F., & Harris, R. (2016). Using Twitter™ to drive research impact: A discussion of strategies, opportunities and challenges. International Journal of Nursing Studies, 59, 15–26.
Article Google Scholar
Costa, J., Silva, C., Antunes, M., & Ribeiro, B. (2014). Concept Drift Awareness in Twitter Streams. In 13th IEEE International Conference on Machine Learning and Applications (pp. 294–299).
Google Scholar
Wang, Y., Liu, J., Huang, Y., & Feng, X. (2016). Using Hashtag graph-based topic model to connect semantically-related words without co-occurrence in microblogs. IEEE Transactions on Knowledge and Data Engineering, 28(7), 1919–1933.
Article Google Scholar
Entities—Twitter Developers. (2017). In Dev.twitter.com. https://dev.twitter.com/overview/api/entities. Accessed May 8, 2017.
Eskandari, S., & Javidi, M. (2016). Online streaming feature selection using rough sets. International Journal of Approximate Reasoning, 69, 35–57.
Article MathSciNet MATH Google Scholar
Li, J., Tai, Z., Zhang, R., Yu, W., & Liu, L. (2014). Online bursty event detection from microblog. In IEEE/ACM 7th International Conference on Utility and Cloud Computing (pp. 865–870).
Google Scholar
Adedoyin-Olowe, M., Gaber, M., Dancausa, C., Stahl, F., & Gomes, J. (2016). A rule dynamics approach to event detection in Twitter with its application to sports and politics. Expert Systems with Applications, 55, 351–360.
Article Google Scholar
Villanueva, D., González-Carrasco, I., López-Cuadrado, J., & Lado, N. (2016). SMORE: Towards a semantic modeling for knowledge representation on social media. Science of Computer Programming, 121, 16–33.
Article Google Scholar
Li, H. (2014). Detecting campaign promoters on Twitter using Markov random fields. In IEEE International Conference on Data Mining (pp. 290–299).
Google Scholar
Kuo, R., Mei, C., Zulvia, F., & Tsai, C. (2016). An application of a meta-heuristic algorithm-based clustering ensemble method to APP customer segmentation. Neurocomputing, 205, 116–129.
Article Google Scholar
Wang, B., Miao, Y., Zhao, H., Jin, J., & Chen, Y. (2016). A biclustering-based method for market segmentation using customer pain points. Journal of Engineering Applications of Artificial Intelligence, 47, 101–109.
Article Google Scholar
Giannitsioti, E., Athanasia, S., Plachouras, D., Kanellaki, S., Bobota, F., Tzepetzi, G., et al. (2016). Impact of patients’ professional and educational status on perception of an antibiotic policy campaign: A pilot study at a university hospital. Journal of Global Antimicrobial Resistance, 6, 123–127.
Article Google Scholar
Han, J., Kamber, M., & Pei, J. (2011). Data mining (3rd ed.). Amsterdam: Elsevier/Morgan Kaufmann.
MATH Google Scholar
He, Z., Xu, X., & Deng, S. (2011). Clustering categorical data streams. Journal of Computational Methods in Sciences and Engineering, 11(4), 185–192.
Google Scholar
Wu, Q., & Ma, S. (2011). Detecting outliers in sliding window over categorical data streams. In Eighth International Conference on Fuzzy Systems and Knowledge Discovery (pp. 1663–1667).
Google Scholar
Sora, M., Roy, S., & Singh, I. (2011). FLoMSqueezer: An effective approach for clustering categorical data stream. International Journal of Computer Science Issues, 8(6), 1.
Google Scholar
Carbonera, J., & Abel, M. (2014). An entropy-based subspace clustering algorithm for categorical data. In IEEE 26th International Conference on Tools with Artificial Intelligence (pp. 272–277).
Google Scholar
Qin, H., Ma, X., Herawan, T., & Zain, J. (2014). MGR: An information theory based hierarchical divisive clustering algorithm for categorical data. Knowledge-Based Systems, 67, 401–411. https://doi.org/10.1016/j.knosys.2014.03.013.
Article Google Scholar
Lenco, D., Bifet, A., Pfahringer, B., & Poncelet, P. (2014). Change detection in categorical evolving data streams. In 29th Annual ACM Symposium on Applied Computing (pp. 792–797).
Google Scholar
Cao, F., & Huang, J. Z. (2013). A concept-drifting detection algorithm for categorical evolving data. In J. Pei, V. S. Tseng, L. Cao, H. Motoda & G. Xu (Eds.), Advances in knowledge discovery and data mining. PAKDD 2013. Lecture notes in computer science (vol. 7819). Berlin, Heidelberg: Springer.
Google Scholar
Li, Y., Li, D., Wang, S., & Zhai, Y. (2014). Incremental entropy-based clustering on categorical data streams with concept drift. Knowledge-Based Systems, 59, 33–47. https://doi.org/10.1016/j.knosys.2014.02.004.
Article Google Scholar
Chen, H. L., Chen, M. S., & Lin, S. C. (2009). Catching the trend: A framework for clustering concept-drifting categorical data. IEEE Transactions on Knowledge Data Engineering, 21(5), 652–665.
Article Google Scholar
Cao, F., Liang, J., Bai, L., Zhao, X., & Dang, C. (2010). A framework for clustering categoricaltime-evolving data. IEEE Transactions on Fuzzy System, 18(5), 872–882.
Article Google Scholar
Cao, F., & Liang, J. (2011). A data labeling method for clustering categorical data. Expert Systems with Applications, 38, 2381–2385. https://doi.org/10.1016/j.eswa.2010.08.026.
Article Google Scholar
Talistu, M., Moh, T. S., & Moh, M. (2015). Gossip-based spectral clustering of distributed data streams. In International Conference on High Performance Computing and Simulation (pp. 325–333). https://doi.org/10.1109/HPCSim.2015.7237058.
Xhafa, F., Naranjo, V., Barolli, L., & Takizawa, M. (2015). On streaming consistency of big data stream processing in heterogenous clusters. In 18th International Conference on Network-Based Information Systems (pp. 476–482).
Google Scholar
Martinez-Gil, J. (2016). CoTO: A novel approach for fuzzy aggregation of semantic similarity measures. Cognitive Systems Research, 40, 8–17. https://doi.org/10.1016/j.cogsys.2016.01.001.
Article Google Scholar
Rehioui, H., Idrissi, A., Abourezq, M., & Zegrari F. (2016). DENCLUE-IM: A new approach for big data clustering. In 7th International Conference on Ambient Systems, Networks and Technologies (pp. 560–567).
Google Scholar
Laohakiat, S., Phimoltares, S., & Lursinsap, C. (2017). A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction. Information Sciences, 381, 104–123. https://doi.org/10.1016/j.ins.2016.11.018.
Article Google Scholar
Barddal, J., Gomes, H., Enembreck, F., & Pfahringer, B. (2017). A survey on feature drift adaptation: Definition, benchmark, challenges and future directions. Journal of Systems and Software, 127, 278–294. https://doi.org/10.1016/j.jss.2016.07.005.
Article Google Scholar
Andrade Silva, J., Hruschka, E., & Gama, J. (2017). An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications, 67, 228–238. https://doi.org/10.1016/j.eswa.2016.09.020.
Article Google Scholar
Liu, J., & Zio, E. (2016). A SVR-based ensemble approach for drifting data streams with recurring patterns. Applied Soft Computing, 47, 553–564. https://doi.org/10.1016/j.asoc.2016.06.030.
Article Google Scholar
Bai, L., Cheng, X., Liang, J., & Shen, H. (2016). An optimization model for clustering categorical data streams with drifting concepts. IEEE Transactions on Knowledge and Data Engineering, 28, 2871–2883. https://doi.org/10.1109/tkde.2016.2594068.
Article Google Scholar
Chen, H.-L., Chen, M.-S., & Lin, S.-C. (2009). Catching the trend: A framework for clustering concept-drifting categorical data. IEEE Transactions on Knowledge and Data Engineering, 21, 652–665. https://doi.org/10.1109/tkde.2008.192.
Article Google Scholar
Cao, F., Liang, J., Bai, L., Zhao, X., & Dang, C. (2010). A framework for clustering categorical time-evolving data. IEEE Transactions on Fuzzy Systems, 18, 872–882. https://doi.org/10.1109/tfuzz.2010.2050891.
Article Google Scholar
Koh, Y. S. (2016). CD-TDS: Change detection in transactional data streams for frequent pattern mining. In International Joint Conference on Neural Networks (pp. 1554–1561). https://doi.org/10.1109/IJCNN.2016.7727383.
Song, G., Ye, Y., Zhang, H., Xu, X., Lau, R., & Liu, F. (2016). Dynamic clustering forest: An ensemble framework to efficiently classify textual data stream with concept drift. Information Sciences, 357, 125–143. https://doi.org/10.1016/j.ins.2016.03.043.
Article Google Scholar
Haque, A., Khan, L., Baron, M., Thuraisingham, B., & Aggarwal, C. (2016). Efficient handling of concept drift and concept evolution over Stream Data. In IEEE 32nd International Conference on Data Engineering (pp. 481–492).
Google Scholar
Sethi, T. S., Kantardzic, M., & Arabmakki, E. (2016). Monitoring classification blindspots to detect drifts from unlabeled data. In IEEE 17th International Conference on Information Reuse and Integration (pp. 142–151).
Google Scholar
da Costa, F., Rios, R., & de Mello, R. (2016). Using dynamical systems tools to detect concept drift in data streams. Expert Systems with Applications, 60, 39–50. https://doi.org/10.1016/j.eswa.2016.04.026.
Article Google Scholar
Lughofer, E., & Mouchaweh, M. S. (2015). Autonomous data stream clustering implementing split-and-merge concepts—Towards a plug-and-play approach. Information Sciences, 304, 54–79. https://doi.org/10.1016/j.ins.2015.01.010.
Article Google Scholar
Yang, H., & Fong, S. (2015). Countering the concept-drift problems in big data by an incrementally optimized stream mining model. Journal of Systems and Software, 102, 158–166. https://doi.org/10.1016/j.jss.2014.07.010.
Article Google Scholar
Wu, X., Li, P., & Hu, X. (2012). Learning from concept drifting data streams with unlabeled data. Neurocomputing, 92, 145–155. https://doi.org/10.1016/j.neucom.2011.08.041.
Article Google Scholar
Hong, L., Dan, O., & Davison, B. D. (2011). Predicting popular messages in Twitter. In ACM International Conference on World Wide Web(WWW).
Google Scholar
Li, C., Shan, M., Jheng, S., & Chou, K. (2016). Exploiting concept drift to predict popularity of social multimedia in microblogs. Information Sciences, 339, 310–331. https://doi.org/10.1016/j.ins.2016.01.009.
Article Google Scholar
Shang, K., Yan, W., & Small, M. (2016). Evolving networks—Using past structure to predict the future. Physica A: Statistical Mechanics and its Applications, 455, 120–135. https://doi.org/10.1016/j.physa.2016.02.067.
Article Google Scholar
Lipizzi, C., Dessavre, D., Iandoli, L., & Marquez, J. (2016). Social media conversation monitoring: Visualize information contents of Twitter messages using conversational metrics. Procedia Computer Science, 80, 2216–2220. https://doi.org/10.1016/j.procs.2016.05.384.
Article Google Scholar
Miller, Z., Dickinson, B., Deitrick, W., Hu, W., & Wang, A. (2014). Twitter spammer detection using data stream clustering. Information Sciences, 260, 64–73. https://doi.org/10.1016/j.ins.2013.11.016.
Article Google Scholar
Karunasekera, S., Harwood, A., Samarawickrama, S., Ramamohanrao, K., & Robins, G. (2014). Topic-specific post identification in microblog streams. In IEEE International Conference on Big Data (pp. 7–13). https://doi.org/10.1109/BigData.2014.7004416.
Malik, S., Smith, A., Hawes, T., Papadatos, P., Li, J., Dunne, C., et al. (2013). TopicFlow: Visualizing topic alignment of Twitter data over time. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (pp. 720–726). https://doi.org/10.1145/2492517.2492639.
Jiang, W., & Brice, P. (2009). Data stream clustering and modeling using context-trees. In 6th IEEE International Conference on Service Systems and Service Management (pp. 932–937).
Google Scholar
Li, Wen J., Tai, Z., Zhang, R., & Yu, W. (2015). Bursty event detection from microblog: A distributed and incremental approach. Concurrency and Computation: Practice and Experience, 28(11), 3115–3130.
Article Google Scholar
Kalloubi, F., Nfaoui, E. H., & Beqqali, O. El. (2014). Named entity linking in microblog posts using graph-based centrality scoring. In 9th International Conference on Intelligent Systems: Theories and Application (pp. 501–506). https://doi.org/10.1109/SITA.2014.6847286.
Gaglio, S., Re, G., & Morana, M. (2015). Real-time detection of Twitter social events from the user’s perspective. In IEEE International Conference on Communications (ICC) (pp. 1207–1212).
Google Scholar
Kalloubi, F., Nfaoui, E., & Beqqali, O. (2014). Graph-based tweet entity linking using DBpedia. In IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA) (pp. 501–506). https://doi.org/10.1109/AICCSA.2014.7073240.
Kumar, N., & Muruganantham, D. (2016). Disambiguating the Twitter stream entities and enhancing the search operation using DBpedia ontology. International Journal of Information Technology and Web Engineering, 11(2), 51–62. https://doi.org/10.4018/IJITWE.2016040104.
Article Google Scholar

Download references

Author information

Authors and Affiliations

UIET, Panjab University, Chandigarh, India
Nidhi, Veenu Mangat, Vishal Gupta & Renu Vig

Authors

Nidhi
View author publications
You can also search for this author in PubMed Google Scholar
Veenu Mangat
View author publications
You can also search for this author in PubMed Google Scholar
Vishal Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Renu Vig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nidhi .

Editor information

Editors and Affiliations

Computer Science and Engineering, VIT University, Vellore, Tamil Nadu, India
S. Margret Anouncia
The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark
Uffe Kock Wiil

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nidhi, Mangat, V., Gupta, V., Vig, R. (2018). Methods to Investigate Concept Drift in Big Data Streams. In: Margret Anouncia, S., Wiil, U. (eds) Knowledge Computing and Its Applications. Springer, Singapore. https://doi.org/10.1007/978-981-10-6680-1_3

Download citation

DOI: https://doi.org/10.1007/978-981-10-6680-1_3
Published: 16 February 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6679-5
Online ISBN: 978-981-10-6680-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Methods to Investigate Concept Drift in Big Data Streams

Abstract

Access this chapter

Similar content being viewed by others

Concept Drift for Big Data

Discussion and review on evolving data streams and concept drift adapting

A Brief Survey on Concept Drift

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Methods to Investigate Concept Drift in Big Data Streams

Abstract

Access this chapter

Similar content being viewed by others

Concept Drift for Big Data

Discussion and review on evolving data streams and concept drift adapting

A Brief Survey on Concept Drift

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation