Skip to main content

Methods to Investigate Concept Drift in Big Data Streams

  • Chapter
  • First Online:
Knowledge Computing and Its Applications

Abstract

The explosion of information from various social networking sites, Web clickstream, information retrieval, customers’ records, users’ reviews, business transactions, network event logs, etc. Results in generating a continuous deluge of data at different rates, called streaming data. Organizing, indexing, analyzing, or mining hidden knowledge from such a data deluge becomes a critical functionality for a broad range of content analysis tasks that includes emerging topic detection, interesting content identification, user interest profiling, and real-time Web search. But managing such ‘Big Data’ becomes even more challenging when streaming data is taken for analyzing and producing results in real time. The streaming data may include numeric, categorical, or mixed value. Most of the current research has been done on numeric data streams by exploiting the statistical properties of the numeric data. But now categorical/textual data streams have also gained researchers’ interest due to the high availability of data in textual format on the Internet. Applying classification for managing data streams is an unrealistic approach as not every incoming data has a class label. So, in such a case, for managing unlabeled data streams, a clustering technique is applied. One property that can affect the results of any clustering algorithm is concept drift. Therefore, detecting and managing concept drift over a period imposes a great challenge to better cluster analysis. This chapter provides an in-depth critique of various algorithms that have been introduced to handle concept drift in a real environment. A framework for examining concept drift in big data streams is also proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Zhang, B., Qin, S., Wang, W., Wang, D., & Xue, L. (2016). Data stream clustering based on fuzzy C-mean algorithm and entropy theory. Journal of Signal Processing, 126, 111–116.

    Article  Google Scholar 

  2. Lifna, C., & Vijaylakshmi, M. (2015). Identifying concept drifts in Twitter streams. In International Conference on Advanced Computing Technologies and Applications (pp. 86–94).

    Google Scholar 

  3. Xhafa, F., Naranjo, V., Barolli, L., & Takizawa, M. (2015). On Streaming Consistency of Big Data Stream Processing in Heterogenous Clusters. In 18th IEEE International Conference on Network-Based Information Systems (pp. 476–482).

    Google Scholar 

  4. Schnitzler, K., Davies, N., Ross, F., & Harris, R. (2016). Using Twitter™ to drive research impact: A discussion of strategies, opportunities and challenges. International Journal of Nursing Studies, 59, 15–26.

    Article  Google Scholar 

  5. Costa, J., Silva, C., Antunes, M., & Ribeiro, B. (2014). Concept Drift Awareness in Twitter Streams. In 13th IEEE International Conference on Machine Learning and Applications (pp. 294–299).

    Google Scholar 

  6. Wang, Y., Liu, J., Huang, Y., & Feng, X. (2016). Using Hashtag graph-based topic model to connect semantically-related words without co-occurrence in microblogs. IEEE Transactions on Knowledge and Data Engineering, 28(7), 1919–1933.

    Article  Google Scholar 

  7. Entities—Twitter Developers. (2017). In Dev.twitter.com. https://dev.twitter.com/overview/api/entities. Accessed May 8, 2017.

  8. Eskandari, S., & Javidi, M. (2016). Online streaming feature selection using rough sets. International Journal of Approximate Reasoning, 69, 35–57.

    Article  MathSciNet  MATH  Google Scholar 

  9. Li, J., Tai, Z., Zhang, R., Yu, W., & Liu, L. (2014). Online bursty event detection from microblog. In IEEE/ACM 7th International Conference on Utility and Cloud Computing (pp. 865–870).

    Google Scholar 

  10. Adedoyin-Olowe, M., Gaber, M., Dancausa, C., Stahl, F., & Gomes, J. (2016). A rule dynamics approach to event detection in Twitter with its application to sports and politics. Expert Systems with Applications, 55, 351–360.

    Article  Google Scholar 

  11. Villanueva, D., González-Carrasco, I., López-Cuadrado, J., & Lado, N. (2016). SMORE: Towards a semantic modeling for knowledge representation on social media. Science of Computer Programming, 121, 16–33.

    Article  Google Scholar 

  12. Li, H. (2014). Detecting campaign promoters on Twitter using Markov random fields. In IEEE International Conference on Data Mining (pp. 290–299).

    Google Scholar 

  13. Kuo, R., Mei, C., Zulvia, F., & Tsai, C. (2016). An application of a meta-heuristic algorithm-based clustering ensemble method to APP customer segmentation. Neurocomputing, 205, 116–129.

    Article  Google Scholar 

  14. Wang, B., Miao, Y., Zhao, H., Jin, J., & Chen, Y. (2016). A biclustering-based method for market segmentation using customer pain points. Journal of Engineering Applications of Artificial Intelligence, 47, 101–109.

    Article  Google Scholar 

  15. Giannitsioti, E., Athanasia, S., Plachouras, D., Kanellaki, S., Bobota, F., Tzepetzi, G., et al. (2016). Impact of patients’ professional and educational status on perception of an antibiotic policy campaign: A pilot study at a university hospital. Journal of Global Antimicrobial Resistance, 6, 123–127.

    Article  Google Scholar 

  16. Han, J., Kamber, M., & Pei, J. (2011). Data mining (3rd ed.). Amsterdam: Elsevier/Morgan Kaufmann.

    MATH  Google Scholar 

  17. He, Z., Xu, X., & Deng, S. (2011). Clustering categorical data streams. Journal of Computational Methods in Sciences and Engineering, 11(4), 185–192.

    Google Scholar 

  18. Wu, Q., & Ma, S. (2011). Detecting outliers in sliding window over categorical data streams. In Eighth International Conference on Fuzzy Systems and Knowledge Discovery (pp. 1663–1667).

    Google Scholar 

  19. Sora, M., Roy, S., & Singh, I. (2011). FLoMSqueezer: An effective approach for clustering categorical data stream. International Journal of Computer Science Issues, 8(6), 1.

    Google Scholar 

  20. Carbonera, J., & Abel, M. (2014). An entropy-based subspace clustering algorithm for categorical data. In IEEE 26th International Conference on Tools with Artificial Intelligence (pp. 272–277).

    Google Scholar 

  21. Qin, H., Ma, X., Herawan, T., & Zain, J. (2014). MGR: An information theory based hierarchical divisive clustering algorithm for categorical data. Knowledge-Based Systems, 67, 401–411. https://doi.org/10.1016/j.knosys.2014.03.013.

    Article  Google Scholar 

  22. Lenco, D., Bifet, A., Pfahringer, B., & Poncelet, P. (2014). Change detection in categorical evolving data streams. In 29th Annual ACM Symposium on Applied Computing (pp. 792–797).

    Google Scholar 

  23. Cao, F., & Huang, J. Z. (2013). A concept-drifting detection algorithm for categorical evolving data. In J. Pei, V. S. Tseng, L. Cao, H. Motoda & G. Xu (Eds.), Advances in knowledge discovery and data mining. PAKDD 2013. Lecture notes in computer science (vol. 7819). Berlin, Heidelberg: Springer.

    Google Scholar 

  24. Li, Y., Li, D., Wang, S., & Zhai, Y. (2014). Incremental entropy-based clustering on categorical data streams with concept drift. Knowledge-Based Systems, 59, 33–47. https://doi.org/10.1016/j.knosys.2014.02.004.

    Article  Google Scholar 

  25. Chen, H. L., Chen, M. S., & Lin, S. C. (2009). Catching the trend: A framework for clustering concept-drifting categorical data. IEEE Transactions on Knowledge Data Engineering, 21(5), 652–665.

    Article  Google Scholar 

  26. Cao, F., Liang, J., Bai, L., Zhao, X., & Dang, C. (2010). A framework for clustering categoricaltime-evolving data. IEEE Transactions on Fuzzy System, 18(5), 872–882.

    Article  Google Scholar 

  27. Cao, F., & Liang, J. (2011). A data labeling method for clustering categorical data. Expert Systems with Applications, 38, 2381–2385. https://doi.org/10.1016/j.eswa.2010.08.026.

    Article  Google Scholar 

  28. Talistu, M., Moh, T. S., & Moh, M. (2015). Gossip-based spectral clustering of distributed data streams. In International Conference on High Performance Computing and Simulation (pp. 325–333). https://doi.org/10.1109/HPCSim.2015.7237058.

  29. Xhafa, F., Naranjo, V., Barolli, L., & Takizawa, M. (2015). On streaming consistency of big data stream processing in heterogenous clusters. In 18th International Conference on Network-Based Information Systems (pp. 476–482).

    Google Scholar 

  30. Martinez-Gil, J. (2016). CoTO: A novel approach for fuzzy aggregation of semantic similarity measures. Cognitive Systems Research, 40, 8–17. https://doi.org/10.1016/j.cogsys.2016.01.001.

    Article  Google Scholar 

  31. Rehioui, H., Idrissi, A., Abourezq, M., & Zegrari F. (2016). DENCLUE-IM: A new approach for big data clustering. In 7th International Conference on Ambient Systems, Networks and Technologies (pp. 560–567).

    Google Scholar 

  32. Laohakiat, S., Phimoltares, S., & Lursinsap, C. (2017). A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction. Information Sciences, 381, 104–123. https://doi.org/10.1016/j.ins.2016.11.018.

    Article  Google Scholar 

  33. Barddal, J., Gomes, H., Enembreck, F., & Pfahringer, B. (2017). A survey on feature drift adaptation: Definition, benchmark, challenges and future directions. Journal of Systems and Software, 127, 278–294. https://doi.org/10.1016/j.jss.2016.07.005.

    Article  Google Scholar 

  34. Andrade Silva, J., Hruschka, E., & Gama, J. (2017). An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications, 67, 228–238. https://doi.org/10.1016/j.eswa.2016.09.020.

    Article  Google Scholar 

  35. Liu, J., & Zio, E. (2016). A SVR-based ensemble approach for drifting data streams with recurring patterns. Applied Soft Computing, 47, 553–564. https://doi.org/10.1016/j.asoc.2016.06.030.

    Article  Google Scholar 

  36. Bai, L., Cheng, X., Liang, J., & Shen, H. (2016). An optimization model for clustering categorical data streams with drifting concepts. IEEE Transactions on Knowledge and Data Engineering, 28, 2871–2883. https://doi.org/10.1109/tkde.2016.2594068.

    Article  Google Scholar 

  37. Chen, H.-L., Chen, M.-S., & Lin, S.-C. (2009). Catching the trend: A framework for clustering concept-drifting categorical data. IEEE Transactions on Knowledge and Data Engineering, 21, 652–665. https://doi.org/10.1109/tkde.2008.192.

    Article  Google Scholar 

  38. Cao, F., Liang, J., Bai, L., Zhao, X., & Dang, C. (2010). A framework for clustering categorical time-evolving data. IEEE Transactions on Fuzzy Systems, 18, 872–882. https://doi.org/10.1109/tfuzz.2010.2050891.

    Article  Google Scholar 

  39. Koh, Y. S. (2016). CD-TDS: Change detection in transactional data streams for frequent pattern mining. In International Joint Conference on Neural Networks (pp. 1554–1561). https://doi.org/10.1109/IJCNN.2016.7727383.

  40. Song, G., Ye, Y., Zhang, H., Xu, X., Lau, R., & Liu, F. (2016). Dynamic clustering forest: An ensemble framework to efficiently classify textual data stream with concept drift. Information Sciences, 357, 125–143. https://doi.org/10.1016/j.ins.2016.03.043.

    Article  Google Scholar 

  41. Haque, A., Khan, L., Baron, M., Thuraisingham, B., & Aggarwal, C. (2016). Efficient handling of concept drift and concept evolution over Stream Data. In IEEE 32nd International Conference on Data Engineering (pp. 481–492).

    Google Scholar 

  42. Sethi, T. S., Kantardzic, M., & Arabmakki, E. (2016). Monitoring classification blindspots to detect drifts from unlabeled data. In IEEE 17th International Conference on Information Reuse and Integration (pp. 142–151).

    Google Scholar 

  43. da Costa, F., Rios, R., & de Mello, R. (2016). Using dynamical systems tools to detect concept drift in data streams. Expert Systems with Applications, 60, 39–50. https://doi.org/10.1016/j.eswa.2016.04.026.

    Article  Google Scholar 

  44. Lughofer, E., & Mouchaweh, M. S. (2015). Autonomous data stream clustering implementing split-and-merge concepts—Towards a plug-and-play approach. Information Sciences, 304, 54–79. https://doi.org/10.1016/j.ins.2015.01.010.

    Article  Google Scholar 

  45. Yang, H., & Fong, S. (2015). Countering the concept-drift problems in big data by an incrementally optimized stream mining model. Journal of Systems and Software, 102, 158–166. https://doi.org/10.1016/j.jss.2014.07.010.

    Article  Google Scholar 

  46. Wu, X., Li, P., & Hu, X. (2012). Learning from concept drifting data streams with unlabeled data. Neurocomputing, 92, 145–155. https://doi.org/10.1016/j.neucom.2011.08.041.

    Article  Google Scholar 

  47. Hong, L., Dan, O., & Davison, B. D. (2011). Predicting popular messages in Twitter. In ACM International Conference on World Wide Web(WWW).

    Google Scholar 

  48. Li, C., Shan, M., Jheng, S., & Chou, K. (2016). Exploiting concept drift to predict popularity of social multimedia in microblogs. Information Sciences, 339, 310–331. https://doi.org/10.1016/j.ins.2016.01.009.

    Article  Google Scholar 

  49. Shang, K., Yan, W., & Small, M. (2016). Evolving networks—Using past structure to predict the future. Physica A: Statistical Mechanics and its Applications, 455, 120–135. https://doi.org/10.1016/j.physa.2016.02.067.

    Article  Google Scholar 

  50. Lipizzi, C., Dessavre, D., Iandoli, L., & Marquez, J. (2016). Social media conversation monitoring: Visualize information contents of Twitter messages using conversational metrics. Procedia Computer Science, 80, 2216–2220. https://doi.org/10.1016/j.procs.2016.05.384.

    Article  Google Scholar 

  51. Miller, Z., Dickinson, B., Deitrick, W., Hu, W., & Wang, A. (2014). Twitter spammer detection using data stream clustering. Information Sciences, 260, 64–73. https://doi.org/10.1016/j.ins.2013.11.016.

    Article  Google Scholar 

  52. Karunasekera, S., Harwood, A., Samarawickrama, S., Ramamohanrao, K., & Robins, G. (2014). Topic-specific post identification in microblog streams. In IEEE International Conference on Big Data (pp. 7–13). https://doi.org/10.1109/BigData.2014.7004416.

  53. Malik, S., Smith, A., Hawes, T., Papadatos, P., Li, J., Dunne, C., et al. (2013). TopicFlow: Visualizing topic alignment of Twitter data over time. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (pp. 720–726). https://doi.org/10.1145/2492517.2492639.

  54. Jiang, W., & Brice, P. (2009). Data stream clustering and modeling using context-trees. In 6th IEEE International Conference on Service Systems and Service Management (pp. 932–937).

    Google Scholar 

  55. Li, Wen J., Tai, Z., Zhang, R., & Yu, W. (2015). Bursty event detection from microblog: A distributed and incremental approach. Concurrency and Computation: Practice and Experience, 28(11), 3115–3130.

    Article  Google Scholar 

  56. Kalloubi, F., Nfaoui, E. H., & Beqqali, O. El. (2014). Named entity linking in microblog posts using graph-based centrality scoring. In 9th International Conference on Intelligent Systems: Theories and Application (pp. 501–506). https://doi.org/10.1109/SITA.2014.6847286.

  57. Gaglio, S., Re, G., & Morana, M. (2015). Real-time detection of Twitter social events from the user’s perspective. In IEEE International Conference on Communications (ICC) (pp. 1207–1212).

    Google Scholar 

  58. Kalloubi, F., Nfaoui, E., & Beqqali, O. (2014). Graph-based tweet entity linking using DBpedia. In IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA) (pp. 501–506). https://doi.org/10.1109/AICCSA.2014.7073240.

  59. Kumar, N., & Muruganantham, D. (2016). Disambiguating the Twitter stream entities and enhancing the search operation using DBpedia ontology. International Journal of Information Technology and Web Engineering, 11(2), 51–62. https://doi.org/10.4018/IJITWE.2016040104.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nidhi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Nidhi, Mangat, V., Gupta, V., Vig, R. (2018). Methods to Investigate Concept Drift in Big Data Streams. In: Margret Anouncia, S., Wiil, U. (eds) Knowledge Computing and Its Applications. Springer, Singapore. https://doi.org/10.1007/978-981-10-6680-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6680-1_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6679-5

  • Online ISBN: 978-981-10-6680-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics