Abstract
The explosion of information from various social networking sites, Web clickstream, information retrieval, customers’ records, users’ reviews, business transactions, network event logs, etc. Results in generating a continuous deluge of data at different rates, called streaming data. Organizing, indexing, analyzing, or mining hidden knowledge from such a data deluge becomes a critical functionality for a broad range of content analysis tasks that includes emerging topic detection, interesting content identification, user interest profiling, and real-time Web search. But managing such ‘Big Data’ becomes even more challenging when streaming data is taken for analyzing and producing results in real time. The streaming data may include numeric, categorical, or mixed value. Most of the current research has been done on numeric data streams by exploiting the statistical properties of the numeric data. But now categorical/textual data streams have also gained researchers’ interest due to the high availability of data in textual format on the Internet. Applying classification for managing data streams is an unrealistic approach as not every incoming data has a class label. So, in such a case, for managing unlabeled data streams, a clustering technique is applied. One property that can affect the results of any clustering algorithm is concept drift. Therefore, detecting and managing concept drift over a period imposes a great challenge to better cluster analysis. This chapter provides an in-depth critique of various algorithms that have been introduced to handle concept drift in a real environment. A framework for examining concept drift in big data streams is also proposed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhang, B., Qin, S., Wang, W., Wang, D., & Xue, L. (2016). Data stream clustering based on fuzzy C-mean algorithm and entropy theory. Journal of Signal Processing, 126, 111–116.
Lifna, C., & Vijaylakshmi, M. (2015). Identifying concept drifts in Twitter streams. In International Conference on Advanced Computing Technologies and Applications (pp. 86–94).
Xhafa, F., Naranjo, V., Barolli, L., & Takizawa, M. (2015). On Streaming Consistency of Big Data Stream Processing in Heterogenous Clusters. In 18th IEEE International Conference on Network-Based Information Systems (pp. 476–482).
Schnitzler, K., Davies, N., Ross, F., & Harris, R. (2016). Using Twitter™ to drive research impact: A discussion of strategies, opportunities and challenges. International Journal of Nursing Studies, 59, 15–26.
Costa, J., Silva, C., Antunes, M., & Ribeiro, B. (2014). Concept Drift Awareness in Twitter Streams. In 13th IEEE International Conference on Machine Learning and Applications (pp. 294–299).
Wang, Y., Liu, J., Huang, Y., & Feng, X. (2016). Using Hashtag graph-based topic model to connect semantically-related words without co-occurrence in microblogs. IEEE Transactions on Knowledge and Data Engineering, 28(7), 1919–1933.
Entities—Twitter Developers. (2017). In Dev.twitter.com. https://dev.twitter.com/overview/api/entities. Accessed May 8, 2017.
Eskandari, S., & Javidi, M. (2016). Online streaming feature selection using rough sets. International Journal of Approximate Reasoning, 69, 35–57.
Li, J., Tai, Z., Zhang, R., Yu, W., & Liu, L. (2014). Online bursty event detection from microblog. In IEEE/ACM 7th International Conference on Utility and Cloud Computing (pp. 865–870).
Adedoyin-Olowe, M., Gaber, M., Dancausa, C., Stahl, F., & Gomes, J. (2016). A rule dynamics approach to event detection in Twitter with its application to sports and politics. Expert Systems with Applications, 55, 351–360.
Villanueva, D., González-Carrasco, I., López-Cuadrado, J., & Lado, N. (2016). SMORE: Towards a semantic modeling for knowledge representation on social media. Science of Computer Programming, 121, 16–33.
Li, H. (2014). Detecting campaign promoters on Twitter using Markov random fields. In IEEE International Conference on Data Mining (pp. 290–299).
Kuo, R., Mei, C., Zulvia, F., & Tsai, C. (2016). An application of a meta-heuristic algorithm-based clustering ensemble method to APP customer segmentation. Neurocomputing, 205, 116–129.
Wang, B., Miao, Y., Zhao, H., Jin, J., & Chen, Y. (2016). A biclustering-based method for market segmentation using customer pain points. Journal of Engineering Applications of Artificial Intelligence, 47, 101–109.
Giannitsioti, E., Athanasia, S., Plachouras, D., Kanellaki, S., Bobota, F., Tzepetzi, G., et al. (2016). Impact of patients’ professional and educational status on perception of an antibiotic policy campaign: A pilot study at a university hospital. Journal of Global Antimicrobial Resistance, 6, 123–127.
Han, J., Kamber, M., & Pei, J. (2011). Data mining (3rd ed.). Amsterdam: Elsevier/Morgan Kaufmann.
He, Z., Xu, X., & Deng, S. (2011). Clustering categorical data streams. Journal of Computational Methods in Sciences and Engineering, 11(4), 185–192.
Wu, Q., & Ma, S. (2011). Detecting outliers in sliding window over categorical data streams. In Eighth International Conference on Fuzzy Systems and Knowledge Discovery (pp. 1663–1667).
Sora, M., Roy, S., & Singh, I. (2011). FLoMSqueezer: An effective approach for clustering categorical data stream. International Journal of Computer Science Issues, 8(6), 1.
Carbonera, J., & Abel, M. (2014). An entropy-based subspace clustering algorithm for categorical data. In IEEE 26th International Conference on Tools with Artificial Intelligence (pp. 272–277).
Qin, H., Ma, X., Herawan, T., & Zain, J. (2014). MGR: An information theory based hierarchical divisive clustering algorithm for categorical data. Knowledge-Based Systems, 67, 401–411. https://doi.org/10.1016/j.knosys.2014.03.013.
Lenco, D., Bifet, A., Pfahringer, B., & Poncelet, P. (2014). Change detection in categorical evolving data streams. In 29th Annual ACM Symposium on Applied Computing (pp. 792–797).
Cao, F., & Huang, J. Z. (2013). A concept-drifting detection algorithm for categorical evolving data. In J. Pei, V. S. Tseng, L. Cao, H. Motoda & G. Xu (Eds.), Advances in knowledge discovery and data mining. PAKDD 2013. Lecture notes in computer science (vol. 7819). Berlin, Heidelberg: Springer.
Li, Y., Li, D., Wang, S., & Zhai, Y. (2014). Incremental entropy-based clustering on categorical data streams with concept drift. Knowledge-Based Systems, 59, 33–47. https://doi.org/10.1016/j.knosys.2014.02.004.
Chen, H. L., Chen, M. S., & Lin, S. C. (2009). Catching the trend: A framework for clustering concept-drifting categorical data. IEEE Transactions on Knowledge Data Engineering, 21(5), 652–665.
Cao, F., Liang, J., Bai, L., Zhao, X., & Dang, C. (2010). A framework for clustering categoricaltime-evolving data. IEEE Transactions on Fuzzy System, 18(5), 872–882.
Cao, F., & Liang, J. (2011). A data labeling method for clustering categorical data. Expert Systems with Applications, 38, 2381–2385. https://doi.org/10.1016/j.eswa.2010.08.026.
Talistu, M., Moh, T. S., & Moh, M. (2015). Gossip-based spectral clustering of distributed data streams. In International Conference on High Performance Computing and Simulation (pp. 325–333). https://doi.org/10.1109/HPCSim.2015.7237058.
Xhafa, F., Naranjo, V., Barolli, L., & Takizawa, M. (2015). On streaming consistency of big data stream processing in heterogenous clusters. In 18th International Conference on Network-Based Information Systems (pp. 476–482).
Martinez-Gil, J. (2016). CoTO: A novel approach for fuzzy aggregation of semantic similarity measures. Cognitive Systems Research, 40, 8–17. https://doi.org/10.1016/j.cogsys.2016.01.001.
Rehioui, H., Idrissi, A., Abourezq, M., & Zegrari F. (2016). DENCLUE-IM: A new approach for big data clustering. In 7th International Conference on Ambient Systems, Networks and Technologies (pp. 560–567).
Laohakiat, S., Phimoltares, S., & Lursinsap, C. (2017). A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction. Information Sciences, 381, 104–123. https://doi.org/10.1016/j.ins.2016.11.018.
Barddal, J., Gomes, H., Enembreck, F., & Pfahringer, B. (2017). A survey on feature drift adaptation: Definition, benchmark, challenges and future directions. Journal of Systems and Software, 127, 278–294. https://doi.org/10.1016/j.jss.2016.07.005.
Andrade Silva, J., Hruschka, E., & Gama, J. (2017). An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications, 67, 228–238. https://doi.org/10.1016/j.eswa.2016.09.020.
Liu, J., & Zio, E. (2016). A SVR-based ensemble approach for drifting data streams with recurring patterns. Applied Soft Computing, 47, 553–564. https://doi.org/10.1016/j.asoc.2016.06.030.
Bai, L., Cheng, X., Liang, J., & Shen, H. (2016). An optimization model for clustering categorical data streams with drifting concepts. IEEE Transactions on Knowledge and Data Engineering, 28, 2871–2883. https://doi.org/10.1109/tkde.2016.2594068.
Chen, H.-L., Chen, M.-S., & Lin, S.-C. (2009). Catching the trend: A framework for clustering concept-drifting categorical data. IEEE Transactions on Knowledge and Data Engineering, 21, 652–665. https://doi.org/10.1109/tkde.2008.192.
Cao, F., Liang, J., Bai, L., Zhao, X., & Dang, C. (2010). A framework for clustering categorical time-evolving data. IEEE Transactions on Fuzzy Systems, 18, 872–882. https://doi.org/10.1109/tfuzz.2010.2050891.
Koh, Y. S. (2016). CD-TDS: Change detection in transactional data streams for frequent pattern mining. In International Joint Conference on Neural Networks (pp. 1554–1561). https://doi.org/10.1109/IJCNN.2016.7727383.
Song, G., Ye, Y., Zhang, H., Xu, X., Lau, R., & Liu, F. (2016). Dynamic clustering forest: An ensemble framework to efficiently classify textual data stream with concept drift. Information Sciences, 357, 125–143. https://doi.org/10.1016/j.ins.2016.03.043.
Haque, A., Khan, L., Baron, M., Thuraisingham, B., & Aggarwal, C. (2016). Efficient handling of concept drift and concept evolution over Stream Data. In IEEE 32nd International Conference on Data Engineering (pp. 481–492).
Sethi, T. S., Kantardzic, M., & Arabmakki, E. (2016). Monitoring classification blindspots to detect drifts from unlabeled data. In IEEE 17th International Conference on Information Reuse and Integration (pp. 142–151).
da Costa, F., Rios, R., & de Mello, R. (2016). Using dynamical systems tools to detect concept drift in data streams. Expert Systems with Applications, 60, 39–50. https://doi.org/10.1016/j.eswa.2016.04.026.
Lughofer, E., & Mouchaweh, M. S. (2015). Autonomous data stream clustering implementing split-and-merge concepts—Towards a plug-and-play approach. Information Sciences, 304, 54–79. https://doi.org/10.1016/j.ins.2015.01.010.
Yang, H., & Fong, S. (2015). Countering the concept-drift problems in big data by an incrementally optimized stream mining model. Journal of Systems and Software, 102, 158–166. https://doi.org/10.1016/j.jss.2014.07.010.
Wu, X., Li, P., & Hu, X. (2012). Learning from concept drifting data streams with unlabeled data. Neurocomputing, 92, 145–155. https://doi.org/10.1016/j.neucom.2011.08.041.
Hong, L., Dan, O., & Davison, B. D. (2011). Predicting popular messages in Twitter. In ACM International Conference on World Wide Web(WWW).
Li, C., Shan, M., Jheng, S., & Chou, K. (2016). Exploiting concept drift to predict popularity of social multimedia in microblogs. Information Sciences, 339, 310–331. https://doi.org/10.1016/j.ins.2016.01.009.
Shang, K., Yan, W., & Small, M. (2016). Evolving networks—Using past structure to predict the future. Physica A: Statistical Mechanics and its Applications, 455, 120–135. https://doi.org/10.1016/j.physa.2016.02.067.
Lipizzi, C., Dessavre, D., Iandoli, L., & Marquez, J. (2016). Social media conversation monitoring: Visualize information contents of Twitter messages using conversational metrics. Procedia Computer Science, 80, 2216–2220. https://doi.org/10.1016/j.procs.2016.05.384.
Miller, Z., Dickinson, B., Deitrick, W., Hu, W., & Wang, A. (2014). Twitter spammer detection using data stream clustering. Information Sciences, 260, 64–73. https://doi.org/10.1016/j.ins.2013.11.016.
Karunasekera, S., Harwood, A., Samarawickrama, S., Ramamohanrao, K., & Robins, G. (2014). Topic-specific post identification in microblog streams. In IEEE International Conference on Big Data (pp. 7–13). https://doi.org/10.1109/BigData.2014.7004416.
Malik, S., Smith, A., Hawes, T., Papadatos, P., Li, J., Dunne, C., et al. (2013). TopicFlow: Visualizing topic alignment of Twitter data over time. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (pp. 720–726). https://doi.org/10.1145/2492517.2492639.
Jiang, W., & Brice, P. (2009). Data stream clustering and modeling using context-trees. In 6th IEEE International Conference on Service Systems and Service Management (pp. 932–937).
Li, Wen J., Tai, Z., Zhang, R., & Yu, W. (2015). Bursty event detection from microblog: A distributed and incremental approach. Concurrency and Computation: Practice and Experience, 28(11), 3115–3130.
Kalloubi, F., Nfaoui, E. H., & Beqqali, O. El. (2014). Named entity linking in microblog posts using graph-based centrality scoring. In 9th International Conference on Intelligent Systems: Theories and Application (pp. 501–506). https://doi.org/10.1109/SITA.2014.6847286.
Gaglio, S., Re, G., & Morana, M. (2015). Real-time detection of Twitter social events from the user’s perspective. In IEEE International Conference on Communications (ICC) (pp. 1207–1212).
Kalloubi, F., Nfaoui, E., & Beqqali, O. (2014). Graph-based tweet entity linking using DBpedia. In IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA) (pp. 501–506). https://doi.org/10.1109/AICCSA.2014.7073240.
Kumar, N., & Muruganantham, D. (2016). Disambiguating the Twitter stream entities and enhancing the search operation using DBpedia ontology. International Journal of Information Technology and Web Engineering, 11(2), 51–62. https://doi.org/10.4018/IJITWE.2016040104.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Nidhi, Mangat, V., Gupta, V., Vig, R. (2018). Methods to Investigate Concept Drift in Big Data Streams. In: Margret Anouncia, S., Wiil, U. (eds) Knowledge Computing and Its Applications. Springer, Singapore. https://doi.org/10.1007/978-981-10-6680-1_3
Download citation
DOI: https://doi.org/10.1007/978-981-10-6680-1_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6679-5
Online ISBN: 978-981-10-6680-1
eBook Packages: Computer ScienceComputer Science (R0)