Abstract
Usually, during data stream classifier learning, we assume that labels of all incoming examples are available without any delay and they are used to update employing predictive model. Unfortunately, this assumption about access to all class labels is naive and it requires relatively high budget for labeling. It causes that methods which can train data stream classifiers on the basis of partially labeled data are highly desirable. Among them, active learning [1] seems to be a promising direction, which focuses on selecting only the most valuable learning examples to be labeled and used to produce an accurate predictive model. However, designing such a system we have to ensure that a chosen active learning strategy is able to handle changes in data distribution and quickly adapt to changing data distribution. In this work, we focus on novel active learning strategies that are designed for effective tackling of such changes. We propose a novel active data stream classifier learning method based on query by clustering approach. Experimental evaluation of the proposed methods prove the usefulness of the proposed approach for reducing labeling cost for classifier of drifting data streams.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In a nutshell, this work addresses one of the active learning approaches to decrease the learning cost of data stream classifier. We propose to employ so-called query by clustering into new classifier training. The main contributions of this work are as follows:
-
Presentation of new active data stream classifier learning method.
-
Experimental evaluation of the discussed approaches on the basis of diverse benchmark datasets.
The structure of this article is as follows. Firstly, we describe the proposition of a novel active learning algorithm dedicated to the classification task of drifted streaming data stream. Then we focus on experimental evaluation of the proposed approaches. The final conclusions and proposition of the future works are given thereafter.
1.1 Classification
Classification is an important task among ones studied in the machine learning field. It is a supervised learning problem that consists on assigning classes to observations basing on their attributes. This problem has been a field of study for quite long time. Nevertheless, application of classification in some areas can be discussed. In contemporary world data is generated continuously and cannot be treated as classical classification problems where whole data set is known. Classification systems should create models that are able to adapt to state of the streams which can change over time.
1.2 Data Streams
From the statistical point of view data streams can be described as stochastic processes. Important remark is that such processes are continuous and events are independent [2]. Because of the nature of data streams, one has to keep in mind that there is need to use different approach than in traditional data sets. Samples are arriving on-line and they potentially do not have specified size, so there is need to incrementally update built model with keeping model updated. Such mechanisms for keeping model up to date can be, for instance, windowing or forgetting techniques [2, 3]. Another important remark is data can arrive at different velocity – when stream is generating data at high velocity, there is need to use sampling algorithms [4].
Keeping fresh model is important task but there can appear sudden changes in data stream, called concept drifts. When such event occur, there is need to rebuild model, as existing one exhibited behavior of data in the past and is no longer applicable.
1.3 Concept Drift
Aforesaid behavior when changes in the data appear is called a concept drift [2, 5]. This can lead to situation when existing model is no longer relevant, sometimes existing model can be partially accurate. There are many real problems which faces such behavior, such as sensor network analysis [6], fraud detection [7], news categorization [7] or spam detection [7].
There are different methods of classifying concept drift. One can differentiate concept drifts basing on velocity of change: sudden and gradual [5]. Another taxonomy distinguishes two types of drifts which have impact on posterior probabilities: real and virtual concept drifts.
1.4 Labeling
In usual classification problems, cost of labeling is not considered. Typically labels are obtained from the oracle, for instance, human expert. Data streams can have different velocities – when data is arriving too fast, there can be a situation when expert cannot keep pace and some samples need to be discarded. Thus, there is need to implement system that can recognize which incoming samples can be omitted from labeling [8]. Different strategies can be employed to achieve such task, eg. query synthesis or selective sampling [1].
2 Related Works
There are many active learning algorithms. One example can be ACLStream presented in [9]. It is clustering-based approach where clustering is evaluated on every chunk of incoming data. Learning procedure is divided in two steps – Macro and Micro Steps. First one ranks clusters and the latter ranks instances inside clusters in order to extract the most representative instances for labeling. After labeling procedure clusters are discarded. This algorithm uses fixed number of clusters.
Another algorithm is MINAS [10]. Classes are represented by the micro-clusters which can be incrementally updated. This algorithm has two phases – initial training which uses supervised learning to build a decision model. Second phase consists of online learning using current decision model. Unknown examples are held in short-term memory and when there is sufficient number of examples, they are clustered, creating new micro-clusters.
Applying clustering technique is similar to our approach, however we aim at using any clustering algorithm (by parameterizing it) and using incrementally updated model to extract the most representative samples.
3 Method
The concept of the ALCC algorithm introduces an active learning approach for regular classification algorithms. That approach bases on using clustering algorithms to initially process incoming samples. It is worth mentioning that clustering algorithm \(\Lambda \) must be able to train incrementally. Samples from the data stream are coming in data chunks \(DS_{i}\), where i is the chunk index. The clustering algorithm \(\Lambda \) is trained incrementally with such chunk. After processing whole chunk clusters are extracted from the evaluated model. Then samples from the chunk are assigned to each cluster, because extracted clusters have only summaries consisting of gravity center and radius. Some samples can be outside any clustering because of the dissimilarities to other samples from the chunk. After assigning task cluster weights can be computed. This computation bases on average distance between points in every cluster. The idea was inspired by the point connectivity measure presented in [11]. After computing distances, they are normalized. For each cluster samples are randomly selected according to the cluster weight and the budget b (eg. 10% randomly selected samples from each cluster) and asked for labels from the oracle. Then the classification algorithm \(\Psi \) is trained with such selection.
Presented algorithm is parameterizable and has following parameters:
-
chunk length n,
-
budget of samples to learn b,
-
clustering algorithm \(\Lambda \),
-
option to evaluate points outside of the clusterings p,
-
option to evaluate cluster weights d,
-
classifying algorithm \(\Psi \).
The idea of the algorithm is presented in Algorithm 1.
4 Experiments
In this section, we describe the details of the experimental study used to verify the usefulness of the proposed methods. The following subsections present goals, used benchmark data streams, experimental set-up, as well as a discussion of obtained results.
4.1 Objective
The main goal of experimental evaluation is to verify the impact of given budget and clustering algorithm on classification accuracy, measured for each processed chunk. Naive Bayes was selected as classifying algorithm. Classifier implementation was done using MOA framework [12] which is incorporated in the Java programming language. Source code of implementation along with experimental results are available on-line in the article repositoryFootnote 1.
4.2 Benchmark Data Streams
Unfortunately, there are not so many benchmark data streams, which may be interpreted as non-stationary ones. We decided to use both real-life data streams and artificially generated ones. Their details have been presented in Table 1.
It is worth mentioning that Zliobaite presents in [17] the problem of autocorrelated data. In presented paper there is dataset which is known to be temporally autocorrelated – Electricity (elecNormNew).
To evaluate the proposed methods we to employ test and train framework [12], i.e., every classifier is trained on a recent data, but its evaluation (i.e., error estimation and training time) is done on the basis of the following one.
4.3 Algorithm Parameter Setup
For all datasets, classifier presented in this paper was evaluated with different sets of parameters:
-
budget – from 10% to 80%, with step of 10%,
-
option to compute distances between points in clusters – on and off,
-
option to compute samples outside of clusterings – on and off,
-
clustering algorithm – clustering algorithm to perform initial data analysis – Clustream [18], ClusTree [19], Dstream [20].
Such combinations created set of 96 different parameter groups.
4.4 Results
Due to the stochastic nature of examined algorithms, all experiments were repeated 5 times and averaged results have been presented.
Statistical analysis was done in KEEL software [21]. Friedman \(N\times N\) test was used, with Shaffer post-hoc method. Below are presented results for two groups: with and without computing distances between points in clusters. Only top 5 of both algorithm groups are presented in this paper. One can see that variants which used Dstream do not appear in presented results, as they did not outperform the best parameter groups. More results are available in paper’s repository, mentioned in Sect. 4.1.
Below are presented results of the Friedman test where last column exhibits rank, in Tables 2 and 3 (Table 4).
Results of the Shaffer post-hoc test between presented algorithm and baseline method are depicted in Table 5.
4.5 Analysis
For every dataset, one can see that evaluation is very similar to the baseline algorithm. However, presented algorithm uses less labeled data to train the model.
Unfortunately, statistical analysis yields disturbing conclusions. For variant without computing lengths between points in clusters, statistically significant differences are not present in contrast to the baseline algorithm.
For the latter, there are statistically significant differences. However, ranked results present that this variant of algorithm is not outperforming the baseline algorithm.
No statistically significant differences do not necessarily mean bad results. Using less labeled data to train the classifier, as specified by budget to label instances, can produce very similar classifier as the baseline one (Fig. 1).
4.6 Lessons Learnt
To sum up, few observations can be drawn:
-
Employing active learning techniques can conduct to create model, similar to the one created by learning all incoming samples, which would maintain similar accuracy but with reduction of number of labeled instances.
-
Data from stream can be pre-processed in many ways, in this paper query by clustering method was used. Created clusters can be used to group samples. However, this is not the same case as labeling data. High dense clusters have potentially similar samples which could have same label. Less dense clusters, with lesser distances between points in cluster can potentially have samples of different classes.
5 Conclusions
The novel active learning classifying algorithm has been proposed. The main idea of pre-processing data is to use clustering algorithm which would gather similar objects in groups. Of course, one has to keep in mind that such groups is not the same thing as labeling data. The results of the experiment show that using less labeled instances to train the classifier can yield very similar model as using data from the whole data stream. However, considering cluster constraints, such as distance between points in clusters in this case, did not exhibit improvement in terms of number of asked labels while maintaining similar accuracy to the baseline classifier.
Presented algorithm is open for modifications, in the future research we will focus on:
-
Developing method which will automatically adjust budget, according to the change in accuracy of the classifier,
-
Employ more sophisticated method to detect incoming concept drifts.
References
Settles, B.: Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6, no. 1, pp. 1–114 (2012)
Gama, J.: Knowledge Discovery from Data Streams, 1st edn. Chapman & Hall/CRC, Boca Raton (2010)
Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 44:1–44:37 (2014). http://doi.acm.org/10.1145/2523813
Domingos, P., Hulten, G.: Mining high-speed data streams, pp. 71–80. ACM Press (2000)
Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report, Trinity College Dublin (2004)
Gama, J., Gaber, M.: Learning from Data Streams: Processing Techniques in Sensor Networks. Springer, Heidelberg (2007). https://doi.org/10.1007/3-540-73679-4
Zliobaite, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. In: Japkowicz, N., Stefanowski, J. (eds.) Big Data Analysis: New Algorithms for a New Society, pp. 91–114. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-26989-4_4
Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec. 34(2), 18–26 (2005). http://doi.acm.org/10.1145/1083784.1083789
Ienco, D., Bifet, A., Žliobaitė, I., Pfahringer, B.: Clustering based active learning for evolving data streams. In: Fürnkranz, J., Hüllermeier, E., Higuchi, T. (eds.) DS 2013. LNCS (LNAI), vol. 8140, pp. 79–93. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40897-7_6
de Faria, E.R., de Leon Ferreira Carvalho, A.C.P., Gama, J.: MINAS: multiclass learning algorithm for novelty detection in data streams. Data Min. Knowl. Discov. 30(3), 640–680 (2016). https://doi.org/10.1007/s10618-015-0433-y
Kremer, H., et al.: An effective evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2011, pp. 868–876. ACM, New York (2011). http://doi.acm.org/10.1145/2020408.2020555
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010). http://portal.acm.org/citation.cfm?id=1859903
Dheeru, D., Taniskidou, E.K.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agricult. 24, 131–151 (1999)
Harries, M., Wales, N.S.: Splice-2 comparative evaluation: electricity pricing. Technical report (1999)
Zhu, X.H.: Stream data mining repository (2010). http://www.cse.fau.edu/~xqzhu/stream.html
Zliobaite, I.: How good is the electricity benchmark for evaluating concept drift adaptation. CoRR, abs/1301.3524 (2013). http://arxiv.org/abs/1301.3524
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, vol. 29, pp. 81–92. VLDB Endowment (2003). http://dl.acm.org/citation.cfm?id=1315451.1315460
Kranen, P., Assent, I., Baldauf, C., Seidl, T.: The ClusTree: indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst. 29(2), 249–272 (2011)
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2007, pp. 133–142. ACM, New York (2007). http://doi.acm.org/10.1145/1281192.1281210
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2011). http://dblp.uni-trier.de/db/journals/mvl/mvl17.html#Alcala-FdezFLDG11
Acknowledgment
This work is supported the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wrocław University of Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zgraja, J., Gama, J., Woźniak, M. (2019). Active Learning by Clustering for Drifted Data Stream Classification. In: Monreale, A., et al. ECML PKDD 2018 Workshops. ECML PKDD 2018. Communications in Computer and Information Science, vol 967. Springer, Cham. https://doi.org/10.1007/978-3-030-14880-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-14880-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14879-9
Online ISBN: 978-3-030-14880-5
eBook Packages: Computer ScienceComputer Science (R0)