Keywords

1 Introduction

In a nutshell, this work addresses one of the active learning approaches to decrease the learning cost of data stream classifier. We propose to employ so-called query by clustering into new classifier training. The main contributions of this work are as follows:

  • Presentation of new active data stream classifier learning method.

  • Experimental evaluation of the discussed approaches on the basis of diverse benchmark datasets.

The structure of this article is as follows. Firstly, we describe the proposition of a novel active learning algorithm dedicated to the classification task of drifted streaming data stream. Then we focus on experimental evaluation of the proposed approaches. The final conclusions and proposition of the future works are given thereafter.

1.1 Classification

Classification is an important task among ones studied in the machine learning field. It is a supervised learning problem that consists on assigning classes to observations basing on their attributes. This problem has been a field of study for quite long time. Nevertheless, application of classification in some areas can be discussed. In contemporary world data is generated continuously and cannot be treated as classical classification problems where whole data set is known. Classification systems should create models that are able to adapt to state of the streams which can change over time.

1.2 Data Streams

From the statistical point of view data streams can be described as stochastic processes. Important remark is that such processes are continuous and events are independent [2]. Because of the nature of data streams, one has to keep in mind that there is need to use different approach than in traditional data sets. Samples are arriving on-line and they potentially do not have specified size, so there is need to incrementally update built model with keeping model updated. Such mechanisms for keeping model up to date can be, for instance, windowing or forgetting techniques [2, 3]. Another important remark is data can arrive at different velocity – when stream is generating data at high velocity, there is need to use sampling algorithms [4].

Keeping fresh model is important task but there can appear sudden changes in data stream, called concept drifts. When such event occur, there is need to rebuild model, as existing one exhibited behavior of data in the past and is no longer applicable.

1.3 Concept Drift

Aforesaid behavior when changes in the data appear is called a concept drift [2, 5]. This can lead to situation when existing model is no longer relevant, sometimes existing model can be partially accurate. There are many real problems which faces such behavior, such as sensor network analysis [6], fraud detection [7], news categorization [7] or spam detection [7].

There are different methods of classifying concept drift. One can differentiate concept drifts basing on velocity of change: sudden and gradual [5]. Another taxonomy distinguishes two types of drifts which have impact on posterior probabilities: real and virtual concept drifts.

1.4 Labeling

In usual classification problems, cost of labeling is not considered. Typically labels are obtained from the oracle, for instance, human expert. Data streams can have different velocities – when data is arriving too fast, there can be a situation when expert cannot keep pace and some samples need to be discarded. Thus, there is need to implement system that can recognize which incoming samples can be omitted from labeling [8]. Different strategies can be employed to achieve such task, eg. query synthesis or selective sampling [1].

2 Related Works

There are many active learning algorithms. One example can be ACLStream presented in [9]. It is clustering-based approach where clustering is evaluated on every chunk of incoming data. Learning procedure is divided in two steps – Macro and Micro Steps. First one ranks clusters and the latter ranks instances inside clusters in order to extract the most representative instances for labeling. After labeling procedure clusters are discarded. This algorithm uses fixed number of clusters.

Another algorithm is MINAS [10]. Classes are represented by the micro-clusters which can be incrementally updated. This algorithm has two phases – initial training which uses supervised learning to build a decision model. Second phase consists of online learning using current decision model. Unknown examples are held in short-term memory and when there is sufficient number of examples, they are clustered, creating new micro-clusters.

Applying clustering technique is similar to our approach, however we aim at using any clustering algorithm (by parameterizing it) and using incrementally updated model to extract the most representative samples.

3 Method

The concept of the ALCC algorithm introduces an active learning approach for regular classification algorithms. That approach bases on using clustering algorithms to initially process incoming samples. It is worth mentioning that clustering algorithm \(\Lambda \) must be able to train incrementally. Samples from the data stream are coming in data chunks \(DS_{i}\), where i is the chunk index. The clustering algorithm \(\Lambda \) is trained incrementally with such chunk. After processing whole chunk clusters are extracted from the evaluated model. Then samples from the chunk are assigned to each cluster, because extracted clusters have only summaries consisting of gravity center and radius. Some samples can be outside any clustering because of the dissimilarities to other samples from the chunk. After assigning task cluster weights can be computed. This computation bases on average distance between points in every cluster. The idea was inspired by the point connectivity measure presented in [11]. After computing distances, they are normalized. For each cluster samples are randomly selected according to the cluster weight and the budget b (eg. 10% randomly selected samples from each cluster) and asked for labels from the oracle. Then the classification algorithm \(\Psi \) is trained with such selection.

Presented algorithm is parameterizable and has following parameters:

  • chunk length n,

  • budget of samples to learn b,

  • clustering algorithm \(\Lambda \),

  • option to evaluate points outside of the clusterings p,

  • option to evaluate cluster weights d,

  • classifying algorithm \(\Psi \).

The idea of the algorithm is presented in Algorithm 1.

figure a

4 Experiments

In this section, we describe the details of the experimental study used to verify the usefulness of the proposed methods. The following subsections present goals, used benchmark data streams, experimental set-up, as well as a discussion of obtained results.

4.1 Objective

The main goal of experimental evaluation is to verify the impact of given budget and clustering algorithm on classification accuracy, measured for each processed chunk. Naive Bayes was selected as classifying algorithm. Classifier implementation was done using MOA framework [12] which is incorporated in the Java programming language. Source code of implementation along with experimental results are available on-line in the article repositoryFootnote 1.

4.2 Benchmark Data Streams

Unfortunately, there are not so many benchmark data streams, which may be interpreted as non-stationary ones. We decided to use both real-life data streams and artificially generated ones. Their details have been presented in Table 1.

Table 1. Data streams used for evaluation

It is worth mentioning that Zliobaite presents in [17] the problem of autocorrelated data. In presented paper there is dataset which is known to be temporally autocorrelated – Electricity (elecNormNew).

To evaluate the proposed methods we to employ test and train framework [12], i.e., every classifier is trained on a recent data, but its evaluation (i.e., error estimation and training time) is done on the basis of the following one.

4.3 Algorithm Parameter Setup

For all datasets, classifier presented in this paper was evaluated with different sets of parameters:

  • budget – from 10% to 80%, with step of 10%,

  • option to compute distances between points in clusters – on and off,

  • option to compute samples outside of clusterings – on and off,

  • clustering algorithm – clustering algorithm to perform initial data analysis – Clustream [18], ClusTree [19], Dstream [20].

Such combinations created set of 96 different parameter groups.

4.4 Results

Due to the stochastic nature of examined algorithms, all experiments were repeated 5 times and averaged results have been presented.

Statistical analysis was done in KEEL software [21]. Friedman \(N\times N\) test was used, with Shaffer post-hoc method. Below are presented results for two groups: with and without computing distances between points in clusters. Only top 5 of both algorithm groups are presented in this paper. One can see that variants which used Dstream do not appear in presented results, as they did not outperform the best parameter groups. More results are available in paper’s repository, mentioned in Sect. 4.1.

Below are presented results of the Friedman test where last column exhibits rank, in Tables 2 and 3 (Table 4).

Table 2. Ranked tests of presented algorithm, along with parameter groups. Variant with computing distances between points in clusters.
Table 3. Ranked tests of presented algorithm, along with parameter groups. Variant without computing distances between points in clusters.
Table 4. Presentation of the results for the best combinations of parameters

Results of the Shaffer post-hoc test between presented algorithm and baseline method are depicted in Table 5.

Table 5. Shaffer post-hoc comparison between the ALCC and baseline algorithm.
Fig. 1.
figure 1

Stream evaluations for the best set of parameters for ALCC algorithm ( color) versus Naive Bayes evaluation ( color). (Color figure online)

4.5 Analysis

For every dataset, one can see that evaluation is very similar to the baseline algorithm. However, presented algorithm uses less labeled data to train the model.

Unfortunately, statistical analysis yields disturbing conclusions. For variant without computing lengths between points in clusters, statistically significant differences are not present in contrast to the baseline algorithm.

For the latter, there are statistically significant differences. However, ranked results present that this variant of algorithm is not outperforming the baseline algorithm.

No statistically significant differences do not necessarily mean bad results. Using less labeled data to train the classifier, as specified by budget to label instances, can produce very similar classifier as the baseline one (Fig. 1).

4.6 Lessons Learnt

To sum up, few observations can be drawn:

  • Employing active learning techniques can conduct to create model, similar to the one created by learning all incoming samples, which would maintain similar accuracy but with reduction of number of labeled instances.

  • Data from stream can be pre-processed in many ways, in this paper query by clustering method was used. Created clusters can be used to group samples. However, this is not the same case as labeling data. High dense clusters have potentially similar samples which could have same label. Less dense clusters, with lesser distances between points in cluster can potentially have samples of different classes.

5 Conclusions

The novel active learning classifying algorithm has been proposed. The main idea of pre-processing data is to use clustering algorithm which would gather similar objects in groups. Of course, one has to keep in mind that such groups is not the same thing as labeling data. The results of the experiment show that using less labeled instances to train the classifier can yield very similar model as using data from the whole data stream. However, considering cluster constraints, such as distance between points in clusters in this case, did not exhibit improvement in terms of number of asked labels while maintaining similar accuracy to the baseline classifier.

Presented algorithm is open for modifications, in the future research we will focus on:

  • Developing method which will automatically adjust budget, according to the change in accuracy of the classifier,

  • Employ more sophisticated method to detect incoming concept drifts.