1 Introduction

With the digital transformation in the Industry 4.0 era, the availability of time series data has increased dramatically thanks to process sensing. The rise of real-time connected data presents a great opportunity along with technical challenges, one of which is the modeling of industrial processes while dealing with the constant change of data when the detection of behavioral shifts is crucial. Industrial processes are non-stationary dynamic processes due to various factors such as degradation or operational failures. Those process changes are reflected in the data, but are rarely detected by the model. For model training, the most commonly used methodology consists of historical data-driven offline training, which normally causes a decrease in their performance under changing conditions.

According to recent studies, between 80% and 87% of big data projects fail to generate sustainable solutions [1]. ML algorithms in the real world, and particularly in industry, operate in dynamic environments, with non-stationary data where they need to be able to detect any drift or change in the data distribution and adapt or update the model in order to maintain the performance [2, 3].

The change in the data distribution over time is known as concept drift and as defined in [4], a concept drift means that the statistical properties of the target variable change over time in unforeseen ways. The concept drift occurs when the input data changes, which could correspond to a change in the distribution of the data or to a change in the relation between the model input and output. As a consequence, the old data-driven model’s performance may decay, and hence it may not be suitable for the new data.

There are different types of concept drift according to [4, 5]:

  • Sudden/Abrupt drift: a new concept occurs suddenly

  • Gradual drift: a new concept gradually replaces the old one over a period of time

  • Incremental drift: an old concept incrementally changes to a new concept

  • Recurring concept: an old concept may occur after some time

In this context, in [6] a comprehensive survey that discusses the research constraints and the current state-of-the-art, as well as an updated overview of the different stream mining tasks, such as classification, regression, clustering, and frequent patterns can be found.

As stated by Barros et al. [7], it is quite usual to use a concept drift detection method with a base learner. With the prediction made by the classifier, the drift detector decides whether or not there has been any change in the data distribution. In the last few years, several concept drift detection methods have been proposed in the literature. Based on the classification made by Lu et al. [4], there are three main categories depending on the applied statistical test, namely: Error Rate Based Drift Detection, Data distribution based drift detection and Multiple Hypothesis test Drift Detection. The most well-known concept drift detectors according to the literature [4, 7, 8] are: Drift Detection Method (DDM) [9], Early Drift Detection Method (EDDM) [10], Adaptive Windowing (ADWIN) [11], Statistical Test of Equal Proportions (STEPD) [12], Paired Learners (PL) [13] and EWMA for concept drift detection (ECDD) [14]. In this work, an original approach, consisting of the use of Common Spatial Pattern (CSP) along with ML for discriminating between different concepts, is proposed for sudden drift detection in multivariate time series. CSP is a widely-used method for electroencephalography (EGG) systems to optimally distinguish different classes [15].

Taking this method as a basis—CSP, the proposed approach attempts to differentiate between two concepts in a multivariate signal where a concept drift has occurred. CSP was presented as an extension of the Principal Component Analysis (PCA), but while PCA maximizes the variance of the projected data, CSP maximizes the difference between the variances of two classes. In the case of PCA, the principal component corresponds to the direction of the maximum variability of the data, but it does not assure maximum discrimination between classes, whereas CSP is based on matrix decomposition that maximizes the difference between classes.

The presented approach—CatSight—aims to apply this technique in order to effectively detect sudden drifts in industrial process data. This type of drift is the most common type of concept drift in industrial processes, along with gradual drift [16]. As a matter of fact, the objective of the article is to apply the proposed method to industrial data, thus showing the applicability of CatSight to current problems that may arise in the behavioral changes of machines currently used in industry.

In order to verify how adequate the proposed approach is, a comparison is made with PCA-ML, another feature extractor, and with the conventional ML algorithms. Moreover, to check the suitability of the method, the presented approach has been tested on two publicly available databases, and obtained results compared with state of the art ones; then it has been used on real world industrial data.

Six different base classifiers have been used in the study, and a comparison is made between the conventional approach and the new approach presented in this paper—CatSight–. Obtained results confirm the adequateness of the proposed method.

This paper is organized as follows. Section 2 briefly introduces some drift detection methods found in the literature. Section 3.2 introduces the theoretical aspects of the Common Spatial Pattern method. Section 4 details the new approach presented in this work. Section 5 includes the descriptions of the preprocessing steps followed for data preparation, a brief description of ML algorithms used for classification and also includes a brief description of the dataset used. Section 6 presents the results obtained in the different datasets used in the study. In this part, the obtained accuracies have been evaluated. Section 7 presents the results obtained with industrial actual data dataset, which is used as a validator of the proposed approach. Section 8 is a summary of the results obtained. Finally, Sect. 9 draws the conclusions and suggests future work.

2 Related work

Concept drift is a phenomenon that leads to degradation of machine learning performance due to changes in the input data and/or the target variable, in recent years several works can be found in the literature which try to deal with this drawback.

2.1 Machine learning: supervised classification

Zenisek et al. [17] through continuous data flow analysis present an approach based on machine learning to detect drift behavior, and thereby identify the degradation and malfunction of a system. In order to do this, a regression model is used, comparing the estimation made with the model and the real value. Based on the idea that an increased prediction error could indicate a change in concept.

Saurav et al. [18] describe a model based on Recurrent Neural Networks (RNNs) for the detection of anomalies in time series. RNNs are used to make multi-step predictions of the time series, and the prediction errors are used to update the RNN model, as well as to detect anomalies and points of change, thus, a prediction with a large error indicates anomalous behavior or a change of concept in the normal behavior of the time series. Veloso et al. [19] present an extension of Single-pass Self Parameter Tuning SPT, SSPT, a methodology in which the hyperparameters of the model are automatically readjusted when there is a change in concept.

Other authors present classifier ensembles to deal with concept drift detection; in [20] a comparison of several ensemble algorithms is presented, in which 10 different detectors are used; on the other hand, Babüroğlu et al. [21] propose an on-line real-time detector by means of a hybridization of detectors; Wang et al. [22] present a robust novelty detection framework based on ensemble learning.

2.2 Machine learning: unsupervised classification

Liu et al. [23] propose a heuristic method to improve the sensitivity of drift detection, equal intensity k-means space partitioning (EI-kMeans). This approach consists of three components; a greedy equal intensity cluster initialization algorithm, an intensity based cluster amplify-shrink algorithm to unify the cluster intensity ratio and a Pearson’s chi-square test-based concept drift detection algorithm. This method is a modification of K-means algorithm drift detector on multi-cluster data; Santos et al. [24] present an empirical method, based on a differential evolution, to tune concept drift detectors in order to improve the obtained accuracy.

Sethi et al. [2] propose the unsupervised Margin Density Drift Detection (MD3) algorithm which tracks the number of samples in the uncertainty region of the classifier, as a metric to detect drift. If a variation is found, the algorithm is retrained.

2.3 Statistics based approaches

In [7] presents a concept drift detection method based on the Wilcoxon rank sum statistical test method. De Lima Cabral et al. [25] also present a drift concept detector based on statistical tests, in this case, the proposed approach is based on Fisher’s Exact Test.

Liu et al. [26] propose a drift detector based on Angle Optimized Global Scaling (AOGE) and Principal Component Analysis (PCA). AOGE and PCA analyze the projection angle and variance consecutively, which are subsequently used to identify / detect changes in the objective function.

PCA has been used successfully in studies of dimensionality reduction for multivariate time series analysis with highly correlated variables. For instance, [27] use the PCA to project the high-dimensional data into a principal component projection space before feeding the data to a GAN model. Similarly [28] use PCA to convert multivariate time series datasets into a univariate time series.

The Common Spatial Patterns (CSP) algorithm, first presented in [29] as Fukunaga–Koontz Transform, is a mathematical technique used in signal processing, mainly used in Brain Computer Interface (BCI) applications for electroencephalography (EEG) systems [30,31,32,33,34]. CSP is based on a matrix decomposition method that maximizes the power difference of the two-class signal. This is achieved by maximizing the variance in one class while minimizing the variance of the other class.

Several studies have been conducted by means of CSP. Rodríguez-Moreno et al. [35] present the application of CSP in the shedding light on people action recognition in social robotics. Using this method, a better discrimination between two actions is obtained; this technique allows the signal components that differentiate the actions the most to be extracted. The same authors also present CSP as a feature extraction method that improves the classification task of video activity recognition [36].

In this work, a new approach is presented to deal with temporal data, a new concept drift detection method by means of Common Spatial Pattern. The aim of this algorithm is to filter the data belonging to two populations using the variances to discriminate the signals corresponding to two different targets, finding an optimum spatial filter which reduces the dimensionality of the original signals.

A brief introduction of the Common Spatial Pattern is presented in the following section; a more in-detail presentation can be found in [29].

3 Theoretical aspects

In this paper a new approach to deal with Concept Drift detection is presented. Two main concepts need to be used: Time Series and Common Spatial Patterns.

3.1 Time series

When a variable is measured sequentially in time over or at a fixed interval, known as the sampling interval, it forms a time series. The term univarite time series refers to a time series that consists of a single observation recorded sequentially in time and multivariate time series is used when multiple dependent variables observations are received each time. A time series of length n can be represented by \(\lbrace x_t: t=1, \dots , n \rbrace =\lbrace x_1,x_2, \dots , x_n\rbrace\), which consists of n values sampled at discrete times \(1,2,\dots , n\). When all the observations between specific start and end time are extracted from a time series, the term time (series) window is used to refer to it.

The special structure of time series produces unique challenges for machine learning researchers. A consideration due to the special nature of time series is the fact that individual observations are typically highly related with their neighbours in time. Indeed, it is this property that makes most time series excellent candidates for dimensionality reduction.

The main features of many time series are trends and seasonal variations that can be modeled deterministically with mathematical functions in time. A systematic change in a time series that does not appear to be periodic is known as the trend and the repeating pattern within any fixed period is called seasonality. A stationary time series is one whose properties are constant.

It is worth mentioning that Time Windows are the most common way to deal with Time Series analysis and classification. In this paper two very different types of windows are used:

  • Short Time Windows (5 or 10 time intervals each) which reflect the temporal evolution are used as individual cases for the Machine Learning classification task.

  • A set of the previous time windows is used as a whole dataset; in this case, n Time Windows before and n after the Concept Drift are selected, and labeled as before and after for classification purposes. As it will be shown in the Experimental Setup subsection, it is important to notice the difference between Time Period and Time Window. Both concepts are to be used during the experimental phase, and they can be briefly described as follows:

    • Time Window: this refers to the consecutive time points that are considered in order to characterize a short time slot; it is used as a summary of the slot itself. Each point could be used individually (1 size Time Window), but it could be computationally expensive, and it may not be that appropriate to show the whole tendency of the series.

    • Time Period: this is the time elapsed from the beginning to the end of a time series; it refers to all the time steps considered in the performed experiment, and it is composed of several time windows before and the same amount after.

    In this paper, a set of small time windows immediately before and immediately after the concept drift is selected, and used to verify the adequateness of the proposed approach by means of a classification process in which the main characteristics of the small time windows are used as descriptors of each of them. The main novelty of the paper is to use Common Spatial Pattern to obtain a new vision of the data.

3.2 Common spatial pattern description

Common Spatial Pattern method is based on matrix decomposition that maximizes the power difference of the two-class signal. The CSP algorithm requires the information of the class to which the samples belong to calculate the transformation matrix. Thus, CSP tries to find the optimum spatial filters, considering two classes, which maximize the variances of the filtered signals of one of the classes while keeping the variances constant for the other, this way maximizing the difference of the variances between targets. Let \(X_{1i}\), \(i=1,...,n_1\) and \(X_{2i}\), \(i=1,...,n_2\) be the signals belonging to two different targets and each element of those lists is a \(F \times N\) matrix, having the value of F signals for \(t=1,\dots ,N\) time periods. The CSP algorithm calculates a matrix W with optimum spatial filters to transform the original signals \(X_{ki}\) (1), where \(k=1,2\).

$$\begin{aligned} Z = W^TX_{ki} \end{aligned}$$
(1)

The first vector of Z contains high variance for the first class signals (\(k=1\)) and low variance for the second class signals (\(k=2\)), while the last vector contains the opposite, low variance for the first class signals and high variance for the second class signals.

First, to obtain the W matrix, the mean non-centered covariance matrices are calculated (2).

$$\begin{aligned} R_1 = \dfrac{1}{n_1}\sum ^{n_1}_{i=1} X_{1i}X^T_{1i}; \quad R_2 = \dfrac{1}{n_2}\sum ^{n_2}_{i=1} X_{2i}X^T_{2i} \end{aligned}$$
(2)

Then, applying the generalized eigen decomposition of the covariance matrices (3), the \(W = ({\textbf {w}}_1,\dots ,{\textbf {w}}_F) \in \mathbb {R}^{F\times F}\) projections are calculated, which maximize the function indicated in (4).

$$\begin{aligned}{} & {} R_1{\textbf {w}} = \lambda R_2{\textbf {w}} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \begin{aligned} \max{} & {} Tr(W^TR_1W) \\ \text {subject to}{} & {} W^T(R_1+R_2)W=I \end{aligned} \end{aligned}$$
(4)

The first and last q vectors are chosen

\(W_{CSP} = ({\textbf {w}}_1,\dots , {\textbf {w}}_q, {\textbf {w}}_{F-q+1},\dots ,{\textbf {w}}_F)\), where the first q vectors (\(j=1,\dots ,q\)) obtain large variability for signals that belong to class \(k=1\) (\(X^T_{1i}{} {\textbf {w}}_j\)) and low variability for signals that belong to class \(k=2\) (\(X^T_{2i}{} {\textbf {w}}_j\)), and the opposite is obtained with the last q vectors (\(j=F-q+1,\dots ,F\)). Once the dimensionality of the original signals has been reduced using the W filters, the features are extracted by calculating the variance of each of the output signals Z. Usually the logarithm of the variances is used, hence, the feature vector value for the p-th component of the i-th trial is the logarithm of the normalized variance (5). As mentioned before, the feature vector has 2q dimensionality, where q indicates how many vectors of the spatial filter are used in the projection. Exactly, the q first and q last vectors of the aforementioned generalized eigenvectors are used, which yield to the smallest variance for one class and simultaneously, to the largest variance for the other class.

$$\begin{aligned} f^i_p = log \left( \frac{var({\textbf {w}}_pX_{ki})}{\sum ^{2q}_{p=1}var({\textbf {w}}_pX_{ki})} \right) \end{aligned}$$
(5)

4 Proposed approach

In this paper a new approach is proposed to deal with Concept Drift detection: on the first step, temporal relevant features are projected in a new space by means of an statistical approach called Common Spatial Patterns, and the best among those projected variables are selected and used in the second step to identify the concept drift.

4.1 New approach: CatSight

We present this new approach, called CatSight, as a feature extractor of multivariate time series, and it aims to extract the features that help in the task of differentiating two consecutive time series, having as the final objective the detection of the concept drift in temporal data as is presented in Fig. 1.

Fig. 1
figure 1

An overview of the proposed approach. As it could be seen, previous changes are projected using CSP and then used to learn detecting new ones

Fig. 2
figure 2

Multivariate time series with a time period selection, containing data of two different labels. Temporal data is divided in very small time windows which are then used for the learning process

The problem is posed as a supervised classification problem in which the label of all the data data is known. As explained in Sect. 3.2, the CSP method is based on matrix decomposition that maximizes the power difference of two class signals. In order to achieve that, first of all, a time period is selected from the multivariate time series \(T_{ t }= \big \{{T_{ t_{0} }...T_{t_{n}}}\big \}\) where a change in the distribution of the multivariate time series is observed. This change in the distribution is labeled and classified into two different classes, last concept and new concept and the instant in which this change occurs is called “concept drift”. Once the time period is selected, \(T_{Last}\) and \(T_{New}\), those two multivariate time series are divided and grouped with different small time windows. Each time window is represented as a NxF matrix, \(W_{ N \times F }\) where \(F\) is the number of features of the multivariate time series and N is the number of samples (time steps) of each time series as shown in Fig. 2.

When the data is prepared, the CSP method is used to transform the data and extract the most relevant features in order to maximize the distance between the two classes. For this, the variance of the transformed data is computed. Then, these features are used as the input to the classifier.

5 Experimental setup

This section presents the different steps performed during the experimental setup.

5.1 Data collection

The main objective of this work is to detect the change produced between two states in time series. Therefore, it has been posed as a binary classification problem, in which the objective is to detect if there is any difference between the two states, aiming at differentiating the condition which best fits each section of the evolving data.

This last characteristic is necessary due to the fact that the detection of the concept drift is expected to be between two time intervals named time period, i.e., it is aimed at studying whether the characteristics of the signals vary once a concept drift is detected. Consequently, temporal windows are required to compare one state with another; it is worth mentioning that the used windows –composed of only a few time intervals– belong to one class or another, and these are the cases of the classification problem.

Considering those requirements, 2 external datasets have been selected from the UCI Machine learning repository and UEA & Kaggle which are summarized in Table 1.

  • EEG Eye State Data Set: the data was collected by Oliver Roesler [37] in 2013. In this case, the data has been obtained from UCI, a machine learning repository. The dataset belongs to a continuous EEG measurement obtained with the Emotiv Neuroheadset. The subject is recorded with open and closed eyes during 117 s. A total of 14,980 observations were made with 15 attributes (14 electrode measurements and eye states) [38].

  • Water pump sensor dataFootnote 1: data obtained from Kaggle, an online community of data scientists and machine learning practitioners, and composed of water pump data recorded from 52 sensors, which has system failures.

  • Real-world case study: an industrial use case has been evaluated in order to experimentally validate the proposed method in a real environment. The description is presented in Section. 7

Table 1 List of external datasets and industrial actual data

5.2 Data preprocessing

In this section, some preprocessing steps are adopted to transform and prepare data into a suitable form for the data mining procedure. The preprocessing was performed by applying MinMax normalization and Pearson Correlation filtering (0.8), selecting the most relevant features. Aiming to compare CatSight methodology with its feature extractor counterpart, before splitting the data in different time periods, PCA was applied in the multivarite time series. PCA is a statistical method that converts a set of correlated variables into a set of uncorrelated variables, into a much smaller k principal components.

After that, the pre-processed data is split in different time periods as it is explained in Sect. 4.

5.3 Base learners

Six different classifiers have been selected to perform the experimental phase; they are used to classify the streaming data using both the original sensor data and the projected data obtained with CSP. This work aims to study the applicability of CSP as a drift detector rather than creating a fine-tuned classifier. For this reason, no fine-tuning methodology has been used and the default values of the scikit-learnFootnote 2 (Python module for machine learning) classifiers are used. For each data collection these six classifiers are trained and evaluated by a 5-fold Cross Validation.

  • RandomForest Classifier (RF): A RF is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The RF classifier was first defined by Tin Kam Ho in 1995 [39].

  • Support Vector Machine Classifier (SVM): This is a supervised learning algorithm developed by Vladimir Vapnik et al. [40]. It constructs a hyperplane in a multidimensional space to separate different classes. The SVM generates an optimal hyperplane iteratively, which is used to minimize an error. The central idea of SVM is to find a maximum marginal hyperplane that best divides the data set into classes.

  • Linear Discriminant Analysis (LDA): A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix, [41].

  • KNNeighbors Classifier (KNN): K-Nearest Neighbor is a supervised instance-based Machine Learning Algorithm. It ranks values by looking for the most similar (by closeness) data points learned in the training stage and making guesses for new points based on that ranking [42].

  • Classification trees (C4.5): The goal of this classifier is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features [39].

  • Naive Bayes (NB): Performs a classic Bayesian prediction under the assumption that all inputs are independent. Bayesian classifiers use statistical theorems to predict the probabilities of class memberships. It is based on the assumption that the values of the attributes are independent of each other in the calculation of the probabilities of the class known as the class conditional independence [43, 44].

5.4 Comparison

The proposed approach is compared with the conventional method in which the raw data is used after performing the steps described in Sect. 5.2 and with PCA. Data time periods are selected for the training process, explained in Sect. 4, and evaluated by a 5-fold cross validation.

It is worth mentioning that no special computation devices (i.e. GPUs) are needed, and that all the experimental process has been carried out in a conventional PC.

6 Experimental results

In this section, the obtained results for the first phase of the experimentation are presented.

Table 2 Final datasets

6.1 Preprocessing results

Several data preprocessing techniques were performed, such as removing outliers and standardizing data. Correlation filtering was computed to remove the highly correlated features, obtaining the dataset shown in Table 2.

The data has been adapted so that it can be used as input in the classifiers. For this purpose, in each selected dataset, we have obtained the minimum amount of samples that a class has, selecting this minimum value (the minimum value of instances that a class has) as the maximum selected time period. Based on this, and after several experimental tests, we selected a set of window lengths that we considered suitable for the study, which are summarized in Table 2.

As can be observed, the EEG Eye State dataset has 14,980 cases, 9 variables and 2 classes; the selected time period consists of two time periods (\(T_{Last}\) and \(T_{New}\)), the first selected time period is 360 \(\times\) 2, which means that 360 time units are selected before and 360 time units are selected after the Concept Drift occurs. The second selected time period is 180 \(\times\) 2 units of time period. Regarding Pump sensor data, there are 1680 cases, 15 variables and 2 classes; used time periods are 840, 360 and 180, hence double time units length each respectively.

With each of the datasets three different time window sizes are used: 5, 10 and 15 time unit sizes have been chosen. Therefore, if a 360 time period is used, for instance, we have 720 time units, which means 144 different time windows (5 time units size) are used for each experiment. Proportionally, 72 windows each of size 10) and 48 (size 15) are used in each experiment.

6.2 Accuracy results and comparison

Tables 3 and 4 present the accuracy scores obtained from the application of the presented approach (CatSight) in selected datasets. In order to assess the effectiveness of the CatSight, a comparison is made with the conventional way of classification, which means, in this case, without applying CSP on data, and PCA-ML. In all the studied methods, a 5-fold validation is performed in order to train the base learners.

In the CatSight method, the parameter \(q\) indicates the selected features for the classification process, that is; \(2 \times q\) features are considered in the training process in CSP. The selected time period corresponds to the number of samples selected \(T\ \in T_{Last}, T_{New}\), and window corresponds to the sub-window.The time window corresponds to the sub-window \(w_{ N \times F }\) obtained by breaking the multivariate time series into a number of equal size time windows that are used to feed the CSP method.

For each combination of the different parameters of the presented method, it has been evaluated whether the application of CSP may improve the classification results. For this reason, the best results are highlighted in boldface. For every data time window, the best result of the conventional method is colored in blue, the best results obtained with PCA-ML are colored in red, whereas, the best results of the new approach are colored in gray. Finally, the green colored values represent the best results obtained from the three methodologies compared in the study.

6.2.1 Water pump sensor data results

In this case, the goal is to analyze whether the new approach is able to identify and improve the classification of the two states of the machine so that it can be established that there has been a change in the operation of the pump.

As can be observed in Fig. 3, the system failed five times during last year, those failures are labeled as normal, broken and recovering. The locations of the breakage points have been identified and those points have been established as reference points for the study so that different time periods were selected from these points. As mentioned before, the minimum amount of data of each class was obtained, selecting this size as the maximum time period.

Fig. 3
figure 3

Water pump sensor data with 5 system failure example. A time period is presented in each failure, delimited between the blue and red lines. The red dot represents the point at which the change in the system has occurred

Fig. 4
figure 4

Water pump sensor stretch example, a failure of the system. Three different time periods are represented, in each of them, the change point is denoted with a red dot

Each break point is analyzed separately in different stretches, different time periods and windows are established in all the sections as shown in Fig. 4. First of all, the data was processed in order to apply CatSight. Thus, the six classification algorithms were applied in the three compared methodologies and in all the identified streches as a result, the accuracy values of each classifier with a different time period and window was obtained. In the following Table 3, the mean value of the accuracy is presented, showing the mean value of each classifier in every stretch.

Table 3 Pump sensor data CPS accuracy results

The results in Table 3 show that the best outcome is obtained with CSP, in particular when \(q=2\) or \(q=3\) and with a time period of 840 grouped with a time window of 5, indicating that this configuration is enough to perform the classification. If we compare the results obtained with CatSight and with the conventional way of classification and with PCA-ML, it can be observed that the use of CSP to filter and select temporal variables improves the results of the classification; as a matter of fact, the best result has been obtained using CSP combined with SVM using only 2 (\(\times 2\)) or 3 (\(\times 2\)) variables, with a 0.983 accuracy, using a 840 size time period. It is worth noticing that the result obtained by SVM in the same experiment is 0.834 with the conventional way of classification and 0.833 with PCA-ML. The best result obtained using conventional classifiers without CSP is 0.893, using RF paradigm in a time period of 360 samples and 0.969 with PCA-ML, using SVM with a time period of 360 samples as well.

Regarding the performance of each classifier, it can be seen that in the case of CSP and PCA-ML, the best classifier is SVM, whereas in the conventional case, the best value was obtained by RF and KNN but if we further analyze it, we can see that RF has obtained better results more times than KNN.

In addition, a statistical analysis has been performed to compare all the algorithms against each other. For this purpose, the R package scmampFootnote 3 was used which is mainly focused on non-parametric methods and implements Shaffer static [45] and Bergmann and Hommel dynamic corrections [46] for pairwise tests.

As we wanted to compare multiple classifiers, a post-hoc technique was used to visually represent the comparison of the performance of the different algorithms. For that purpose, the critical difference plot was used. The methodology is based on determining whether the performance difference between two algorithms is greater than the critical difference, if this is the case, this is regarded as significantly different [47].

The position of each method in the CD diagram represents their mean ranks across all outcomes of the observations, where the lower ranks indicates that the algorithm performs better more often than its competitors with higher ranks. If two or more algorithms are connected with each other, it means that there is not enough statistical evidence to say that those two algorithms perform differently, whereas it can be said that those that are not connected perform differently.

The results are shown in Fig. 5, where, each algorithm is presented according to its average ranking. On average, SVM.csp and KNN.csp were the best algorithms over all the stretches of the dataset. The horizontal bold line groups the classifiers that show no significant difference, and for that reason they are grouped together. Moreover, the critical diagram shows that all of the algorithms in which the CSP method was applied before the training process improved their accuracy results over the two methodologies proposed.

Fig. 5
figure 5

Statistical comparison of the accuracy of the classifiers using the critical difference diagram on Water Pump data with three methods

6.2.2 Eye state detection results

In the present experiment, we wanted to study if CatSight approach would improve the accuracy results of the conventional method regarding the prediction of the eye open/closed state.

Taking into account the process carried out in the previous experiment, four drift points are established as shown in Fig. 6. As previously done, each stretch is analyzed separately, dividing it into different time periods, as in Fig. 7. The three methodologies were applied in each time period aiming to verify whether the application of the proposed approach improves the accuracy of the classifiers.

Fig. 6
figure 6

EEG record example. Every change in the state of the eyes is identified with a red dot and a time period is selected delimited by the red and blue lines

The results are summarized in Table 4, where the average of accuracies of the four defined stretches are presented. As Table 4 shows, the application of CSP improves the accuracy rate of the classification models in all the cases studied. The best configuration of the parameters is obtained when the time period is 360 in the case of the CatSight method and with the conventional way of classification, whereas with PCA-ML, the best result is obtained with a time period of 180. If we analyze the value of q, that is, how many new features obtained by CSP transformation are necessary to perform the classification, in this case, the best result is obtained when \(q=3\). The best accuracy is obtained by CatSight using SVM as base classifier (0.955),while the best classifier for conventional classifier is KNN, obtaining a accuracy of (0.744) in both cases. As is shown, the use of CSP improves the obtained results in all the base classifiers.

Table 4 Eye state data CSP accuracy results; best results are highlighted in boldface. in each section. the best accuracy results are colored in blue. red and gray. highlighting the best ones in green

On the other hand, Fig. 8 displays the critical distance accuracy plots of the multiple algorithms. According to the critical diagram, using CSP in most cases improves the performance of the base classifiers and the methods where PCA is applied, getting the best results with SVM, LDA and KNN. Note that those results also concur with the data presented in Fig. 8.

Fig. 7
figure 7

EEG trace example with different time periods. The selecton of an appropriate time period is an important issue

Fig. 8
figure 8

Accuracy statistical comparison of the methods using the critical difference diagram on Eye state detection data, with three methods

7 Case study: experimental results with a real industrial dataset

After testing CatSight on publicly available time series data, the proposed methodology was also applied on the data generated from a real-world industrial case study. The studied use case is a centrifugal end suction pump in charge of cooling various components of a metallurgical plant. As shown in Fig. 9, the pump is powered by a coupled induction motor. The motor was a three-phase two-pole induction motor with a rated power of 75 kW working at 400 V and a rated speed of 3000 r.p.m. at 50 Hz. We would like to point out that this is a practical application of the proposed method to an industrial problem on which the authors are currently working.Footnote 4

Fig. 9
figure 9

Long coupled centrifugal pumps\(^4\)

The cooling pump is part of a critical system in the plant, as production depends on its correct operation. A machine bearing failure can cause unscheduled production downtime, resulting in economic costs. In the worst case, inadvertent failures can lead to catastrophic damage. For this reason, the system is continuously monitored by several sensors. As shown in Fig. 10, a data logger captures electrical parameters at the inverter input. An accelerometer is also fitted to the motor housing to capture vibration data. Both electrical parameters and vibration data were recorded every six minutes.

Fig. 10
figure 10

Setup diagram of the case study

The ground truth consists of a bearing ball that gets damaged around a certain time as is shown in Fig. 11, where the red dot corresponds to the moment of the bearing damage. The bearings ensure that the motor shaft is centered, if any of the balls break, the motor shaft may be off-center, which causes a reduction in the remaining useful life (RUL) of the motor. In addition, with an off-center shaft, the motor generates vibrations that propagate towards the load. All this causes the performance of the engine to deteriorate and the sooner the behavioral change is detected, the sooner the necessary actions can be taken. For this reason, we attempted to demonstrate that the methodology presented can be applied in real scenarios and enables to detect the behavioral change of the data.

Fig. 11
figure 11

Industrial actual data feature examples, with the two states delimited by red and blue colors

Fig. 12
figure 12

Current per phase industrial actual data with different time period size

Taking this break point as a reference, the data is divided into two states, namely normal and damaged, (see Fig. 11), and as in the previous experiments, the minimum amount of a class is obtained in order to establish the stretch size. Time periods were selected taking into account the minimum data amount of one class as was explained before and depicted in Fig. 12. We created temporal sequences of time to feed the three methodologies and taking the presented six classifiers, the three methods are computed in order to compare them and to verify whether CatSight improves the accuracy results.

Analyzing the results in Table 5 and 6, it can be seen that comparing the three methodologies; the CSP based method improves, in general, the accuracies obtained from different algorithms. The best result is obtained by the CSP+TREE algorithm (0.996) using a 2800 time period. The best conventional classifier result is 0.958, obtained by NB; nevertheless, the results of the base classifiers are always improved by means of the CSP approach.

Table 5 Industrial actual data accuracy results part 1
Table 6 Industrial actual data accuracy results part 2

In order to assess the performance of the multiple algorithms that use the analyzed methodologies, we present a post-hoc analysis, in which we use the critical difference diagram. In this case, Fig. 13 illustrates that, for the classification task, all CSP based methods obtained better accuracies. Moreover, analyzing the obtained rank, the best ranked method is SVM-CSP, which is consistent with the observations made from the accuracy tables 5 and 6.

Fig. 13
figure 13

Accuracy statistical comparison of the methods using the critical difference diagram on industrial actual data, three methods

8 Comparison between two external data and industrial actual data

This section summarizes the best values obtained in the three datasets with three methodologies; CatSight, conventional way of classification and PCA-ML. The best results are summarized in Table 7. In general it can be concluded that the new approach outperforms the results comparing with the previously proposed methodologies. CSP projects the multivariate temporal data into a clearer space, making the separation of two states higher, and thus achieving better accuracy results. The obtained results using CatSight are more robust as well. Regarding the base classifiers, when CSP is applied, the best performance was obtained using SVM in two of the three datasets used, and with NB in the industrial actual data, followed very closely by SVM; with the conventional methods, there is not a clear winner, RD, KNN and NB have achieved the best results respectively; it is worth noticing that CatSight outperforms the conventional classifiers in all cases. If we compare CatSight with the PCA-ML method, we can see that the results are better than the ones obtained with the conventional way of classification, but even so, CSP still outperforms the PCA-ML methods, it performs better and obtains better results.

Table 7 Summary of three dataset best accuracy results

As it can be seen, the new approach outperforms the standard approaches in mean—of the six Machine Learning classifiers—in the three cases used in this paper: 0.977 for the water pump data, while ML standard approaches obtaines, in mean, 0.848 and using CSP 0.945; on the Eye state detection data, the best mean is again the one achieved by Catsight, 0.948, and the second approach is ML standard classifiers, 0.885, being CSP the wors with 0.707 mean value; in the industrial data the best is Catsight, 0.989, far away from ML standard classifiers (0.850) and CSP (0.841).

9 Conclusion and further work

In this paper, a new approach to deal with concept drift in temporal data is presented, named CatSight. It is a combination of two steps; (i) the use of Common Spatial Pattern method to project the multivariate temporal data into a subspace in order to select the most relevant features that separate the two classes in a clearer way and (ii) use of machine learning conventional classification algorithms to detect the change in the data. The drift detector is based on the results of the classifier, using the classification accuracy as a metric.

In order to assess the effectiveness of the method, the CatSight method has been compared with conventional way of classification and the PCA-ML method in three different datasets—two publicly available datasets and one real world industrial dataset-. Experiments show that the CatSight method has a better perfomance among the tested datasets. Generally, higher accuracy rates are obtained with the proposed approach, as an average increase of 10,5% is observed while comparing CatSight with the conventional way of classification. If we analyze the best combination of the CatSight method, it can be stated that the best is the combination of CSP-SVM, which obtains better average accuracy scores than the other methods.

To conclude, in this work, it has been shown that the application of CatSight obtains better discrimination rates between two states in multivariate time series data where a concept drift is observed. Authors believe that this improvement can be applied to several industrial data in industrial problems, as the drift detection capability of the proposed CatSight method has been proven.

Nevertheless, the approach is to be further investigated in order to overcome some of the limitations; to continue with this research line, the some further works are envisaged:

On the one hand, the combination of several approaches is to be analysed, similar to the approach proposed in [48], in order to use the appropriate classifier ensemble to each concept drift detection problem. A different approach is also to be tried to help automatically selecting the set of classifiers to be used [49].

On the other hand, streaming data analysis [50] and more real industrial data [51] are also to be investigated. There are also new research lines, such as Change-Point Detection [52] which are to be studied in order to apply them in industiral data.

In this paper, the problem of the drift detector was placed as a balanced classification problem, in which two time windows with the same length were selected. As future work the problem can be analyzed as an unbalanced class data problem, where different time lengths will be selected, so that the change of the concept can be detected as close to the inflection point as possible.