1 Introduction

The manufacturing industry has gone through several paradigm changes along the years. Industrie 4.0, also referred as smart industry, is a new paradigm that proposes the integration of information and communication technologies (ICT) into a decentralised production. With manufacturing machines fully networked to share data and controlled by advanced computational intelligence techniques, this paradigm is looking to improve productivity, quality, sustainability and reduce costs [1, 2].

The estimation of the remaining useful life (RUL) of industrial components is an important task in smart manufacturing. Early detection of cutting tool degradation facilitates the reduction of failures, and hence decreases manufacturing costs and improves productivity. It can also help maintain the quality of the workpiece, as it has been demonstrated that there is a correlation between the surface roughness of the workpiece and the cutting tool wear [3]. Real-time tool wear measurement is difficult to put in practice as the tool is continuously in contact with the workpiece during machining. For this reason, a plethora of indirect approaches for tool wear estimation (also referred as Prognosis) have been proposed utilising sensor signals such as cutting forces, vibrations, acoustic emissions and power consumption [4].

Prognostic approaches can be divided into two categories: model-based and data-driven. The first ones rely on the a priori knowledge of the underlying physical laws and probability distributions that describe the dynamic behaviour of a system [5,6,7,8]. Although these have proven to be successful, an in-depth understanding and expertise of the physical processes that lead to tool failure is required.

On the other hand, data-driven approaches model the data by means of a learning process, avoiding any assumptions on its underlying distribution. Most data-driven methods that have been used for tool wear prediction are based on machine learning, particularly artificial neural networks (ANN), support vector machines (SVM) and decision trees (DT) [9]. However, these techniques are limited in their ability to process raw (i.e. unstructured or unformatted) data, which has a negative effect on their generalisation capabilities [10].

The large amount of data in smart manufacturing imposes challenges such as the proliferation of multivariate data, high dimensionality of feature space and multicollinearity among data measurements [2, 11]. This paper presents in detail the methodology of a novel approach for tool wear classification recently used in [12] as a component of an on-line monitoring framework. Its automatic feature learning and high-volume processing capabilities make deep learning a viable advanced analytics method for tool wear classification despite the large volumes of data required. The proposed classification methodology is based on two components: an imaging step and a deep learning step. The imaging technique employed encodes sensor signals in such a way that its complex features as well as the exhibited temporal correlations are captured by the deep learning, avoiding manual selection. An analysis of the challenges and strategies used to build a big data classifying approach is performed through a set of experiments using the PHM 2010 challenge dataset [13], where the technical procedures of how the data was generated and collected are not entirely known. This provides a way to perform an un-biased blind test and proof of the generalisation capabilities of the methodology.

The rest of the manuscript is organised as follows: Section 2 presents details of how machine learning has been applied to tool wear prediction. Section 3 introduces the proposed approach giving details of the signals imaging and the deep learning methodology. The experimental setup and the results and discussion are presented in Section 4. Finally, conclusions and future work are presented in Section 5.

2 Related work

Tool wear has been widely studied as it is a very common phenomenon in manufacturing processes such as milling, drilling and turning. It is well known that different machining parameters such as spindle speed, feed rate and cutting tool characteristics as well as the workpiece material have an effect on tool wear progression [14]. Although this progression can be mathematically estimated [15, 16], these models rarely capture the stochastic properties of real machining processes and tool-to-tool performance variation [17]. Over the last two decades, it has been demonstrated that data-driven models can achieve higher accuracy, although these have also shown some drawbacks [10].

Some of the most common data-driven methods are based on traditional machine learning algorithms. SVMs, for example, have been successfully applied for tool condition monitoring in [18]. The authors use automatic relevance determination (ARD) on acoustic emission data to select nine features as inputs for classification. ANNs have also been extensively applied for tool wear prediction. These commonly use a combination of cutting parameters such as cutting speed, feed rate and axial cutting length as well as statistical features of forces, vibrations and acoustic emission [19,20,21,22]. In applications such as drilling and milling, it has been shown how ANNs can outperform regression models. In [9], a tool wear prediction method based on random forests is proposed. Although this approach has outperformed ANN- and SVM-based methods, it relies on the manual selection of features to build the internal classification structures.

Manual feature selection is a significant problem when dealing with large amounts of shop floor–generated sensory data. Its distribution as well as the number of features available may change with time. Cloud-based architectures recently proposed for collecting and managing sensory data [2, 23] present new challenges to current TCM solutions. To develop a more general approach, forthcoming approaches should be able to cope not only with high volumes of heterogeneous data but also with the constant evolution of high-dimensional features. Most classical machine learning techniques have been designed to work with data features that do not change with time (static data). As a result, several of these techniques either have been extended to handle the temporal changes or rely on a prior selection of features using other algorithms [24].

Deep learning has offered better solutions when dealing with high-dimensional evolving features. These techniques have made major advances in fields such as image recognition [25, 26], speech recognition [27] and natural language processing [28, 29], to name a few. Its capability to process highly complex featured data has led to an emerging study of deep learning applications for smart manufacturing. For instance, recurrent neural networks (RNN) have been successful for the long-term prognosis of rolling bearing health status [30]. In [31], a local feature-based gated recurrent unit network is applied to tool wear prediction, gearbox fault diagnosis and bearing fault detection. The bi-directional recurrent structure proposed by the authors can access the sequential data in two directions–forward and backward–so that the model can fully explore the ‘past and future’ of each state.

Another successful deep learning architecture is the convolutional neural network (CNN) [32], which is the one addressed in this work. CNNs have become the de facto standard for deep learning tasks as they have achieved state-of-the-art performance in image recognition tasks. The architecture of a CNN is based on the architecture of the ANN, but further extended with a combination of convolutional and sub-sampling layers that allow the discovery of relevant features. This is explained in more detail in Section 3.2. CNNs are developed primarily for 2D signals such as images and video frames. Some successful applications are the detection of vehicles in complex satellite images [33], the classification of galaxy morphology [34], brain tumour segmentation from MRI images [35], among others. Their success in the classification of two-dimensional data has led to further development of CNNs for time series classification (one-dimensional data). Some applications include the classification of electrocardiogram beats for detecting heart failure [36] and the use of accelerometer readings for human activity recognition [37].

CNNs have also been applied in manufacturing problems. For example, this technique has been used for the detection of faulty bearings [38,39,40] by feeding raw vibration data directly to the CNN, achieving good accuracy and reducing the computational complexity of the extraction of fixed features. In [41], real-time structural health monitoring is performed using 1D CNNs. The authors use vibration signals from damaged and undamaged joints of a girder to train several CNNs, one for each joint. Their objective is to detect the structural damage (if any), and identify the location of the damaged joint(s) in the girder. The authors report an outstanding performance and computational efficiency of the approach when dealing with large-scale experiments.

Some previous work on tool wear prediction using a CNN combined with bi-directional long short-term memory (LSTM) has been done [42]. The proposed approach is able to extract local features of the data, achieving good accuracy when compared with other deep learning techniques such as RNNs. However, the method performs a substantial size reduction of the original data, losing information at the flute level. This will be further discussed in Section 5.

Manual feature selection is still a limitation for tool wear prediction approaches to achieve generalisation. To address this, this paper extends preliminary experiments of a novel deep learning–based method that will allow the automatic discovery of intricate structures in sensor signals that relate to the tool condition, and from this provide a classification of the tool state. The approach is blind to the type of signals given or their underlying distribution, so no assumptions nor manual feature selections are needed. At the same time, the model is blind to the type of wear being classified. Although in this work flank wear has been used as a measure of the tool condition, the proposed methodology could be used for other types of tool wear as well.

3 Methodology

This section presents the two main steps of the methodology: the imaging of sensor signals using Gramian Angular Summation Fields [43] and the classification using CNNs. The idea behind this approach is to visually recognise, classify and learn structures and patterns intrinsic to sensory data without loss of information.

3.1 Time series imaging

There has been a recent interest on reformulating features of time series to improve their identification, and hence classification. Eckmann et al. introduced the method of recurrence plots to visualise the repetitive patterns of dynamical systems [44]. Silva et al. used this method and proposed the use of a compression distance approach to compare recurrence plots of time series as a way to measure similarity [45]. Methods based on time series to network mapping using the topology of the network as a way to characterise the time series have also been proposed [46, 47]. Most of these methods do not provide a way to reconstruct the original data, making unclear how the topological properties relate to the time series. Wang et al. propose three techniques, two based on Gramian Angular Fields (GAF) and one on Markov Transition Fields (MTF) to image time series [43]. They argue that compared with previous techniques, the original time series can be re-constructed, allowing the user to understand how the features introduced in the encoding process improve classification. They reported GAF encoding methods were able to achieve competitive results in a series of baseline problems that include different domains such as medicine, entomology, engineering and astronomy. Furthermore, this method has been found to perform well compared with other time series encoding techniques in applications such as the classification of future trends of financial data [48].

As a pre-processing step, our approach uses the GAF imaging technique proposed by [43], particularly the one based on the summation of angular fields, Gramian Angular Summation Fields (GASF). This encoding method consists of two steps. First, the time series is represented in a polar coordinate system instead of the typical Cartesian coordinates. Thus, given a time series X = x1, x2,..., xn of n real-valued observations, X is rescaled so that all values fall in the interval [− 1,1] by:

$$ \tilde{x}^{i}_{-1}=\frac{x_{i} - max(X)+(x_{i}-min(X))}{max(X)-min(X)} $$
(1)

The time series X~ can then be represented in polar coordinates by encoding the value as the angular cosine and the time stamp as the radius applying Eqs. 2 and 3:

$$ \phi=arcos(\tilde{x}_{i}), {\kern1.7pt} -1\leq\tilde{x}_{i}\leq1,{\kern1.7pt} \tilde{x}_{i} \in \tilde{X} $$
(2)
$$ r=\frac{t_{i}}{N}, {\kern1.7pt} t_{i} \in \mathbb{N} $$
(3)

In Eq. 3, ti is the time stamp and N is a constant factor to regularise the span of the polar coordinate system. Figure 1 shows an example of forces on z-dimension and its representation in polar coordinates.

Fig. 1
figure 1

Forces on y-axis acquired from a dynamometer are encoded as polar coordinates by applying Eqs. 2 and 3. As time increases, the corresponding values of the signal in polar coordinates wrap among different angular points on the spanning circles, keeping the temporal relations

As time increases, corresponding values on the polar coordinate system warp among different angular points on the spanning circles. This representation preserves the temporal relations and can easily be exploited to identify the temporal correlation within different time intervals. This temporal correlation is represented as:

$$ G = \left[\begin{array}{ccc} cos(\phi_{1}+\phi_{1}) & {\dots} & cos(\phi_{1}+\phi_{n}) \\ cos(\phi_{2}+\phi_{1}) & {\dots} & cos(\phi_{2}+\phi_{n}) \\ {\vdots} & {\ddots} & {\vdots} \\ cos(\phi_{n}+\phi_{1}) & {\dots} & cos(\phi_{n}+\phi_{n}) \end{array}\right] $$
(4)
$$ cos(\phi_{i}+\phi_{j}) = \tilde{X}^{\prime}\cdot\tilde{X}-\sqrt{I-\tilde{X}^{2}}^{\prime}\cdot\sqrt{I-\tilde{X}^{2}} $$
(5)

where I is a unit full row vector ([1,1,...,1]). Figure 2 shows the resulting image of applying the encoding method to the time series presented in Fig. 1.

Fig. 2
figure 2

Example of the encoding of forces in the y-axis as an image using GASF. The colour represents the intensity of the relative correlation between two points in the time series, which is a value between − 1 and 1. There is no PAA smothing applied to the resulting image, so the resolution (300 × 300 pixels) is the same as in the original signal

The GASF image provides a way to preserve temporal dependency. Time increases as the position in the image moves from top–left to bottom–right. G(i, j||ij|=k) represents the relative correlation by superposition of directions with respect to time interval k. The main diagonal Gi, i is the special case when k = 0, which contains the original value/angular information. The dimension of the resulting GASF image is n × n when the time series is of length n. To reduce the size of the image, piecewise aggregation approximation (PAA) is applied to smooth the time series while keeping trends [49]. As explained in the Experiments section, the amount of time series data that is acquired from the sensors is large (more than 200,000 measurements), so PAA is fundamental to keep the images at a reasonable size without losing time coherence.

To label the images, three regions have been identified as defined in [50]. According to the literature, the tool life in milling operations is typically divided into three stages/classes: a break-in region, which occurs with a rapid wear rate; the steady-state wear region with uniform wear rate; and a failure region, which again occurs with a rapid wear rate [51]. Figure 3 presents a tool degradation curve example with the classes that were used to label the images.

Fig. 3
figure 3

Tool flank wear as a function of cutting time (cut events of cutter c6 used in the experiments). For each region, a sample image of forces in y-axis is provided

3.2 Deep learning for time series classification

To identify the current state of wear of a tool by using sensor signals, the approach applied needs to be capable of picking up the temporal dependencies present in the signals. Sensor signals are expected to show changes in their temporal structures as the tool wears out. A classification tool should be capable of identifying those changes and map them to a predefined wear class.

Time series classification methods are generally divided into two categories: sequence-based methods and feature-based methods. Among both of these categories, k-nearest neighbour (k-NN), which is a sequence-based method, has proven to be very difficult to beat. This is specially true when paired with dynamic time warping (DTW). The drawback of this approach is its lengthy computation time. As the training set grows, the computation time, and hence the prediction time, increases linearly.

An approach that can provide constant prediction time as well as a way to extract relevant features automatically is deep learning. CNNs in particular have been successful in handling large volumes of data. Although they have been primarily used for visual tasks, voice recognition and language processing, new developments have looked towards time series classification.

CNNs have been inspired by the way the visual cortex in the human brain works. Neurons of the visual cortex have a local receptive field which reacts to visual stimuli located in a limited region of the visual field [52]. These receptive fields may overlap, tiling together the whole visual field. Some neurons have larger receptive fields which react to more complex patterns that are further combinations of lower level patterns. The discovery of these basic functionalities of the human brain inspired the idea of developing an artificial neural network architecture whereby higher level neurons are based on the outputs of neighbouring lower level neurons, to detect complex patterns. In 1998, LeCun et al. [32] proposed the LeNet-5 architecture, which contains the main building blocks of a CNN: the convolution layer and the pooling layer.

A convolution layer is formed by a series of neurons that are connected to neurons of a previous layer based on the their receptive field. For example, in the first convolution layer, each neuron is not connected to each individual pixel of the input image, but to only those pixels within a receptive field. Then each neuron in the second convolution layer is connected to neurons within a small rectangle in the first layer. The first convolution layer is responsible for detecting the lower level features, and further convolutions assemble these features into higher level ones. The set of weights (i.e. filter) of a neuron in each convolution layer will depend on the type of feature it is “looking” for. For example, a particular filter would be able to detect vertical lines while another one could detect horizontal ones. During the convolution, the filter is compared with different areas of the image, obtaining a feature map, that highlights the areas in an image that are most similar to the filter (see Fig. 4a). As images posses a variety of different features, each convolution neuron would have more than one set of weights or filters. The training process will enable the CNN to find the most useful filters for the particular classification task. In the case of the force classification that is addressed here, the training process will find those filters that allow it to recognise in a first instance features at a flute level regardless of where in the image they are located. Then, higher level convolutions allow the determination of the state of the tool considering all flutes.

Fig. 4
figure 4

Low-level features of forces are picked up by the first layer, which are then assembled into higher level features in the following layers

The pooling layer is another important building block of the CNN. This layer downscales the output of the convolution, thus reducing dimensionality, the local sensitivity of the network and computational complexity (see Fig. 4b) [32]. A typical CNN architecture stacks several convolutions (that may include a rectified linear unit (ReLU) step to speed up the training) and pooling layers which reduce the size of the image as it gets deeper. Finally, at the top of the stack, a multilayer neural network is connected to the last convolution/pooling to perform the classification.

In this paper, the CIFAR-10 architecture from Tensorflow has been used [53]. This is an off-the-shelf CNN architecture that has proven to achieve high accuracy on the classification of 3-channel images (see Fig. 5). This architecture has two convolution layers stacked with their corresponding ReLU and pooling layers. Each convolution applies 64 filters. As will be presented in the next section, the implemented CNN will take 3-channel images generated from the force sensors and use these for training. The deep learning structure will be able to pick up the relevant features that relate to tool wear condition. Figure 6 shows a schematic of how the approach has been implemented.

Fig. 5
figure 5

CNN architecture based on the Tensorflow implementation for the CIFAR-10 dataset (adapted from [53])

Fig. 6
figure 6

Framework proposed combining time series imaging and deep learning for tool wear classification. Forces in the three dimensions are individually encoded using GASF and put together as 3-channel images. From those images, 70% is used for training a CNN model and then 30% used for testing

4 Experiments and results

Tool wear classification was performed using a dataset that was originally made available by the PHM2010 Data Challenge [13]. The dataset contains sensory data of six 3-flute cutters (labelled c1, ..., c6) used in a high-speed CNC machine (Röders Tech RFM760) under dry milling conditions until a significant wear stage. The experiment with each cutter was carried out as follows. The workpiece surface was machined line-by-line along the x-axis with a 6-mm three-flute cutter. After finishing one pass along the x-axis (axial depth of 0.2 mm and radial depth of 0.125 mm), the tool was retracted to start a new pass. This was done until the complete surface was removed. Then, the tool was removed from the tool holder and taken to a LEICA MZ12 microscope, where the corresponding flank wear (Vb) for each individual flute was measured. In order to capture cutting forces throughout the experiment, a Kistler quartz 3-component platform dynamometer was mounted between the workpiece and the machining table. A schematic of this setup is shown in Fig. 7. To measure the vibrations, three Kistler piezo accelerometers were mounted on the workpiece. Finally, an acoustic emission sensor was mounted on the workpiece to monitor the high-frequency stress wave generated by the cutting process. For each cutter, the seven signal channels (forces in the x-, y- and z-axes, vibrations in the x-, y- and z-axes and acoustic emission) were recorded while removing 315 layers of the stainless steel workpiece (see Table 1). Table 2 shows the details of the process conditions during the cutting tests. The total size of the dataset for each cutter is about 3.2 GB, making in total nearly 20 GB for all cutters. In this work, only three of the six cutters (c1, c4 and c6) were used as these were labelled with their corresponding tool wear measurements. More details on the machining setup can be found in [54].

Fig. 7
figure 7

Schematic of the experimental setup used in [54] to collect forces, vibrations and frequency stress waves of the cutting process

Table 1 Signal channels and measurement data of the complete dataset
Table 2 Operating conditions during dry milling

Initial experiments were carried out with a data subset comprising a single cutting tool for the training and test sets, with a total data set size of 1 GB. In this case, the cutter labelled c6, from which 315 cuts and tool wear measurements are available, was used. Force signals were selected as the only input for the CNN to avoid a computationally expensive training process for this proof of concept.

To prepare the dataset for training and testing of the CNN, each cutting force Fx, Fy and Fz corresponding to a removed layer was encoded as three separate images. Since the time series that corresponds to one layer can be as long as 219,000 measurements, a representative portion of the complete time series was taken. This was done by selecting a subsequence of 2,000 measurements that correspond to the middle of the layer, thus capturing different material hardness. Applying the GASF method explained in Section 3, an image for each force (Fx, Fy and Fz) was obtained. These were then reduced from a size of 2k × 2k pixels into images of 512 × 512 pixels using PAA and then combined into a 3-channel image. The associated wear class to this image is then determined by the flank wear value that was measured when the layer was removed. Although this experimental setup is particular to flank wear, the images could be labelled using other types of wear such as crater wear. Regardless of the type of wear measure used, the training process should be able to capture the features on the input that relate to the particular wear measure used.

As an example, Fig. 8 shows forces on the x-axis at different stages of the milling experiment. From what can be observed in this figure, the forces tend to be more uniform (i.e. shapes tend to get more circular) as the tool starts to wear out. The size reduction does not affect the time coherence of the data, allowing each individual flute temporal information to still be kept after PAA.

Fig. 8
figure 8

Sample images of rescaled forces in the x-axis at different stages of flank wear. It can be observed how the shapes in the image become more circular as the signal becomes smoother. It can also be observed how the information by individual flute is kept

In total, the pre-processing step produced 315 3-channel images, one for each cutting event. This set of images was divided 70% for training and 30% for testing. The CNN was trained using the softmax regression method, which applies a softmax nonlinearity to the output of the network and calculates the cross-entropy between the normalised predictions and the actual labels. The parameters used for the training process are shown in Table 3.

Table 3 Operating conditions during dry milling

Once the model was trained, it was tested on the remaining 95 images. Table 4 presents a confusion matrix with the results obtained. Based on the test set, the estimated accuracy of our model is 90%. Break-in wear was correctly classified for 82% of the cases, steady wear 94% of the cases and failure wear correctly classified 75% of the cases. The number of incorrect predictions suggest that the number of cases for break-in and failure regions may need to be increased.

Table 4 Confusion matrix summarising the results on the test set

As it can be observed in Fig. 8, the number of cuts that fall in the break-in region is 50, while the number of cuts in the steady-state are 200. This means that two-thirds of the data available would be categorised as steady-state. If the training set is generated by randomly sampling from the complete dataset, it is likely that two-thirds of those samples are steady-state class. This class imbalance problem has been well documented in the literature [55,56,57]. Failure cases tend to be considerably less abundant than steady wear cases. The less represented classes are more likely to be misclassified than the majority examples due to the design principles of the learning process. The training process optimises the overall classification accuracy which results in the misclassification of the minority classes. Therefore, several techniques could be applied to balance the number of samples of each class. Because the time series corresponding to one layer of the workpiece can be as long as 220,000 measurements, the data can be resampled. This would generate more than one sample from each layer, particularly with the break-in and failure cases. At the same time, an undersampling can be done by adding another class for the cases that are approaching the failure region. Thus, a fourth class that identifies this region could, in fact, be more useful as currently the low-wear region covers a wide range of tool wear values. It is important to remark that tool wear progresses differently depending on the type of tool, type of material, cutting parameters and other cutting conditions. It is not possible to identify the degree of class imbalance for a tool for which no prior data has been collected. Therefore, class imbalance needs to be detected and acted upon as part of the data preparation prior to model training.

A balanced number of cases among all classes will be crucial to achieve accuracy homogeneity across all wear regions. The overall results are nevertheless promising, showing that the CNN was successfully capable of capturing the intrinsic structures of the sensory data. This method is then scalable to include the remaining cut data.

A second experiment was performed by adding a 4th class that corresponds to the area prior to entering the failure region (Fig. 9). This area is of particular interest to this study as it considers a point in time were decisions could be taken to extend the life of the tool. The number of instances per case was also increased by taking two more sub-sequences from each layer, for a total of three 2,000 sample sub-sequences from the middle of each layer (cut event); enough so that the experiment could still be kept short for the proof of concept. Sequences were again encoded into images and labelled according to the wear value and the new classes. A total of 954 images were produced, where 70% was used for training and 30% for testing. The results are shown in Table 5.

Fig. 9
figure 9

Four stages of tool wear for cutters c1, c4 and c6, and sample images of forces in the y-axis that correspond to those regions

Table 5 Confusion matrix summarising the results with four classes on the test set

The overall accuracy of the classification was 89%, which is about the same compared with the first experiment. However, there was an improvement on the percentage of cases correctly classified per class. For example, the break-in wear region went up from 82% in the previous experiment, whereas the steady wear region remains at 94%. The severe wear region, which was introduced in this round of experiments, is correctly classified 82% of the time. Despite this, it can be seen that only 6 cases (9%) of the severe region were classified as steady wear. The other 6 cases were classified as failure due to their proximity to the failure values. Finally, the failure region cases were accurately classified 82% of the time, which is again an improvement over the first experiment. From the number of cases, it can still be observed that there is a class imbalance that could be affecting the training process.

In a third experiment, the class imbalance was addressed using a stratified undersampling technique. In the previous experiments, the datasets used for training were kept small to avoid high computational load for a proof of concept. However, it is possible to sample more subsequences from each of the 315 cuts. For the c6 tool, it is possible to sample up to 95 subsequences from each cut, generating a total of 29,925 3-channel images. An undersampling strategy to deal with class imbalance is suitable in this case as the dataset is large enough to avoid losing critical features. Using a strata based on the wear classes defined, sampling of each class was done individually, making sure classes such as steady state were undersampled to achieve an equal number of samples across all classes. After performing the undersampling, a training set consisting of 14,000 images and a test set of 6,000 images were produced. These were used to train and validate a new model.

As the size of the training had increased considerably, images were reduced to 256 × 256. It was also decided to move from a generic Tensorflow architecture implementation to a more tuned one, by changing the size of the filters for both convolution layers from 5 × 5 to 16 × 16 for the first convolution and from 5 × 5 to 8 × 8 for the second convolution. Given that the GASF images are typically capturing 7 complete revolutions of the tool (21 cycles of the signal as the tool has 3 flutes), the kernel of the first convolution was set to a size of 16, which allows capturing a complete signal cycle. This means that the convolution will be searching for features at a flute level. The stride of the kernel was set to 4 due to the size of the image, allowing a reduction of the feature map by a quarter. The pooling layer that follows uses a kernel of size 3, which allows a further reduction of the feature map to a size of 32×32. This is enough to keep the detected low-level features that will be grouped into higher level ones by the following convolution.

Results with the new trained model are shown in Fig. 10, where the model is labelled as M6, as it is the model that corresponds to cutter c6. Overall, M6 was able to achieve a 96.4% accuracy on the test set. The classification accuracy increased for both the break-in and failure regions to 99.7% and 97.5% respectively when tested on c6. The lowest accuracy was shown in the severe region, where a result of 92.6% correctly classified cases was achieved.

Fig. 10
figure 10

Confusion matrices summarising the results of the M6 model (cutter c6) with four classes using the stratified undersampling technique

To understand the capabilities and limitations of the approach when a different set of data is available, a similar sampling and training was done with cutters c1 and c4, generating two additional models, M1 and M4, respectively. Each of these models were validated against the same cutter as well as the other two cutters. Accuracy results per class are shown in Fig. 11 and the overall results in Table 6. All experiments were carried out on a 2.80 GHz Intel Core i7-7600C CPU and 32GB RAM. The average training time for one batch (100 images) is 7.6 s, so a complete epoch takes approximately 16.5 min for any model. The testing time for one sample using any model is 0.2727 s. Although the training time is computationally expensive, testing is not, which still makes it applicable for real-time monitoring. Training time can be improved by using a higher specification processor or GPU as well as by parallelising the code and/or training one-class classifiers in parallel.

Fig. 11
figure 11

Confusion matrices summarising the accuracy results (0–100 %) for M1 (top row) and M4 (bottom row) across c1, c4 and c6 using four tool wear classes and the stratified undersampling for training/testing

Table 6 Summary of the accuracy (in %) of each model (labelled M1, M4 and M6) when validated against the same cutter and other cutters

As can be observed in Table 6, there is not one model so far that works best when validated against all cutters. However, the model developed with c1 (M1) achieves the highest accuracy across the three models when validated against other cutters (accuracy of 89.3% on c4, and an accuracy of 80.4% on c6). M1 particularly struggles classifying correctly the failure cases of c6 (see Fig. 11 first row). Looking at Fig. 9, it can be seen that c1 wears out at a very high rate during the first 20 cuts, reaching the steady state earlier than the other two cutters, and developing a lower tool wear after 315 cuts. This can explain why a model developed with this tool might perform badly on highly worn cutters as it does not provide enough examples of the degree of wear that was developed by c6. Unfortunately, a more in-depth analysis onto these differences in wear degradation cannot be performed as no additional data or meta-data is available regarding the conditions of the PHM2010 data experiments. However, these results suggest that a better model can be built if a combination of both cutters’ data was used in the training process.

When analysing the results obtained with M6, it was observed that this model is very good at identifying failure cases when tested on c4. The model correctly classifies tool failure 95.8% of the time. This model shows again a weakness in identifying the severe region (see Fig. 10). Most of the cases that are incorrectly classified are identified as failure cases, which could be explained by the abrupt change in wear rate of c4 when approaching failure. M4 did not show particularly good results when identifying tool failure. This model achieved 75% and 70% accuracy when tested in c1 and c6. What is interesting to point out is that M4 is particularly good at identifying the severe region on c6, achieving a 97.7% accuracy. This again highlights the importance of making sure that a training dataset be a good representation of the search space in order to achieve generalisation.

In general, the results of the three models show the ability of the architecture used to learn force patterns and relate those to wear classes. The architectural setup of the CNN used in this last experiment allowed finding relevant features at a flute level, which is necessary for the approach to detect the current maximum wear regardless of the flute that is developing the wear. This is important, as it ensures that the technique can achieve good results regardless of the tool used. The accuracy obtained in particular classes shows the importance of presenting the CNN with samples that are representative of all the input space during training. A more robust model would need to be enriched with data from different cutters to ensure this.

5 Comparison of the proposed approach to previous work

The proposed approach has its advantages and disadvantages when compared with other approaches. Making a fair comparison in terms of accuracy is not straightforward due to several factors. First, to compare against classical machine learning, the best set of features would need to be found and not chosen arbitrarily. There are a wide range of algorithms for selecting and fusing features [58]; however, it is not in the scope of this paper to explore these. In addition, each approach has an “ideal” parametrisation depending on the problem and specific instantiation of the methodology, for example, selecting the right number of hidden layers and nodes in each layer of an ANN. For this reason, the comparison is approached differently, by describing the power of using GASF as a tool to automatically encode raw signals into images. The features of GASF images are ultimately exploited by an off-the-shelf CNN implementation that outputs the different stages of wear.

Most of the published works in tool wear prediction or tool wear classification perform some type of specific data pre-processing such as statistical feature selection using mean, maximum, standard deviation and median. Wu et al., for example, use these four features across multiple sensor data to perform tool wear prediction using ANNs, SVMs and random forests, the latter achieving the lowest root mean square error (RMSE) [9]. In Zhao et al., a deep learning approach using convolutional bi-directional LSTM (CBLSTM) network to perform tool wear prediction is presented. In this work, sensor signals are reduced from 200,000 measurements into 100 datums of maximum and mean values, and these are fed into the CBLSTM model. From three different configurations of the approach, the authors report that CBLSTM with dropout achieves the lowest RSME. [42]. The main disadvantage of manual feature extraction is that, unless it is continuously re-applied to update the models, it does not consider changes in the data distribution related to either noise or the tool wear phenomenon itself, making it unreliable in some cases. An example of this can be seen in cutter c1. Inspecting the data of this cutter, it was found that, although mean, maximum and median statistics follow generally the same trend with a tendency to increase with every cutting event, there is a peculiar change in these statistics for cutter c1 as seen in Fig. 12. The figure shows how there is a sudden increase in the maximum force along the x-axis (also applies for the mean, median and standard deviation) around cutting events 225 and 250, then the values return to their normal trend. Although change was not much in the wear measurements during this period of time (from 131.25 to 136.9 mm), the force values did show changes. This suggests that some conditions of the experiment changed and were reflected on the sensor readings but were not actually related to changes in tool wear. From the results reported in Zhao et al., it is interesting to note that the highest RMSE obtained is on cutter c1, particularly during cutting events 225 and 250. This strongly suggests that there is a sensitivity to maximum and mean values, as the highest errors occur during the aforementioned cutting events. Although the method in [42] employs a deep learning approach, their results suggest that the model is not picking up the information on how one measurement changes in relation to another one in the time series, like a typical deep neural network would do. In their work, the dimensionality reduction performed averages 2k measurements, corresponding to nearly 7 revolutions of the tool, therefore losing the details of each individual flute. As each flute might wear out at a different rate, retaining flute level information is relevant as it provides a better understanding of how the tool is wearing through time. Figure 13 shows two force samples and their corresponding GASF images between cutting events 225 and 250. By visually inspecting the images, it can be inferred that not much change in the force patterns has happened during these cutting events. The GASF image encoding provides the CNN the right level of information for it to learn how the tool erodes at the flute level as well as how patterns change from one flute to another regardless of the actual force measurement made. From the results shown in Fig. 11, it can be seen that M4 achieves an accuracy of 83% on the severe cases of c1. Taking into account that a third of the mean force measurements are showing a significant increase (Fig. 12), the CNN is still quite reliable in classifying these as severe (having only 17% as failure).

Fig. 12
figure 12

Maximum force in newtons (N) in the x-axis at each cutting event for cutters c1, c4 and c6

Fig. 13
figure 13

Sample images of rescaled forces in the x-axis during cutting events: a 225 and b 250. Although there is a sudden increase on the mean force during cutting events 225 and 250 (which is not visible after normalisation), the wear does not increase at that same rate. In fact, the GASF images suggest there is not much change on the wear as the force patterns are very similar

A similar comparison with the work of Wu et al. [9] is not straightforward as results are presented as total accuracy on the test set, with no detail of which tools were used for training and for testing. As a result, it is not possible to determine from the reported results how the proposed approaches are capable of dealing with the noise or changes in the data distribution.

A current disadvantage of the GASF representation is the loss of the magnitude information of the measurement during normalisation, as this normalisation process is performed individually by image, not taking into account the maximum value of all the observations. A combination of GASF and actual magnitude encoding could potentially be more effective, particularly for the cases like in c1, where conditions could change suddenly.

6 Conclusions and future work

This paper presents an approach to tool wear classification by means of sensory data imaging and deep learning. The GASF encoding keeps the temporal correlations for each flute, which is an advantage over classification methods that are based on statistical features, where the features of a particular flute are lost. Experimental results show the ability of the CNN to capture and learn the features on the raw data to correctly classify tool wear condition. Overall, the percentage of accurately classified cases on the test set is high, achieving in most cases above 80% when testing in a new cutter. The moment prior to the transition from critical wear to failure is in most cases correctly identified, and the cases where it is incorrectly classified were generally labelled as a failure, which from an application standpoint means the replacement of the tool would still be enacted. These results show the importance of using a training sample set that can represent all of the input space. In this case, the training set needs to be enriched with samples from multiple cutters to ensure the successful detection of the transition period from severe to failure. The application of this work will allow for the extension of the remaining useful life of the tool, improve cut quality and ensure machining elements are replaced before failure.

Future work will include parallelisation of the architecture and its implementation to run in GPUs as well as incorporating the approach in a cloud architecture. Techniques for partially retraining the architecture will also be explored to study its adaptation capabilities when new data becomes available. Additional work will also include experimentation with more input channels on the GASF image to feed in multiple sensor data and improve the accuracy of the classification. Finally, further enhancements to the encoding technique will be investigated such as incorporating the magnitude information.