1 Introduction

An electrocardiogram (ECG) is a non-invasive detection method that is widely used to reflect a potential heart condition. The ECG information of the patient’s heart is directly obtained to reflect its heart rate status. ECG is appropriate for the diagnosis and treatment of various types of heart diseases [9], and its automatic classification can provide an objective diagnosis and reduce medical diagnosis time [6]. Therefore, the detection and classification of ECG are of great clinical significance, which is also helpful in promoting the clinical research of cardiovascular diseases [8, 15]. Various feature classifiers have been developed to automatically detect ECG signals [27]. However, most methods involve manually extracting features and then classifying heartbeats using traditional classifiers [18]. Thus, achieving high accuracy requires considerable time finding and calculating the best combination of features.

Deep learning has achieved considerable success in computer vision [30] with a robust feature extraction ability. In particular, the convolutional neural network (CNN) is the most widely used deep learning model [2, 13, 16, 19], attaining robust results in applications in medical imaging, gene recognition, speech processing, sleep apnea detection and other aspects [4, 28]. However, in the classification and detection of ECG, the existing methods still reflect three shortcomings, as follows: (1) The complexity of algorithms for QRS waves; (2) The complex changes of irregular heartbeats in rhythm or morphology leading to difficulties in the ECG feature recognition; (3) The need for large training samples and training time to achieve the ideal recognition accuracy [1, 22]. Given the problems in ECG automatic classification, this study proposes a DE-optimized, automated classification approach for arrhythmia based on a representative CNN.

The contributions of this study are as follows: (1) Applying deep learning technology to ECG signal detection classification, which solves the drawback of traditional machine learning methods that require manual feature extraction; (2) Using the DE algorithm to optimize the CNN parameters is helpful to improve the performance of CNN in terms of classification accuracy and training time; (3) Help to detect arrhythmia more accurately and automatically using deep learning.

The remainder of this paper is organized as follows. Section 2 describes the studies related to CNN in ECG recognition. Section 3 presents the methodology of our proposed one-dimensional CNN (1D-CNN) and its optimization method and Sect. 4 describes the experimental setup. Section 5 discusses the experimental results through analysis. Section 6 presents the conclusions and future work.

2 Background

Early detection is beneficial for diagnosing and treating various heart diseases to ensure a high survival rate of patients [5]. Examples of heart diseases are atrial premature beat, ventricular premature beat, left bundle branch block (LBBB) and right bundle branch block (RBBB). Many effective methods for automatically detecting cardiovascular diseases (CVDs) from ECG signals have been developed in the past few decades. For example, in the automatic detection of Atrial Fibrillation (AF), several detectors can identify cardiovascular diseases (CVDs) based on p-wave deletion or RR interval variation (R is a symbol for the beginning of ventricular depolarization). Dash et al. proposed an AF automatic detection algorithm based on the randomness, variability and complexity of heartbeat interval time series. Lian et al. presented an AF detection algorithm based on RR interval scatter plots with variations [11]. For the method in [11], the morphological features are extracted by a wavelet transform and independent component analysis. Wavelet features consist of fourth-order approximation coefficients and third and fourth-order detail coefficients. The features extracted by independent component analysis include a set of independent source signals recovered from the observed samples [14].

Although various feature classifiers have been developed, most of these involve manual extraction and traditional classification. These methods require considerable time in determining the best feature combination to increase their accuracy. Moreover, feature extraction in ECG signal processing requires high professional knowledge of digital signal processing. Therefore, feature extraction or selection presents a challenge to non-medical researchers and those in the field. To overcome this drawback, several researchers have begun using neural networks to automatically extract features such as heartbeat. Escalona-Morán et al. presented a neural network based on the convolution of the 2D heartbeat as a classification method [10], a series of three adjacent beats are converted into a 2D coupling matrix input, enabling the convolution filter to easily capture the continuous waveform of adjacent heartbeats and the correlation between beats. The above method achieved a final sensitivity of 76.8%, with positive predictive values of 74.0% for SupraVentricular Ectopic Beat (SVEB) and 93.8% for Ventricular Ectopic Beat (VEB).

Deep learning, which exhibits strong feature extraction capability, has achieved considerable success in computer vision in recent years. In particular, the CNN model is the most widely used deep learning model, demonstrating robustness in its applications in medical imaging, gene recognition, speech processing, sleep apnea detection and other aspects. Accordingly, the use of deep neural networks (DNN) to automatically detect ECG signals has gained research interest. Salem et al. developed an automatic AF detection method based on CNN [23], where AF features are automatically learned and applied to the classification module. This method simplifies the feature extraction without requiring expert feature engineering to determine the suitability and criticality of features.

Bhagyalakshmi et al. developed a genetic bat-assisted support vector neural network (GB-SVNN) to classify ECG arrhythmia [3], obtaining a final accuracy of 0.9696 and sensitivity of 0.99. Zhang et al. proposed a multi-scale CNN (MCNN) that performs the timescale transformation of input signals and AF detection based on the scale transformation input [31], achieving a depth that strongly correlated with detection performance.

Although the above methods are experimentally effective in addressing specific CVD detection problems, their good performance is typically based on carefully selected clean data or a small number of testers. Thus, their applicability may be limited. Therefore, achieving the generalization capability of models to detect CVD reliably from limited single-lead ECG records remains a considerable challenge. Such generalization capability depends on how a CNN model depends is trained and requires a longer training time. In the present study, we propose using DE to optimize the initial weights of 1D-CNN to ensure its generalization and reduce its training time.

3 Methodology

This section starts with (1) Data source and signal preprocessing, (2) Implementation of 1D-CNN and (3) Optimization of SDE. These three aspects introduce the implementation and optimization methods for ECG classification. The specific process is shown in Fig. 1.

Fig. 1
figure 1

Flow chart of ECG classification and recognition based on CNN

Figure 1 shows the entire process of ECG classification with CNN, which starts from preprocessing the hexadecimal raw file of ECG. After preprocessing, we obtain the intuitive image and extract the ECG image to build the data set. With sufficient data extracted from the database, the data set is divided into 75% training set and 25% test set. We repeatedly train CNN and obtain the required classification results.

3.1 Data source and signal preprocessing

The experimental data for this study is the single-lead ECG signal from the MIT-BIH and Sudden Death Cardiac Holter (SCDH) arrhythmia databases [12] to evaluate the proposed method. MIT-BIH database covers ECG signal records with a sampling rate of 360 Hz and 30 min as the unit, and the SCDH contains records with a sampling rate of 250 Hz. Unlike the equal-length records in the MIT-BIH, the record lengths vary in the SCDH. For example, the No. 30 record is stored for approximately 24 h, and the No. 52 record is stored for approximately 7 h. This study needs data recording with heartbeats as the unit. Therefore, we extract the heartbeat from the records before the experiment. The process can be divided into signal preprocessing and heartbeat location and capture.

3.1.1 Signal preprocessing

The ECG signals are preprocessed to ensure that the experimental data only retain the frequency and characteristics related to arrhythmia. In addition, such preprocessing eliminates the interference of noise and other clutter and thus singles out the waveform mode. Therefore, preprocessing of ECG signals is required to accurately capture heartbeats.

The preprocessing is divided into four processes: band-pass filtering, ‘double slope’ preprocessing, waveform smoothing and window sliding. In signal filtering, the ECG signal is filtered by a 40-order FIR band-pass filter with a passband of 15–25 Hz, as suggested in literature [25]. After the band-pass filtering, the ‘double slope’ preprocessing is applied to increase the prominence of the waveform. Then, waveform smoothing is performed based on the low-pass filter with a cut-off frequency of 5 Hz. The last step is to implement window sliding to the waveform to increase the amplitude of and smoothen the waveform. The width of the sliding window is set to 17 sampling points.

Figure 2 shows the waveform comparison before and after ECG row data preprocessing. After preprocessing, the original ECG signal shows single-mode peaks, each corresponding to a QRS wave. Compared with the original signal, the preprocessed signal is easier to locate and detect, thereby the purpose of preprocessing is attained.

Fig. 2
figure 2

Waveform comparison of record No. 100 in MIT-BIH before and after preprocessing

3.1.2 Heartbeat location and capture

Adaptive double threshold for QRS peak location

This study uses the double threshold method [26] to complete the detection logic of the QRS peak. The specific processing method works such that the detection starts when the wave peak is higher than the low threshold. The threshold is lowered when the wave peak is between the high and low thresholds, thenadjusted higher when the wave peak is higher than the high threshold. When the wave peak is lower than the low threshold, then noise is assumed.

The sensitivity (SE) and positive predictive rate (P +) are used to evaluate the QRS detection algorithm. The evaluation indicators are defined as follows [21]:

$$\left\{\begin{array}{c}SE=\frac{TP}{TP+FN}\\ P+=\frac{TP}{TP+FP}\end{array}\right. ,$$
(1)

where \(TP\) represents the number of correct check beats, \(FN\) represents the number of missed beats and \(FP\) represents the number of check wrong beats.

Heartbeats capture

Based on the QRS peak location, 100 sampling points are extracted from the left, 150 sampling points are extracted from the right and a single heartbeat with a total length of 250 sampling points is intercepted.

The manual annotation type codes of various heartbeats are stored in the MIT-BIH and SCDH arrhythmia databases. The heartbeats consist of the normal beat (Normal), left bundle branch block beat (LBBB), right bundle branch block beat (RBBB), premature venture contract (PVC), aberrated atrial premature beat (Aab), left or right bundle branch block (BBB), R-on-T premature ventricular contraction (RONT), nodal (junctional) premature beat (NPC) and for premature or ectopic supraventricular beat (SVPB).

Tables 1 and 2 show the final heartbeat statistics. LBBB, RBBB, PVC and Normal are the four categories with the most significant number of 41 types in the MIT-BIH arrhythmia database. In the SCDH database, Normal, PVC, SVPB and BBB are the four categories with the largest sample sizes. For the accuracy evaluation, we select these two groups of heartbeats as the training and testing data of 1D-CNN.

Table 1 Summary of heartbeat data in MIT-BIH
Table 2 Summary of heartbeat data in SCDH

3.2 Implementation of 1D-CNN

We use 1D-CNN for the processing because it commonly classifies electrodermal activity signals and works similarly to two-dimensional CNN (2D-CNN) [24]. 1D-CNN receives lower dimensionality of the elements, leading to its simpler architecture [24]. Therefore, this study adopts the 1D-CNN model as a classifier according to ECG signals' characteristics and data scales in the MIT-BIH and SCDH arrhythmia databases. Designed on the basis of the 2D-CNN and its backpropagation (BP) training, the 1D-CNN predicts and classifies the ECG signals of different patients. This section briefly reviews and summarizes the above process and explains the modifications needed to transform 2D-CNN to 1D-CNN.

The convolutional and the pooling layers are stacked alternately for design to simplify the CNN and its network parameters, as shown in Table 3.

Table 3 Parameter design of 1D-CNN network

3.2.1 Forward propagation

Forward propagation is divided into four situations, as follows:

  1. (1)

    Forward propagation from the input layer to the convolution layer

In the CNN, forward propagation of the input layer is the first step. Generally, the input layer corresponds to the convolution layer. The dimension of the convolution kernel can correspond to the input dimensionality. For example, if the input is a black-and-white image, then the corresponding convolution kernel is a 2D square matrix.

  1. (2)

    Forward propagation from the hidden layer to the convolution layer

The forward propagation of the hidden layer to the convolution layer is very similar to that of the input layer to the convolution layer. The only difference is that the input is from the hidden layer rather than the matrix formed by the original image samples.

  1. (3)

    Forward propagation from the hidden layer to the pooling layer

The processing logic of the pooling layer is relatively simple and its purpose is only to reduce and summarize the input matrix. For example, the input is an N × N square matrix and the pooled size is a k × k square matrix. Then the size of the output matrix is \(\frac{N}{k}\times \frac{N}{k}\). The average pooling logic is used in the pooling layer of the 1D-CNN, with sizes of 5 and 3, respectively.

  1. (4)

    Forward propagation from the hidden layer to the fully connected layer

The forward propagation of the hidden layer to the fully connected layer is used to put multiple outputs together to calculate the fully connected layer. For example, if the outputs of the hidden layers are 3N × N square matrixes, then the input size of the full connection layer is 3N × 3N. After calculating the full connection layer, the final output is obtained by applying the \(softmax\) activation function.

3.2.2 Backpropagation

BP uses the gradient descent algorithm to update the weights and offsets of CNN. With the need to propagate the output error step by step, each layer must keep the intermediate variables of the error delta and the derivative of the activation function.

3.2.3 Update the weights and offsets of each layer

The update is divided into two situations, for the hidden and fully connected layers. In the first situation, given that the pooling layer has no weights and biases, only the convolutional layer is updated.

3.2.4 Changes needed for 1D-CNN implementation

The most crucial difference between 1D-CNN and 2D-CNN is that they adopt weights, input and output elements as 1D vectors and a matrix, respectively. Therefore, the operation of the matrix must be changed accordingly. For example, convolution, pooling, full connection and error calculation need to be formatted based on 1D-CNN. At the same time, \(conv2D\) in the forward propagation and backpropagation must be changed to \(conv1D\) to show the difference in 2D-CNN.

3.3 Optimization of 1D-CNN based on SDE

The DE algorithm is another excellent optimization after genetic algorithms, particle swarm optimization and other evolutionary algorithms. The DE algorithm has a simple structure, few control parameters, easy real coding and fast convergence speed, which has been proven theoretically [20]. The parameter setting operations of DE can be divided into SDE and adaptive DE. This study selects the SDE algorithm with a relatively simple structure to optimize the initial parameters of 1D-CNN by considering the algorithm complexity and running time cost. In SDE, the vector containing D optimization variables is called individual and the \(i\mathrm{th}\) individual is expressed as:

$${\mathrm{X}}_{\mathrm{i},\mathrm{G}}=\left[{x}_{1,i,G}\cdot \cdot \cdot ,{x}_{j,i,G}\cdot \cdot \cdot ,{x}_{D,i,G}\right] ,$$
(2)

where \(i=\mathrm{1,2},\cdot \cdot \cdot ,NP\), \(NP\) is the population size, \(G\) is the evolution generation and \(j\) is the \(j\mathrm{th}\) optimization variable.

Figure 3 shows the optimization flow chart of the SDE for 1D-CNN and its basic operations include initialization, mutation, crossover and selection. Individuals are randomly generated in the search space by initialization and new individuals are generated by mutation and crossover. The selection determines the individuals entering the next generation. This process is repeated until the termination condition is reached.

Fig. 3
figure 3

Flow chart of SDE in optimizing the initial weights of the 1D-CNN

3.3.1 Mutation

When the population evolves to the \(G\) generation, the mutation operation is carried out on the parent individual \({\mathrm{X}}_{\mathrm{i},\mathrm{G}}\) to obtain the mutated individual:

$${V}_{\mathrm{i},\mathrm{G}+1}={X}_{r1,G}+F\bullet \left({X}_{r2,G}-{X}_{r3,G}\right) .$$
(3)

The subscripts \(r1\) \(\mathrm{r}2\) and \(\mathrm{r}3\) are mutually different integers that are randomly selected between 1 and \(NP\) and different from \(i\). \({X}_{r1,G}\) is called the base vector, \(\left({X}_{r2,G}-{X}_{r3,G}\right)\) is called the difference vector and \(F\) is the mutation operator. If the parameter in the variant exceeds the boundary, the parameter value is replaced by the boundary value.

The mutation strategy used by the SDE is usually called DE/rand/1, where ‘rand’ means that the base vector is randomly selected in the population and ‘1’ means the number of difference vectors. DE/rand/1 strategy has good global convergence but also the disadvantage of slow convergence.

3.3.2 Crossover

The experimental individuals generated by crossover operation are as follows:

$${U}_{\mathrm{i},\mathrm{G}+1}=\left[{u}_{1,i,G+1}\cdot \cdot \cdot ,{u}_{j,i,G+1}\cdot \cdot \cdot ,{u}_{D,i,G+1}\right] ,$$
(4)
$${u}_{j,i,G+1}=\left\{\begin{array}{c}{v}_{j,i,G+1} if {r}_{j}\left[\mathrm{0,1}\right)\le CR or j==r(i)\\ \\ {x}_{j,i,G} otherwise \end{array} ,\right.$$
(5)

where \({r}_{j}\left[\mathrm{0,1}\right)\) is the random number calculated for the \(j\) th time and \(CR\) is the crossover operator. \(r(i)\) is a randomly selected integer between 1 and \(D\), which enables \({U}_{i,G+1}\) to obtain at least one variable from \({V}_{i,G+1}\).

3.3.3 Selection

For the minimization problem, the individual with a smaller objective function result is selected from the test individuals \({U}_{i,G+1}\) and the parent individual \({X}_{i,G}\) to enter the population of the next generation:

$${X}_{i,G+1}=\left\{\begin{array}{c}{U}_{i,G+1} if F\left({U}_{i,G+1}\right)<F\left({X}_{i,G}\right)\\ \\ {X}_{i,G} otherwise \end{array} ,\right.$$
(6)

where \(F\left(X\right)\) represents the objective function. In the experiment below, we use the accuracy of heartbeat type prediction to evaluate the optimization effect of the model, and thus the positive prediction rate P + of 1D-CNN is used as the fitness function.

3.3.4 Control parameters

The control parameters of the SDE mainly include population size NP, mutation operator F and crossover operator CR, which remain unchanged in the evolution [7]. The control parameter selection of the SDE follows the rules below based on experience.

  1. (1)

    Population size. According to experience, the population size is commonly taken as \(NP=5-10D\) to ensure that the algorithm has enough different mutation vectors.

  2. (2)

    Mutation operator. The mutation operator \(F\in \left[\mathrm{0,2}\right]\) is a real constant factor, which determines the amplification ratio of the deviation vector. The size of \(F\) is negatively related to the convergence rate. If the population converges too early, then \(F\) should be increased. In this study, the initial value of \(F\) is set as 0.5.

  3. (3)

    Crossover operator. The crossover operator \(CR\in [\mathrm{0,1}]\) is a real constant factor. For \(CR\), the size is positively related to the convergence speed and its initial value is set as 0.9 to prevent overfitting.

  4. (4)

    Maximum evolutionary generation. The parameter’s initial value is 30 and the control experiments are carried out 30–50 times to ensure reliability and shorten the experimental time.

3.3.5 Conversion of weight into the chromosome representation

Matrix deformation operation splits the organizational structure of the corresponding weights in 1D-CNN, which are then transformed and mapped into the chromosome representation in DE. The structure mapping from chromosome to weight is shown in Fig. 4, where \({\widehat{\mathrm{X}}}_{\mathrm{s}*d}\) represents the mapping of chromosomes to weights, the subscript s represents the size of the weight and d represents the weight dimension.

Fig. 4
figure 4

Mapping from chromosome to weights

4 Experimental setup

In this section, we introduce ECG detection and classification before and after using SDE to optimize the initial weights of 1D-CNN.

4.1 Experimental process before DE optimization

4.1.1 Simulation

The experiments are simulated on the MatlabR2017b platform. All data in the MIT-BIH and SCDH databases are two-lead ECG records. Still, our simulation experiment only needs the ECG signal of the first lead and the label marking the corresponding heartbeats.

The most significant advantage of CNN over traditional machine learning lies in feature extraction and selection, due to automatically acquire features from large data without manual design and selection. The data of the input layer takes the QRS peak as the midpoint and the left and right heartbeats are sampled to 250 points to learn the heartbeat waveform’s morphological structure.

In the MIT-BIH arrhythmia database, the 4 types of manually labeled beats of Normal, LBBB, RBBB and VPC account for over 88% of the total heartbeats in the global database. To ensure the training effect of CNN and improve the classification accuracy, we use these four heartbeats train CNN for ECG recognition as the first group of data. For the same reason, the four types of beats in the SCDH database, namely, Normal, PVC, SVPB and BBB, are used to train the 1D-CNN.

The experimental process is as follows:

  1. (1)

    Binary files in the MIT-BIH and SCDH databases are decoded and read based on records.

  2. (2)

    From the MIT-BIH arrhythmia and SCDH databases, 48 and 23 records, respectively, are scanned globally and manually labeled to extract Normal, LBBB, RBBB and PVC and Normal, PVC, SVPB and BBB, respectively.

  3. (3)

    From each of the four data types obtained in steps 1 and 2, 5000 records are taken from MIT-BIH and 1000 records from SCDH and then combined into a data set. At the same time, construct the corresponding tag set.

  4. (4)

    The data set processed in Step 3 is randomly divided into a 75% training set and a 25% test set.

  5. (5)

    According to the parameter settings in Table 2, a 7-layer CNN framework is designed. The initial learning rate is set to 0.01, the batch size is set to 16 and the epoch is set to 30.

  6. (6)

    CNN is established and initialized.

  7. (7)

    According to the batch size set in Step 5, this study constructs the input matrix, input CNN and start training.

4.1.2 Algorithm performance analysis

Evaluation index setting

  1. (1)

    Confusion matrix

The confusion matrix represents the classification results in the form of a matrix, where each column represents the prediction category and its total number represents the number of predicted data. Each row represents the real data category and its total number represents the number of actual data.

  1. (2)

    Average accuracy

Given that the training and test samples are randomly selected from Tables 1 and 2, each classification result may have contingecies. The average accuracy evaluates the classification performance of the algorithm to avoid the impact of contingency on the results analysis. Ten independent experiments are repeated on the classification algorithm and the epoch of each experiment is 30. The average classification accuracy over 10 times is calculated.

  1. (3)

    Loss function

Cross-entropy is used as a loss function to evaluate the model to evaluate the gap between predictions and labels after each training. The simplified formula of cross-entropy is:

$$Loss=-\sum_{i=1}^{n}{y}_{i}{\mathrm{log}}_{2}\left({a}_{i}\right),$$
(7)

where \({y}_{i}\) is the expected output and \({a}_{i}\) is the actual output of the neuron.

Performance evaluation

The following experimental results are the test data set from MIT-BIH under different activation functions according to the above process. Tables 4, 5 and 6 show the classification results of 1D-CNN based on Sect. 3.2. These results show the average of 10 independent experiments.

Table 4 Confusion matrix of 5000 sample test set with sigm as the activation function
Table 5 Confusion matrix of 5000 sample test set with tanh as the activation function
Table 6 Confusion matrix of 5000 sample test set with ReLU as the activation function

The results in Table 4 with sigm as the activation function show that the average classification accuracy is 96.0%, derived from the average accuracies are Normal 99.8%, LBBB 97.5%, RBBB 97.4% and PVC 89.5%.

The results in Table 5 with tanh as the activation function show that the average classification accuracy is 94.3%, derived from the average accuracies of Normal 99.5%, LBBB 93.3%, RBBB 97.3% and PVC 87.1%.

The results in Table 6 with ReLU as the activation function shows that the average classification accuracy is 98.9%, derived from the average accuracies of Normal 99.9%, LBBB 99.2%, RBBB 99.5% and PVC 96.9%.

The following experimental results are the test data set from the SCDH under different activation functions according to the above process. Tables 7, 8 and 9 show the classification results of 1D-CNN based on Sect. 3.2. These results show the average of 10 independent experiments.

Table 7 Confusion matrix of 1000 sample test set with sigm as activation function
Table 8 Confusion matrix of 1000 sample test set with tanh as activation function
Table 9 Confusion matrix of 1000 sample test set with ReLU as activation function

The above results with two databases show that the 1D-CNN based on Sect. 3.2 has a good ECG recognition effect in the case of 30 epochs, except for using the sigm function in the SCDH database. Comparing the three activation functions, ReLU has the highest accuracy, reaching 98.9% and 89.3% for the two databases, respectively.

4.2 Experimental process after DE optimization

4.2.1 Simulation

The simulation experiment based on the SDE is implemented on MatlabR2017b. The optimization objective is set as the initial parameters of the 1D-CNN model built in Sect. 3.2. The record set used by 1D-CNN is still that of ECG signals from the MIT-BIH and SCDH databases with two-lead and corresponding heartbeat labels. Given the vast amount of calculations needed, the ECG data set is reduced in this section.

In both the MIT-BIH and SCDH arrhythmia databases, the specific method is to take 1000 for each type of signal, with 4000 total number of heartbeat samples and select 75% as the training set and 25% as the testing set [29].

The experimental process is as follows:

  1. (1)

    A data set with total heartbeats of 4000 is imported and built with the corresponding label set.

  2. (2)

    The data set is randomly divided into a 75% training set and a 25% test set.

  3. (3)

    According to the parameter design in Sect. 3.2, a 1D-CNN is established and initialized and the parameters of SDE are set. The population size is set to 30, the maximum evolution generation to 30 and 50, the initial value of the mutation operator is 0.5 and the initial value of the crossover operator is 0.9.

  4. (4)

    The total number of parameters to be optimized is calculated and the parameter vector—often called the chromosome—is generated.

  5. (5)

    According to the data in Step 5, we initialize the population elements to random numbers between (− 1,1), generate the population random matrix with the size of [30,732] and initialize the corresponding fitness.

  6. (6)

    The chromosome variables in DE are split and transformed into the organizational structure of corresponding weights in 1D-CNN and then assigned the weights.

  7. (7)

    The optimal fitness and the optimal parameter vector of the population in this iteration are calculated through mutation, crossover and selection. In addition, the positive prediction rate P + of 1D-CNN is used as the fitness function mentioned in Sect. 3.3.4.

  8. (8)

    Step 7 is repeated until the end of the iteration. The final fitness and parameter vector are recorded and the parameter vector is split and assigned to 1D-CNN.

4.2.2 Optimization analysis of standard differential evolution algorithm

Given that the fitness function is the positive prediction rate of 1D-CNN, it can be used to analyse the optimization performance of the algorithm. The other parameters are not adjusted except for mutation operator \(F\) and crossover operator \(CR\) to generalize the problem.

  1. (1)

    Optimization analysis based on 30 evolutionary generations

Thus far in literature, mutation operator less than 0.4 and greater than 1.2 are only occasionally valid [17]. Therefore, in the experiment, we choose F = 0.5, \(CR\) = 0.9 as the initial values and take F = 0.5, 0.8, 1.0, 1.2 and \(CR\) = 0.9, 0.8 as the parameter combination to test the optimization results step by step.

Although the final evaluation standard is the P + result of 1D-CNN, the average accuracy is still used to evaluate the classification performance of the algorithm. Ten independent experiments on the classification algorithm are carried out with different DE parameter combinations and the number of epochs for each experiment is set to 10. Table 10 shows the optimization results, where the average of P + based on the MIT-BIH arrhythmia database and 10 repeated experiments. Table 11 shows the optimization results based on SCDH, which are also the averages of 10 repeated experiments.

Table 10 Statistics of SDE optimization results based on 30 generations with MIT-BIH
Table 11 Statistics of SDE optimization results based on 30 generations with SCDH
  1. (2)

    Optimization analysis based on 50 evolutionary generations

Consistent with the parameter setting of 30 evolutionary generation, F = 0.5 and CR = 0.9 are still selected as the initial values and the optimization results of the algorithm are tested step by step with the parameter combinations of F = 0.5, 0.8, 1.0 and 1.2 and CR = 0.9 and 0.8. Tables 12 and 13 shows the corresponding optimization results based on the MIT-BIH and SCDH arrhythmia databases.

Table 12 Statistics of SDE optimization results based on 50 generations with MIT-BIH
Table 13 Statistics of SDE optimization results based on 50 generations with SCDH

As shown in the figures below, according to 30 generations, Fig. 5 shows the fitness lines of different configurations of SDE parameters under other activation functions based on the MIT-BIH database and Fig. 6 shows the fitness lines based on the SCDH database. Similarly, According to 50 generations, Fig. 7 shows the fitness lines of different DE parameter combinations under other activation functions based on the MIT-BIH database. Figure 8 shows the fitness lines based on the SCDH database.

Fig. 5
figure 5

Fitness line based on 30 generations with MIT-BIH. a sigm, b tanh, c ReLU

Fig. 6
figure 6

Fitness line based on 30 generations with SCDH. a sigm, b tanh, c ReLU

Fig. 7
figure 7

Fitness line based on 50 generations with MIT-BIH. a sigm, b tanh, c ReLU

Fig. 8
figure 8

Fitness line based on 50 generations with SCDH. a sigm, b tanh, c ReLU

The four parameter configurations are represented by four colors in the figures, respectively. In the figures below, further improvements in convergence are observed at later generations by comparing the fitness function lines measured by P + between 30 and 50 generations. For example, parameter 4 improves convergence at 48 generations whereas the remaining configurations show static convergence for sigm active function. Parameter 1 shows improvement in convergence for tanh, while parameter 2 shows improvement in convergence for ReLU at later generations. The optimal fitness associated with each active function is contributed by different configurations of parameters with different maximum generations.

The following figures show the fitness based on 50 generations with MIT-BIH and SCDH databases.

5 Results and comparative analysis of 1D-CNN before and after optimization

According to the statistics of SDE optimization results in Tables 10, 11, 12 and 13, the optimal configurations of SDE parameters associated with active functions for 30 and 50 generations are used to determine the initial weights for 1D-CNN. Then, the performances of the optimized and unoptimized 1D-CNNs at the selected epochs are recorded. Tables 14 and 15 show the accuracy of 1D-CNN across training times before and after optimization by SDE for 30 and 50 generations based on the MIT-BIH and SCDH databases, respectively. The better results between the optimized and unoptimized 1D-CNNs are bolded.

Table 14 P + comparison of different stages before and after DE with MIT-BIH
Table 15 P + comparison of different stages before and after DE with SCDH

By referring to Tables 14 and 15, the comparisons between 30 epochs of training of unoptimized 1D-CNN and 10 epochs of training of optimized 1D-CNN show that the latter model is equally good or better regardless of generations. The reason is that the initial parameters of 1D-CNN after optimization have structural features, which when used in the data set training, the gradient descent algorithm in the BP increases the convergence speed of the loss function, resulting in fewer training times and higher accuracy. The results in Tables 14 and 15 demonstrate that the final optimization result of 50 generations is slightly better than that of 30 generations under the same parameters. Therefore, the initial weights produced by optimal parameters of 50 generations are used to analyse the processing time of the 1D-CNN optimized by SDE. Under the condition that the total number of samples is 4000, Tables 16 and 17 show the optimized and unoptimized 1D-CNN processing time at the selected epochs with the two different databases. The results are the average value after 10 independent experiments.

Table 16 Time statistics of different stages before and after DE with MIT-BIH
Table 17 Time statistics of different stages before and after DE with SCDH

In the MIT-BIH and SCDH databases, the accuracies of the optimized 1D-CNN after 10 epochs equal or surpass the unoptimized 1D-CNN after 30 epochs. Therefore, we mainly compare the time consumed by these two accuracies. The shorter time between these two accuracies are bolded. Based on the corresponding columns in Tables 16 and 17, the time required significantly decreased. Moreover, this reduction increases sharply with the model complexity and the total number of samples. In other words, a higher model complexity and larger sample size can show apparent effects of DE optimization.

The results in Table 18 and 19 show that regardless of MIT-BIH or SCDH, the different combinations of variation factor F and crossover factor CR determine the convergence rate and optimization effect in the SDE stage. F value negatively correlates while the CR value positively correlates with the convergence rate. However, the best classification accuracy of 1D-CNN does not necessarily come from the best optimization results in the SDE stage. The reason is the difference of the two optimization strategies. DE uses random optimization based on the evolutionary process while 1D-CNN uses gradual optimization based on gradient descent. Therefore, when the classification accuracy of 1D-CNN is not very high, the actual development must avoid the blind pursuit of a more significant DE optimization effect that can only waste computing resources. Carrying out repeated experiments and cyclic training on the parameter combination of DE is reasonable to determine the better combinations to achieve better final classification accuracy and convergence effect.

Table 18 Parameter comparison of different activation functions with MIT-BIH
Table 19 Parameter comparison of different activation functions with SCDH

6 Conclusions and future works

In this study, we propose a 1D-CNN method for ECG classification with initial weights optimized by SDE. Comparison of the experimental results before and after optimization shows the feasibility of using DE to optimize the initial parameters of 1D-CNN. The specific data are as follows:

Under the condition that the total number of samples is 4000, the accuracy of optimized 1D-CNN after10 epochs has reached or exceeded the accuracy of unoptimized 1D-CNN after 30 epochs of training. The 1D-CNN using the ReLU activation function shows the highest accuracy. After 10 epochs, the accuracy improves from 97.6% to 99.5% for the MIT-BIH arrhythmia and from 80.2% to 88.5% for the SCDH. Training time decreases from 28.12 s to 9.22 s and from 28.96 s to 10.35 s, respectively. Based on the optimized 1D-CNN performances in the two databases, the accuracy improves by 1.9% and 8.3% and the training time decreases by 67.2% and 64.2%, respectively.

Therefore, the original 1D-CNN has a faster convergence speed and less training time and the classification accuracy improves further through SDE optimization. One of the experimental findings in this study shows that different parameter configurations of SDE affect the 1D-CNN's accuracy differently with the active functions. Future research can then focus on adaptive differential optimization to determine the parameters corresponding to the active functions of 1D-CNN and improve the effectiveness and efficiency of ECG classification.