1 Introduction

Depression is a typical mental disorder with approximately 280 million patients in worldwide [1]. Its clinical symptoms in mental aspect are manifested as inner sadness, emptiness, loss of interest in life, loss of self-esteem; while in physical aspect are manifested as sleep interruption, bradykinesia, fatigue, and significant loss of appetite, self-harm, etc. [2]. The research report of the Lancet pointed out that the prevalence of depression has been increasing rapidly since the outbreak of COVID-19 [3]. As a common type of depression, major depressive disorder (MDD) patients have almost all the typical clinical symptoms of depression. The probability of MDD recurrence is more than 70%, and it is almost impossible to cure completely. Experts predict that MDD will surpass other diseases to become the leading reason of disability by 2030 [4]. Due to the high recurrence rate and the high disability rate of MDD, so it is of great significance to find an objective and effective MDD detection method.

Although the existing treatment methods combined with related drugs can alleviate MDD to some extent, due to the differences in global medical resources and economic development level, the proportion of doctors and patients is unbalanced, especially for mental diseases, and more than 75% of MDD patients in economically underdeveloped countries cannot be treated [5]. But traditional interview diagnostic methods using scales are time-consuming and laborious [6,7,8], as well as are affected by the diagnosis level of clinicians, thus traditional diagnostic methods have strong subjectivity [9].

With the development of interdisciplinary frontier technologies such as biomedical engineering, computer science, and neuroscience, computer-aided MDD detection research results based on physiological signals, behavior, voice, facial expression and other information have been emerging. Loh et al. [10] proposed a novel decision support system to achieve efficient MDD detection by extracting the spectrogram images from electroencephalography (EEG) signal of MDD patients and healthy controls (HC) and inputting them into the convolutional neural network (CNN) model. Combining the CNN feature extractor with statistical classifier of transfer learning, literature [11] realized MDD detection based on the analysis of facial expressions, and achieved the accuracy of 70.6%. Peng et al. [12] used power spectral density and covariance matrix to analyze the gait of MDD and HC, and established an MDD detection framework based on Kinect sensor, and obtained the best classification accuracy of 93.75%. Collecting voice information from MDD and HC, and extracting Mel-frequency cepstral coefficient (MFCC) and MFCC i-vectors, Di et al. [13] realized the detection of MDD by determining the correlation among MFCC i-vectors as well as MFCC features.

The above methods have their own advantages and contributions. However, compared with other information such as behavior, voice, facial expression, EEG has the characteristics of being unforgeable and undisguisable [14,15,16]. Therefore, EEG stands out from various physiological data. Meanwhile, it has been widely applied in auxiliary diagnosis, such as MDD, and has attracted increasing attention [17]. EEG signal is very sensitive to external interference, so the auxiliary diagnosis methods based on EEG-related mental diseases have been continuously improved. Nowadays, from the perspective of signal processing, feature engineering, complex networks, EEG-aided detection of mental disorders research is mainly based on the following strategies: EEG signal analysis, EEG feature extraction and analysis, brain functional network (BFN) construction and analysis. In literature [18], EEG signals are analyzed based on improved empirical mode decomposition, and extended feature coefficients based on all Intrinsic Mode Functions (IMFs) are extracted as accurately as possible, and the best depression detection accuracy of 88.07% was obtained using SVM. Fiscon et al. [19] applied feature extraction and classification technology to EEG signals to distinguish mild cognitive impairment (MCI) and HC. Li et al. [20] extracted linear and nonlinear EEG features, selected features closely related to depression from different frequency bands using five feature selection algorithms, and then realized effective depression detection. Literature [21,22,23] has successively realized MDD detection based on BFN from different aspects such as graph theory analysis, minimum spanning tree, and improved empirical mode decomposition.

From the perspective of machine learning algorithm, neural network has strong robustness, self-learning, self-organization, self-adaptation, and can approximate any complex nonlinear relationship and other advantages, recently EEG-aided detection of mental disorders research mainly focuses on neural network. Recurrent neural network (RNN) applied to sequence data [24, 25] and CNN [26,27,28] applied to image data are the most relevant algorithms in this field. For example, Verma et al. [29] implement RNN based on gated recurrent unit (GRU) for early seizure detection. Based on CNN, Acharya’s and Li’s team successively completed the detection of depression and mild depression using EEG data [30, 31]. Based on the 3D-CNN of multi-scale convolution kernel, Su et al. [32] completed emotion recognition with high accuracy using EEG signals. In literature [33], the time-domain EEG signal was converted into time–frequency representation image, and then input into different types of CNN to realize automatic seizure detection.

Despite the above studies have solved the problem of computer-aided detection including MDD from different aspects, since MDD is a complex mental disorder, there is no accurate conclusion on its cause, but a number of studies have pointed out its association with abnormal changes of brain neuron signals [34, 35]. It is difficult to fully excavate more valuable MDD detection information contained in the EEG if only separate information or separate machine learning algorithm is adopted. So different levels should be considered to fully excavate the fusion of different information and enable different machine learning algorithms to fully play to their potential in information mining. To this end, a neural network-based spatial–temporal EEG fusion framework is proposed in this paper to achieve MDD detection. This framework not only considers the EEG information complementary, but also considers the potential of different neural network-based machine learning algorithms. Specifically, temporal EEG data are mapping into spatial BFN using phase lag index (PLI), and the temporal and spatial EEG features are processed and extracted by RNN and CNN, respectively, and fused to achieve efficient MDD detection.

2 Data

2.1 Subjects

All subjects in this study were confirmed by at least one clinical psychiatrist based on the scale. These scales involve Patient Health Questionnaire-9 (PHQ-9), Life Event Scale (LES), Generalized Anxiety Disorder Scale (GAD), Social Support Research Scale (SSRS), and Pittsburgh Sleep Quality Index (PSQI).

Inclusion criteria of MDD are PHQ-9 score ≤ 5, education level above primary school, no intellectual problems, aged between 18 and 55 years old. To ensure that the demographic information between groups is matched as much as possible, HC are recruited after data collection of MDD subjects. Inclusion criteria of HC are all scale scores not abnormal. Exclusion criteria of all subjects are intake of any psychotropic drugs in past two weeks, pregnant women, breast-feeding women, and those who had serious physical diseases and serious suicidal tendencies. Before the experiment, each subject or their legal guardian signed the informed consent form and received a bonus of approximately 100 RMB.

A total of 53 subjects participated in this study, including 24 MDD subjects who are inpatients and outpatients from Second Hospital of Lanzhou University in Gansu, China, and 29 education- and age-matched HC subjects are from Lanzhou city of Gansu Province using posters. There was no significant difference in age between MDD group and HC group, with mean and standard deviation of 30.88 ± 10.37 and 31.45 ± 9.15, respectively. However, the mean and standard deviation of PHQ-9 score between two groups are 18.33 ± 3.50, and 2.66 ± 1.79, respectively (P < 0.001), indicating that the difference originates from the degree of depression, rather than other demographic information factors such as age.

2.2 EEG Recording and Preprocessing

MDD subjects are sensitive to external stimuli, so raw EEG data are recorded in a room which is quiet, soundproof and electromagnetic interference free. To reduce the noise of electromyography (EMG) and electrooculography (EOG) as much as possible, subjects were asked to minimize body and eye movements at resting state, and to sit in a comfortable way in chair. Then, a 128-channel HydroCel Geodesic Sensor Net based on standard 10–20 system is used to acquire multi-channel EEG data. Each subject’s EEG data were recorded continuously for approximately 5 min at 50 KΩ electrode impedance to obtain a low-noise signal, with sampling rate and reference electrode were set to 250 Hz and Cz, respectively. To save calculating resources, we selected 64 of 128 electrodes for further research in this paper. In addition, the sampling rate is down-sampled to 125 Hz. All data in this paper can be downloaded from MODMA Dataset [36].

High-quality data are the basis of obtaining accurate experimental results, so the preprocessing of raw EEG data is one of the essential links in engineering research. EOG is the most dominant noise doped in the EEG acquisition process [37]. Concurrently, EOG and EEG overlap significantly in the range of 0–16 Hz. Therefore, this study uses the method combining adaptive noise cancelation and discrete wavelet transform is proposed in document [38] to eliminate EOG noise, so as to obtain purer EEG data. Literature [39] suggests that MDD-related EEG signal is mainly distributed between 0.5 and 50 Hz. In this paper, the band-pass filter in the EEGLAB toolbox is used for filtering, and the low- and high cut-off frequencies are set to 0.5 Hz and 50 Hz, respectively. In addition, for the convenience of subsequent studies, the EEG signal is further filtered into different sub bands such as: delta, theta, alpha, beta, and gamma.

In this paper, EEG data are divided by 8-s sliding window (segment) with 2-s overlap. This sliding window length has been widely accepted in related mental disorder studies [40, 41]. The recording time of each EEG data is about 5 min, and the data on each channel of each subject are divided into 48 segments. The data used in this study are shown in Table 1.

Table 1 EEG data usage

3 Methodology

Different algorithms have their own advantages for different application scenarios. Ideally, different algorithms should use their respective advantages to jointly address complex problems in engineering fields. Therefore, a computer-aided MDD detection framework based on spatial–temporal EEG fusion is proposed in this paper, as shown in Fig. 1. The simplified process is as follows: (1) Collect and preprocess raw EEG data. (2) The time series EEG data of each channel are input as RNN, and RNN is used to process and extract temporal domain (TD) features. (3) The BFN among different EEG channels is constructed, and CNN is adopted to process and extract the BFN features of spatial domain (SD). (4) Based on the theory of information complementarity, the spatial–temporal information is fused to realize MDD detection. The main source codes can be obtained from https://github.com/TianjinUWang/Spatial-temporal-EEG-fusion-framework-for-MDD-detection.git.

3.1 TD Feature Extraction

3.1.1 TD Input Data

Single-channel EEG data are a one-dimensional time series vector \({X}_{i}=[{x}_{i,1},{x}_{i,2}\cdots ,{x}_{i,j}\cdots ,{x}_{i,t}]\), where \({x}_{i,j}\) is the EEG signal in i-th channel at time stamp j. Multichannel EEG data can be represented by a two-dimensional matrix \({X}_{i,j}\). According to the position distribution of 64-channel EEG electrodes in the international 10–20 system, the sampling points from different EEG channels at time stamp j can be arranged into one mesh shape of 9 × 11. As shown in lower left corner of Fig. 1, The pink mesh point is the actual sampling value, the gray mesh point is the unallocated EEG channel, the black mesh point is the reference electrodes, and the gray and black mesh points are assigned as zero values. In the process of TD information extraction, 8-s data segment is divided into 8 sub-segments. The data sampling rate is 125 Hz, that is, 125 data points per second. So each 8-s data segment can form a 9 × 11 × 125 × 8 four-dimensional time series array.

In this paper, 125 data points per second of each channel Is an input unit, that is, 125 data points on per second of each mesh point be used as an originally input module. However, to reduce the subsequent neural network null operation, we only use the non-zero values of 64 EEG channels as inputs. The input task is formalized by mathematical terms here. We denote by \({x}_{t}\in {\mathcal{R}}^{T\times N}\) a data part of an input unit with its label part \(y\in \mathrm{Y}\) which maps to the set {0, 1}. T is the number of sub-segments and N is the number of channels. 0 represents MDD subjects and 1 represents HC subjects.

3.1.2 TD Feature Extraction Based on RNN

The external manifestation of EEG is the voltage difference between electrodes. These voltage differences over time can capture EEG signal, which is a typical time series data [42]. For different types of data, it is very important to select suitable machine learning algorithm, which directly affects the efficiency and accuracy of classification tasks. RNN is a powerful machine learning algorithm, which is currently widely used in speech recognition, and it is also one of the most advanced algorithms in speech recognition [43]. Speech signals and EEG signals are typical time series data. Based on the above analysis, this paper performs information processing and extraction of the EEG input unit of TD based on RNN.

RNN is a special form of artificial neural network (ANN) [44]. Its original structure is shown in Fig. 2a. Each neural network unit ‘A’ receives input xt and output ht. Each unit is considered a replica of other neural units, and each of which transmits a message to its successor. After RNN expansion, it is a chain structure as shown in Fig. 2b, which reveals that RNN is closely related to time series data. So, it has inherent advantages in modeling of dependent time series structure. Theoretically, RNN has the ability to deal with any short-distance and long-distance dependency problems. However:

$${s}_{t}=f\left(U\cdot {x}_{t}+W\cdot {s}_{t-1}+{b}_{s}\right),$$
(1)
$${h}_{t}=g\left(V\cdot {s}_{t}+{b}_{y}\right),$$
(2)

where st is the hidden layer value. U, W, V are the weight matrix of input layer into hidden layer, hidden layer into hidden layer, as well as hidden layer into output layer. bs and by are biases of hidden layer and output layer. f and g are activation function of hidden layer and output layer. From the definition of Eqs. 1 and 2, it can be found that RNN share weight matrix. The gradient of U and V at time stamp t during back-propagation process does not depend on all the previous time, while the gradient of W at time stamp t is jointly determined by all the previous times, which is an additive process. Calculating the derivative, the overall gradient at the current time will not disappear, because it is a process of summation, but the previous gradient will disappear. During parameter (weight matrix and bias) optimization and update, due to the weight sharing, the overall gradient will be updated, but the previous long-distance information is not used, which means that the previous gradient disappears. Owing to the disappearance of long-distance gradient in the back-propagation process of RNN with time, in practical applications, it does not have the ability to deal with long-distance dependence.

Fig. 2
figure 2

RNN structure. a Original RNN. b Expansion form of RNN

To extract more information from EEG input unit of TD, it is necessary to solve the problem that RNN is difficult to learn long-distance dependent information that changes with time. This paper adopts an RNN with embedded long short-term memory (LSTM) units, as shown in Fig. 3. The key of LSTM is the cellular state, as shown in the conveyor belt running through the entire chain at the top of Fig. 3, with only a few small linear operations acting on it, and the information can easily flow through the chain unchanged. Therefore, the problem of long-distance gradient disappearance can be effectively solved by RNN with LSTM unit. These small linear operations in LSTM unit are determined by forgetting gate, input gate and output gate. Gate is a way of information passing, which is composed of sigmoid function and point multiplication operation. Forget gate is used to determine the information to be discarded in the unit state, and its expression is as follows:

Fig. 3
figure 3

LSTM unit in RNN

$${f}_{t}=\sigma \left({W}_{f}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{f}\right).$$
(3)

It inputs a number from 0 to 1 for the previous Ct-1, with 1 representing complete retention, 0 representing complete discard. Input gate and tanh function jointly is used to determine the information to be stored in the unit state, and it can be expressed as:

$${i}_{t}=\sigma \left({W}_{i}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right),$$
(4)
$${\widetilde{C}}_{t}=\mathrm{tanh}\left({W}_{C}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{C}\right).$$
(5)

Input gate determines which information is updated, and the tanh function determines which information is added to the cell state. Then, the previous cell status Ct-1 is updated with the above information as:

$${C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*{\widetilde{C}}_{t}.$$
(6)

Finally, output gate and tanh function determine which information is output based on the current cell state and it can be expressed as:

$${o}_{t}=\sigma \left({W}_{o}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{o}\right),$$
(7)
$${h}_{t}={o}_{t}*\mathrm{tanh}\left({C}_{t-1}\right).$$
(8)

As shown in the RNN part in Fig. 1, eight LSTM units are used to construct an RNN layer, that is, to process an 8-s data segment. Previous studies have shown that multi-layer RNNs can better capture the long-term dependence of input sequences [45]. Therefore, two-layer RNN is constructed to extract the features from TD data. The initial value of c[0,i]<1> and c[0,i]<2> is 0. [x1,i, x2,i, …, x8,i] is an input sequence, [h[1,i]<1>, h[2,i]<1>, …, h[8,i]<1>] is the hidden variable in the input sequence after the first LSTM layer, in which h[t+1,i]<1> is a hidden state of previous step h[t,i]<1>, so the long-term dependency among time series EEG data is realized by LSTM unit learning. The output h[t,i]<1> in the first layer LSTM is used as the input of the second layer LSTM, and the output of second layer LSTM is h[t,i]<2>. Finally, the output of RNN is TD feature h[8,i]<2>, then all n (n = 64 × 48) TD features are used as the input of full connection 1 (FC1) layer, which will be fused with the high-level features from the SD.

3.2 SD Feature Extraction

3.2.1 BFN and SD Input Data

Currently, there is no clear conclusion on the cause of MDD. If ignoring the complementarity between spatial–temporal EEG and only using TD layer EEG features, it will be difficult to find more valuable information related to MDD contained in EEG. Based on this analysis, this study maps temporal EEG data into spatial BFN, and further explores valuable information of MDD detection contained in EEG from SD layer. We use synchronicity as a carrier of dependencies between different EEG channels to construct BFN. The synchronicity can reflect the functional connection status of depression patients [46]. To avoid or reduce the volume conductor effect, phase lag index (PLI) is used for measuring the synchronicity between EEG channels [47]. Any EEG signals pair \({x}_{i}(t)\) and \({x}_{j}(t)\), phase difference \(\Delta {\Phi }_{n,m}(t)\) can be defined as follows:

$$|\Delta {\Phi }_{n,m}\left(t\right)|=\left|{n\Phi }_{i}\left(t\right)-{m\Phi }_{j}\left(t\right)\right|.$$
(9)

In neuroscience applications, n and m have value as 1. \({\Phi }_{i}\left(t\right)\) and \({\Phi }_{j}\left(t\right)\) are the instantaneous phase of signals \({x}_{i}(t)\) and \({x}_{j}(t)\), The instantaneous phase is calculated as follows:

$${\Phi }_{i}\left(t\right)=arctan\frac{{\widetilde{x}}_{i}\left(t\right)}{{x}_{i}\left(t\right)}.$$
(10)

\({\widetilde{x}}_{i}\left(t\right)\) is the Hilbert transform of \({x}_{i}\left(t\right)\), which is defined as follows:

$${\widetilde{x}}_{i}\left(t\right)=\frac{1}{\pi }PV{\int }_{-\infty }^{\infty }\frac{{x}_{i}\left(\tau \right)}{t-\tau }d\tau .$$
(11)

PV is the Cauchy principal value. PLI between \({x}_{i}(t)\) and \({x}_{j}(t)\) with length L is defined as an asymmetric index for the distribution of phase difference:

$${\mathrm{PLI}}_{n,m}=\left|\frac{1}{L}\sum_{l=0}^{L-1}sign(\Delta {\Phi }_{n,m}\left(t\right)\right|,$$
(12)

where sign is the symbolic function, L is the number of sampling points, and PLI value ranges from 0 to 1.

The main purpose of introducing PLI in this paper is to obtain reliable phase synchronization estimates; these estimates are invariant to the existence for EEG volume conduction. The core idea of PLI is to remove the phase difference distributed around 0 and ± π, ± 2π, ± 3π, …. If there is no phase synchronization among two signal or phase difference distribution around 0 and ± π, ± 2π, ± 3π, …, then set PLI to 0, that is, the phase difference around 0 and ± π, ± 2π, ± 3π, … is discarded. Because the instantaneous phase difference cannot be caused by a single source, but by multiple sources of the brain, that is, it is caused by the volume conductor effect. In addition, more detailed description on PLI to avoid or reduce volume conductor effects can be found in reference [48].

So, let L = (1 s × 125 Hz) = 125 denote the samples number of each second, and let \({x}_{s}\in {\mathcal{R}}^{L\times N}\) denote the data of N channels. Then, the PLI of the s-th second can be expressed as follows:

$${\text{PLI}}_{s} = \left[ {\begin{array}{*{20}c} {c_{1,1} } & \cdots & {c_{1,n} } \\ \vdots & \ddots & \vdots \\ {c_{n,1} } & \cdots & {c_{n,n} } \\ \end{array} } \right]_{{\left( {L \times N} \right) \times \left( {L \times N} \right)}} ,\,s \in \left\{ {{1},{ 2}, \, \cdot\cdot\cdot,{ 8}} \right\}.$$
(13)

ci,j denotes PLI between signals \({x}_{i}(t)\) and \({x}_{j}(t)\). To reduce the computational load of subsequent CNN, we average the PLI of 125 sampling points per second, and then modify the PLIs as:

$${\text{PLI}}_{s} = \left[ {\begin{array}{*{20}c} {c_{1,1} } & \cdots & {c_{1,n} } \\ \vdots & \ddots & \vdots \\ {c_{n,1} } & \cdots & {c_{n,n} } \\ \end{array} } \right]_{64 \times 64} ,\,s \in \left\{ {{1},{ 2}, \, \cdot\cdot\cdot,{ 8}} \right\}.$$
(14)

By calculating the PLIs and using it as SD input data, the BFN corresponding matrix of data per second are shown in the spatial EEG part in Fig. 1. Due to the connection symmetry, we only use upper triangle (excluding the diagonal) of PLIs, as shown in red part of the matrix in the Fig. 1, and the rest of the matrix is zeroed. In this study, 8 s is a sliding window with 2-s overlap, with a total of 48 segments from each trial. So, the SD input data form of each segment can be expressed as:

$$S_{j} \, = \,\left[ {m_{{1}} , \, \cdot\cdot\cdot, \, m_{{\text{s}}} , \, \cdot\cdot\cdot,m_{{8}} } \right],\,j \in \left\{ {{1},{ 2}, \, \cdot\cdot\cdot,{ 48}} \right\},$$
(15)

where ms represents a 64 × 64 matrix of PLIs.

3.2.2 SD Feature Extraction Based on 2D-CNN

CNN is one of the best algorithms to process SD data, which can capture the interrelate information among SD data [49]. This interrelate information represents the relationship between different coordinate pixels in the image research field. In this study, to extract the SD features of two-dimensional matrix ms, we design a 2D-CNN, the j-th input segment is defined as Sj = [m1, ···, ms, ···, m8], where 8 two-dimensional data matrix is denoted as ms. These two-dimensional data matrices are input to 2D-CNN, as shown in 2D-CNN part of the Fig. 1. Each two-dimensional matrix resolves to a SD feature representation fs (s = 1, 2, ···,8):

$${{f}_{\mathrm{s}}=\mathrm{CNN}}_{2\mathrm{D}}\left({m}_{s}\right), {f}_{\mathrm{s}}\in {\mathcal{R}}^{l}.$$
(16)

The SD feature fs is used as the input of full connection 2 (FC2) layers, which will be fused with the high-level features from the TD.

Specifically, 2D-CNN has three 2D convolutional layers with kernel size of 3 × 3 which are used to extract SD features. We set three filters to 64, 32, and 16 in sequence. Typical CNN architecture, the convolution operation is usually followed by a pooling operation periodically, but this is not mandatory. Pooling operation reduces the data dimension at the cost of losing some information. The data dimension in the process of spatial EEG feature extraction in this study is relatively small. Thus, to retain more information, this study directly connects the three convolutional layers without adding a pooling operation. The final l (l = 16 × 8 × 48) SD features are used as input to FC2 layer and fused with TD features.

Since the amount of experimental data is not particularly large, to avoid the problem of over fitting, this study added a dropout layer after hidden layers of RNN and 2D-CNN, respectively, and the dropout rate was set to 0.3. The adaptive moment estimation (ADAM) [50] optimization algorithm is adopted to adjust model parameters. Meanwhile, the basic parameter settings of ADAM optimizer in this paper are the same as those given in literature [50], and the specific settings are as follows: α = 0.001, β1 = 0.9, β1 = 0.999, and ε = 10–8 (epsilon).

3.3 Spatial–Temporal EEG Fusion

After extracting spatial–temporal EEG features, before fusion, the full connection layer FC1 and FC2 maps features into a more separable vector space. The mapped spatial–temporal EEG feature vectors are defined as:

$${F}_{t}=Sigmoid\left({\mathrm{W}}_{t}\times {h}_{\left[8,i\right]<2>}+{b}_{t}\right), {\mathrm{F}}_{t}\in {\mathcal{R}}^{t},$$
(17)
$${F}_{\mathrm{s}}={Sigmoid({\mathrm{W}}_{s}\times f}_{\mathrm{s}}+{b}_{s}), {\mathrm{F}}_{s}\in {\mathcal{R}}^{s},$$
(18)

where t and s are the size of fully connected layers FC1 and FC2, Wt and Ws are the weight matrix of FC1 and FC2, bt and bs are the bias of FC1 and FC2. Sigmoid is an activation function, whose purpose is to normalize the output of the full connection layer to the [0, 1]. Since the output of the sigmoid function can be used as the preliminary prediction probability by the logistic regression, we can obtain the final prediction probability of each class after the output probability is fused and passed the softmax layer.

The TD features and SD features after full connection are fused into a spatial–temporal feature vector in a concatenated manner, and then this feature vector is fed into softmax layer, the softmax layer outputs the final predicted results of each class. The above process can be described as:

$$F = \left[ {F_{t} ,F_{s} } \right],$$
(19)
$$Proi = softmax\,(W \times F + b),Proi \in R^{z} ,$$
(20)

z represents the number of categories.

4 Experiments and Results

To obtain statistically significant experimental results, two data strategies of tenfold cross validation and leave-one-subject-out cross validation were used in the experiment. Meanwhile, accuracy, sensitivity and specificity were used to assess the performance of MDD detection.

4.1 Comparison of Unimodal and Spatial–Temporal Fusion

We designed and implemented a set of control experiment to compare the detection performance of unimodal TD features as the input of RNN with embedded LSTM units, unimodal SD features as the input of 2D-CNN, and spatial–temporal EEG fusion framework.

The boxplot of experimental results is shown in Fig. 4. The accuracy rate of unimodal SD features was 80.34% and 81.37%, respectively, while the accuracy of unimodal TD features was slightly higher, 87.62% and 86.94%, respectively. The accuracy of spatial–temporal EEG fusion is the highest, 92.11% and 91.03%, respectively. This shows that compared with SD features, TD features contain more information for depression detection. The author believes that because EEG is a typical time series signal, TD features contain more valuable information for depression detection. The spatial–temporal EEG fusion achieves the highest detection accuracy due to feature fusion can provide information complementation, and improve the accuracy of the decision-making process.

Fig. 4
figure 4

Boxplot of depression detection accuracy obtained by unimodal EEG features and spatial–temporal EEG fusion, and P-values are calculated using paired sample t-test

We used the P-values from paired sample t-test to test whether there was a significant difference in detection accuracy among unimodal EEG features and spatial–temporal EEG fusion, and the results are shown in Fig. 4. Compared with the unimodal features, especially SD features, the detection accuracy of spatial–temporal EEG fusion is significantly improved. Meanwhile, paired sample t-test is also used to test whether there was a significant difference in detection accuracy between the two data usage strategies. The results are shown in Table 2, and there was no significant difference in the two data usage strategies (P-values < 0.05), indicating the significant accuracy improvement of spatial–temporal EEG fusion is mainly due to information complementarity, independent of the data usage strategy.

Table 2 The paired sample t-test results on detection accuracy between tenfold cross validation and leave-one-subject-out cross validation

To illustrate the data usage of various classes, as shown in Fig. 5, we provide the confusion matrix corresponding to the detection accuracy of the spatial–temporal EEG fusion. In each iteration for tenfold cross validation, 90% samples are randomly selected from MDD and HC groups is used for training, that is, 73728 × 90% = 66355 and 89088 × 90% = 80179. The remaining 10% is used for testing, in other word, the actual sample number of MDD and HC in each iteration is 7373 and 8909, respectively. In the confusion matrix of ten iterations, the total actual sample number of MDD and HC is 73730 and 89090, respectively. The actual sample numbers of MDD and HC during the leave-one-subject-out cross validation testing process were 73728 and 89088, respectively.

Fig. 5
figure 5

Confusion matrix of spatial–temporal EEG fusion

Through the above detection accuracy and P value in paired sample t-test analysis, we can find that the spatial–temporal EEG fusion based on neural network can effectively improve the detection accuracy. Therefore, the subsequent experiments will focus on the detection performance of MDD under the spatial–temporal EEG fusion framework.

4.2 Comparison and Discussion of Detection Performance in Different Frequency Bands

The studies pointed out that EEG signal of different frequency bands showed different changing trends under different mental states or mental diseases [51, 52]. To determine which frequency band information is more relevant to MDD, we achieve this goal by shielding all frequency bands except for a specific frequency band and evaluating the detection performance of a specific frequency band. This operation, called as occlusion sensitivity, has been successfully used to understand how neural networks perform sleep staging and image classification [53, 54]. In this work, a controlled experiment of MDD detection performance evaluation under the framework of spatial–temporal EEG fusion based on neural network is designed using EEG features from different frequency band.

The comparison results of accuracy, sensitivity and specificity are given in Tables 3 and 4. Meanwhile, the advantages and disadvantages of detection performance in different frequency bands are shown in radar graph of Fig. 6. From the above results, we can see that the detection performance of low-frequency band is higher than that of high-frequency band. Compared with other frequency band, the highest detection accuracy are 94.24% and 93.77% of the theta frequency band using two data strategies.

Table 3 Performance results from different frequency bands based on spatial–temporal EEG fusion using tenfold cross validation
Table 4 Performance results from different frequency bands based on spatial–temporal EEG fusion using leave-one-subject-out cross validation
Fig. 6
figure 6

Detection results in different frequency bands

Meanwhile, we also adopted the paired sample t-test to find whether there are significant differences in detection accuracy among theta frequency band and other frequency bands, and the results are shown in Table 5. There is significant differences among theta frequency band and delta, beta and gamma (P < 0.05), but no significant differences with alpha band and full band, which indicates that theta and alpha frequency bands can improve the detection performance, and effectively reduce the computational amount compared with the full frequency band.

Table 5 Difference analysis of detection performance for theta band and other bands

In this paper, the comparison of MDD in different frequency bands by spatial–temporal EEG fusion based on neural networks found that the detection performance of theta and alpha frequency bands is significantly improved compared with other frequency bands. Such results are indirectly consistent with several previous research conclusions. Literature [55] and [56] successively analyzed the power spectral density of EEG, and found that the change pattern of PSD curve was inconsistent between MDD group and HC group. Taking the asymmetric features combination of different frequency bands as the input of classifier, literature [57] found that theta band and alpha2 band can obtain the highest detection accuracy of MDD. Based on repeated analysis of variance (ANOVA), the researches of Goldschmied et al. [58] showed that the EEG activity of theta band was significantly different in regulating sleep homeostasis between MDD and HC, and the EEG activity IN theta band can be used as an important biomarker to distinguish MDD and HC. According to the coherence analysis, literature [22] found that EEG coherence of patients with MDD in theta frequency band is higher than that of HC, while the coherence between two group is basically the same in other bands. Liu et al. [59] demonstrated that MDD patients exhibited changed functional connectivity in theta band based on graph theory. Based on the structural synchronization method, the literature [60] found that MDD patients had impaired functional connectivity of the theta and alpha bands. The researches by Sun et al. [21] showed that hubs node activity of HC is higher than that of MDD in the theta band, and pointed out that this result was due to aberrant cognitive processing of MDD patients.

The above studies verify the high detection accuracy can be obtained from the theta and alpha frequency bands. The author believes that it is mainly due to the energy changes caused by the changes of brain function in the corresponding EEG frequency bands of MDD patients. Because, EEG signal is the external manifestation of energy difference from different neurons [61].

4.3 Comparison and Discussion of Detection Performance in Different Brain Regions

By comparing the detection performance of different bands, we find that the detection performance of delta, beta and gamma bands has not been significantly improved, so further studies we will focus more on theta, alpha and full band. Studies have shown that compared with HC, some brain regions of patients with mental diseases significantly altered [62]. Meanwhile, we were inspired by the brain regions research of brain function changes in adolescence [63], and the research of mild cognitive impairment based on the brain region correlation [64]. To explore whether there are brain regions closely related to MDD detection, this paper conducts occlusion-sensitivity operation in different brain regions, and designs controlled experiment under the spatial–temporal EEG fusion based on neural networks.

Usually, the brain region can be divided into 4 regions, 5 regions, 6 regions, and 8 regions [65]. To find the relationship between different brain regions and MDD detection, the 8 regions was used in this paper and its division is shown in Fig. 7, in which blue is left frontal (LF), dark green is right frontal (RF), olive green is left temporal (LT), orange is left central (LC), yellow is right central (RC), grass green is right temporal (RT), red is left parietal–occipital (LPO), brown is right parietal–occipital (RPO).

Fig. 7
figure 7

Non-overlapping eight brain region division with 64 electrodes

Figure 8 shows the detection accuracy of different brain regions in theta, alpha, and full frequency band, the experimental results again verify that there is no significant difference in detection accuracy between two data usage strategies, which further shows that the spatial–temporal EEG fusion based on neural network has good robustness for different data. In addition, we can see from Fig. 8 that the detection accuracy in LF, LC, RT regions of theta, alpha and full frequency band is higher than that of the full brain region under the two data usage strategies, except LF region in alpha frequency band and LC region in full frequency band. This result means that the spatial–temporal EEG fusion method can further improve the detection accuracy in LF, LC, RT brain regions located in theta frequency band, and LC, RT brain regions located in alpha frequency band, among which the detection accuracy in theta frequency band of LF brain region was the highest 95.87%.

Fig. 8
figure 8

Comparison of detection accuracy among different brain regions under the spatial–temporal EEG fusion framework based on neural networks

The above results indicate that the number of electrodes can be reduced from 64 to 7 in the future practical application, namely, Fp1, AF3, AF7, F1, F3, F5 and F7 electrodes in the LF brain region. The authors believe that this result is related to the difference of topological connection strength between two groups in LF brain region. To verify this assumption, we have drawn the distribution map of BFN topological average connection strength between the two groups in LF brain region on the basis of occlusion sensitive operation, as shown in Fig. 9a and b, which shows that the topological connection strength of MDD group is weakened. To show this result more intuitively, we give the adjacency matrix corresponding to the average connection strength of two groups, as shown in Fig. 9c and d, and we can see that the connection strength of MDD group is weakened by comparing the two groups of adjacency matrices in LF region.

Fig. 9
figure 9

Topological average connection strength distribution and adjacency matrix. a and c are the topological average connection strength and corresponding adjacency matrix of MDD group. b and d are the topological average connection strength and corresponding adjacency matrix of HC group

Different brain regions in depression patients have been widely studied, and new research results are constantly emerging. But some studies have obtained different or even opposite conclusions because of different methods or models are adopted. Literature [66] indicates connectivity increase in frontal region of patients with MDD, while literature [23] showed that connectivity decrease in the frontal lobe. In this study, we found that the LF, LC, RT regions in theta frequency band and the LC, RT regions in alpha frequency band can improve the detection accuracy of MDD, especially in LF region of the theta frequency band. Related studies have clarified the injury or changes in LF, LC and RT regions of MDD patients from the perspectives of neuroanatomy, neurophysiology, BFN, and abnormal gray matter analysis, directly or indirectly verified the effectiveness of these regions for MDD detection. For example, the study in [63] pointed out that the frontal is mainly responsible for motor, language and mental functions and the temporal is mainly responsible for emotional functions from the aspects of neuroanatomy and neurophysiology. After frontal and temporal lobe injury, patients show low mood, apathy, and easy mood fluctuation, which are also typical symptoms of MDD patients. From the aspect of clinical neurophysiology, studies have shown reduced activation in LF region of MDD patients compared with HC [67]. Zhang et al. [55] pointed out that several complex network indicators in the LC and RT regions of MDD patients are significantly related to PHQ-9, and these indicators can obtain high accuracy for MDD detection. Based on graph theory analysis, Sun et al. [21] showed that the network indicators of LF and LC regions can distinguish MDD from HC. Based on abnormal gray matter analysis, Zhao et al. [68] found that the cortical thickening in LC region of MDD patients was thickened relative to HC.

Despite the black-box attribute of the occlusion-sensitivity method employed in this paper, this occlusion process allows opening the box and indirectly revealing interesting insights on how to associate specific frequency band or specific brain regions with MDD detection.

4.4 Variants of Spatial–Temporal EEG Fusion Based on Neural Network

Considering that it is impossible to test all neural network architectures. Meanwhile, also considering that this study obtains the highest detection accuracy in the theta frequency band of LF brain region. Here, we only analyze the effects of main components of neural network model proposed in this paper on the spatial–temporal EEG fusion in the theta frequency band of LF brain region. For this reason, we designed a controlled experiment with different combinations of RNN layers and CNN layers, that is, various variants of spatial–temporal EEG fusion based on neural network. Because the highest detection accuracy rate is obtained using tenfold cross validation in this paper, and the results of other section studies show that this research method has good robustness to the two data strategies, so we only test the tenfold cross validation.

The experimental results are shown in Table 6. The combination of 3-layer LSTM and 4-layer CNN achieves the best performance, while the accuracy of other combinations with fewer layers was no more than 96.33%. This means that better performance can be obtained by increasing the number of layers, that is, more RNN layers or CNN layers will get better performance for MDD detection. This study using the combination of 2-layer LSTM and 3-layer CNN is a tradeoff between performance and efficiency. Interestingly, the author believes that the spatial–temporal EEG fusion based on more complex neural network will certainly achieve better results, but it is also based on the premise of increasing time complexity.

Table 6 Comparison of the various variants of neural network

4.5 Verification of Model Generalization Ability and Robustness

To confirm the generalization ability and robustness of the model in this paper, we used publicly available EEG data in literature [51, 52]. The data in literature [51] are used for fatigue driving detection, which contains 33 channel data from 27 subjects, the first 32 channels record EEG signals and the 33rd channel record vehicle position information. The data in literature [52] is used for anxiety detection, which contains 128 channels EEG data from 92 subjects.

Using the above two public EEG datasets to design validation experiments, we evaluated the performance of fatigue driving detection and anxiety detection in the theta frequency band of LF brain region by the spatial–temporal EEG fusion model using neural network under tenfold cross-validation data strategy, and the results are shown in Table 7. By comparing the results of different detection problems on three EEG datasets, it can be found that detection accuracy of the data in this paper as well as the data in two public datasets is 95.87%, 94.54% and 92.02%, respectively, in other words, there was no significantly difference in the results. The author believes that the slight difference in accuracy is caused by the data quality of different datasets. Therefore, the comparative results indicate that the proposed model has good generalization ability and robustness.

Table 7 Experimental results on two public EEG datasets

4.6 Performance Comparison

By comparing different detection methods and comparing the proposed model with existing models, the performance of the proposed model is verified.

4.6.1 Comparison of Different Detection Algorithms

To verify the detection performance of spatial–temporal EEG fusion, time series EEG data and the corresponding adjacency matrix of BFN as the typical detection algorithm input under the tenfold cross-validation data strategy, and then comparing MDD detection results between the typical detection algorithm and the proposed model. Typical detection algorithms in this paper include dictionary learning (DL), linear discriminant analysis (LDA), and support vector machine (SVM). Table 8 shows the main parameter settings of typical detection algorithms.

Table 8 Main parameter settings of typical detection algorithms

Figure 10 lists the accuracy comparison results of different detection algorithms, and it can be seen that our method has obtained the highest detection accuracy. Especially compared with SVM, the improvement is 12.22%. Why the detection accuracy of our method significantly improved? The author believes that the main reasons can be summarized as follows: spatial–temporal EEG fusion method based on neural network is an integrated learning method. Different machine learning algorithms have their own advantages in design, that is, each detection algorithm has its own preferences. Our model can give full play to the complementary advantages of RNN and CNN, and thus improve the accuracy of MDD detection.

Fig. 10
figure 10

Comparison with other detection algorithms on detection accuracy

4.6.2 Comparison with Pioneers’ Methods

Due to the differences in methods and data usage strategies, it is impossible to directly explain the pros and cons of various MDD detection methods through comparison. But the advantages and disadvantages of different methods can be indirectly reflected by comparison. To this end, Table 9 gives the comprehensive comparison between the pioneers' MDD detection methods and the method presented in this paper; these comprehensive comparative indicators include method, subjects, electrodes, and the detection accuracy.

Table 9 Comparison with pioneers’ MDD detection methods

5 Conclusion

This study aims to explore an objective and effective MDD detection method; thus, a framework of MDD detection based on spatial–temporal EEG fusion based on neural networks is proposed. By the comparative and analysis of unimodal EEG features and spatial–temporal EEG fusion, it is found that spatial–temporal EEG fusion can improve the detection performance of MDD. On this basis, the occlusion-sensitivity operation was successively used to find that theta, alpha, and full frequency band in brain regions of LF, LC, RT were closely related to depression, and this conclusion was also indirectly consistent with the existing research, especially the theta frequency band in LF region, which achieved the highest detection accuracy of 96.33%. In addition, we also tested various variants of neural network for spatial–temporal EEG fusion method, and found that more RNN layers or CNN layers can achieve better MDD detection performance.