Speech emotion recognition using deep 1D & 2D CNN LSTM networks

doi:10.1016/j.bspc.2018.08.035

Biomedical Signal Processing and Control

Volume 47, January 2019, Pages 312-323

https://doi.org/10.1016/j.bspc.2018.08.035 Get rights and content

Abstract

We aimed at learning deep emotion features to recognize speech emotion. Two convolutional neural network and long short-term memory (CNN LSTM) networks, one 1D CNN LSTM network and one 2D CNN LSTM network, were constructed to learn local and global emotion-related features from speech and log-mel spectrogram respectively. The two networks have the similar architecture, both consisting of four local feature learning blocks (LFLBs) and one long short-term memory (LSTM) layer. LFLB, which mainly contains one convolutional layer and one max-pooling layer, is built for learning local correlations along with extracting hierarchical correlations. LSTM layer is adopted to learn long-term dependencies from the learned local features. The designed networks, combinations of the convolutional neural network (CNN) and LSTM, can take advantage of the strengths of both networks and overcome the shortcomings of them, and are evaluated on two benchmark databases. The experimental results show that the designed networks achieve excellent performance on the task of recognizing speech emotion, especially the 2D CNN LSTM network outperforms the traditional approaches, Deep Belief Network (DBN) and CNN on the selected databases. The 2D CNN LSTM network achieves recognition accuracies of 95.33% and 95.89% on Berlin EmoDB of speaker-dependent and speaker-independent experiments respectively, which compare favourably to the accuracy of 91.6% and 92.9% obtained by traditional approaches; and also yields recognition accuracies of 89.16% and 52.14% on IEMOCAP database of speaker-dependent and speaker-independent experiments, which are much higher than the accuracy of 73.78% and 40.02% obtained by DBN and CNN.

Introduction

Speech emotion recognition has attracted much attention in the last decades. Emotions are specific and intense mental activities, which can be signed outward by many expressive behaviors. Speech, facial expression, body gesture, and brain signals etc., are the cues of the whole-body emotional phenomena [[1], [2], [3]]. Speech is a fast, efficient and essential pathway of human’s communication. So, recognizing speech emotion is one of the important research directions in emotion detection and recognition naturally [4,5].

In order to recognize the emotional state of the speaker, distinguishing paralinguistic features which do not depend on the speaker or the lexical content need to be extracted from the speech. In general, there are two types of information in speech: linguistic information, and paralinguistic information. The linguistic information always refers to the context or the meaning of the speech. The paralinguistic information comes to mean the implicit messages such as the emotion contained in the speech [4,[6], [7], [8]].

There are many distinguishing acoustic features usually used into recognizing the speech emotion: continuous features, qualitative features, and spectral features [[9], [10], [11], [12], [13]]. Many features have been investigated to recognize speech emotion. Some researchers weighted the pros and cons of each feature, but no one can identify which category is the best one until now [4,6,14,15].

In order to learn high-level features from emotion utterances and form a hierarchical representation of the speech, many deep learning architectures have been introduced in speech emotion recognition. The classification accuracy of handcrafted features extracted from certain emotion utterances is relatively high, but the extraction of handcrafted features always consumes expensive manual labor and depend on professional knowledge [6,16,17]. The handcrafted features extraction normally overlooks the high-level features, which are derived from lower level features. So, hierarchical learning, also known as deep learning, is introduced to model high-level abstractions of the data.

Speech signal processing has been revolutionized by deep learning. More and more researcher achieved excellent results in certain applications using deep belief networks (DBNs), convolutional neural networks (CNNs) and long short-term memory (LSTM) [[18], [19], [20],32]. Deep neural networks are typical “black box” approaches, because it is extremely difficult to understand how the final output is arrived at. There are two models or methods have been introduced to study relevant problems or coincidences. Compared to the “data model’’ used largely by statisticians, deep networks focus on finding an algorithm to do prediction, so they are called “algorithmic model” [55], [56]. The interpretability of how the highly abstracted features are learned by deep neural networks (DNNs) is poor [57]. But deep neural networks perform dramatically better than traditional approaches (see Fig. 1) in some experiments [21,22].

We constructed two convolutional neural network and long short-term memory (CNN LSTM) networks by stacking four designed local feature learning blocks (LFLBs) and other building layers to extract emotional features. The speech signal is a time-varying signal which needs special processing to reflect time-varying properties. Therefore, LSTM layer is introduced to extract long-term contextual dependencies. The 1D CNN LSTM network is intended to recognize speech emotion from audio clips (see Fig. 2a); the 2D CNN LSTM network mainly focuses on learning global contextual information from the handcrafted features (see Fig. 2b). Most of the traditional features extraction algorithms can reduce data dimension dramatically. The amount of extracted low-level features, such as the spectrum features [23,24], is smaller than that of the raw data. A significant advantage of the learning from a small amount of the low-level features is the decreasing of the training time. The experimental results show that the designed CNN LSTM networks can recognize the speech emotion effectively. Moreover, the designed 2D CNN LSTM network does not only achieve high emotion recognition accuracies but also has better generalization ability. High recognition rate and good generalization ability can provide a guarantee for the application of the designed networks in disease prevention, health care, medical diagnosis, social intercourse etc.

Our original contributions of the work are as follows: 1) a local feature learning block (LFLB), which consists of one convolutional layer, one batch normalization (BN) layer, one exponential linear unit layer, and one max-pooling layer, is designed to extract local features; 2) to learn long-term dependencies from a sequence of local features, LSTM layer is introduced to build CNN LSTM networks following the LFLB; 3) it is proved experimentally that 1D CNN LSTM network can learn lots of emotional features from raw audio utterances for the first time. In our experiments, 2D CNN LSTM network achieves better results. 2D CNN LSTM network focuses on capturing both local correlations and global contextual information from log-mel spectrogram, which is a representation of how the frequency content of a signal changes with time. When log-mel spectrogram is considered as a grid or a sequence, it can be processed by LFLB or LSTM layer.

Section snippets

Related work

Distinguishing features are essential for recognizing the speech emotion. Among many paralinguistic features, spectrum features are widely used in speech emotion recognition. AB Kandali et al. presented a method based on MFCCs as features and Gaussian mixture model classifier to recognize emotion from Assamese speeches [25]. Milton, A. et al. used a 3-stage Support Vector Machine classifier to classify seven different emotions present in the Berlin Emotional Database (Berlin EmoDB) [26]. VB

Methods and materials

Extracting more distinguishing emotion features is one of the main tasks for researchers to recognize speech emotion. According to the difference of feature extraction methods, speech features can be classified as handcrafted features and learned features. Most of the extraction of handcrafted features are carefully designed using ingenious strategies and can be explained in more detail how it works and what it does. The learned features extracted by different deep networks, such RBM based DNN [

Experimental results

Speaker-dependent and speaker-independent experiments are conducted on Berlin EmoDB and IEMOCAP database. Each experiment consists of two parts, the first one is conducted on raw audio clips, the second one is conducted on log-mel spectrograms. The 1D CNN LSTM network is utilized to learn the emotional features from raw audio clips, the 2D CNN LSTM network is adopted to learn high-level features from log-mel spectrograms.

Deep networks are generally considered to be “black box” approaches, how

Discussion

In this work, 1D & one 2D CNN LSTM networks, which consists of four LFLBs and one LSTM layer, are built to learn both the local and global emotion-related features. Speeches are time-varying signals, need more sophisticated analysis to reflect time-varying properties. The designed networks with the strength of CNN and LSTM are utilized to recognize the speaker’s emotional state.

The experiments have accomplished the task of learning more emotional information from the experimental data, but how

Conclusion

This paper presents 1D and 2D CNN LSTM networks to recognize speech emotion. The method of how to learn local correlations and global contextual information from raw audio clips and log-mel spectrograms is investigated. LFLB which consists of one convolutional layer, one BN layer, one exponential linear unit layer, and one max-pooling layer is designed to learn local features. When local features learned by LFLBs are reshaped, they are inputted into an LSTM layer. The LSTM layer can learn

Acknowledgements

Part of this work was done when the first author worked in Advanced Analytics Institute (AAI), University of Technology, Sydney as a visiting scholar. Jianfeng Zhao, Xia Mao, and Lijiang Chen’s work in this paper was supported in part by the National Natural Science Foundation of China under Grant No. 61603013. This article recently received funding from the Fundamental Research Funds for the Central Universities (Grant No. YWF-18-BJ-Y-181).

References (67)

H. Gunes et al.
Bi-modal emotion recognition from expressive face and body gestures
J. Netw. Comput. Appl.
(2007)
R. Yuvaraj et al.
Detection of emotions in Parkinson’s disease using higher order spectral features from brain’s electrical activity
Biomed. Signal Process. Control
(2014)
A. Guidi et al.
Automatic analysis of speech F0 contour for the characterization of mood changes in bipolar patients
Biomed. Signal Process. Control
(2015)
L. He et al.
Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech
Biomed. Signal Process. Control
(2011)
S. Wu et al.
Automatic speech emotion recognition using modulation spectral features
Speech Commun.
(2011)
H. Pérez-Espinosa et al.
Acoustic feature selection and classification of emotions in speech using a 3D continuous emotion model
Biomed. Signal Process. Control
(2012)
Y. Sun et al.
Weighted spectral features based on local Hu moments for speech emotion recognition
Biomed. Signal Process. Control
(2015)
L. Chen et al.
Speech emotion recognition: features and classification models
Digit. Signal Process.
(2012)
K. Wan et al.
Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals
Biomed. Signal Process. Control
(2017)
M.E. Ayadi et al.
Survey on speech emotion recognition: features, classification schemes, and databases
Pattern Recognit.
(2011)

M. Nardelli et al.

Recognizing emotions induced by affective sounds through heart rate variability

IEEE Trans. Affect. Comput.

(2015)

C.N. Anagnostopoulos et al.

Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011

Artif. Intell. Rev.

(2012)

Z. Zhang et al.

Cooperative learning and its application to emotion recognition from speech

IEEE/ACM Trans. Audio Speech Lang. Process.

(2015)

Z. Huang et al.

Speech Emotion Recognition Using CNN

(2014)

Y. Huang et al.

Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition

IET Signal Process.

(2015)

S. Demircan et al.

Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech

Neural Comput. Appl.

(2016)

L. Chen et al.

Text-independent phoneme segmentation combining EGG and speech data

IEEE/ACM Trans. Audio Speech Lang. Process.

(2016)

A. Graves et al.

Hybrid speech recognition with Deep Bidirectional LSTM

Autom. Speech Recognit. Underst.

(2013)

K. Han et al.

Speech emotion recognition using deep neural network and extreme learning machine

Conference of the International Speech Communication Association

(2014)

H. Lee et al.

Unsupervised feature learning for audio classification using convolutional deep belief networks

Neural Inf. Process. Syst.

(2009)

W.Q. Zheng et al.

An experimental study of speech emotion recognition based on deep convolutional neural networks

International Conference on Affective Computing and Intelligent Interaction IEEE

(2015)

G. Antipov et al.

Learned vs. Hand-crafted features for pedestrian gender recognition

ACM International Conference on Multimedia ACM

(2015)

K. Wang et al.

Speech emotion recognition using fourier parameters

IEEE Trans. Affect. Comput.

(2015)

A.B. Kandali et al.

Emotion recognition from Assamese speeches using MFCC features and GMM classifier

TENCON 2008 - 2008 IEEE Region 10 Conference IEEE

(2008)

A. Milton et al.

SVM scheme for speech emotion recognition using MFCC feature

Int. J. Comput. Appl.

(2013)

V.B. Waghmare et al.

Emotion recognition system from artificial marathi speech using MFCC and LDA techniques

International Conference on Advances in Communication, Network, and Computing

(2014)

S. Demircan et al.

Feature extraction from speech data for emotion recognition

J. Adv. Comput. Netw.

(2014)

N.J. Nalini et al.

Speech emotion recognition using residual phase and MFCC features

Int. J. Eng. Technol.

(2013)

F. Chenchah et al.

Acoustic emotion recognition using linear and nonlinear cepstral coefficients

Int. J. Adv. Comput. Sci. Appl.

(2015)

N.J. Nalini et al.

Music emotion recognition: the combined evidence of MFCC and residual phase

Egypt. Inf. J.

(2015)

G.E. Hinton et al.

Reducing the dimensionality of data with neural networks

Science

(2006)

A. Stuhlsatz et al.

Deep neural networks for acoustic emotion recognition: raising the benchmarks

IEEE International Conference on Acoustics, Speech, and Signal Processing IEEE

(2011)

Cited by (758)

Deep learning approach for accurate and stable recognition of driver's lateral intentions using naturalistic driving data
2024, Engineering Applications of Artificial Intelligence
Accurate and stable recognition of a driver's lateral intention is a crucial prerequisite for the proper functioning of advanced driver-assistance systems (ADAS). Existing studies usually rely on auxiliary sensor signals, such as cameras and eye trackers; however, this reliance poses challenges in applying these methods to vehicles lacking such auxiliary sensors. Furthermore, existing studies have not fully leveraged the inherent temporal dependence of lateral intentions, leading to difficulties in avoiding erroneous recognition interruptions. Thus, this study proposes a deep-learning-based lateral intention recognition method to achieve accurate and stable recognition of lateral intention using onboard sensor signals. First, a real vehicle is used to collect a vast amount of driving data, and thus guarantee the robustness and practicality of the recognition model. Subsequently, vehicle trajectories are extracted, and a trajectory clustering method is used to label lateral intentions of the driving data; these intention labels and a feature selection algorithm are utilized to select the most representative recognition features. Therefore, a lateral driving intention recognition model is constructed using double convolutional neural networks with a long short-term memory layer (CNN-LSTM). This network architecture can fully utilize the temporal dependence of lateral intentions. Finally, the recognition performance of the designed double CNN-LSTM networks is validated using the existing driving data and real-world vehicle tests. The results indicate that the double CNN-LSTM networks can achieve stable recognition of lateral intention in real-time and the accuracy reaches 98.64% in the experiment.
A novel concatenated 1D-CNN model for speech emotion recognition
2024, Biomedical Signal Processing and Control
Speech Emotion Recognition (SER) has conventionally been executed solely through acoustic data. Convolutional Neural Network (CNN) in Deep Learning (DL) is a cutting-edge technique that has been effectively used in diverse fields, such as analyzing speech emotion. This study utilizes the Mel Frequency Magnitude Coefficient (MFMC), a modified variant of the Mel Frequency Cepstral Coefficient (MFCC) feature, in conjunction with a one-dimensional (1D) CNN to increase the SER performance. The accuracy of SER is evaluated utilizing the proposed models. Model 1 refers to MFCC-1D-CNN, model 2 refers to MFMC-1D-CNN, and model 3 refers to the concatenated model of model 1 and model 2. The models were evaluated on four datasets comprising eight distinct emotions, specifically anger, happiness, sadness, boredom, fear, surprise, disgust, neutral, and calm. The proposed models, namely model 1, model 2, and model 3, were evaluated on the four datasets using coefficients of 12, 24, and 30. The Proposed Model 1 achieved accuracy rates for EMO-DB, EMOVO, SAVEE, and RAVDESS of 88.1%, 87.5%, 75%, and 97.7% for the respective coefficients. While the proposed model 2 achieved accuracy rates of 93.8%, 98.3%, 90.3%, and 97.5% for the same coefficients. However, the proposed concatenated model 3 outperformed the accuracy of 95.6%, 99.4%, 91.7%, and 98.1% for the four datasets EMO-DB, EMOVO, SAVEE, and RAVDESS compared to model 1 and model 2. Experimental results indicate that the proposed concatenated model 3 improves accuracy and prevents the model from overfitting.
In-depth investigation of speech emotion recognition studies from past to present –The importance of emotion recognition from speech signal for AI–
2024, Intelligent Systems with Applications
In the super smart society (Society 5.0), new and rapid methods are needed for speech recognition, emotion recognition, and speech emotion recognition areas to maximize human-machine or human-computer interaction and collaboration. Speech signal contains much information about the speaker, such as age, sex, ethnicity, health condition, emotion, and thoughts. The field of study which analyzes the mood of the person from the speech is called speech emotion recognition (SER). Classifying the emotions from the speech data is a complicated problem for artificial intelligence, and its sub-discipline, machine learning. Because it is hard to analyze the speech signal which contains various frequencies and characteristics. Speech data are digitized with signal processing methods and speech features are obtained. These features vary depending on the emotions such as sadness, fear, anger, happiness, boredom, confusion, etc. Even though different methods have been developed for determining the audio properties and emotion recognition, the success rate varies depending on the languages, cultures, emotions, and data sets. In speech emotion recognition, there is a need for new methods which can be applied in data sets with different sizes, which will increase classification success, in which best properties can be obtained, and which are affordable. The success rates are affected by many factors such as the methods used, lack of speech emotion datasets, the homogeneity of the database, the difficulty of the language (linguistic differences), the noise in audio data and the length of the audio data. Within the scope of this study, studies on emotion recognition from speech signals from past to present have been analyzed in detail. In this study, classification studies based on a discrete emotion model using speech data belonging to the Berlin emotional database (EMO-DB), Italian emotional speech database (EMOVO), The Surrey audio-visual expressed emotion database (SAVEE), Ryerson Audio-Visual Database of Emotional Speech and Song Database (RAVDESS), which are mostly independent of the speaker and content, are examined. The results of both classical classifiers and deep learning methods are compared. Deep learning results are more successful, but classical classification is more important in determining the defining features of speech, song or voice. So It develops feature extraction stage. This study will be able to contribute to the literature and help the researchers in the SER field.
Analysis of emotion in autism spectrum disorder children using Manta-ray foraging optimization
2024, Biomedical Signal Processing and Control
Understanding emotions conveyed through human voices is crucial, particularly in discerning behavioral and emotional traits. Effective recognition of emotions from voice data demands proficient feature extraction supported by adept feature selection and optimization techniques. This study focuses on exploring emotion comprehension through voice analysis by integrating multiple feature extraction methods. Specifically targeting emotions in children diagnosed with Autism Spectrum Disorder (ASD), this research combines Mel Frequency Cepstral Coefficients (MFCC), Linear Prediction Coefficients (LPC), and Gammatone Filter Cepstral Coefficient (GFCC) for feature extraction. Notably, it introduces the novel application of the Manta Ray Foraging Optimization (MRFO) algorithm for emotion analysis in ASD children's voices. This combination of multiple feature extraction methods along with the manta ray optimization algorithm in the voice-based analysis of emotions of ASD children has contributed to the state-of-the-art result in this research field. Additionally, the approach identifies the ASD children's key emotions and the next likely perception of emotions, enabling them to analyze their emotional state. The experimental research reveals that the developed system performs well in distinguishing autistic children's emotions, which supports the course of providing suitable attention and guidance to aid them in leading normal lives. The suggested approach outclasses existing methods, resulting in a respectable result with an approximate classification accuracy of 95.7%. By combining diverse feature extraction methods with the MRFO algorithm, this study advances our understanding of ASD children's emotions through voice analysis, paving the way for enhanced emotional support and care strategies.
Leafy vegetable freshness identification using hyperspectral imaging with deep learning approaches
2024, Infrared Physics and Technology
With more attention being paid to healthy diets, vegetable freshness is a concerning issue for consumers and traders. Rapid and non-destructive evaluation of vegetable freshness plays an important role in vegetable consumption. In this study, hyperspectral imaging was used to identify the freshness of spinach and Chinese cabbage stored at different durations. Classification models were built using the extracted average spectra, including conventional machine learning methods (logistic regression (LR), support vector machine (SVM), and random forest (RF)) and deep learning methods (convolutional neural networks, long-short term memory (LSTM), and CNN combined with LSTM (CNN-LSTM)). Results showed that CNN-LSTM models outperformed the other models for both vegetables, with classification accuracy over 80 % in the training, validation and testing set. Grad-CAM++ methods showed that great similarity of important wavelengths contributing more to the freshness identification between the vegetables existed. Both fine-tuning and direct prediction using the model built on one vegetable to predict the other vegetable were explored. Good performances were obtained for both situations with the classification accuracy over 80 % for the three sets using all training samples. The overall results illustrated the great potential of hyperspectral imaging with deep learning approaches to identify the freshness of different vegetables. This study provides a basis for further assessment of vegetable freshness.
EEG stress classification based on Doppler spectral features for ensemble 1D-CNN with LCL activation function
2024, Journal of King Saud University - Computer and Information Sciences
The paper proposes an induced stress classification algorithm that uses features from the Doppler spectrum. In this approach, a reference signal source is used to obtain the quadrature and in-phase components of the EEG signal. The higher frequency components from the in-phase and quadrature are then eliminated using a pair of low-pass filters. The Doppler spectrum was then constructed from which the Doppler frequency features are then estimated. The features that were thus obtained are trained using an ensemble $1 D$ -CNN (one-dimensional Convolutional neural network) which uses two sections of $1 D$ -CNN. The first section 1D-CNN trains the features based on the EEG signal classes namely Stroop test, arithmetic task, and mirror tasks, while the second $1 D$ -CNN section trains the features based on the EEG signal intense classes namely high, low, and medium stress. We also propose a linear-cosine-linear (LCL) activation function for the ensemble $1 D$ -CNN which was derived from the cosine signal. The proposed stress classification scheme was evaluated using the SAM-40 datasets with induced stress classes namely arithmetic task, Stroop color-word test, and mirror image recognition task with stress levels namely high, low, and medium with the evaluation metrics such as precision, F1-score, accuracy, specificity, and recall. The proposed stress classification scheme attains an average accuracy, precision, and recall of $95.25 %$ , $95.22 %$ , and $92.9 %$ when evaluated in $9$ classes of EEGs.

View all citing articles on Scopus

View full text

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

Abstract

Introduction

Section snippets

Related work

Methods and materials

Experimental results

Discussion

Conclusion

Acknowledgements

J. Netw. Comput. Appl.

Biomed. Signal Process. Control

Biomed. Signal Process. Control

Biomed. Signal Process. Control

Speech Commun.

Biomed. Signal Process. Control

Biomed. Signal Process. Control

Digit. Signal Process.

Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals

Biomed. Signal Process. Control

Survey on speech emotion recognition: features, classification schemes, and databases

Pattern Recognit.

Recognizing emotions induced by affective sounds through heart rate variability

IEEE Trans. Affect. Comput.

Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011

Artif. Intell. Rev.

Cooperative learning and its application to emotion recognition from speech

IEEE/ACM Trans. Audio Speech Lang. Process.

Speech Emotion Recognition Using CNN

Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition

IET Signal Process.

Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech

Neural Comput. Appl.

Text-independent phoneme segmentation combining EGG and speech data

IEEE/ACM Trans. Audio Speech Lang. Process.

Hybrid speech recognition with Deep Bidirectional LSTM

Autom. Speech Recognit. Underst.

Speech emotion recognition using deep neural network and extreme learning machine

Conference of the International Speech Communication Association

Unsupervised feature learning for audio classification using convolutional deep belief networks

Neural Inf. Process. Syst.

An experimental study of speech emotion recognition based on deep convolutional neural networks

International Conference on Affective Computing and Intelligent Interaction IEEE

Learned vs. Hand-crafted features for pedestrian gender recognition

ACM International Conference on Multimedia ACM

Speech emotion recognition using fourier parameters

IEEE Trans. Affect. Comput.

Emotion recognition from Assamese speeches using MFCC features and GMM classifier

TENCON 2008 - 2008 IEEE Region 10 Conference IEEE

SVM scheme for speech emotion recognition using MFCC feature

Int. J. Comput. Appl.

Emotion recognition system from artificial marathi speech using MFCC and LDA techniques

International Conference on Advances in Communication, Network, and Computing

Feature extraction from speech data for emotion recognition

J. Adv. Comput. Netw.

Speech emotion recognition using residual phase and MFCC features

Int. J. Eng. Technol.

Acoustic emotion recognition using linear and nonlinear cepstral coefficients

Int. J. Adv. Comput. Sci. Appl.

Music emotion recognition: the combined evidence of MFCC and residual phase

Egypt. Inf. J.

Reducing the dimensionality of data with neural networks

Science

Deep neural networks for acoustic emotion recognition: raising the benchmarks

IEEE International Conference on Acoustics, Speech, and Signal Processing IEEE