Speech emotion recognition using deep 1D & 2D CNN LSTM networks
Introduction
Speech emotion recognition has attracted much attention in the last decades. Emotions are specific and intense mental activities, which can be signed outward by many expressive behaviors. Speech, facial expression, body gesture, and brain signals etc., are the cues of the whole-body emotional phenomena [[1], [2], [3]]. Speech is a fast, efficient and essential pathway of human’s communication. So, recognizing speech emotion is one of the important research directions in emotion detection and recognition naturally [4,5].
In order to recognize the emotional state of the speaker, distinguishing paralinguistic features which do not depend on the speaker or the lexical content need to be extracted from the speech. In general, there are two types of information in speech: linguistic information, and paralinguistic information. The linguistic information always refers to the context or the meaning of the speech. The paralinguistic information comes to mean the implicit messages such as the emotion contained in the speech [4,[6], [7], [8]].
There are many distinguishing acoustic features usually used into recognizing the speech emotion: continuous features, qualitative features, and spectral features [[9], [10], [11], [12], [13]]. Many features have been investigated to recognize speech emotion. Some researchers weighted the pros and cons of each feature, but no one can identify which category is the best one until now [4,6,14,15].
In order to learn high-level features from emotion utterances and form a hierarchical representation of the speech, many deep learning architectures have been introduced in speech emotion recognition. The classification accuracy of handcrafted features extracted from certain emotion utterances is relatively high, but the extraction of handcrafted features always consumes expensive manual labor and depend on professional knowledge [6,16,17]. The handcrafted features extraction normally overlooks the high-level features, which are derived from lower level features. So, hierarchical learning, also known as deep learning, is introduced to model high-level abstractions of the data.
Speech signal processing has been revolutionized by deep learning. More and more researcher achieved excellent results in certain applications using deep belief networks (DBNs), convolutional neural networks (CNNs) and long short-term memory (LSTM) [[18], [19], [20],32]. Deep neural networks are typical “black box” approaches, because it is extremely difficult to understand how the final output is arrived at. There are two models or methods have been introduced to study relevant problems or coincidences. Compared to the “data model’’ used largely by statisticians, deep networks focus on finding an algorithm to do prediction, so they are called “algorithmic model” [55], [56]. The interpretability of how the highly abstracted features are learned by deep neural networks (DNNs) is poor [57]. But deep neural networks perform dramatically better than traditional approaches (see Fig. 1) in some experiments [21,22].
We constructed two convolutional neural network and long short-term memory (CNN LSTM) networks by stacking four designed local feature learning blocks (LFLBs) and other building layers to extract emotional features. The speech signal is a time-varying signal which needs special processing to reflect time-varying properties. Therefore, LSTM layer is introduced to extract long-term contextual dependencies. The 1D CNN LSTM network is intended to recognize speech emotion from audio clips (see Fig. 2a); the 2D CNN LSTM network mainly focuses on learning global contextual information from the handcrafted features (see Fig. 2b). Most of the traditional features extraction algorithms can reduce data dimension dramatically. The amount of extracted low-level features, such as the spectrum features [23,24], is smaller than that of the raw data. A significant advantage of the learning from a small amount of the low-level features is the decreasing of the training time. The experimental results show that the designed CNN LSTM networks can recognize the speech emotion effectively. Moreover, the designed 2D CNN LSTM network does not only achieve high emotion recognition accuracies but also has better generalization ability. High recognition rate and good generalization ability can provide a guarantee for the application of the designed networks in disease prevention, health care, medical diagnosis, social intercourse etc.
Our original contributions of the work are as follows: 1) a local feature learning block (LFLB), which consists of one convolutional layer, one batch normalization (BN) layer, one exponential linear unit layer, and one max-pooling layer, is designed to extract local features; 2) to learn long-term dependencies from a sequence of local features, LSTM layer is introduced to build CNN LSTM networks following the LFLB; 3) it is proved experimentally that 1D CNN LSTM network can learn lots of emotional features from raw audio utterances for the first time. In our experiments, 2D CNN LSTM network achieves better results. 2D CNN LSTM network focuses on capturing both local correlations and global contextual information from log-mel spectrogram, which is a representation of how the frequency content of a signal changes with time. When log-mel spectrogram is considered as a grid or a sequence, it can be processed by LFLB or LSTM layer.
Section snippets
Related work
Distinguishing features are essential for recognizing the speech emotion. Among many paralinguistic features, spectrum features are widely used in speech emotion recognition. AB Kandali et al. presented a method based on MFCCs as features and Gaussian mixture model classifier to recognize emotion from Assamese speeches [25]. Milton, A. et al. used a 3-stage Support Vector Machine classifier to classify seven different emotions present in the Berlin Emotional Database (Berlin EmoDB) [26]. VB
Methods and materials
Extracting more distinguishing emotion features is one of the main tasks for researchers to recognize speech emotion. According to the difference of feature extraction methods, speech features can be classified as handcrafted features and learned features. Most of the extraction of handcrafted features are carefully designed using ingenious strategies and can be explained in more detail how it works and what it does. The learned features extracted by different deep networks, such RBM based DNN [
Experimental results
Speaker-dependent and speaker-independent experiments are conducted on Berlin EmoDB and IEMOCAP database. Each experiment consists of two parts, the first one is conducted on raw audio clips, the second one is conducted on log-mel spectrograms. The 1D CNN LSTM network is utilized to learn the emotional features from raw audio clips, the 2D CNN LSTM network is adopted to learn high-level features from log-mel spectrograms.
Deep networks are generally considered to be “black box” approaches, how
Discussion
In this work, 1D & one 2D CNN LSTM networks, which consists of four LFLBs and one LSTM layer, are built to learn both the local and global emotion-related features. Speeches are time-varying signals, need more sophisticated analysis to reflect time-varying properties. The designed networks with the strength of CNN and LSTM are utilized to recognize the speaker’s emotional state.
The experiments have accomplished the task of learning more emotional information from the experimental data, but how
Conclusion
This paper presents 1D and 2D CNN LSTM networks to recognize speech emotion. The method of how to learn local correlations and global contextual information from raw audio clips and log-mel spectrograms is investigated. LFLB which consists of one convolutional layer, one BN layer, one exponential linear unit layer, and one max-pooling layer is designed to learn local features. When local features learned by LFLBs are reshaped, they are inputted into an LSTM layer. The LSTM layer can learn
Acknowledgements
Part of this work was done when the first author worked in Advanced Analytics Institute (AAI), University of Technology, Sydney as a visiting scholar. Jianfeng Zhao, Xia Mao, and Lijiang Chen’s work in this paper was supported in part by the National Natural Science Foundation of China under Grant No. 61603013. This article recently received funding from the Fundamental Research Funds for the Central Universities (Grant No. YWF-18-BJ-Y-181).
References (67)
- et al.
Bi-modal emotion recognition from expressive face and body gestures
J. Netw. Comput. Appl.
(2007) - et al.
Detection of emotions in Parkinson’s disease using higher order spectral features from brain’s electrical activity
Biomed. Signal Process. Control
(2014) - et al.
Automatic analysis of speech F0 contour for the characterization of mood changes in bipolar patients
Biomed. Signal Process. Control
(2015) - et al.
Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech
Biomed. Signal Process. Control
(2011) - et al.
Automatic speech emotion recognition using modulation spectral features
Speech Commun.
(2011) - et al.
Acoustic feature selection and classification of emotions in speech using a 3D continuous emotion model
Biomed. Signal Process. Control
(2012) - et al.
Weighted spectral features based on local Hu moments for speech emotion recognition
Biomed. Signal Process. Control
(2015) - et al.
Speech emotion recognition: features and classification models
Digit. Signal Process.
(2012) - et al.
Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals
Biomed. Signal Process. Control
(2017) - et al.
Survey on speech emotion recognition: features, classification schemes, and databases
Pattern Recognit.
(2011)
Recognizing emotions induced by affective sounds through heart rate variability
IEEE Trans. Affect. Comput.
Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011
Artif. Intell. Rev.
Cooperative learning and its application to emotion recognition from speech
IEEE/ACM Trans. Audio Speech Lang. Process.
Speech Emotion Recognition Using CNN
Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition
IET Signal Process.
Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech
Neural Comput. Appl.
Text-independent phoneme segmentation combining EGG and speech data
IEEE/ACM Trans. Audio Speech Lang. Process.
Hybrid speech recognition with Deep Bidirectional LSTM
Autom. Speech Recognit. Underst.
Speech emotion recognition using deep neural network and extreme learning machine
Conference of the International Speech Communication Association
Unsupervised feature learning for audio classification using convolutional deep belief networks
Neural Inf. Process. Syst.
An experimental study of speech emotion recognition based on deep convolutional neural networks
International Conference on Affective Computing and Intelligent Interaction IEEE
Learned vs. Hand-crafted features for pedestrian gender recognition
ACM International Conference on Multimedia ACM
Speech emotion recognition using fourier parameters
IEEE Trans. Affect. Comput.
Emotion recognition from Assamese speeches using MFCC features and GMM classifier
TENCON 2008 - 2008 IEEE Region 10 Conference IEEE
SVM scheme for speech emotion recognition using MFCC feature
Int. J. Comput. Appl.
Emotion recognition system from artificial marathi speech using MFCC and LDA techniques
International Conference on Advances in Communication, Network, and Computing
Feature extraction from speech data for emotion recognition
J. Adv. Comput. Netw.
Speech emotion recognition using residual phase and MFCC features
Int. J. Eng. Technol.
Acoustic emotion recognition using linear and nonlinear cepstral coefficients
Int. J. Adv. Comput. Sci. Appl.
Music emotion recognition: the combined evidence of MFCC and residual phase
Egypt. Inf. J.
Reducing the dimensionality of data with neural networks
Science
Deep neural networks for acoustic emotion recognition: raising the benchmarks
IEEE International Conference on Acoustics, Speech, and Signal Processing IEEE
Cited by (758)
Deep learning approach for accurate and stable recognition of driver's lateral intentions using naturalistic driving data
2024, Engineering Applications of Artificial IntelligenceA novel concatenated 1D-CNN model for speech emotion recognition
2024, Biomedical Signal Processing and ControlIn-depth investigation of speech emotion recognition studies from past to present –The importance of emotion recognition from speech signal for AI–
2024, Intelligent Systems with ApplicationsAnalysis of emotion in autism spectrum disorder children using Manta-ray foraging optimization
2024, Biomedical Signal Processing and ControlLeafy vegetable freshness identification using hyperspectral imaging with deep learning approaches
2024, Infrared Physics and TechnologyEEG stress classification based on Doppler spectral features for ensemble 1D-CNN with LCL activation function
2024, Journal of King Saud University - Computer and Information Sciences