Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

Kotti, Margarita; Paternò, Fabio

doi:10.1007/s10772-012-9127-7

Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

Published: 31 January 2012

Volume 15, pages 131–150, (2012)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Margarita Kotti¹ &
Fabio Paternò¹

768 Accesses
51 Citations
Explore all metrics

Abstract

In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisaki’s model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers’ error rates and then to evaluate the information expressed by the classifiers’ confusion matrices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Essam H. Houssein, Asmaa Hammad & Abdelmgeid A. Ali

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Article Open access 13 February 2024

Priyadarsini Samal & Mohammad Farukh Hashmi

Survey on SVM and their application in image classification

Article 11 January 2018

Mayank Arya Chandra & S. S. Bedi

References

Altun, H., & Polat, G. (2009). Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection. Expert Systems With Applications, 36(4), 8197–8203.
Article Google Scholar
Austermann, A., Esau, N., Kleinjohann, L., & Kleinjohann, B. (2005). Prosody based emotion recognition for MEXI. In Proc. IEEE/RSJ int. conf. intelligent robots and systems, Edmonton, Canada, August 2005 (pp. 201–208).
Google Scholar
Benetos, E., & Kotropoulos, C. (2010). Non-negative tensor factorization applied to music genre classification. IEEE Transactions on Audio, Speech, and Language Processing, 18(8), 1955–1967.
Article Google Scholar
Benetos, E., Kotti, M., & Kotropoulos, C. (2007). Large scale musical instrument identification. In Proc. 4th sound and music computing conference, Lefkada, Greece, July 2007 (pp. 283–286).
Google Scholar
Bitouk, D., Verma, R., & Nenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52(7–8), 613–625.
Article Google Scholar
Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proc. institute of phonetic sciences (Vol. 17, pp. 97–110).
Google Scholar
Bosma, W., & André, E. (2004). Exploiting emotions to disambiguate dialogue acts. In Proc. 9th int. conf. intelligent user interfaces, Funchal, Portugal, January 2004 (pp. 85–92).
Chapter Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Proc. 9th European conf. speech communication and technology, Lisbon, Portugal, September 2005 (pp. 1517–1520).
Google Scholar
Burkhardt, F., Ajmera, J., Englert, R., Stegmann, J., & Burleson, W. (2006). Detecting anger in automated voice portal dialogs. In Proc. 9th int. conf. spoken language processing, Pittsburgh, USA, September 2006 (pp. 1–4).
Google Scholar
Busso, C., Lee, S., & Narayanan, S. (2009). Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Speech and Audio Processing, 17(4), 582–596.
Article Google Scholar
Calvo, R. A., & D’Mello, S. (2011). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on Affective Computing, 1(1), 18–37.
Article Google Scholar
Chandaka, S., Chatterjee, A., & Munshi, S. (2009). Support vector machines employing cross-correlation for emotional speech recognition. Measurement, 42(4), 611–618.
Article Google Scholar
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. G. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.
Article Google Scholar
Dai, K., Fell, H., & MacAuslan, J. (2009). Comparing emotions using acoustics and human perceptual dimensions. In Proc. 27th int. conf. extended abstracts on human factors in computing systems, Boston, USA, April 2009 (pp. 3341–3346).
Chapter Google Scholar
Ekman, P., & Davidson, R. (1994). The nature of emotion: fundamental questions. New York: Oxford University Press.
Google Scholar
Ekman, P., Matsumoto, D., & Friesen, W. (2005). Facial expression in affective disorders. In Series in affective science. What the face reveals (pp. 331–342). London: Oxford Press. Chap. 15.
Google Scholar
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.
Article MATH Google Scholar
Ellis, D. P. W. (2005). PLP and RASTA (and MFCC, and inversion) in Matlab. URL http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/. Online web resource.
Espinosa, H. P., & Reyes-García, C. (2009). Detection of negative emotional state in speech with anfis and genetic algorithms. In Proc. 6th int. workshop models and analysis of vocal emissions for biomedical applications, Florence, Italy, December 2009 (pp. 24–28).
Google Scholar
Fersini, E., Messina, E., Arosio, G., & Archetti, F. (2009). Audio-based emotion recognition in judicial domain: A multilayer support vector machines approach. In Proc. 6th int. conf. machine learning and data mining in pattern recognition, Leipzig, Germany, July 2009 (pp. 594–602).
Chapter Google Scholar
Gunes, H., Schuller, B., Pantic, M., & Cowie, R. (2011). Emotion representation, analysis and synthesis in continuous space: A survey. In Proc. of IEEE int. conf. automatic face and gesture recognition, Santa Barbara, USA, March 2011 (pp. 827–834).
Google Scholar
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(7–8), 1157–1182.
MATH Google Scholar
Guyon, I., Makhoul, J., Schwartz, R., & Vapnik, V. (1998). What size test set gives good error rate estimates? IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), 52–64.
Article Google Scholar
Hirschberg, J., Benus, S., Brenier, J. M., Enos, F., & Friedman, S. (2005). Distinguishing deceptive from non-deceptive speech. In Proc. 9th European conf. speech communication and technology, Lisbon, Portugal, September 2005 (pp. 1833–1836).
Google Scholar
Iliou, T., & Anagnostopoulos, C. (2009). Statistical evaluation of speech features for emotion recognition. In Proc. 4th int. conf. digital telecommunications, Colmar, France, July 2009 (pp. 121–126).
Chapter Google Scholar
Inanoglu, Z., & Caneel, R. (2005). Emotive alert: HMM-based emotion detection in voicemail messages. In Proc. 10th int. conf. intelligent user interfaces, San Diego, USA, January 2005 (pp. 251–253).
Chapter Google Scholar
Jackson, L. B. (1989). Digital filters and signal processing (2nd ed.). New York: Kluwer Academic.
Google Scholar
Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129(5), 770–814.
Article Google Scholar
Konstantinidis, E. I., Hitoglou-Antoniadou, M., Luneski, A., Bamidis, P. D., & Nikolaidou, M. M. (2009). Using affective avatars and rich multimedia content for education of children with autism. In Proc. 2nd int. conf. pervsive technologies related to assistive environments, Corfu, Greece, June 2009 (pp. 1–6).
Chapter Google Scholar
Kostoulas, T. P., & Fakotakis, N. (2006). A speaker dependent emotion recognition framework. In Proc. 5th int. symposium communication systems, networks and digital signal processing, Patras, Greece, July 2006 (pp. 305–309).
Google Scholar
Kotti, M., & Kotropoulos, C. (2008). Gender classification in two emotional speech databases. In Proc. 19th int. conf. pattern recognition, Tampa, USA, December 2008 (pp. 1–4).
Chapter Google Scholar
Kotti, M., Paternò, F., & Kotropoulos, C. (2010). Speaker-independent negative emotion recognition. In Proc. 2nd int. workshop cognitive information processing, Elba Island, Italy, June 2010.
Google Scholar
Lee, C. M., & Narayanan, S. (2005). Towards detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(12), 293–303.
Google Scholar
MacKay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press.
MATH Google Scholar
Markel, J. D., & Gray, A. H. (1976). Linear prediction of speech. New York: Springer.
Book MATH Google Scholar
Minker, W., Pittermann, J., Pittermann, A., Strauß, P. M., & Bühler, D. (2007). Challenges in speech-based human–computer interfaces. International Journal of Speech Technology, 10(2–3), 109–119.
Article Google Scholar
Mishra, H. K., & Sekhar, C. C. (2009). Variational Gaussian mixture models for speech emotion recognition. In Proc. 7th int. conf. advances in pattern recognition, Kolkata, India, February 2009 (pp. 183–186).
Chapter Google Scholar
Mixdorff, H. (2000). A novel approach to the fully automatic extraction of Fujisaki model parameters. In Proc. IEEE int. conf. acoustics, speech, and signal processing, June 2000 (pp. 1281–1284).
Google Scholar
Murray, I. R., & Arnott, J. L. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. The Journal of the Acoustical Society of America, 93(2), 1097–1108.
Article Google Scholar
Nass, C., Jonsson, I. M., Harris, H., Reaves, B., Endo, J., Brave, S., & Takayama, L. (2005). Improving automotive safety by pairing driver emotion and car voice emotion. In Proc. int. conf. human-computer interaction, extended abstracts on human factors in computing systems, Portland, OR, USA, April 2005 (pp. 1973–1976).
Google Scholar
Ntalampiras, S., Potamitis, I., & Fakotakis, N. (2009). An adaptive framework for acoustic monitoring of potential hazards. EURASIP Journal on Audio, Speech, and Music Processing. doi:10.1155/2009/594103.
MATH Google Scholar
Pantic, M., & Rothkrantz, L. J. M. (2003). Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE, 91(9), 1370–1390.
Article Google Scholar
Pantic, M., Pentland, A., Nijholt, A., & Huang, T. (2006). Human computing and machine understanding of human behavior: A survey. In Proc. 8th int. conf. multimodal interfaces, Banff, Canada, November 2006 (pp. 239–248).
Chapter Google Scholar
Pao, T. L., Chen, Y. T., Yeh, J. H., & Li, P. J. (2006). Mandarin emotional speech recognition based on SVM and NN. In Proc. 18th int. conf. pattern recognition, Hong Kong, Hong Kong, August 2006 (pp. 1096–1100).
Google Scholar
Picard, R. W. (1997). Affective computing. Cambridge: MIT Press.
Google Scholar
Pittermann, J., Pittermann, A., & Minker, W. (2010). Emotion recognition and adaptation in spoken dialogue systems. International Journal of Speech Technology, 13(1), 49–60.
Article Google Scholar
Ramakrishnan, S., & El Emary, I. (2011). Speech emotion recognition approaches in human computer interaction. Telecommunication Systems, 1–12. doi:10.1007/s11235-011-9624-z.
Ruvolo, P., Fasel, I., & Movellan, J. R. (2010). A learning approach to hierarchical feature selection and aggregation for audio classification. Pattern Recognition Letters, 31(12), 1535–1542.
Article Google Scholar
Sato, N., & Obuchi, Y. (2007). Emotion recognition using mel-frequency cepstral coefficients. Journal of Natural Language Processing, 14(4), 83–96.
Article Google Scholar
Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1–2), 227–256.
Article MATH Google Scholar
Schuller, B., Reiter, S., Muller, R., Al-Hames, M., Lang, M., & Rigoll, G. (2005a). Speaker independent speech emotion recognition by ensemble classification. In Proc. IEEE int. conf. multimedia and expo, Amsterdam, The Netherlands, July 2005 (pp. 864–867).
Chapter Google Scholar
Schuller, B., Villar, R. J., Rigoll, G., & Lang, M. (2005b). Meta-classifiers in acoustic and linguistic feature fusion-based affect recognition. In Proc. IEEE int. conf. acoustics, speech, and signal processing, Philadelphia, USA, March 2005 (pp. 325–328).
Google Scholar
Schuller, B., Müeller, R., Höernler, B., Höethker, A., Konosu, H., & Rigoll, G. (2007). Audiovisual recognition of spontaneous interest within conversations. In Proceedings of 9th int. conf. multimodal interfaces, Nagoya, Japan, November 2007 (pp. 30–37).
Chapter Google Scholar
Schuller, B., Rigoll, G., Can, S., & Feussner, H. (2008). Emotion sensitive speech control for human-robot interaction in minimal invasive surgery. In Proc. 17th IEEE int. symposium robot and human interactive communication, Munich, Germany, August 2008 (pp. 453–458).
Chapter Google Scholar
Schuller, B., Müller, R., Eyben, F., Gast, J., Hörnler, B., Wöllmer, M., Rigoll, G., Höthker, A., & Konosu, H. (2009a). Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image and Vision Computing, 27(12), 1760–1774.
Article Google Scholar
Schuller, B., Steidl, S., & Batliner, A. (2009b). The INTERSPEECH 2009 emotion challenge. In Proc. 10th annual int. conf. speech communication association, Brighton, UK, September 2009 (pp. 312–315).
Google Scholar
Sondhi, M. M. (1968). New methods of pitch extraction. IEEE Transactions on Audio and Electroacoustics, 16(2), 262–266.
Article Google Scholar
Tato, R., Santos, R., Kompe, R., & Pardo, J. M. (2002). Emotional space improves emotion recognition. In Proc. 7th int. conf. spoken language processing, September 2002 (pp. 2029–2032).
Google Scholar
Vanello, N., Martini, N., Milanesi, M., Keiser, H., Calisti, M., Bocchi, L., Manfredi, C., & Landini, L. (2009). Evaluation of a pitch estimation algorithm for speech emotion recognition. In Proc. 6th int. workshop models and analysis of vocal emissions for biomedical applications, Florence, Italy, December 2009 (pp. 29–32).
Google Scholar
Ververidis, D., & Kotropoulos, C. (2005). Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm. In Proceedings of IEEE int. conf. multimedia and expo, Los Alamitos, USA, July 2005 (pp. 1500–1503).
Chapter Google Scholar
Ververidis, D., & Kotropoulos, C. (2006). Fast sequential floating forward selection applied to emotional speech features estimated on DES and SUSAS data collections. In Proc. 14th European signal processing conference, Florence, Italy, September 2006.
Google Scholar
Vogt, T., André, E., & Bee, N. (2008). EmoVoice—A framework for online recognition of emotions from voice. In Proc. 4th IEEE tutorial and research workshop on perception and interactive technologies for speech-based systems, Irsee, Germany, June 2008 (pp. 188–199).
Google Scholar
Wallach, H. (2006). Evaluation metrics for hard classifiers (Tech. Rep.). Cambridge University, Cavendish Lab. URL www.inference.phy.cam.ac.uk/hmw26/papers/evaluation.ps.
Watson, D. (2000). Mood and temperament. New York: Guilford Press.
Google Scholar
Yang, B., & Lugger, M. (2010). Emotion recognition from speech signals using new harmony features. Signal Processing, Special Section on Statistical Signal & Array Processing, 90(5), 1415–1423.
MATH Google Scholar
Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2007). A survey of affect recognition methods: Audio, visual and spontaneous expressions. In Proc. 9th int. conf. multimodal interfaces, Nagoya, Japan, November 2007 (pp. 126–133).
Chapter Google Scholar
Zervas, P., Mporas, I., Fakotakis, N., & Kokkinakis, G. K. (2006). Employing Fujisaki’s intonation model parameters for emotion recognition. In Proc. 4th hellenic conf. artificial intelligence, Heraclion, Greece, May 2006 (pp. 443–453).
Google Scholar

Download references

Acknowledgements

M. Kotti would like to thank Associate Professor Constantine Kotropoulos for his valuable contributions for the extraction of part of the features that are described in Sect. 4.

Author information

Authors and Affiliations

ISTI-CNR, Via G. Moruzzi, 1, 56124, Pisa, Italy
Margarita Kotti & Fabio Paternò

Authors

Margarita Kotti
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Paternò
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Margarita Kotti.

Additional information

This work was carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kotti, M., Paternò, F. Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema. Int J Speech Technol 15, 131–150 (2012). https://doi.org/10.1007/s10772-012-9127-7

Download citation

Received: 12 September 2011
Accepted: 12 January 2012
Published: 31 January 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10772-012-9127-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Survey on SVM and their application in image classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Survey on SVM and their application in image classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation