ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

Multilingual Speech Emotion Recognition System Based on a Three-Layer Model

Xingfeng Li, Masato Akagi

Speech Emotion Recognition (SER) systems currently are focusing on classifying emotions on each single language. Since optimal acoustic sets are strongly language dependent, to achieve a generalized SER system working for multiple languages, issues of selection of common features and retraining are still challenging. In this paper, we therefore present a SER system in a multilingual scenario from perspective of human perceptual processing. The goal is twofold. Firstly, to predict multilingual emotion dimensions accurately such as human annotations. To this end, a three layered model consist of acoustic features, semantic primitives, emotion dimensions, along with Fuzzy Inference System (FIS) were studied. Secondly, by knowledge of human perception of emotion among languages in dimensional space, we adopt direction and distance as common features to detect multilingual emotions. Results of estimation performance of emotion dimensions comparable to human evaluation is furnished, and classification rates that are close to monolingual SER system performed are achieved.


doi: 10.21437/Interspeech.2016-645

Cite as: Li, X., Akagi, M. (2016) Multilingual Speech Emotion Recognition System Based on a Three-Layer Model. Proc. Interspeech 2016, 3608-3612, doi: 10.21437/Interspeech.2016-645

@inproceedings{li16m_interspeech,
  author={Xingfeng Li and Masato Akagi},
  title={{Multilingual Speech Emotion Recognition System Based on a Three-Layer Model}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={3608--3612},
  doi={10.21437/Interspeech.2016-645}
}