Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction

Schnell, Bastian; Garner, Philip N.

doi:10.21437/SSW.2021-11

Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction

Bastian Schnell, Philip N. Garner

We aim to provide controls for emotion in synthetic speech. Many emotions are not displayed continuously in an otherwise emotional utterance; rather, the intensity varies with time. We show that an emotion recogniser is capable of producing a measure of emotion intensity via attention or saliency; this measure is appropriate to label utterances subsequently used to train a speech synthesiser. We evaluate novel and published means to do this showing that, whilst it is no longer state of the art for emotion recognition, attention is a good way to indicate emotion intensity for speech synthesis.

doi: 10.21437/SSW.2021-11

Cite as: Schnell, B., Garner, P.N. (2021) Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 60-65, doi: 10.21437/SSW.2021-11

@inproceedings{schnell21_ssw,
  author={Bastian Schnell and Philip N. Garner},
  title={{Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction}},
  year=2021,
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},
  pages={60--65},
  doi={10.21437/SSW.2021-11}
}