Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings

Lenglet, Martin; Perrotin, Olivier; Bailly, Gérard

doi:10.21437/Interspeech.2022-759

Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings

Martin Lenglet, Olivier Perrotin, Gérard Bailly

Since neural Text-To-Speech models have achieved such high standards in terms of naturalness, the main focus of the field has gradually shifted to gaining more control over the expressiveness of the synthetic voices. One of these leverages is the control of the speaking rate that has become harder for a human operator to control since the introduction of neural attention networks to model speech dynamics. While numerous models have reintroduced an explicit duration control (ex: FastSpeech2), these models generally rely on additional tasks to complete during their training. In this paper, we show how an acoustic analysis of the internal embeddings delivered by the encoder of an unsupervised end-to-end TTS Tacotron2 model is enough to identify and control some acoustic parameters of interest. Specifically, we compare this speaking rate control with the duration control offered by a supervised FastSpeech2 model. Experimental results show that the control provided by embeddings reproduces a behaviour closer to natural speech data.

doi: 10.21437/Interspeech.2022-759

Cite as: Lenglet, M., Perrotin, O., Bailly, G. (2022) Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings. Proc. Interspeech 2022, 11-15, doi: 10.21437/Interspeech.2022-759

@inproceedings{lenglet22_interspeech,
  author={Martin Lenglet and Olivier Perrotin and Gérard Bailly},
  title={{Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={11--15},
  doi={10.21437/Interspeech.2022-759}
}