Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus

Guennec, David; Wadoux, Lily; Sini, Aghilas; Barbot, Nelly; Lolive, Damien

doi:10.21437/SSW.2023-27

Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus

David Guennec, Lily Wadoux, Aghilas Sini, Nelly Barbot, Damien Lolive

Text-To-Speech synthesis with few data is a challengingtask, in particular when choosing the target speaker is not anoption. Voice cloning is a popular method to alleviate these issues using only a few minutes of target speech. To do this, themodel must first be trained on a large corpus of thousands ofhours and hundreds of speakers. In this paper, we tackle thechallenge of cloning voices with a much smaller corpus, using both the speaker adaptation and speaker encoding methods.We study the impact of selecting our training speakers basedon their similarity to the targets. We train models using onlythe training speakers closest/farthest to our targets in terms ofspeaker similarity from a pool of 14 speakers. We show thatthe selection of speakers in the training set has an impact on thesimilarity to the target speaker. The effect is more prominent forspeaker encoding than adaptation. However, it remains nuancedwhen it comes to naturalness.

doi: 10.21437/SSW.2023-27

Cite as: Guennec, D., Wadoux, L., Sini, A., Barbot, N., Lolive, D. (2023) Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 170-176, doi: 10.21437/SSW.2023-27

@inproceedings{guennec23_ssw,
  author={David Guennec and Lily Wadoux and Aghilas Sini and Nelly Barbot and Damien Lolive},
  title={{Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={170--176},
  doi={10.21437/SSW.2023-27}
}