Investigating the Relation Between Voice Corpus Design and Hybrid Synthesis Under Reduction Constraint

Shamsi, Meysam; Lolive, Damien; Barbot, Nelly; Chevelu, Jonathan

doi:10.1007/978-3-030-31372-2_14

Meysam Shamsi¹¹,
Damien Lolive¹¹,
Nelly Barbot¹¹ &
…
Jonathan Chevelu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11816))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

633 Accesses
1 Citations

Abstract

Hybrid TTS systems generally try to optimise their cost function with the voice provided to generate the best signal. The voice is based on a speech corpus usually designed for a specific purpose. In this paper, we consider that the voice creation is realized through a corpus design step under reduction constraints. During this stage, a recording script is crafted to be optimal for the target TTS engine and its purpose. In this paper, we investigate the impact of sharing information between the corpus design step and the hybrid TTS optimisation step.

We start from a reduced voice optimized for a unit selection system using a CNN-based model. This baseline is compared to a hybrid TTS system that uses, as its target cost, a linguistic embedding built for the recording script design step. This approach is also compared to a standard hybrid TTS system trained only on the voice and so that does not have information about the corpus design process.

Objective measures and perceptual evaluations show how the integration of the corpus design embedding as target cost outperforms a classical hard-coded target cost. However, the feed-forward DNN acoustic model from the standard hybrid TTS system remains the best. This emphasizes the importance of acoustic information in the TTS target cost, which is not directly available before the voice recording.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hunt, A., Black, A.: Unit selection in a concatenative speech synthesis system using a large speech database. ICASSP 1, 373–376 (1996)
Google Scholar
Zen, H., Tokuda, K., Black, A.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Article Google Scholar
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: ICASSP, pp. 7962–7966 (2013)
Google Scholar
King, S., Wihlborg, L., Guo, W.: The Blizzard Challenge 2017. In: Blizzard Challenge workshop (2017)
Google Scholar
Fan, Y., Qian, Y., Xie, F., Soong, F.: TTS synthesis with bidirectional LSTM based recurrent neural networks, In: Interspeech, pp. 1964–1968 (2014)
Google Scholar
King, S., Crumlish, J., Martin, A., Wihlborg, L.: The Blizzard Challenge 2018. In: Blizzard Challenge Workshop (2018)
Google Scholar
Merritt, T., Clark, R., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: ICASSP, pp. 5145–5149 (2016)
Google Scholar
Wan, V., Agiomyrgiannakis, Y., Silen, H., Vit, J.: Google’s next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based auto-encoders. In: Interspeech, pp. 1143–1147 (2017)
Google Scholar
Zhou, X., Ling, Z., Zhou, Z., Dai, L.: Learning and modeling unit embeddings for improving HMM-based unit selection speech synthesis. In: Interspeech, pp. 2509–2513 (2018)
Google Scholar
Perquin, A., Lecorvé, G., Lolive, D., Amsaleg, L.: Phone-level embeddings for unit selection speech synthesis. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds.) SLSP 2018. LNCS (LNAI), vol. 11171, pp. 21–31. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00810-9_3
Chapter Google Scholar
Chevelu, J., Lolive, D.: Do not build your TTS training corpus randomly. In: EUSIPCO, pp. 350–354 (2015)
Google Scholar
Szklanny, K., Koszuta, S.: Implementation and verification of speech database for unit selection speech synthesis. In: Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 1262–1267 (2017)
Google Scholar
Nose, T., Arao, Y., Kobayashi, T., Sugiura, K., Shiga, Y., Ito, A.: Entropy-based sentence selection for speech synthesis using phonetic and prosodic contexts. In: Interspeech, pp. 3491–3495 (2015)
Google Scholar
François, H., Boëffard, O.: Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem. In: Interspeech, pp. 829–832 (2001)
Google Scholar
Cadic, D., D’Alessandro, C.: Towards optimal TTS corpora. In: LREC, pp. 99–104 (2010)
Google Scholar
Isogai, M., Mizuno, H., Mano, K.: Recording script design for corpus-based TTS system based on coverage of various phonetic elements. In: ICASSP, pp. 301–304 (2005)
Google Scholar
Barbot, N., Boëffard, O., Chevelu, J., Delhay, A.: Large linguistic corpus reduction with SCP algorithms. Computat. Linguist. 41(3), 355–383 (2015)
Article MathSciNet Google Scholar
Krul, A., Damnati, G., Yvon, F., Moudenc, T.: Corpus design based on the kullback-leibler divergence for text-to-speech synthesis application. In: ICSLP, pp. 2030–2033 (2006)
Google Scholar
Krul, A., Damnati, G., Yvon, F., Boidin, C., Moudenc, T.: Approaches for adaptive database reduction for text-to-speech synthesis. In: Interspeech, pp. 2881–2884 (2007)
Google Scholar
Cooper, E., Chang, A., Levitan, Y., Hirschberg, J.: Data selection and adaptation for naturalness in HMM-based speech synthesis. In: Interspeech, pp. 357–361 (2016)
Google Scholar
Nose, T., Arao, Y., Kobayashi, T., Sugiura, K., Shiga, Y.: Sentence selection based on extended entropy using phonetic and prosodic contexts for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25(5), 1107–1116 (2017)
Article Google Scholar
Alain, P., Barbot, N., Chevelu, J., Lecorvé G., Simon, C., Tahon, M.: The IRISA text-to-speech system for the blizzard challenge 2017. In: Blizzard Challenge Workshop (2017)
Google Scholar
Boeffard, O., Charonnat, L., Le Maguer, S., Lolive, D., Vidal, G.: Towards fully automatic annotation of audio books for TTS. In: LREC, pp. 975–980 (2012)
Google Scholar
Chevelu, J., Lolive, D., Le Maguer, S., Guennec, D.: How to compare TTS systems: a new subjective evaluation methodology focused on differences. In: Interspeech (2015)
Google Scholar
Lambert, T., Braunschweiler, N., Buchholz, S.: How (not) to select your voice corpus: random selection vs. phonologically balanced. In: SSW6, pp. 264–269 (2007)
Google Scholar

Download references

Acknowledgements

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015 and also funded by the Région Bretagne and the Conseil Départmental des Côtes d’armor.

Author information

Authors and Affiliations

Univ Rennes, CNRS, IRISA, Lannion, France
Meysam Shamsi, Damien Lolive, Nelly Barbot & Jonathan Chevelu

Authors

Meysam Shamsi
View author publications
You can also search for this author in PubMed Google Scholar
Damien Lolive
View author publications
You can also search for this author in PubMed Google Scholar
Nelly Barbot
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Chevelu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meysam Shamsi .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
Queen Mary University of London, London, UK
Matthew Purver
Jožef Stefan Institute, Ljubljana, Slovenia
Senja Pollak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shamsi, M., Lolive, D., Barbot, N., Chevelu, J. (2019). Investigating the Relation Between Voice Corpus Design and Hybrid Synthesis Under Reduction Constraint. In: Martín-Vide, C., Purver, M., Pollak, S. (eds) Statistical Language and Speech Processing. SLSP 2019. Lecture Notes in Computer Science(), vol 11816. Springer, Cham. https://doi.org/10.1007/978-3-030-31372-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-31372-2_14
Published: 27 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31371-5
Online ISBN: 978-3-030-31372-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics