Skip to main content

Investigating the Relation Between Voice Corpus Design and Hybrid Synthesis Under Reduction Constraint

  • Conference paper
  • First Online:
Book cover Statistical Language and Speech Processing (SLSP 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11816))

Included in the following conference series:

Abstract

Hybrid TTS systems generally try to optimise their cost function with the voice provided to generate the best signal. The voice is based on a speech corpus usually designed for a specific purpose. In this paper, we consider that the voice creation is realized through a corpus design step under reduction constraints. During this stage, a recording script is crafted to be optimal for the target TTS engine and its purpose. In this paper, we investigate the impact of sharing information between the corpus design step and the hybrid TTS optimisation step.

We start from a reduced voice optimized for a unit selection system using a CNN-based model. This baseline is compared to a hybrid TTS system that uses, as its target cost, a linguistic embedding built for the recording script design step. This approach is also compared to a standard hybrid TTS system trained only on the voice and so that does not have information about the corpus design process.

Objective measures and perceptual evaluations show how the integration of the corpus design embedding as target cost outperforms a classical hard-coded target cost. However, the feed-forward DNN acoustic model from the standard hybrid TTS system remains the best. This emphasizes the importance of acoustic information in the TTS target cost, which is not directly available before the voice recording.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hunt, A., Black, A.: Unit selection in a concatenative speech synthesis system using a large speech database. ICASSP 1, 373–376 (1996)

    Google Scholar 

  2. Zen, H., Tokuda, K., Black, A.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)

    Article  Google Scholar 

  3. Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: ICASSP, pp. 7962–7966 (2013)

    Google Scholar 

  4. King, S., Wihlborg, L., Guo, W.: The Blizzard Challenge 2017. In: Blizzard Challenge workshop (2017)

    Google Scholar 

  5. Fan, Y., Qian, Y., Xie, F., Soong, F.: TTS synthesis with bidirectional LSTM based recurrent neural networks, In: Interspeech, pp. 1964–1968 (2014)

    Google Scholar 

  6. King, S., Crumlish, J., Martin, A., Wihlborg, L.: The Blizzard Challenge 2018. In: Blizzard Challenge Workshop (2018)

    Google Scholar 

  7. Merritt, T., Clark, R., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: ICASSP, pp. 5145–5149 (2016)

    Google Scholar 

  8. Wan, V., Agiomyrgiannakis, Y., Silen, H., Vit, J.: Google’s next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based auto-encoders. In: Interspeech, pp. 1143–1147 (2017)

    Google Scholar 

  9. Zhou, X., Ling, Z., Zhou, Z., Dai, L.: Learning and modeling unit embeddings for improving HMM-based unit selection speech synthesis. In: Interspeech, pp. 2509–2513 (2018)

    Google Scholar 

  10. Perquin, A., Lecorvé, G., Lolive, D., Amsaleg, L.: Phone-level embeddings for unit selection speech synthesis. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds.) SLSP 2018. LNCS (LNAI), vol. 11171, pp. 21–31. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00810-9_3

    Chapter  Google Scholar 

  11. Chevelu, J., Lolive, D.: Do not build your TTS training corpus randomly. In: EUSIPCO, pp. 350–354 (2015)

    Google Scholar 

  12. Szklanny, K., Koszuta, S.: Implementation and verification of speech database for unit selection speech synthesis. In: Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 1262–1267 (2017)

    Google Scholar 

  13. Nose, T., Arao, Y., Kobayashi, T., Sugiura, K., Shiga, Y., Ito, A.: Entropy-based sentence selection for speech synthesis using phonetic and prosodic contexts. In: Interspeech, pp. 3491–3495 (2015)

    Google Scholar 

  14. François, H., Boëffard, O.: Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem. In: Interspeech, pp. 829–832 (2001)

    Google Scholar 

  15. Cadic, D., D’Alessandro, C.: Towards optimal TTS corpora. In: LREC, pp. 99–104 (2010)

    Google Scholar 

  16. Isogai, M., Mizuno, H., Mano, K.: Recording script design for corpus-based TTS system based on coverage of various phonetic elements. In: ICASSP, pp. 301–304 (2005)

    Google Scholar 

  17. Barbot, N., Boëffard, O., Chevelu, J., Delhay, A.: Large linguistic corpus reduction with SCP algorithms. Computat. Linguist. 41(3), 355–383 (2015)

    Article  MathSciNet  Google Scholar 

  18. Krul, A., Damnati, G., Yvon, F., Moudenc, T.: Corpus design based on the kullback-leibler divergence for text-to-speech synthesis application. In: ICSLP, pp. 2030–2033 (2006)

    Google Scholar 

  19. Krul, A., Damnati, G., Yvon, F., Boidin, C., Moudenc, T.: Approaches for adaptive database reduction for text-to-speech synthesis. In: Interspeech, pp. 2881–2884 (2007)

    Google Scholar 

  20. Cooper, E., Chang, A., Levitan, Y., Hirschberg, J.: Data selection and adaptation for naturalness in HMM-based speech synthesis. In: Interspeech, pp. 357–361 (2016)

    Google Scholar 

  21. Nose, T., Arao, Y., Kobayashi, T., Sugiura, K., Shiga, Y.: Sentence selection based on extended entropy using phonetic and prosodic contexts for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25(5), 1107–1116 (2017)

    Article  Google Scholar 

  22. Alain, P., Barbot, N., Chevelu, J., Lecorvé G., Simon, C., Tahon, M.: The IRISA text-to-speech system for the blizzard challenge 2017. In: Blizzard Challenge Workshop (2017)

    Google Scholar 

  23. Boeffard, O., Charonnat, L., Le Maguer, S., Lolive, D., Vidal, G.: Towards fully automatic annotation of audio books for TTS. In: LREC, pp. 975–980 (2012)

    Google Scholar 

  24. Chevelu, J., Lolive, D., Le Maguer, S., Guennec, D.: How to compare TTS systems: a new subjective evaluation methodology focused on differences. In: Interspeech (2015)

    Google Scholar 

  25. Lambert, T., Braunschweiler, N., Buchholz, S.: How (not) to select your voice corpus: random selection vs. phonologically balanced. In: SSW6, pp. 264–269 (2007)

    Google Scholar 

Download references

Acknowledgements

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015 and also funded by the Région Bretagne and the Conseil Départmental des Côtes d’armor.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meysam Shamsi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shamsi, M., Lolive, D., Barbot, N., Chevelu, J. (2019). Investigating the Relation Between Voice Corpus Design and Hybrid Synthesis Under Reduction Constraint. In: Martín-Vide, C., Purver, M., Pollak, S. (eds) Statistical Language and Speech Processing. SLSP 2019. Lecture Notes in Computer Science(), vol 11816. Springer, Cham. https://doi.org/10.1007/978-3-030-31372-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-31372-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-31371-5

  • Online ISBN: 978-3-030-31372-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics