Skip to main content
Log in

Phoneme-guided Dysarthric speech conversion With non-parallel data by joint training

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

The phonetic structures of dysarthric speech are more difficult to discriminate than those of normal speech. Therefore, in this paper, we propose a novel voice conversion framework for dysarthric speech by learning disentangled audio-transcription representations. The novelty of this method is that it simultaneously takes both audio and its corresponding transcription as training inputs. We constrain the extracted linguistic representation from the audio input to be close to the linguistic representation from the transcription input, forcing them to share the same distribution. Furthermore, the proposed model can generate appropriate linguistic representations without any transcripts during the testing stage. The results of objective and subjective evaluations showed that the proposed method exhibits higher intelligibility and better speaker similarity of the converted speech than those of the baseline approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization. In: IEEE International Conference on Acoustics Speech and Signal Proceedings (ICASSP), pp. 8037–8040 (2013)

  2. Aihara, R., Takashima, R., Takiguchi, T., Ariki, Y.: A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary. EURASIP J. Audio, Speech Music Process. 2014(1), 1–10 (2014)

    Article  Google Scholar 

  3. Aihara, R., Takiguchi, T., Ariki, Y.: Multiple non-negative matrix factorization for many-to-many voice conversion. IEEE/ACM Transitions on Audio Speech and Language Processing (TASLP) 24(7), 1175–1184 (2016)

  4. Aihara, R., Takiguchi, T., Ariki, Y.: Phoneme-discriminative features for dysarthric speech conversion. In: Proceedings of Interspeech, pp. 3374–3378 (2017)

  5. Chen, J., Takiguchi, T., Ariki, Y.: Rotation-reversal invariant hog cascade for facial expression recognition. Signal, Image Video Process. (SIVP) 11(8), 1485–1492 (2017)

    Article  Google Scholar 

  6. Debnath, S., Roy, P.: Appearance and shape-based hybrid visual feature extraction: toward audio-visual automatic speech recognition, pp. 1–8. Signal, Image and Video Processing (SIVP) pp (2020)

  7. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde- Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems, pp. 2672–2680 (2014)

  8. Helander, E., Virtanen, T., Nurminen, J., Gabbouj, M.: Voice con-version using partial least squares regression. IEEE/ACM Tran.s Audio, Speech Lang. Process. (TASLP) 18(5), 912–921 (2010)

    Article  Google Scholar 

  9. Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Proc. (ICASSP), vol. 1, pp. 285–288. IEEE (1998)

  10. Kaneko, T., Kameoka, H.: Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)

  11. Lai, J., Chen, B., Tan, T., Tong, S., Yu, K.: Phone-aware LSTM-RNN for voice conversion. In: IEEE International Conference on Signal Processing (ICSP), pp. 177–182. IEEE (2016)

  12. Morise, M., Yokomori, F., Ozawa, K.: World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)

    Article  Google Scholar 

  13. Nakashika, T., Takiguchi, T., Ariki, Y.: Voice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(3), 580–587 (2014)

    Article  Google Scholar 

  14. Nakashika, T., Takiguchi, T., Minami, Y.: Non-parallel training in voice conversion using an adaptive restricted Boltzmann machine. IEEE/ACM Trans. Audio Speech Lang. Process. TASLP 24(11), 2032–2045 (2016)

    Article  Google Scholar 

  15. Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  16. Park, S.w., Kim, D.y., Joe, M.c.: Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data. arXiv preprint arXiv:2005.03295 (2020)

  17. Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: Autovc: Zero-shot voice style transfer with only autoencoder loss. arXiv preprint arXiv:1905.05879 (2019)

  18. Rudzicz, F., Namasivayam, A.K., Wolff, T.: The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Lang. Resour. Eval. 46(4), 523–541 (2012)

    Article  Google Scholar 

  19. Saito,D., Doi,H., Minematsu,N., Hirose,K.: Application of matrix variate Gaussian mixture model to statistical voice conversion. In: 15th Annual Conference of the International Speech Communication Association, pp. 2504–2508

  20. Seltzer, M.L., Droppo, J.: Multi-task learning in deep neural networks for improved phoneme recognition. In: IEEE International Conference on Acoustics Speech and Signal Proceedinds (ICASSP), pp. 6965–6969 (2013)

  21. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., et al.: Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)

  22. Stylianou, Y., Cappe, O., Moulines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998)

    Article  Google Scholar 

  23. Toda, T., Nakagiri, M., Shikanon, K.: Statistical voice conversion techniques for body-conducted unvoiced speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Proces. (TASLP) 20(9), 2505–2517 (2012)

    Article  Google Scholar 

  24. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018)

  25. Wu, Z., Valentini-Botinhao, C., Watts, O., King, S.: Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Proc. (ICASSP), pp. 4460–4464 (2015)

  26. Yang, S.H., Chung, M.: Improving dysarthric speech intelligibility using cycle-consistent adversarial training. arXiv preprint arXiv:2001.04260 (2020)

  27. Yuan, X.T., Liu, X., Yan, S.: Visual classification with multitask joint sparse representation. IEEE Trans. Image Proces. 21(10), 4349–4360 (2012)

    Article  MathSciNet  Google Scholar 

  28. Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)

Download references

Author information

Authors and Affiliations

Authors

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Oshiro, A., Chen, J. et al. Phoneme-guided Dysarthric speech conversion With non-parallel data by joint training. SIViP 16, 1641–1648 (2022). https://doi.org/10.1007/s11760-021-02119-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-021-02119-6

Keywords

Navigation