Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework

Wei, Jianguo; Fang, Qiang; Zheng, Xinyuan; Lu, Wenhuan; He, Yuqing; Dang, Jianwu

doi:10.1007/s11042-015-3038-y

Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework

Published: 03 November 2015

Volume 75, pages 5223–5245, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jianguo Wei^1,3,
Qiang Fang²,
Xinyuan Zheng³,
Wenhuan Lu¹,
Yuqing He⁴ &
…
Jianwu Dang^3,5

534 Accesses
7 Citations
Explore all metrics

Abstract

Constructing a mapping between articulatory movements and corresponding speech could significantly facilitate speech training and the development of speech aids for voice disorder patients. In this paper, we propose a novel deep learning framework for the creation of a bidirectional mapping between articulatory information and synchronized speech recorded using an ultrasound system. We created a dataset comprising six Chinese vowels and employed the Bimodal Deep Autoencoders algorithm based on the Restricted Boltzmann Machine (RBM) to learn the correlation between speech and ultrasound images of the tongue and the weight matrices of the data representations obtained. Speech and ultrasound images were then reconstructed from the extracted features. The reconstruction error of the ultrasound images created with our method was found to be less than that of the approach based on Principal Components Analysis (PCA). Further, the reconstructed speech approximated the original as the mean formants error (MFE) was small. Following acquisition of their shared representations using the RBM-based deep autoencoder, we carried out mapping between ultrasound images of the tongue and corresponding acoustics signals with a Deep Neural Network (DNN) framework using the revised Deep Denoising Autoencoders. The results obtained indicate that the performance of our proposed method is better than that of a Gaussian Mixture Model (GMM)-based method to which it was compared.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transition Model from Articulatory Domain to Acoustic Domain of Phoneme Using SVM for Regression: Towards a Silent Spoken Communication

Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Tongue model construction based on ultrasound images with image processing and deep learning method

Article 18 February 2022

References

Badino L, Canevari C, Fadiga L, Metta G (2012) Deep-level acoustic-to-articulatory mapping for DBN-HMM based phone recognition. In Spoken Language Technology Workshop (SLT). IEEE 370–375
Ben-Youssef A, Shimodaira H, Braude D (2014) Speech driven talking head from estimated articulatory features. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 4573–4577
Ghosh PK, Narayanan SS (2011) A subject-independent acoustic-to-articulatory inversion. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 4624–4627
Hinton G (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1):926
Google Scholar
Hiroya S, Honda M (2004) Estimation of articulatory movements from speech acoustics using an HMM-based speech production model. IEEE Trans Speech Audio Process 12(2):175–185
Article Google Scholar
Hodgen J, Valdez P (2001) A stochastic articulatory-to-acoustic mapping as a basis for speech recognition. In Instrumentation and Measurement Technology Conference, 2001. IMTC 2001. Proceedings of the 18th IEEE. IEEE 2:1105–1110
Hogden J, Lofqvist A, Gracco V, Zlokarnik I, Rubin P, Saltzman E (1996) Accurate recovery of articulator positions from acoustics: new conclusions based on human data. J Acoust Soc Am 100(3):1819–1834
Article Google Scholar
Huang J, Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 7596–7599
Kello CT, Plaut DC (2004) A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. J Acoust Soc Am 116(4):2354–2364
Article Google Scholar
Kello CT, Plaut DC (2004) A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. J Acoust Soc Am 116(4):2354–2364
Article Google Scholar
Ladefoged P (1980) What are linguistic sounds made of? Language 485–502
Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, Saenko K (2007) Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU summer workshop. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Int Conf IEEE 4:IV-621
Nakamura K, Toda T, Nankaku Y, Tokuda K (2006) On the use of phonetic information for mapping from articulatory movements to vocal tract spectrum. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. IEEE Int Conf IEEE 1:I-I
Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled HMM for audio-visual speech recognition. In Acoustics, Speech, and Signal Processing (ICASSP). IEEE Int Conf IEEE 2:II-2013
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11) 689–696
Papandreou G, Katsamanis A, Pitsikalis V, Maragos P (2009) Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans Audio Speech Lang Process 17(3):423–435
Article Google Scholar
Richmond K (2006) A trajectory mixture density network for the acoustic-articulatory inversion mapping. In Interspeech
Richmond K, King S, Taylor P (2003) Modelling the uncertainty in recovering articulation from acoustics. Comput Speech Lang 17(2):153–172
Article Google Scholar
Saenko K, Darrell T, Glass JR (2004) Articulatory features for robust visual speech recognition. In Proceedings of the 6th international conference on Multimodal interfaces. ACM 152–158
Schroeter J, Sondhi MM (1992) Speech coding based on physiological models of speech production. Advances Speech Signal Process 231–267
Schroeter J, Sondhi MM (1994) Techniques for estimating vocal-tract shapes from the speech signal. IEEE Trans Speech Audio Process 2(1):133–150
Article Google Scholar
Simko J, Cummins F (2009) Sequencing embodied gestures in speech
Suzuki S, Okadome T, Honda M (1998) Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints. In ICSLP
Toda T, Black AW, Tokuda K (2008) Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Commun 50(3):215–227
Article Google Scholar
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
MathSciNet MATH Google Scholar
Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. In INTERSPEECH 10:446–449
Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Basic Research Program of China (No. 2013CB329305), and in part by grants from the National Natural Science Foundation of China (No. 61175016, No. 61304250).

Author information

Authors and Affiliations

School of Computer Software, Tianjin University, Tianjin, China
Jianguo Wei & Wenhuan Lu
Chinese Academy of Social Science, Beijing, China
Qiang Fang
Tianjin Key Lab. of Cognitive Computation and Application, Tianjin University, Tianjin, China
Jianguo Wei, Xinyuan Zheng & Jianwu Dang
School of Electronic Information Engineering, Tianjin University, Tianjin, China
Yuqing He
Japan Advanced Institute of Science and Technology, Nomi, Japan
Jianwu Dang

Authors

Jianguo Wei
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Xinyuan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Wenhuan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yuqing He
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenhuan Lu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, J., Fang, Q., Zheng, X. et al. Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework. Multimed Tools Appl 75, 5223–5245 (2016). https://doi.org/10.1007/s11042-015-3038-y

Download citation

Received: 15 April 2015
Revised: 06 October 2015
Accepted: 20 October 2015
Published: 03 November 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s11042-015-3038-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework

Abstract

Access this article

Similar content being viewed by others

Transition Model from Articulatory Domain to Acoustic Domain of Phoneme Using SVM for Regression: Towards a Silent Spoken Communication

Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Tongue model construction based on ultrasound images with image processing and deep learning method

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework

Abstract

Access this article

Similar content being viewed by others

Transition Model from Articulatory Domain to Acoustic Domain of Phoneme Using SVM for Regression: Towards a Silent Spoken Communication

Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Tongue model construction based on ultrasound images with image processing and deep learning method

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation