Abstract
Constructing a mapping between articulatory movements and corresponding speech could significantly facilitate speech training and the development of speech aids for voice disorder patients. In this paper, we propose a novel deep learning framework for the creation of a bidirectional mapping between articulatory information and synchronized speech recorded using an ultrasound system. We created a dataset comprising six Chinese vowels and employed the Bimodal Deep Autoencoders algorithm based on the Restricted Boltzmann Machine (RBM) to learn the correlation between speech and ultrasound images of the tongue and the weight matrices of the data representations obtained. Speech and ultrasound images were then reconstructed from the extracted features. The reconstruction error of the ultrasound images created with our method was found to be less than that of the approach based on Principal Components Analysis (PCA). Further, the reconstructed speech approximated the original as the mean formants error (MFE) was small. Following acquisition of their shared representations using the RBM-based deep autoencoder, we carried out mapping between ultrasound images of the tongue and corresponding acoustics signals with a Deep Neural Network (DNN) framework using the revised Deep Denoising Autoencoders. The results obtained indicate that the performance of our proposed method is better than that of a Gaussian Mixture Model (GMM)-based method to which it was compared.
Similar content being viewed by others
References
Badino L, Canevari C, Fadiga L, Metta G (2012) Deep-level acoustic-to-articulatory mapping for DBN-HMM based phone recognition. In Spoken Language Technology Workshop (SLT). IEEE 370–375
Ben-Youssef A, Shimodaira H, Braude D (2014) Speech driven talking head from estimated articulatory features. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 4573–4577
Ghosh PK, Narayanan SS (2011) A subject-independent acoustic-to-articulatory inversion. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 4624–4627
Hinton G (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1):926
Hiroya S, Honda M (2004) Estimation of articulatory movements from speech acoustics using an HMM-based speech production model. IEEE Trans Speech Audio Process 12(2):175–185
Hodgen J, Valdez P (2001) A stochastic articulatory-to-acoustic mapping as a basis for speech recognition. In Instrumentation and Measurement Technology Conference, 2001. IMTC 2001. Proceedings of the 18th IEEE. IEEE 2:1105–1110
Hogden J, Lofqvist A, Gracco V, Zlokarnik I, Rubin P, Saltzman E (1996) Accurate recovery of articulator positions from acoustics: new conclusions based on human data. J Acoust Soc Am 100(3):1819–1834
Huang J, Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP). IEEE Int Conf IEEE 7596–7599
Kello CT, Plaut DC (2004) A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. J Acoust Soc Am 116(4):2354–2364
Kello CT, Plaut DC (2004) A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. J Acoust Soc Am 116(4):2354–2364
Ladefoged P (1980) What are linguistic sounds made of? Language 485–502
Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, Saenko K (2007) Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU summer workshop. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Int Conf IEEE 4:IV-621
Nakamura K, Toda T, Nankaku Y, Tokuda K (2006) On the use of phonetic information for mapping from articulatory movements to vocal tract spectrum. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. IEEE Int Conf IEEE 1:I-I
Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled HMM for audio-visual speech recognition. In Acoustics, Speech, and Signal Processing (ICASSP). IEEE Int Conf IEEE 2:II-2013
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11) 689–696
Papandreou G, Katsamanis A, Pitsikalis V, Maragos P (2009) Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans Audio Speech Lang Process 17(3):423–435
Richmond K (2006) A trajectory mixture density network for the acoustic-articulatory inversion mapping. In Interspeech
Richmond K, King S, Taylor P (2003) Modelling the uncertainty in recovering articulation from acoustics. Comput Speech Lang 17(2):153–172
Saenko K, Darrell T, Glass JR (2004) Articulatory features for robust visual speech recognition. In Proceedings of the 6th international conference on Multimodal interfaces. ACM 152–158
Schroeter J, Sondhi MM (1992) Speech coding based on physiological models of speech production. Advances Speech Signal Process 231–267
Schroeter J, Sondhi MM (1994) Techniques for estimating vocal-tract shapes from the speech signal. IEEE Trans Speech Audio Process 2(1):133–150
Simko J, Cummins F (2009) Sequencing embodied gestures in speech
Suzuki S, Okadome T, Honda M (1998) Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints. In ICSLP
Toda T, Black AW, Tokuda K (2008) Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Commun 50(3):215–227
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. In INTERSPEECH 10:446–449
Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510
Acknowledgments
This work was supported in part by the National Basic Research Program of China (No. 2013CB329305), and in part by grants from the National Natural Science Foundation of China (No. 61175016, No. 61304250).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wei, J., Fang, Q., Zheng, X. et al. Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework. Multimed Tools Appl 75, 5223–5245 (2016). https://doi.org/10.1007/s11042-015-3038-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3038-y