Improving Acoustic Models for Russian Spontaneous Speech Recognition

Prudnikov, Alexey; Medennikov, Ivan; Mendelev, Valentin; Korenevsky, Maxim; Khokhlov, Yuri

doi:10.1007/978-3-319-23132-7_29

Alexey Prudnikov^7,8,
Ivan Medennikov^8,9,
Valentin Mendelev⁷,
Maxim Korenevsky^7,8 &
…
Yuri Khokhlov⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9319))

Included in the following conference series:

International Conference on Speech and Computer

1633 Accesses
11 Citations

Abstract

The aim of the paper is to investigate the ways to improve acoustic models for Russian spontaneous speech recognition. We applied the main steps of the Kaldi Switchboard recipe to a Russian dataset but obtained low accuracy with respect to the results for English spontaneous telephone speech. We found two methods to be especially useful for Russian spontaneous speech: the i-vector based deep neural network adaptation and speaker-dependent bottleneck features which provide 8.6 % and 11.9 % relative word error rate reduction over the baseline system respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Soltau, H., Saon, G., Sainath, T.N.: Joint training of convolutional and non-convolutional neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5572–5576. Florence (2014)
Google Scholar
Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: 14th Annual Conference of the International Speech Communication Association (Interspeech), pp. 2345–2349. Lyon (2014)
Google Scholar
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 13th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. Olomouc (2013)
Google Scholar
Godfrey, J.J., Holliman, E.C., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: 17th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 517–520. San Francisco (1992)
Google Scholar
Povey, D. et al.: The Kaldi speech recognition toolkit. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 5572–5576. Big Island (2011)
Google Scholar
Gales, M.J.F.: Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition. Technical report, Cambridge University Engineering Department (1997)
Google Scholar
Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. dissertation. University of Cambridge, Cambridge, UK (2003)
Google Scholar
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. Big Island (2011)
Google Scholar
Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Commun. 49(10–11), 827–835 (2007)
Article Google Scholar
Yao K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: IEEE Spoken Language Technology Workshop (SLT), pp. 366–369. Miami (2012)
Google Scholar
Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6399–6403. Florence (2014)
Google Scholar
Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 31st International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toulouse (2006)
Google Scholar
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7893–7897. Vancouver (2013)
Google Scholar
Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 225–229. Florence (2014)
Google Scholar
Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: 15th Annual Conference of the International Speech Communication Association, pp. 2997–3001. Singapore (2014)
Google Scholar
Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 195–199. Florence (2014)
Google Scholar
Rouvier, M., Favre, B.: Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 3007–3011. Singapore (2014)
Google Scholar
Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID Speaker Recognition System for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 278–285. Springer, Heidelberg (2013)
Chapter Google Scholar
Novoselov, S., Pekhovsky, T., Simonchik, K., Shulipa, A.: RBM-PLDA subsystem for the NIST i-vector challenge. In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 378–382. Singapore (2014)
Google Scholar
Karafiat, M., Grezl, F., Hannemann, M., Cernocky, J.H.: But neural network features for spontaneous Vietnamese in BABEL. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5622–5626 (2014)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical report search in Computing Technology (Harvard University) (1998)
Google Scholar

Download references

Acknowledgements

The work was partially financially supported by the Government of the Russian Federation, Grant 074-U01, and by the Ministry of Education and Science of Russian Federation, contract 14.579.21.0057, ID RFMEFI57914X0057.

Author information

Authors and Affiliations

Speech Technology Center Ltd, St. Petersburg, Russia
Alexey Prudnikov, Valentin Mendelev & Maxim Korenevsky
ITMO University, St. Petersburg, Russia
Alexey Prudnikov, Ivan Medennikov & Maxim Korenevsky
STC-innovations Ltd, St. Petersburg, Russia
Ivan Medennikov & Yuri Khokhlov

Authors

Alexey Prudnikov
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Medennikov
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Mendelev
View author publications
You can also search for this author in PubMed Google Scholar
Maxim Korenevsky
View author publications
You can also search for this author in PubMed Google Scholar
Yuri Khokhlov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Medennikov .

Editor information

Editors and Affiliations

SPIIRAS, Saint-Petersburg, Russia
Andrey Ronzhin
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Patras, Patras, Greece
Nikos Fakotakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y. (2015). Improving Acoustic Models for Russian Spontaneous Speech Recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds) Speech and Computer. SPECOM 2015. Lecture Notes in Computer Science(), vol 9319. Springer, Cham. https://doi.org/10.1007/978-3-319-23132-7_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-23132-7_29
Published: 04 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23131-0
Online ISBN: 978-3-319-23132-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics