Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM

Kadyan, Virender; Dua, Mohit; Dhiman, Poonam

doi:10.1007/s10772-021-09814-2

Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM

Published: 09 February 2021

Volume 24, pages 517–527, (2021)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Virender Kadyan¹,
Mohit Dua² &
Poonam Dhiman³

302 Accesses
12 Citations
Explore all metrics

Abstract

Long short term memory (LSTM) is a powerful model in building of an ASR system whereas standard recurrent networks are generally inefficient to obtain better performance. Although these issues are addressed in LSTM neural network architecture but their performance get degraded on long contextual information. Recent experiments show that LSTM and their improved approaches like Deep LSTM requires a lot of tuning in training and experiences. In this paper Deep LSTM models are built on long contextual sentences by selecting optimal value of batch size, layer, and activation functions. It also indulge comparative study of train and test perplexity through computation of word error rate. Furthermore, we use hybrid discriminative approaches with different variants of iterations which shows significant improvement with Deep LSTM networks. Experiments are mainly perform on single sentences or one to two concatenated sentences. Deep LSTM achieves performance improvement of 3–4% over conventional Language Models (LMs) and modelling classifier approaches with acceptable word error rate on top of state-of-the-art Punjabi speech recognition system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

Greg Van Houdt, Carlos Mosquera & Gonzalo Nápoles

Deep learning for time series classification: a review

Article 02 March 2019

Hassan Ismail Fawaz, Germain Forestier, … Pierre-Alain Muller

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (pp. 265–283).
Aggarwal, R. K., & Dave, M. (2013). Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system. Telecommunication Systems, 52(3), 1457–1466.
Article Google Scholar
Bahl, L., Brown, P., De Souza, P., & Mercer, R. (1986, April). Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In ICASSP'86. IEEE international conference on acoustics, speech, and signal processing (Vol. 11, pp. 49–52). IEEE.
Bassan, N., & Kadyan, V. (2018). An experimental study of continuous automatic speech recognition system using MFCC with Reference to Punjabi. Recent Findings in Intelligent Computing Techniques: Proceedings of the 5th ICACNI 2017, 1, 267.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
MATH Google Scholar
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Article Google Scholar
Chen, W., Zhenjiang, M., & Xiao, M. (2009). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(16), 582–589.
Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2018). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10, 2301–2314.
Article Google Scholar
Dua, M., Aggarwal, R. K., Kadyan, V., & Dua, S. (2012). Punjabi automatic speech recognition using HTK. International Journal of Computer Science Issues (IJCSI), 9(4), 359.
Google Scholar
Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3, 115–143.
MathSciNet MATH Google Scholar
Ghosh, S., Vinyals, O., Strope, B., Roy, S., Dean, T., & Heck, L. (2016). Contextual lstm (clstm) models for large scale nlp tasks. arXiv:1602.06291.
Gillick, D., Wegmann, S., & Gillick, L. (2012, March). Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4745–4748). IEEE.
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5–6), 602–610.
Article Google Scholar
Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645–6649). IEEE.
Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645–6649). IEEE.
Hermans, M., & Schrauwen, B. (2013). Training and analysing deep recurrent neural networks. In Advances in neural information processing systems (pp. 190–198).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Jiang, H. (2010). Discriminative training of HMMs for automatic speech recognition: A survey. Computer Speech & Language, 24(4), 589–608.
Article Google Scholar
Kadyan, V., Mantri, A., & Aggarwal, R. K. (2018). Refinement of HMM model parameters for punjabi automatic speech recognition (PASR) system. IETE Journal of Research, 64(5), 673–688.
Article Google Scholar
Kadyan, V., Mantri, A., Aggarwal, R. K., & Singh, A. (2019). A comparative study of deep neural network based Punjabi-ASR system. International Journal of Speech Technology, 22(1), 111–119.
Article Google Scholar
Kipyatkova, I., & Karpov, A. (2015, November). Recurrent neural network-based language modeling for an automatic Russian speech recognition system. In 2015 Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT) (pp. 33–38). IEEE.
Kneser, R., Ney, H. (1995). Improved backing-off for M-gram language modeling. In: 1995 International Conference on Acoustics, Speech and Signal Processing (ICASSP) (vol. 1, pp. 181–184).
Kuamr, A., Dua, M., & Choudhary, A. (2014b). Implementation and performance evaluation of continuous Hindi speech recognition. In 2014 International Conference on Electronics and Communication Systems (ICECS) (pp. 1–5). IEEE.
Kuamr, A., Dua, M., & Choudhary, T. (2014a). Continuous Hindi speech recognition using Gaussian mixture HMM. In 2014 IEEE Students' Conference on Electrical, Electronics and Computer Science (pp. 1–5). IEEE.
Kumar, A., & Aggarwal, R. K. (2020). Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling. Journal of Intelligent Systems, 30(1), 165–179.
Article MathSciNet Google Scholar
Kumar, A., & Aggarwal, R. K. (2020b). A time delay neural network acoustic modeling for hindi speech recognition. In Advances in Data and Information Sciences (pp. 425–432). Springer, Singapore.
Medennikov, I., & Bulusheva, A. (2016). LSTM-based language models for spontaneous speech recognition. In International conference on speech and computer (pp. 469–475). Springer, Cham.
Medennikov, I., & Bulusheva, A. (2016, August). LSTM-based language models for spontaneous speech recognition. In International conference on speech and computer (pp. 469–475). Springer, Cham.
Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech and Language, 20(1), 69–88.
Article Google Scholar
Passricha, V., & Aggarwal, R. K. (2019). A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-019-01325-y.
Article Google Scholar
Povey, D. (2005). Discriminative training for large vocabulary speech recognition. Doctoral dissertation, University of Cambridge.
Povey, D., & Woodland, P. (2001, May). Improved discriminative training techniques for large vocabulary continuous speech recognition. In 2001 IEEE International conference on acoustics, speech, and signal processing. proceedings (Cat. No. 01CH37221) (Vol. 1, pp. 45–48). IEEE.
Sahu, P., Dua, M., & Kumar, A. (2018). Challenges and issues in adopting speech recognition. In Speech and language processing for human-machine communications (pp. 209–215). Springer, Singapore.
Sak, H., Senior, A. W., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. International Journal of Speech Technology. https://doi.org/10.1007/s10772-018-09573-7.
Article Google Scholar
Schwenk, H. (2007). Continuous space language models. Computer Speech & Language, 21(3), 492–518.
Article Google Scholar
Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.
Tian, X., Zhang, J., Ma, Z., He, Y., Wei, J., Wu, P., & Zhang, Y. (2017). Deep LSTM for large vocabulary continuous speech recognition. https://arxiv.org/abs/1703.07090.
Vinyals, O., Ravuri, S. V., & Povey, D. (2012, March). Revisiting recurrent neural networks for robust ASR. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4085–4088). IEEE.
Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2(4), 490–501.
Article Google Scholar
Woodland, P. C., & Povey, D. (2002). Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech & Language, 16(1), 25–47.
Article Google Scholar
Woodland, P.C., & Povey D (2000) Large scale discriminative training for speech recognition. ASR2000-automatic speech recognition: challenges for the new millenium ISCA tutorial and research workshop (ITRW).
Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., & Ney, H. (2017, March). A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2462–2466). IEEE.

Download references

Author information

Authors and Affiliations

Department of Informatics, School of Computer Science, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
Virender Kadyan
Department of Computer Engineering, National Institute of Technology, Kurukshetra, Haryana, India
Mohit Dua
Department of Computer Science, Government Post Graduate College, Ambala Cantt, Haryana, India
Poonam Dhiman

Authors

Virender Kadyan
View author publications
You can also search for this author in PubMed Google Scholar
Mohit Dua
View author publications
You can also search for this author in PubMed Google Scholar
Poonam Dhiman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Virender Kadyan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kadyan, V., Dua, M. & Dhiman, P. Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM. Int J Speech Technol 24, 517–527 (2021). https://doi.org/10.1007/s10772-021-09814-2

Download citation

Received: 07 December 2019
Accepted: 15 January 2021
Published: 09 February 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10772-021-09814-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Deep learning for time series classification: a review

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Deep learning for time series classification: a review

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation