Abstract
Speech recognition in noisy environments is one of the long-standing research themes but remains a very important challenge nowadays. Therefore, there is much research into all techniques and approaches to improve the performance of speech recognition systems, even in poor conditions. This paper presents a comparative study under various conditions based on two architectures (GMM-HMM and DNN-HMM), the Hybrid GMM-HMM models using the CMU Sphinx tools and the Hybrid DNN-HMM using the KALDI toolkit in noise environment. In this study, we compare the Hybrid GMM-HMM models and the Hybrid DNN-HMM models to evaluate the performance of the proposed system. The novelty of this paper is to test if the presented tools could be, with good accuracy, recognize the Arabic speech principally in noisy environment. In addition, we adopted the noisy training theory in this paper based on GMM-HMM and DNN-HMM model. We use the public Arabic Speech Corpus for Isolated Words (20 words), three noise levels, and three noise types. The implementation of our system consists of two phases: Features extraction using Mel-frequency Cepstral Coefficient (MFCC) and the classification phase will use separately the previous two models. In order to test the performance of these methods a simulation will presented for different SNR and for different district type of noise.
Similar content being viewed by others
References
Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 22(10), 1533–1545.
Abdou, S.M., & Moussa, A.M. (2017). Arabic Speech Recognition: Challenges and State of the Art. In: Computational linguistics, speech and image processing for arabic language, vol. 4, (pp. 1–27), World Scientific.
Abushariah, A.A.M., Gunawan, T.S., Khalifa, O.O., & Abushariah, M.A.M. (2010). English digits speech recognition system based on Hidden Markov Models. In: International conference on computer and communication engineering (ICCCE’10), (pp. 1–5).
Alalshekmubara, A.K., & Smith, L.S. (2014). On improving the classification capability of reservoir computing for Arabic speech recognition. In: Artificial neural networks and machine learning – ICANN 2014, (pp. 225–232).
Alalshekmubarak, A., & Smith, L.S. (2014). On improving the classification capability of reservoir computing for Arabic speech recognition. In: Artificial neural networks and machine learning – ICANN 2014, Cham, (pp. 225–232).
Alotaibi, Y.A. (2004). Spoken Arabic digits recognizer using recurrent neural networks. In: Proceedings of the fourth IEEE international symposium on signal processing and information technology, (pp. 195–199).
Alsharhan, E., & Ramsay, A. (2020). Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition. Language Resources and Evaluation, 54(4), 975–998. https://doi.org/10.1007/s10579-020-09505-5.
Aymen, M., Abdelaziz, A., Halim, S., & Maaref, H. (2011). hidden Markov models for automatic speech recognition. In: 2011 International conference on communications, computing and control applications (CCCA), (pp. 1–6).
Barker, J., Marxer, R., Vincent, E., & Watanabe, S. (2017). The third “CHiME” speech separation and recognition challenge: Analysis and outcomes. Computer Speech & Language, 46, 605–626.
Chien, J.-T., & Misbullah, A. (2016). Deep long short-term memory networks for speech recognition. In: 2016 10th international symposium on Chinese spoken language processing (ISCSLP), (pp. 1–5).
Elmahdy, M., Gruhn, R., & Minker, W. (2012). Novel techniques for dialectal Arabic speech recognition. New York: Springer.
El-Ramly, S.H., Abdel-Kader, N.S., & El-Adawi, R. (2002). Neural networks used for speech recognition. In: Proceedings of the nineteenth national radio science conference, (pp. 200–207).
Eltiraifi, O., Elbasheer, E., & Nawari, M. (2018). A comparative study of MFCC and LPCC features for speech activity detection using deep belief network. In: 2018 international conference on computer, control, electrical, and electronics engineering (ICCCEEE), (pp. 1–5).
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, (pp. 6645–6649).
Guerid, A., & Houacine, A. (2019). Recognition of isolated digits using DNN–HMM and harmonic noise model. IET Signal Processing, 13(2), 207–214.
Haton, J.-P. (1999). Neural networks for automatic speech recognition: A review. In G. Chollet, M.-G. Di Benedetto, A. Esposito, & M. Marinaro (Eds.), Speech processing, recognition and artificial neural networks (pp. 259–280). London: Springer.
Hinton, G., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Huggins-Daines, D., Kumar, M., Chan, A., Black, A.W., Ravishankar, M., & Rudnicky, A.I. (2006). Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In 2006 IEEE international conference on acoustics speech and signal processing proceedings, vol. 1, (pp. I–I).
Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272.
Kadyan, V., Mantri, A., Aggarwal, R. K., & Singh, A. (2019). A comparative study of deep neural network based Punjabi-ASR system. International Journal of Speech Technology, 22(1), 111–119.
Khelifa, M. O. M., Elhadj, Y. M., Abdellah, Y., & Belkasmi, M. (2017). Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system. International Journal of Speech Technology, 20(4), 937–949.
Le Prell, C. G., & Clavier, O. H. (2017). Effects of noise on speech recognition: Challenges for communication by service members. Hearing Research, 349, 76–89.
Manchanda, S., & Gupta, D. (2018). Hybrid approach of feature extraction and vector quantization in speech recognition. In: Proceedings of the second international conference on computational intelligence and informatics, (pp. 639–645). Singapore.
Novoa, J., Wuth, J., Escudero, J.P., Fredes, J., Mahu, R., & Yoma, N.B. DNN-HMM based Automatic Speech Recognition for HRI Scenarios. In Proceedings of the 2018 ACM/IEEE international conference on human–robot interaction, Chicago, USA, (pp. 150–159).
Oueslati, O., Cambria, E., HajHmida, M. B., & Ounelli, H. (2020). A review of sentiment analysis research in Arabic language. Future Generation Computer Systems, 112, 408–430. https://doi.org/10.1016/j.future.2020.05.034.
Pearce, D., Hirsch, H., & Gmbh, E.E.D. (2000). The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ISCA ITRW ASR2000, (pp. 29–32).
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Roberts, W.J.J., & Willmore, J.P. (1999). Automatic speaker recognition using Gaussian mixture models. In: Proceedings 1999 information, decision and control. Data and information fusion symposium, signal processing and communications symposium and decision and control symposium. (Cat. No.99EX251), (pp. 465–470).
Senior, A., Heigold, G., Bacchiani, M., & Liao, H. (2014). GMM-free DNN acoustic model training. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5602–5606.
Senior, A., Heigold, G., Bacchiani, M., & Liao, H. (2014). GMM-free DNN acoustic model training. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 5602–5606).
Shahin, M., Ahmed, B., McKechnie, J., Ballard, K., & Gutierrez-Osuna, R. (2014). Comparison of GMM-HMM and DNN-HMM based pronunciation verification techniques for use in the assessment of childhood apraxia of speech. In: 15th Annual conference of the international speech communication association (INTERSPEECH 2014): celebrating the diversity of spoken languages, (pp. 1583–1587).
Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.
Vegesna, V. V. R., Gurugubelli, K., Vydana, H. K., Pulugandla, B., Shrivastava, M., & Vuppala, A. K. (2017). DNN-HMM acoustic modeling for large vocabulary telugu speech recognition. Mining Intelligence and Knowledge Exploration. https://doi.org/10.1007/978-3-319-71928-3_19.
Wahyuni, E.S. (2017). Arabic speech recognition using MFCC feature extraction and ANN classification. In 2017 2nd International conferences on information technology, information systems and electrical engineering, (pp. 22–25).
Yoshioka, T., Karita, S., & Nakatani, T. (2015). Far-field speech recognition using CNN-DNN-HMM with convolution in time. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4360–4364).
Yu, D., & Deng, L. (2015). Automatic speech recognition: A deep learning approach. London: Springer.
Zhang, W., Zhai, M., Huang, Z., Liu, C., Li, W., & Cao, Y. (2019). Towards end-to-end speech recognition with deep multipath convolutional neural networks. In W. Zhang, M. Zhai, Z. Huang, C. Liu, W. Li, & Y. Cao (Eds.), Intelligent robotics and applications. Cham: Springer.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ouisaadane, A., Safi, S. A comparative study for Arabic speech recognition system in noisy environments. Int J Speech Technol 24, 761–770 (2021). https://doi.org/10.1007/s10772-021-09847-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09847-7