Skip to main content
Log in

Speaker Modeling Using Emotional Speech for More Robust Speaker Identification

  • THEORY AND METHODS OF SIGNAL PROCESSING
  • Published:
Journal of Communications Technology and Electronics Aims and scope Submit manuscript

Abstract

Automatic identity recognition in fast, reliable and non-intrusive way is one of the most challenging topics in digital world of today. A possible approach to identity recognition is the identification by voice. Characteristics of speech relevant for automatic speaker recognition can be affected by external factors such as noise and channel distortions, but also by speaker-specific conditions—emotional or health states. The improvement of a speaker recognition system by different model training strategies are addressed in this paper in order to obtain the best performance of the system with only a limited amount of neutral and emotional speech data. The models adopted are a Gaussian Mixture Model and i-vectors whose inputs are Mel Frequency Cepstral Coefficients, and the experiments have been conducted on the Russian Language Affective speech database. The results show that the appropriate use of emotional speech in speaker model training improves the robustness of a speaker recognition system – both when tested on neutral and emotional speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

Similar content being viewed by others

REFERENCES

  1. A. Alarifi, I. Alkurtass, and A.S. Alsalman, in Proc. Int. Conf. on Information Technology and e-Services, Sousse, Mar. 24–26,2012, pp. 1–6. https://doi.org/10.1109/ICITeS.2012.6216673

  2. K. R. Scherer, T. Johnstone, G. Klasmeyer, and T. Bänziger, in Proc. Interspeech, Beijing, China, Oct. 1620,2000. https://www.isca-speech.org/archive/icslp_ 2000/i00_2807.html.

  3. J. H. Hansen and T. Hasan, IEEE Signal Process. Mag. 32, 74 (2015). https://doi.org/10.1109/MSP.2015.2462851

    Article  Google Scholar 

  4. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Dig. Signal Process. 10, 19 (2000). https://doi.org/10.1006/dspr.1999.0361

    Article  Google Scholar 

  5. W. M. Campbell, D. E. Sturim, and D. A. Reynolds, IEEE Signal Process. Lett. 13, No. 5, 308 (2006). https://doi.org/10.1109/LSP.2006.870086

    Article  Google Scholar 

  6. P. Kenny, Tech. Rep. CRIM-06/08-13, (2005). http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.331.7996.

  7. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, IEEE Trans. Audio, Speech, Lang. Proc. 19, 788, (2010). https://doi.org/10.1109/TASL.2010.2064307

    Article  Google Scholar 

  8. T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Bořil, J. H. L. Hansen, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP),2013 (IEEE, New York, 2013), p. 6783. https://doi.org/10.1109/ICASSP.2013.6638975

  9. R. Saedi, K. A. Lee, T. Kinnunen, T. Hasan, B. Fauve, P. M. Bousquet, E. Khoury, P. L. S. Martinez, J. M. K. Kua, C. You, and H. Sun, in Proc.Interspeech,Lyon, France, August 25–29, 2013, p. 1986. https://www.isca-speech.org/archive/archive_papers/ inters peech_2013/i13_1986.pdf.

    Google Scholar 

  10. C. Zhang, F. Bahmaninezhad, S. Ranjan, C. Yu, N. Shokouhi, and J. H. L. Hansen, in Proc. Interspeech, Stockholm, Sweden, August 20–24, (2017), p. 1343. https://doi.org/10.21437/Interspeech.2017-555

  11. K. A. Lee, V. Hautamaki, T. Kinnunen, A. Larcher, C. Zhang, A. Nautsch, T. Stafylakis, G. Liu, M. Rouvier, W. Rao, F. Alegre, J. Ma, M. W. Mak, A. K. Sarkar, H. Delgado, R. Saeidi, H. Aronowitz, A. Sizov, H. Sun, T. H. Nguyen, Md. Sahidullah, V. Vestman, M. Halonen, A. Kanervisto et al., in Proc. Interspeech, Stockholm, Sweden, August 20–24, (2017), p. 1328. https://doi.org/10.21437/Interspeech.2017-203

  12. F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, in Proc.Interspeech,Lisbon, Portugal, (2005), p. 1517. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8506&rep=rep1&type=pdf.

  13. M. Liberman et al. “Emotional Prosody Speech and Transcripts LDC2002S28,” Web Download (2002). https://catalog.ldc.upenn.edu/LDC2002S28.

  14. C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, and S. S. Narayanan, Lang. resources and eval. 42, 335 (2008). https://doi.org/10.1007/s10579-008-9076-6

    Article  Google Scholar 

  15. S. G. Koolagudi, S. Maity, V. A. Kumar, S. Chakrabarti, K. S. Rao, Comm. Comp. & Inf. Sci. (CCIS) 40, 485 (2009). https://doi.org/10.1007/978-3-642-03547-0

    Google Scholar 

  16. T. Wu, Y. Yang, Z. Wu, and D. Li, IEEE Speaker and Lang. Recognition Workshop (IEEE, Odyssey, 2006), p. 1. https://doi.org/10.1109/ODYSSEY.2006.248084

  17. V. Markova and V.A. Pertushin, in Proc.Int. Conf. Spoken Language Processing,Denver, Colorado, USA, Sept. 16–20,2002, p. 2041. https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_2041.pdf.

  18. L. Ferrer, M. McLaren, N. Scheffer, Y. Lei, M. Graciarena & V. Mitra, Sri Int. Menlo Park Ca Speech Tech. and Research Lab. (2013). https://www.semanticscholar.org/paper/A-noise-robust-system-for-NIST-2012-speaker-Ferrer-McLaren/17d2fef89a3c3dd5d291bc0906c9a3d047 c50609.

  19. D. Li, Y. Yang, and W. Dai, Sci. World J. 2014 (2014), https://doi.org/10.1155/2014/628516

    Google Scholar 

  20. K.R. Alluri, S. Achanta, R. Prasath, S. V. Gangashetty, and A. K. Vuppala, in Proc. Int. Conf. on Mining Intelligence and Knowledge Exploration,2016, p. 66. https://doi.org/10.1007/978-3-319-58130-9_7

    Chapter  Google Scholar 

  21. M. V. Ghiurcau, C. Rusu, C., and J. Astola, in Proc IEEE Int. Conf. on Acoustics, Speech and Sig.,2011 (IEEE, New York, 2011), p. 4944. https://doi.org/10.1109/ICASSP.2011.5947465

  22. A. Mansour and Z. Lachiri, in Proc.Eng. Tech. Second Int. Conf. on Automation, Control, Eng. and Comp. Sci., 2016, p. 122. http://ipco-co.com/CSCIT_Journal/papers-CSCIT/CSCIT/CSCIT%20-%20Vol.2%20-%20 issue1%20-%202015/1.pdf.

    Google Scholar 

  23. A. Mansour and Z. Lachiri, Int. Journal of Adv. Comp. Sci. and Applications 8, 538 (2017). https://doi.org/10.14569/IJACSA.2017.080471

    Article  Google Scholar 

  24. T. Wu, Y. Yang, and Z. Wu, Int. Conf. on Affective Comp. and Intelligent Interaction, 382(2005). https://doi.org/10.1007/11573548_49

    Chapter  Google Scholar 

  25. L. Chen and Y. Yang, Biometric Recognition Lect. Notes in Comp. Sci. 8232, 394 (2013). https://doi.org/10.1007/978-3-319-02961-0_49

    Article  Google Scholar 

  26. L. Chen and Y. Yang, Biometric Recognition Lect. Notes in Comp. Sci. 7098, 174 (2011). https://doi.org/10.1007/978-3-642-25449-9_22

    Article  Google Scholar 

  27. D.A. Reynolds and R.C. Rose, IEEE Trans. Speech Audio Process. 3, 72 (1995). https://doi.org/10.1109/89.365379

    Article  Google Scholar 

  28. D. Reynolds, Encyclopedia of Biometrics (Springer, Boston, MA, 2015), pp. 827–832. https://doi.org/10.1007/978-1-4899-7488-4

    Google Scholar 

  29. S. Davis and P. Mermelstein, IEEE Trans. Acoust., Speech, Signal Process. 28, 357 (1990). https://doi.org/10.1016/B978-0-08-051584-7.50010-3

    Article  Google Scholar 

  30. T. Kinnunen, H. Li, Speech Comm. 52, 12 (2010), https://doi.org/10.1016/j.specom.2009.08.009

    Article  Google Scholar 

  31. V. Dellwo, A. Leemann, J. Kolly, in Proc. Interspeech, Portland, USA, Sept. 9–13,2012, pp. 1584–1587. https://doi.org/10.5167/uzh-68554

  32. I. Jokic, V. Delic, S. Jokic, and Z. Peric, Adv. in Electron. & Comp. Eng. 15, 25 (2015). https://doi.org/10.4316/AECE.2015.04004

    Article  Google Scholar 

  33. S. O. Sadjadi, M. Slaney, and L. Heck, Speech and Lang. Process. Tech. Comm. Nwl. 1, (4), 1 (2013). http://www.microsoft.com/en-us/research/wp-content/ uploads/2013/09/MSR-Identity-Toolbox-v1_1.pdf.

  34. M. Brookes, (1997). http://www.ee.ic.ac.uk/hp/staff/ dmb/voicebox/voicebox.html.

  35. I. Sergey, in Proc. Eur. Conf. on Comp. Vision, Graz, Austria, May 7–13,2006, p. 3954. https://doi.org/10.1007/11744085_4

    Chapter  Google Scholar 

  36. P. Ekman, Cognition & Emotion, 6, 169 (1992), https://doi.org/10.1080/02699939208411068

    Article  Google Scholar 

  37. M. Bojanic, M. Gnjatovic, M. Secujski, and V. Delic, in Proc. IEEE Int. Symp. on Intelligent Sys. and Inf.,2013 (IEEE, 2013), p. 353. https://doi.org/10.1109/SISY.2013.6662601

  38. J. Posner, J. A. Russell, and B. S. Peterson, Dev. Psychol. 17, 715 (2005).https://doi.org/10.1017/S0954579405050340

    Article  Google Scholar 

Download references

ACKNOWLEDGMENTS

Authors would like to thank V. A. Petrushin for access to RUSLANA database and V. Dellwo for access to TVOID database.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Milošević.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Milošević, M., Nedeljković, Ž., Glavitsch, U. et al. Speaker Modeling Using Emotional Speech for More Robust Speaker Identification. J. Commun. Technol. Electron. 64, 1256–1265 (2019). https://doi.org/10.1134/S1064226919110184

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1064226919110184

Keywords:

Navigation