Abstract—
Automatic identity recognition in fast, reliable and non-intrusive way is one of the most challenging topics in digital world of today. A possible approach to identity recognition is the identification by voice. Characteristics of speech relevant for automatic speaker recognition can be affected by external factors such as noise and channel distortions, but also by speaker-specific conditions—emotional or health states. The improvement of a speaker recognition system by different model training strategies are addressed in this paper in order to obtain the best performance of the system with only a limited amount of neutral and emotional speech data. The models adopted are a Gaussian Mixture Model and i-vectors whose inputs are Mel Frequency Cepstral Coefficients, and the experiments have been conducted on the Russian Language Affective speech database. The results show that the appropriate use of emotional speech in speaker model training improves the robustness of a speaker recognition system – both when tested on neutral and emotional speech.
Similar content being viewed by others
REFERENCES
A. Alarifi, I. Alkurtass, and A.S. Alsalman, in Proc. Int. Conf. on Information Technology and e-Services, Sousse, Mar. 24–26,2012, pp. 1–6. https://doi.org/10.1109/ICITeS.2012.6216673
K. R. Scherer, T. Johnstone, G. Klasmeyer, and T. Bänziger, in Proc. Interspeech, Beijing, China, Oct. 16–20,2000. https://www.isca-speech.org/archive/icslp_ 2000/i00_2807.html.
J. H. Hansen and T. Hasan, IEEE Signal Process. Mag. 32, 74 (2015). https://doi.org/10.1109/MSP.2015.2462851
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Dig. Signal Process. 10, 19 (2000). https://doi.org/10.1006/dspr.1999.0361
W. M. Campbell, D. E. Sturim, and D. A. Reynolds, IEEE Signal Process. Lett. 13, No. 5, 308 (2006). https://doi.org/10.1109/LSP.2006.870086
P. Kenny, Tech. Rep. CRIM-06/08-13, (2005). http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.331.7996.
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, IEEE Trans. Audio, Speech, Lang. Proc. 19, 788, (2010). https://doi.org/10.1109/TASL.2010.2064307
T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Bořil, J. H. L. Hansen, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP),2013 (IEEE, New York, 2013), p. 6783. https://doi.org/10.1109/ICASSP.2013.6638975
R. Saedi, K. A. Lee, T. Kinnunen, T. Hasan, B. Fauve, P. M. Bousquet, E. Khoury, P. L. S. Martinez, J. M. K. Kua, C. You, and H. Sun, in Proc.Interspeech,Lyon, France, August 25–29, 2013, p. 1986. https://www.isca-speech.org/archive/archive_papers/ inters peech_2013/i13_1986.pdf.
C. Zhang, F. Bahmaninezhad, S. Ranjan, C. Yu, N. Shokouhi, and J. H. L. Hansen, in Proc. Interspeech, Stockholm, Sweden, August 20–24, (2017), p. 1343. https://doi.org/10.21437/Interspeech.2017-555
K. A. Lee, V. Hautamaki, T. Kinnunen, A. Larcher, C. Zhang, A. Nautsch, T. Stafylakis, G. Liu, M. Rouvier, W. Rao, F. Alegre, J. Ma, M. W. Mak, A. K. Sarkar, H. Delgado, R. Saeidi, H. Aronowitz, A. Sizov, H. Sun, T. H. Nguyen, Md. Sahidullah, V. Vestman, M. Halonen, A. Kanervisto et al., in Proc. Interspeech, Stockholm, Sweden, August 20–24, (2017), p. 1328. https://doi.org/10.21437/Interspeech.2017-203
F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, in Proc.Interspeech,Lisbon, Portugal, (2005), p. 1517. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8506&rep=rep1&type=pdf.
M. Liberman et al. “Emotional Prosody Speech and Transcripts LDC2002S28,” Web Download (2002). https://catalog.ldc.upenn.edu/LDC2002S28.
C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, and S. S. Narayanan, Lang. resources and eval. 42, 335 (2008). https://doi.org/10.1007/s10579-008-9076-6
S. G. Koolagudi, S. Maity, V. A. Kumar, S. Chakrabarti, K. S. Rao, Comm. Comp. & Inf. Sci. (CCIS) 40, 485 (2009). https://doi.org/10.1007/978-3-642-03547-0
T. Wu, Y. Yang, Z. Wu, and D. Li, IEEE Speaker and Lang. Recognition Workshop (IEEE, Odyssey, 2006), p. 1. https://doi.org/10.1109/ODYSSEY.2006.248084
V. Markova and V.A. Pertushin, in Proc.Int. Conf. Spoken Language Processing,Denver, Colorado, USA, Sept. 16–20,2002, p. 2041. https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_2041.pdf.
L. Ferrer, M. McLaren, N. Scheffer, Y. Lei, M. Graciarena & V. Mitra, Sri Int. Menlo Park Ca Speech Tech. and Research Lab. (2013). https://www.semanticscholar.org/paper/A-noise-robust-system-for-NIST-2012-speaker-Ferrer-McLaren/17d2fef89a3c3dd5d291bc0906c9a3d047 c50609.
D. Li, Y. Yang, and W. Dai, Sci. World J. 2014 (2014), https://doi.org/10.1155/2014/628516
K.R. Alluri, S. Achanta, R. Prasath, S. V. Gangashetty, and A. K. Vuppala, in Proc. Int. Conf. on Mining Intelligence and Knowledge Exploration,2016, p. 66. https://doi.org/10.1007/978-3-319-58130-9_7
M. V. Ghiurcau, C. Rusu, C., and J. Astola, in Proc IEEE Int. Conf. on Acoustics, Speech and Sig.,2011 (IEEE, New York, 2011), p. 4944. https://doi.org/10.1109/ICASSP.2011.5947465
A. Mansour and Z. Lachiri, in Proc.Eng. Tech. Second Int. Conf. on Automation, Control, Eng. and Comp. Sci., 2016, p. 122. http://ipco-co.com/CSCIT_Journal/papers-CSCIT/CSCIT/CSCIT%20-%20Vol.2%20-%20 issue1%20-%202015/1.pdf.
A. Mansour and Z. Lachiri, Int. Journal of Adv. Comp. Sci. and Applications 8, 538 (2017). https://doi.org/10.14569/IJACSA.2017.080471
T. Wu, Y. Yang, and Z. Wu, Int. Conf. on Affective Comp. and Intelligent Interaction, 382(2005). https://doi.org/10.1007/11573548_49
L. Chen and Y. Yang, Biometric Recognition Lect. Notes in Comp. Sci. 8232, 394 (2013). https://doi.org/10.1007/978-3-319-02961-0_49
L. Chen and Y. Yang, Biometric Recognition Lect. Notes in Comp. Sci. 7098, 174 (2011). https://doi.org/10.1007/978-3-642-25449-9_22
D.A. Reynolds and R.C. Rose, IEEE Trans. Speech Audio Process. 3, 72 (1995). https://doi.org/10.1109/89.365379
D. Reynolds, Encyclopedia of Biometrics (Springer, Boston, MA, 2015), pp. 827–832. https://doi.org/10.1007/978-1-4899-7488-4
S. Davis and P. Mermelstein, IEEE Trans. Acoust., Speech, Signal Process. 28, 357 (1990). https://doi.org/10.1016/B978-0-08-051584-7.50010-3
T. Kinnunen, H. Li, Speech Comm. 52, 12 (2010), https://doi.org/10.1016/j.specom.2009.08.009
V. Dellwo, A. Leemann, J. Kolly, in Proc. Interspeech, Portland, USA, Sept. 9–13,2012, pp. 1584–1587. https://doi.org/10.5167/uzh-68554
I. Jokic, V. Delic, S. Jokic, and Z. Peric, Adv. in Electron. & Comp. Eng. 15, 25 (2015). https://doi.org/10.4316/AECE.2015.04004
S. O. Sadjadi, M. Slaney, and L. Heck, Speech and Lang. Process. Tech. Comm. Nwl. 1, (4), 1 (2013). http://www.microsoft.com/en-us/research/wp-content/ uploads/2013/09/MSR-Identity-Toolbox-v1_1.pdf.
M. Brookes, (1997). http://www.ee.ic.ac.uk/hp/staff/ dmb/voicebox/voicebox.html.
I. Sergey, in Proc. Eur. Conf. on Comp. Vision, Graz, Austria, May 7–13,2006, p. 3954. https://doi.org/10.1007/11744085_4
P. Ekman, Cognition & Emotion, 6, 169 (1992), https://doi.org/10.1080/02699939208411068
M. Bojanic, M. Gnjatovic, M. Secujski, and V. Delic, in Proc. IEEE Int. Symp. on Intelligent Sys. and Inf.,2013 (IEEE, 2013), p. 353. https://doi.org/10.1109/SISY.2013.6662601
J. Posner, J. A. Russell, and B. S. Peterson, Dev. Psychol. 17, 715 (2005).https://doi.org/10.1017/S0954579405050340
ACKNOWLEDGMENTS
Authors would like to thank V. A. Petrushin for access to RUSLANA database and V. Dellwo for access to TVOID database.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Milošević, M., Nedeljković, Ž., Glavitsch, U. et al. Speaker Modeling Using Emotional Speech for More Robust Speaker Identification. J. Commun. Technol. Electron. 64, 1256–1265 (2019). https://doi.org/10.1134/S1064226919110184
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1064226919110184